CreateOrReplaceTempView Performance: Apache Spark SQL DataFrame DataSet API – Difference In Performance ? – 3 great APIs

CreateOrReplaceTempView Performance: Apache Spark SQL difference in performance of SQL, DataFrame and DataSet API? - 3 great APIs
Share this post and Earn Free Points!

Let’s consider two methods to read the data from the same Hive table. For both the execution plan will be the same, [ CreateOrReplaceTempView Performance ] because for both the Catalyst optimiser and Tangsten engine will be used, which were available since Spark 2.0 (Tangsten).  In the future I will prepare posts about these two buzzwords in Spark world (Catalyst and Tangsten).

In Apache Spark, the createOrReplaceTempView method is used to create a temporary view based on a DataFrame. A temporary view is a transient view that is created and used within a single Spark session and is not persisted to the external metastore. It allows you to run SQL queries on the data in the DataFrame.

Introduction

Before we start we need to be aware about two major parts of Apache Spark: Catalyst and Tangsten.

Apache Spark’s Catalyst and Tungsten projects are both designed to improve the performance of Spark applications. However, they operate at different stages of the query execution pipeline and serve different purposes.

Spark Catalyst

In Apache Spark, Catalyst is the name of the query optimization engine. It is responsible for converting Spark’s high-level queries into a logical plan that can be executed by the Spark execution engine.

Catalyst consists of a set of optimization rules that are applied to the logical plan of a query to make it more efficient. These rules are applied by the Catalyst optimizer, which is a rule-based optimizer that uses pattern matching to identify opportunities for optimization and applies the appropriate rules to transform the logical plan.

Catalyst also includes a cost-based optimizer, which uses statistical information about the data to choose the most efficient execution plan for a given query. The cost-based optimizer is disabled by default, but can be enabled by setting the spark.sql.cbo.enabled configuration property to true.

Catalyst is an important part of Spark’s query execution pipeline, as it is responsible for ensuring that queries are executed as efficiently as possible. It plays a key role in the performance of Spark applications and is an important factor to consider when optimizing Spark queries.

Spark Tangsten Engine

Apache Spark’s Tungsten project is a set of optimizations that are designed to improve the performance of Spark applications. Tungsten aims to make Spark more efficient by reducing the amount of memory and CPU resources that are required to execute a query, and by improving the speed of data processing.

Tungsten uses a number of techniques to achieve these improvements, including:

  • Off-heap memory management: Tungsten stores data in off-heap memory, which allows Spark to bypass the Java heap and access memory directly. This reduces the overhead of garbage collection and can improve the speed of data processing.
  • Code generation: Tungsten generates optimized code for data processing tasks, which can improve the speed of query execution.
  • Columnar data storage: Tungsten stores data in a columnar format, which allows Spark to process only the columns that are needed for a query, rather than reading the entire row. This can significantly reduce the amount of memory and CPU resources that are required to execute a query.

Tungsten is an important part of Spark’s execution engine and is responsible for many of the performance improvements that have been made to Spark in recent years.

CreateOrReplaceTempView Performance

val testDataSetDF = spark.sqlContext.table("bigdata_etl.some_dataset").withColumn("age_label", when($"age" = 30, "thirty-year-old person").otherwise("Other"))
val testDataSetDF = spark.sqlContext.sql("SELECT *, CASE WHEN age = 30 THEN 'thirty-year-old person' ELSE 'Other' END age_label FROM bigdata_etl.some_dataset")

The difference is only in the syntax. Choose the one that’s closer to you. Personally, I think that you should use the first version. The second option with writing an SQL query has the advantage that we can refer to temporary tables in the query.

// Refer to temporary tables
val testDataSetDF = spark.sqlContext.table("bigdata_etl.some_dataset").withColumn("age_label", when($"age" = 30, "thirty-year-old person").otherwise("Other"))
testDataSetDF.createOrReplaceTempView("someDataSetTempView")
val only30thPersons = spark.sqlContext.sql("SELECT * FROM someDataSetTempView where age_label = 'thirty-year-old person'")

CreateOrReplaceTempView In PySpark

Due to fact that Spark API is very similar in Scala and Python the command of CreateOrReplaceTempView In PySpark looks basically the same. Please find the below example. As you see the API is the same.

df.createOrReplaceTempView("someDataSetTempView")

CreateOrReplaceTempView PySpark Example

Here is an example of how to use createOrReplaceTempView to create a temporary view in Spark:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()

# Create a DataFrame
df = spark.createDataFrame([(1, "Alice", 10), (2, "Bob", 20), (3, "Charlie", 30)], ["id", "name", "age"])

# Create a temporary view based on the DataFrame
df.createOrReplaceTempView("people")

# Run a SQL query on the temporary view
result = spark.sql("SELECT * FROM people WHERE age > 15")

# Print the results
result.show()

This will create a temporary view named “people” based on the DataFrame df, and then run a SQL query that selects all rows from the “people” view where the “age” column is greater than 15. The results of the query will be displayed using the show method.

Summary

As I have described there is no performance differences dute fact that both go through the same optimisations.

To sum up:

Catalyst is the name of Spark’s query optimization engine. It is responsible for converting Spark’s high-level queries into a logical plan that can be executed by the Spark execution engine. Catalyst consists of a set of optimization rules that are applied to the logical plan of a query to make it more efficient. It also includes a cost-based optimizer, which uses statistical information about the data to choose the most efficient execution plan for a given query.

Tungsten, on the other hand, is a set of optimizations that are designed to improve the performance of the Spark execution engine. It uses techniques such as off-heap memory management, code generation, and columnar data storage to reduce the amount of memory and CPU resources that are required to execute a query, and to improve the speed of data processing.

In summary, Catalyst is responsible for optimizing the logical plan of a query, while Tungsten is responsible for improving the performance of the execution engine. Both are important for the overall performance of Spark applications.

Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!

How useful was this post?

Click on a star to rate it!

Average rating 4.8 / 5. Vote count: 1436

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?