There is no performance difference between these two methods.
Let’s consider two methods to read the data from the same Hive table. For both the execution plan will be the same, because for both the Catalyst optimilizator and Tangsten engine will be used, which were available since Spark 2.0 (Tangsten). In the future I will prepare posts about these two buzzwords in Spark world (Catalyst and Tangsten).
val testDataSetDF = spark.sqlContext.table("bigdata_etl.some_dataset").withColumn("age_label", when($"age" = 30, "thirty-year-old person").otherwise("Other")) val testDataSetDF = spark.sqlContext.sql("SELECT *, CASE WHEN age = 30 THEN 'thirty-year-old person' ELSE 'Other' END age_label FROM bigdata_etl.some_dataset")
The difference is only in the syntax. Choose the one that’s closer to you. Personally, I think that you should use the first version. The second option with writing an SQL query has the advantage that we can refer to temporary tables in the query.
// Refer to temporary tables val testDataSetDF = spark.sqlContext.table("bigdata_etl.some_dataset").withColumn("age_label", when($"age" = 30, "thirty-year-old person").otherwise("Other")) testDataSetDF.createOrReplaceTempView("someDataSetTempView") val only30thPersons = spark.sqlContext.sql("SELECT * FROM someDataSetTempView where age_label = 'thirty-year-old person'")
If you enjoyed this post please add the comment below or share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!