PySpark / Spark DataFrame Cache And Persist StorageLevel – Lear The Most Powerful Spark Feature In 5 Min!

PySpark / Spark DataFrame Cache And Persist StorageLevel
Share this post and Earn Free Points!

In this post I will dive into topic: Spark DataFrame Cache And Persist StorageLevel. I will present you the main differences, use cases, advantages and disadvantages.

Introduction

Spark Cache and Persist are optimization strategies used to increase task performance in DataFrame and Dataset for iterative and interactive Spark applications. This article will teach you how to utilise the cache() and persist() methods in DataFrame and DataSet, as well as the distinction between caching and persistance and how to apply these strategies to DataFrame and DataSet using Scala examples.

Although Spark delivers quicker computation than standard Map Reduce tasks, performance may suffer when dealing with massive volumes of data if the jobs are not intended to reuse recurring calculations. As a result, using optimization techniques to increase speed is critical.

Spark’s cache() and persist() methods provide an optimization mechanism for storing intermediate computations of a Spark DataFrame so that they can be reused in later operations. When a dataset is persistent, each node keeps its partitioned data in memory and reuses it in subsequent operations on that dataset. Furthermore, Spark’s persistent data on nodes is fault-tolerant, which means that if any partition of a dataset is lost, it will be recomputed automatically using the original transformations that formed it.

Spark StorageLevel

In Apache Spark, the StorageLevel is an enumeration of different storage options for caching DataFrames and RDDs (Resilient Distributed Datasets) in memory. The different storage levels available include:

When caching a DataFrame or RDD, you can specify the storage level to use with the StorageLevel parameter of the Spark cache() or Spark persist() methods. For example, you can use .cache(StorageLevel.MEMORY_ONLY) to cache a DataFrame in memory only.

Spark 2.4 Vs Spark 3.1

Here is a list of the storage levels available in Spark 2.4:

  • MEMORY_ONLY
  • MEMORY_ONLY_SER
  • MEMORY_ONLY_DISK
  • MEMORY_ONLY_SER_DISK
  • DISK_ONLY

Spark 3.1

  • MEMORY_ONLY
  • MEMORY_ONLY_SER
  • MEMORY_ONLY_DISK
  • MEMORY_ONLY_SER_DISK
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_SER
  • DISK_ONLY

Advantages Of Caching DataFrame

There are several advantages to caching and persisting DataFrames in Apache Spark:

  1. Improved performance: Spark may avoid the overhead of recomputing the DataFrame for each subsequent operation by caching and storing DataFrames in memory, which can dramatically increase the speed of iterative and interactive Spark applications.
  2. Reduced data loading time: When a DataFrame is cached and/or persisted, it remains in memory, which means that if the DataFrame is needed again, Spark may access it rapidly without having to reload the data.
  3. Efficient use of resources: Caching and storing DataFrames in memory enables Spark to make better use of resources by avoiding recalculating the same DataFrame many times.
  4. Better fault-tolerance: When a DataFrame is persisted, each node maintains its partitioned data in memory, and Spark’s persistent data on nodes is fault-tolerant, which means that if any partition of a Dataset is lost, it will be recomputed automatically using the original transformations that formed it.
  5. Optimization of the Spark’s DAG (Directed Acyclic Graph) execution: When Spark computes a DataFrame, it constructs a DAG of stages, and caching and persisting can assist Spark in optimising the execution of these stages.

Spark DataFrame Cache And Persist StorageLevel

To cache a DataFrame, you can use the Spark cache() or Spark persist() method. The PySpark cache() method stores the data in memory with the default storage level, while the PySpark persist() method allows you to specify the storage level. For example, you can use StorageLevel.MEMORY_ONLY to store the data in memory, or StorageLevel.MEMORY_AND_DISK to store the data in memory and disk.

Here are some examples of caching and persist operations on a Spark DataFrame:

  1. Caching a DataFrame in memory:
df.cache()
  1. Persist a DataFrame with the default storage level (MEMORY_ONLY):
df.persist()
  1. Persist a DataFrame in memory and disk:
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)
  1. Persist a DataFrame in memory and disk, with serialization:
df.persist(StorageLevel.MEMORY_AND_DISK_SER)
  1. Unpersist a DataFrame:
df.unpersist()

Note: caching and persist are used for storing the DataFrame in memory so that it can be reused across multiple operations, instead of re-reading from disk or recomputing every time it is used.

PySpark Cache Temp View

In PySpark, you can cache a temporary view by first creating the view using the createOrReplaceTempView() method, and then calling the cacheTable() method on the SparkSession. Here is an example:

# Create a DataFrame
df = spark.createDataFrame([(1, "John", 25), (2, "Jane", 30), (3, "Mike", 35)], ["id", "name", "age"])

# Create a temporary view
df.createOrReplaceTempView("people")

# Cache the view
spark.catalog.cacheTable("people")

This will cache the “people” view in memory, so that subsequent queries on the view will be faster.

You can also use the spark.sql() method to cache the view. Here is an example:

# Create a DataFrame
df = spark.createDataFrame([(1, "John", 25), (2, "Jane", 30), (3, "Mike", 35)], ["id", "name", "age"])

# Create a temporary view
df.createOrReplaceTempView("people")

# Cache the view
spark.sql("CACHE TABLE people")

Please note that the cacheTable() and sql("CACHE TABLE") only cache the table metadata and not the data, so if you have a huge table with a lot of data, you should use persist() or cache() on the DataFrame to cache the actual data.

Summary

Spark caching is a technique for storing data in memory for speedier access. Caching a DataFrame causes Spark to save the data in memory and label it as read-only. This allows Spark to retrieve the data rapidly without having to read it from disc or recompute it, which can significantly enhance the speed of your Spark task.

It’s important to note that if you cache a DataFrame, Spark will not immediately delete it from memory if it runs out of memory. To free up memory, if you are running many tasks with huge DataFrames, you may need to manually unpersist() the DataFrames you are no longer using.

Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!

How useful was this post?

Click on a star to rate it!

Average rating 5 / 5. Vote count: 1

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?