In this post I will dive into topic: Spark DataFrame Cache And Persist StorageLevel. I will present you the main differences, use cases, advantages and disadvantages.
Table of Contents
Spark Cache and Persist are optimization strategies used to increase task performance in DataFrame and Dataset for iterative and interactive Spark applications. This article will teach you how to utilise the
persist() methods in DataFrame and DataSet, as well as the distinction between caching and persistance and how to apply these strategies to DataFrame and DataSet using Scala examples.
Although Spark delivers quicker computation than standard Map Reduce tasks, performance may suffer when dealing with massive volumes of data if the jobs are not intended to reuse recurring calculations. As a result, using optimization techniques to increase speed is critical.
persist() methods provide an optimization mechanism for storing intermediate computations of a Spark DataFrame so that they can be reused in later operations. When a dataset is persistent, each node keeps its partitioned data in memory and reuses it in subsequent operations on that dataset. Furthermore, Spark’s persistent data on nodes is fault-tolerant, which means that if any partition of a dataset is lost, it will be recomputed automatically using the original transformations that formed it.
In Apache Spark, the
StorageLevel is an enumeration of different storage options for caching DataFrames and RDDs (Resilient Distributed Datasets) in memory. The different storage levels available include:
- MEMORY_ONLY: stores the data in memory as Java objects.
- MEMORY_ONLY_SER: stores the data in memory as Java objects and also serializes the data.
- MEMORY_ONLY_DISK: stores the data in memory as Java objects and spills the data to disk if the memory is full.
- MEMORY_ONLY_SER_DISK: stores the data in memory as serialized Java objects and spills the data to disk if the memory is full.
- DISK_ONLY: stores the data to disk only.
When caching a DataFrame or RDD, you can specify the storage level to use with the
StorageLevel parameter of the Spark
cache() or Spark
persist() methods. For example, you can use
.cache(StorageLevel.MEMORY_ONLY) to cache a DataFrame in memory only.
Spark 2.4 Vs Spark 3.1
Here is a list of the storage levels available in Spark 2.4:
Advantages Of Caching DataFrame
- Improved performance: Spark may avoid the overhead of recomputing the DataFrame for each subsequent operation by caching and storing DataFrames in memory, which can dramatically increase the speed of iterative and interactive Spark applications.
- Reduced data loading time: When a DataFrame is cached and/or persisted, it remains in memory, which means that if the DataFrame is needed again, Spark may access it rapidly without having to reload the data.
- Efficient use of resources: Caching and storing DataFrames in memory enables Spark to make better use of resources by avoiding recalculating the same DataFrame many times.
- Better fault-tolerance: When a DataFrame is persisted, each node maintains its partitioned data in memory, and Spark’s persistent data on nodes is fault-tolerant, which means that if any partition of a Dataset is lost, it will be recomputed automatically using the original transformations that formed it.
- Optimization of the Spark’s DAG (Directed Acyclic Graph) execution: When Spark computes a DataFrame, it constructs a DAG of stages, and caching and persisting can assist Spark in optimising the execution of these stages.
Spark DataFrame Cache And Persist StorageLevel
To cache a DataFrame, you can use the Spark
cache() or Spark
persist() method. The PySpark
cache() method stores the data in memory with the default storage level, while the PySpark
persist() method allows you to specify the storage level. For example, you can use
StorageLevel.MEMORY_ONLY to store the data in memory, or
StorageLevel.MEMORY_AND_DISK to store the data in memory and disk.
Here are some examples of caching and persist operations on a Spark DataFrame:
- Caching a DataFrame in memory:
- Persist a DataFrame in memory and disk:
from pyspark import StorageLevel df.persist(StorageLevel.MEMORY_AND_DISK)
- Persist a DataFrame in memory and disk, with serialization:
- Unpersist a DataFrame:
Note: caching and persist are used for storing the DataFrame in memory so that it can be reused across multiple operations, instead of re-reading from disk or recomputing every time it is used.
PySpark Cache Temp View
# Create a DataFrame df = spark.createDataFrame([(1, "John", 25), (2, "Jane", 30), (3, "Mike", 35)], ["id", "name", "age"]) # Create a temporary view df.createOrReplaceTempView("people") # Cache the view spark.catalog.cacheTable("people")
This will cache the “people” view in memory, so that subsequent queries on the view will be faster.
You can also use the
spark.sql() method to cache the view. Here is an example:
# Create a DataFrame df = spark.createDataFrame([(1, "John", 25), (2, "Jane", 30), (3, "Mike", 35)], ["id", "name", "age"]) # Create a temporary view df.createOrReplaceTempView("people") # Cache the view spark.sql("CACHE TABLE people")
Please note that the
sql("CACHE TABLE") only cache the table metadata and not the data, so if you have a huge table with a lot of data, you should use
cache() on the DataFrame to cache the actual data.
Spark caching is a technique for storing data in memory for speedier access. Caching a DataFrame causes Spark to save the data in memory and label it as read-only. This allows Spark to retrieve the data rapidly without having to read it from disc or recompute it, which can significantly enhance the speed of your Spark task.
It’s important to note that if you cache a DataFrame, Spark will not immediately delete it from memory if it runs out of memory. To free up memory, if you are running many tasks with huge DataFrames, you may need to manually unpersist() the DataFrames you are no longer using.
Could You Please Share This Post? I appreciate It And Thank YOU! :) Have A Nice Day!
We are sorry that this post was not useful for you!
Let us improve this post!
Tell us how we can improve this post?