In this post I will dive into topic: Spark DataFrame" Cache And Persist StorageLevel. I will present you the main differences, use cases, advantages and disadvantages.
Table of Contents
Introduction
Spark Cache and Persist are optimization strategies used to increase task performance in DataFrame and Dataset for iterative and interactive Spark" applications. This article will teach you how to utilise the cache()
and persist()
methods in DataFrame" and DataSet, as well as the distinction between caching and persistance and how to apply these strategies to DataFrame and DataSet using Scala" examples.
Although Spark delivers quicker computation than standard Map Reduce tasks, performance may suffer when dealing with massive volumes of data if the jobs are not intended to reuse recurring calculations. As a result, using optimization techniques to increase speed is critical.
Spark’s cache()
and persist()
methods provide an optimization mechanism for storing intermediate computations of a Spark DataFrame" so that they can be reused in later operations. When a dataset" is persistent, each node keeps its partitioned data in memory and reuses it in subsequent operations on that dataset". Furthermore, Spark’s persistent data on nodes is fault-tolerant, which means that if any partition of a dataset" is lost, it will be recomputed automatically using the original transformations that formed it.
Spark StorageLevel
In Apache Spark", the StorageLevel
is an enumeration of different storage options for caching DataFrames and RDDs (Resilient Distributed Datasets) in memory. The different storage levels available include:
- MEMORY_ONLY: stores the data in memory as Java" objects.
- MEMORY_ONLY_SER: stores the data in memory as Java" objects and also serializes the data.
- MEMORY_ONLY_DISK: stores the data in memory as Java" objects and spills the data to disk if the memory is full.
- MEMORY_ONLY_SER_DISK: stores the data in memory as serialized Java" objects and spills the data to disk if the memory is full.
- DISK_ONLY: stores the data to disk only.
When caching a DataFrame" or RDD", you can specify the storage level to use with the StorageLevel
parameter of the Spark" cache()
or Spark persist()
methods. For example, you can use .cache(StorageLevel.MEMORY_ONLY)
to cache a DataFrame" in memory only.
Spark 2.4 Vs Spark 3.1
Here is a list of the storage levels available in Spark 2.4:
- MEMORY_ONLY
- MEMORY_ONLY_SER
- MEMORY_ONLY_DISK
- MEMORY_ONLY_SER_DISK
- DISK_ONLY
Spark 3.1
- MEMORY_ONLY
- MEMORY_ONLY_SER
- MEMORY_ONLY_DISK
- MEMORY_ONLY_SER_DISK
- MEMORY_AND_DISK
- MEMORY_AND_DISK_SER
- DISK_ONLY
Advantages Of Caching DataFrame
There are several advantages to caching and persisting DataFrames in Apache Spark":
- Improved performance: Spark may avoid the overhead of recomputing the DataFrame" for each subsequent operation by caching and storing DataFrames in memory, which can dramatically increase the speed of iterative and interactive Spark applications.
- Reduced data loading time: When a DataFrame" is cached and/or persisted, it remains in memory, which means that if the DataFrame" is needed again, Spark may access it rapidly without having to reload the data.
- Efficient use of resources: Caching and storing DataFrames in memory enables Spark to make better use of resources by avoiding recalculating the same DataFrame" many times.
- Better fault-tolerance: When a DataFrame" is persisted, each node maintains its partitioned data in memory, and Spark’s persistent data on nodes is fault-tolerant, which means that if any partition of a dataset" is lost, it will be recomputed automatically using the original transformations that formed it.
- Optimization of the Spark’s DAG (Directed Acyclic Graph) execution: When Spark computes a DataFrame", it constructs a DAG" of stages, and caching and persisting can assist Spark in optimising the execution of these stages.
Spark DataFrame Cache And Persist StorageLevel
To cache a DataFrame", you can use the Spark cache()
or Spark persist()
method. The PySpark" cache()
method stores the data in memory with the default" storage level, while the PySpark" persist()
method allows you to specify the storage level. For example, you can use StorageLevel.MEMORY_ONLY
to store the data in memory, or StorageLevel.MEMORY_AND_DISK
to store the data in memory and disk.
Here are some examples of caching and persist operations on a Spark DataFrame":
- Caching a DataFrame" in memory:
df.cache()
df.persist()
- Persist a DataFrame" in memory and disk:
from pyspark import StorageLevel df.persist(StorageLevel.MEMORY_AND_DISK)
- Persist a DataFrame" in memory and disk, with serialization:
df.persist(StorageLevel.MEMORY_AND_DISK_SER)
- Unpersist a DataFrame":
df.unpersist()
Note: caching and persist are used for storing the DataFrame" in memory so that it can be reused across multiple operations, instead of re-reading from disk or recomputing every time it is used.
PySpark Cache Temp View
In PySpark", you can cache a temporary view by first creating the view using the createOrReplaceTempView()
method, and then calling the cacheTable()
method on the SparkSession". Here is an example:
# Create a DataFrame df = spark.createDataFrame([(1, "John", 25), (2, "Jane", 30), (3, "Mike", 35)], ["id", "name", "age"]) # Create a temporary view df.createOrReplaceTempView("people") # Cache the view spark.catalog.cacheTable("people")
This will cache the “people” view in memory, so that subsequent queries on the view will be faster.
You can also use the spark.sql()
method to cache the view. Here is an example:
# Create a DataFrame df = spark.createDataFrame([(1, "John", 25), (2, "Jane", 30), (3, "Mike", 35)], ["id", "name", "age"]) # Create a temporary view df.createOrReplaceTempView("people") # Cache the view spark.sql("CACHE TABLE people")
Please note that the cacheTable()
and sql("CACHE TABLE")
only cache the table metadata and not the data, so if you have a huge table with a lot of data, you should use persist()
or cache()
on the DataFrame" to cache the actual data.
Summary
Spark caching is a technique for storing data in memory for speedier access. Caching a DataFrame" causes Spark to save the data in memory and label it as read-only. This allows Spark to retrieve the data rapidly without having to read it from disc or recompute it, which can significantly enhance the speed of your Spark task.
It’s important to note that if you cache a DataFrame", Spark will not immediately delete it from memory if it runs out of memory. To free up memory, if you are running many tasks with huge DataFrames, you may need to manually unpersist() the DataFrames you are no longer using.
Could You Please Share This Post?
I appreciate It And Thank YOU! :)
Have A Nice Day!