In this post I will show you how to check PySpark" version using CLI and PySpark code in Jupyter notebook. When we create the application which will be run on the cluster we firstly must know what Spark" version is used on our cluster to be compatible. Let’s try to find PySpark version!
Table of Contents
Spark Application Properties
Knowledge about Spark" version gives you important information and can be the answers for the following questions:
- Are the methods available in this version of the API?
- Can I use the Spark application property or not?
Spark provided plenty of application properties. Default" values are reasonable for most of the properties that control internal settings. The most common options are:
Property Name | Default" | Meaning | Since Version |
---|---|---|---|
spark.app.name | (none) | The name of your application. This will appear in the UI and in log data. | 0.9.0 |
spark.driver.cores | 1 | Number of cores to use for the driver process, only in cluster mode. | 1.3.0 |
spark.driver.maxResultSize | 1g | Limit of total size of serialized results of all partitions for each Spark action (e.g. collect) in bytes. Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size is above this limit. Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory and memory overhead of objects in JVM). Setting a proper limit can protect the driver from out-of-memory errors. | 1.2.0 |
spark.driver.memory | 1g | Amount of memory to use for the driver process, i.e. where SparkContext" is initialized, in the same format as JVM memory strings with a size unit suffix (“k”, “m”, “g” or “t”) (e.g. 512m , 2g ).Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your Default" properties file. | 1.1.1 |
spark.driver.memoryOverhead | driverMemory * spark.driver.memoryOverheadFactor , with minimum of 384 | Amount of non-heap memory to be allocated per driver process in cluster mode, in MiB unless otherwise specified. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the container size (typically 6-10%). This option is currently supported on YARN, Mesos and Kubernetes. Note: Non-heap memory includes off-heap memory (when spark.memory.offHeap.enabled=true ) and memory used by other driver processes (e.g. python" process that goes with a PySpark driver) and memory used by other non-driver processes running in the same container. The maximum memory size of container to running driver is determined by the sum of spark.driver.memoryOverhead and spark.driver.memory . | 2.3.0 |
spark.driver.memoryOverheadFactor | 0.10 | Fraction of driver memory to be allocated as additional non-heap memory per driver process in cluster mode. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the container size. This value defaults to 0.10 except for Kubernetes non-JVM jobs, which defaults to 0.40. This is done as non-JVM tasks need more non-JVM heap space and such tasks commonly fail with “Memory Overhead Exceeded” errors. This preempts this error with a higher Default". This value is ignored if spark.driver.memoryOverhead is set directly. | 3".3.0 |
spark.driver.resource.{resourceName}.amount | 0 | Amount of a particular resource type to use on the driver. If this is used, you must also specify the spark.driver.resource.{resourceName}.discoveryScript for the driver to find the resource on startup. | 3".0.0 |
spark.driver.resource.{resourceName}.discoveryScript | None | A script for the driver to run to discover a particular resource type. This should write to STDOUT a JSON" string in the format of the ResourceInformation class. This has a name and an array of addresses. For a client-submitted driver, discovery script must assign different resource addresses to this driver comparing to other drivers on the same host. | 3".0.0 |
spark.driver.resource.{resourceName}.vendor | None | Vendor of the resources to use for the driver. This option is currently only supported on Kubernetes and is actually both the vendor and domain following the Kubernetes device plugin naming convention. (e.g. For GPUs on Kubernetes this config would be set to nvidia.com or amd.com) | 3".0.0 |
spark.resources.discoveryPlugin | org.apache.spark.resource.ResourceDiscoveryScriptPlugin | Comma-separated list of class names implementing org.apache.spark.api.resource.ResourceDiscoveryPlugin to load into the application. This is for advanced users to replace the resource discovery class with a custom implementation. Spark will try each class specified until one of them returns the resource information for that resource. It tries the discovery script last if none of the plugins return information for that resource. | 3".0.0 |
spark.executor.memory | 1g | Amount of memory to use per executor process, in the same format as JVM memory strings with a size unit suffix (“k”, “m”, “g” or “t”) (e.g. 512m , 2g ). | 0.7.0 |
spark.executor.pyspark.memory | Not set | The amount of memory to be allocated to PySpark in each executor, in MiB unless otherwise specified. If set, PySpark memory for an executor will be limited to this amount. If not set, Spark will not limit Python’s memory use and it is up to the application to avoid exceeding the overhead memory space shared with other non-JVM processes. When PySpark is run in YARN or Kubernetes, this memory is added to executor resource requests. Note: This feature is dependent on Python’s `resource` module; therefore, the behaviors and limitations are inherited. For instance, Windows does not support resource limiting and actual resource is not limited on MacOS". | 2.4.0 |
spark.executor.memoryOverhead | executorMemory * spark.executor.memoryOverheadFactor , with minimum of 384 | Amount of additional memory to be allocated per executor process, in MiB unless otherwise specified. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%). This option is currently supported on YARN and Kubernetes. Note: Additional memory includes PySpark" executor memory (when spark.executor.pyspark.memory is not configured) and memory used by other non-executor processes running in the same container. The maximum memory size of container to running executor is determined by the sum of spark.executor.memoryOverhead , spark.executor.memory , spark.memory.offHeap.size and spark.executor.pyspark.memory . | 2.3.0 |
spark.executor.memoryOverheadFactor | 0.10 | Fraction of executor memory to be allocated as additional non-heap memory per executor process. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the container size. This value defaults to 0.10 except for Kubernetes non-JVM jobs, which defaults to 0.40. This is done as non-JVM tasks need more non-JVM heap space and such tasks commonly fail with “Memory Overhead Exceeded” errors. This preempts this error with a higher Default". This value is ignored if spark.executor.memoryOverhead is set directly. | 3".3.0 |
spark.executor.resource.{resourceName}.amount | 0 | Amount of a particular resource type to use per executor process. If this is used, you must also specify the spark.executor.resource.{resourceName}.discoveryScript for the executor to find the resource on startup. | 3".0.0 |
spark.executor.resource.{resourceName}.discoveryScript | None | A script for the executor to run to discover a particular resource type. This should write to STDOUT a JSON" string in the format of the ResourceInformation class. This has a name and an array of addresses. | 3".0.0 |
spark.executor.resource.{resourceName}.vendor | None | Vendor of the resources to use for the executors. This option is currently only supported on Kubernetes and is actually both the vendor and domain following the Kubernetes device plugin naming convention. (e.g. For GPUs on Kubernetes this config would be set to nvidia.com or amd.com) | 3".0.0 |
spark.extraListeners | (none) | A comma-separated list of classes that implement SparkListener ; when initializing SparkContext", instances of these classes will be created and registered with Spark’s listener bus. If a class has a single-argument constructor that accepts a SparkConf, that constructor will be called; otherwise, a zero-argument constructor will be called. If no valid constructor can be found, the SparkContext" creation will fail with an exception. | 1.3.0 |
spark.local.dir | /tmp | Directory to use for “scratch” space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. Note: This will be overridden by SPARK_LOCAL_DIRS (Standalone), MESOS_SANDBOX (Mesos) or LOCAL_DIRS (YARN) environment variables" set by the cluster manager. | 0.5.0 |
spark.logConf | false | Logs the effective SparkConf as INFO when a SparkContext" is started. | 0.9.0 |
spark.master | (none) | The cluster manager to connect to. See the list of allowed master URL’s. | 0.9.0 |
spark.submit.deployMode | (none) | The deploy mode of Spark driver program, either “client” or “cluster”, Which means to launch driver program locally (“client”) or remotely (“cluster”) on one of the nodes inside the cluster. | 1.5.0 |
spark.log.callerContext | (none) | Application information that will be written into Yarn RM log/HDFS audit log when running on Yarn/HDFS. Its length depends on the Hadoop" configuration hadoop.caller.context.max.size . It should be concise, and typically can have up to 50 characters. | 2.2.0 |
spark.driver.supervise | false | If true, restarts the driver automatically if it fails with a non-zero exit status. Only has effect in Spark standalone mode or Mesos cluster deploy mode. | 1.3.0 |
spark.driver.log.dfsDir | (none) | Base directory in which Spark driver logs are synced, if spark.driver.log.persistToDfs.enabled is true. Within this base directory, each application logs the driver logs to an application specific file. Users may want to set this to a unified location like an HDFS" directory so driver log files can be persisted for later usage. This directory should allow any Spark user" to read/write files and the Spark" History Server user" to delete files. Additionally, older logs from this directory are cleaned by the Spark History Server if spark.history.fs.driverlog.cleaner.enabled is true and, if they are older than max age configured by setting spark.history.fs.driverlog.cleaner.maxAge . | 3".0.0 |
spark.driver.log.persistToDfs.enabled | false | If true, spark application running in client mode will write driver logs to a persistent storage, configured in spark.driver.log.dfsDir . If spark.driver.log.dfsDir is not configured, driver logs will not be persisted. Additionally, enable the cleaner by setting spark.history.fs.driverlog.cleaner.enabled to true in Spark History Server. | 3".0.0 |
spark.driver.log.layout | %d{yy/MM/dd HH:mm:ss.SSS} %t %p %c{1}: %m%n%ex | The layout for the driver logs that are synced to spark.driver.log.dfsDir . If this is not configured, it uses the layout for the first appender defined in log4j2.properties. If that is also not configured, driver logs use the Default" layout. | 3".0.0 |
spark.driver.log.allowErasureCoding | false | Whether to allow driver logs to use erasure coding. On HDFS", erasure coded files will not update as quickly as regular replicated files, so they make take longer to reflect changes written by the application. Note that even if this is true, Spark will still not force the file to use erasure coding, it will simply use file system defaults. | 3".0.0 |
Spark Free Tutorials
This post is a part of Spark Free Tutorial. Check the rest of the Spark tutorials which uou can find on the right side bar of this page! Stay tuned!
How To Check Spark Version Using CLI?
To check the Spark version you can use Command Line Interface" (CLI).
$ spark-submit --version $ spark-shell --version Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.3.0 /_/ Type --help for more information.
How To Check PySpark Version Using CLI?
To check the PySpark version just run the pyspark client from CLI. Use the following command:
$ pyspark --version Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.3.0 /_/ Type --help for more information.
Check Spark Version In Jupyter Notebook
Jupyter is an open-source software application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. It is often used for data analysis, scientific computing, and machine learning".
Jupyter notebooks are interactive, meaning you can execute code and see the results directly in the document, as well as include text, images, and other media to explain and document your work. This makes Jupyter a popular choice for data scientists and researchers who want to share their work in an easy-to-understand and reproducible format.
To use Jupyter, you will need to install it on your computer. You can do this using the pip package manager by running the following command in your terminal:
pip install jupyter
Once Jupyter is installed, you can launch it by running the jupyter-notebook
command in your terminal. This will open" a new web browser window with the Jupyter user" interface, where you can create and open" notebooks.
Jupyter supports a wide range of programming" languages, including python", R, Julia, and many others. You can choose the language you want to use by selecting the appropriate kernel when creating a new notebook.
How To Check PySpark Version In Jupyter Notebook
You can check the PySpark version in Jupyter notebook as well. Please just create the new notebook and run the following snippet of code:
import pyspark from pyspark.sql import SparkSession # Create SparkSession spark = SparkSession.builder.master("local[*]") \ .appName('BigData-ETL.com') \ .getOrCreate() print(f'The PySpark {spark.version} version is running...')
When you run above code you will get the response like on the below picture:
![[SOLVED] How To Check Spark Version (PySpark Jupyter Notebook)? - These 2 Simple Method Will Help You! 2 [SOLVED] How To Check Spark Version? - These 2 Simple Method Will Help You!](https://bigdata-etl.com/wp-content/uploads/2022/09/image.png)
Code On Gitlab
The following code you can find on my Gitlab!
Summary
Now you know how to check Spark and PySpark version and use this information to provide correct dependency when you’re creating the applications which will be running on the cluster. No you should know how to check PySpark" version in Jupyter Notebook.
To check the version of PySpark in Jupyter, you can use the pyspark.version
attribute. This attribute returns a string containing the version of PySpark that is currently in use.
PySpark Official Site
If you are more interested in PySpark you should follow by official PySpark (Spark) website which provides up-to-date information about Spark features.
Could You Please Share This Post?
I appreciate It And Thank YOU! :)
Have A Nice Day!