Creating a Spark Session object, which instructs Spark how to access a cluster, is the first step a Spark application must do. You must first generate a SparkSession object, which holds details about your application, before you can establish a SparkContext and SQLContext instances which open for you the Spark functionalities.
Every Spark application, at its core, comprises of a driver software that performs the user’s primary purpose and a number of parallel tasks on a cluster.
This post is a part of Spark Free Tutorial. Check the rest of the Spark tutorials which uou can find on the right side bar of this page! Stay tuned!
Table of Contents
What Is Apache Spark?
Apache Spark is an open-source data processing engine that is designed to handle large-scale data processing tasks such as data streaming, machine learning, and graph processing. It is fast and flexible, capable of performing computations much faster than traditional MapReduce programs due to its in-memory processing capability.
Spark has a wide range of libraries and APIs for various use cases and supports multiple programming languages including Python, Scala, Java, and R. It is also scalable with a distributed architecture that enables it to scale out to large clusters of machines and process large datasets efficiently.
One of its main features is its ability to process data in real-time, making it well-suited for streaming applications. It also has a rich ecosystem of connectors and integrations with popular storage systems and databases such as HDFS, S3, Cassandra, and HBase, allowing it to easily access and process data from various sources.
In summary, Apache Spark is a powerful and widely-used tool for data processing and analysis, used in a variety of applications in industry and academia including data pipelines, stream processing, machine learning, and more.
What Is PySpark?
PySpark is a Python wrapper for the Spark API that allows you to use Spark from Python. It provides a convenient way to use Spark from Python, and it includes many of the same libraries and APIs that are available in the Spark API. PySpark allows you to leverage the power of Spark from the Python ecosystem, and it is widely used in data processing and analysis tasks in Python.
Overall, Spark and PySpark are closely related and are often used together in data processing and analysis tasks. Spark is a powerful data processing engine, and PySpark is the Python API for Spark that allows you to use Spark from Python.
What Is RDD?
A resilient distributed dataset (RDD), which is a set of items divided across the cluster’s nodes and capable of being processed in parallel, is the primary abstraction Spark offers. RDDs are made by changing an existing Scala collection in the driver application or a file in the Hadoop file system (or any other file system supported by Hadoop) as the starting point for a new RDD.
Additionally, users can request that Spark keep an RDD in memory so that it can be effectively used in several concurrent processes. RDDs also automatically restore themselves after node failures.
Spark DataFrame And DataSet
A DataFrame is a distributed collection of data organized into rows and columns, similar to a table in a traditional relational database. It is an immutable distributed collection of data, which means that once created, you cannot modify the data in a DataFrame. DataFrames can be created from various sources, such as structured data files, tables in Hive, or external databases.
A Dataset is a more recent addition to the Spark API, and it is a type-safe, immutable distributed collection of data. It combines the benefits of RDDs (Resilient Distributed Datasets) and DataFrames, and it provides a higher-level API for working with structured data. A Dataset can be created from a DataFrame, and it can be transformed and manipulated using functional operations.
In general, DataFrames are more flexible and have a wider range of functionality, while Datasets are more type-safe and provide better performance. You can choose the appropriate API depending on your requirements and the specific use case.
Both DataFrames and Datasets are widely used in Spark applications for data processing and analysis, and they are an essential part of the Spark API. They provide a convenient and efficient way to work with structured data, and they are supported in a range of programming languages, including Scala, Java, Python, and R.
Spark Driver VS Spark Executor
Each Spark application must consist of:
- One Spark Driver application
- One or more Spark Executors
Spark Driver is like a Boss. It manage the whole application. It decides what part of job will be done on which Executor and also gets the information from Executors about task statuses.
The driver is the main process that coordinates the execution of a Spark job. It is responsible for creating the SparkContext, setting up the cluster, and scheduling tasks for execution on the executors. The driver also receives input data, sends it to the executors, and receives the results from the executors.
The Spark Executors are the processes that actually execute the tasks assigned by the driver. They are responsible for running the code and returning the results to the driver. Executors are launched on the worker nodes of a Spark cluster, and they are responsible for executing the tasks assigned to them and returning the results to the driver.
The communication must be bidirectional. In Hadoop world when the application is submitted to YARN for acceptation the requested resources should be given. Spark Driver is setup on one of the Hadoop Node and the executors on the other Nodes (Spark Driver also can be on the same machine as one of the executor).
SparkSession VS SparkContext
What Is SparkSession?
SparkSession is the main object in Spark – it’s the entry point of each Spark application. As central component that allows you to create DataFrames, access data sources, and perform various operations on structured data.
It is a singleton object that provides a simple and convenient interface for working with the Spark runtime and building Spark applications. You can use the SparkSession to create DataFrames, connect to data sources, and manipulate structured data, as well as access the SparkContext and the SparkConf for additional information about the Spark runtime and configuration options.
What Is SparkContext?
Spark Context is a singleton object in Apache Spark that represents the connection to a Spark cluster and allows you to interact with the Spark runtime. It is the starting point for building Spark applications, and it is responsible for establishing the connection to the Spark cluster, scheduling jobs, and managing the execution of tasks on the cluster. It also handles the distribution of data and computing tasks to the executors and provides access to RDDs, which are the fundamental data structure in Spark.
What Is SQLContext?
SQLContext is the same as Spark Context the Spark Session object variable which is used to execute operation on DataFrames and DataSets.
The SQLContext in Apache Spark is a singleton object that enables you to work with structured data using Spark SQL. It provides a simple and convenient interface for interacting with the Spark runtime and executing SQL queries on structured data.
The SQLContext is created from a SparkContext and is responsible for creating DataFrames, accessing data sources, and executing SQL queries. It also offers a variety of functions and methods for handling structured data, including the creation of DataFrames from structured data files or external database tables, the registration of temporary tables, and the execution of SQL queries on DataFrames.
To visualize these dependencies take a look at the following diagram:
Create SparkSession Object
First of all what we need to do to start woking with Spark is to create the SparkSession instance. To create it you need just few lines of code. That’s all! Now you can start your journey with Apache Spark!
val spark: SparkSession = SparkSession .builder() .master("local[*]") .appName("BigData-ETL.com") .getOrCreate() // Sets the Spark master URL to connect to, such as "local" to run locally, "local" to run locally with 4 cores, with * to run with all available cores, or "spark://master:7077" to run on a Spark standalone cluster.
In summary, Apache Spark is a widely-used and powerful tool for data processing and analysis that is used in various applications across industries and academia. These applications include data pipelines, stream processing, machine learning, and more.
Spark Session VS Spark Context
The SparkSession is the primary way to work with structured data in Apache Spark, while the SparkContext is the main entry point for accessing Spark functionality and interacting with the Spark runtime. Both the SparkSession and the SparkContext are crucial parts of the Spark API and are commonly utilized in Spark applications for data processing and analysis.
Remember: in general, the driver is the central process that controls the execution of a Spark job, while the executors are the worker processes that execute the tasks. The driver and the executors communicate with each other through a network connection, and they work together to process and analyze data in a distributed manner.
That’s all about how to create Spark Session In Scala. Now you are ready to write your first Spark Application. Don’t waste the time and let’s go to the next section!
Could You Please Share This Post? I appreciate It And Thank YOU! :) Have A Nice Day!
We are sorry that this post was not useful for you!
Let us improve this post!
Tell us how we can improve this post?