In this post I will show you how to convert Pandas DataFrame To Spark" DataFrame".
Table of Contents
Introduction
Python
Python" is a high-level, general-purpose programming language that debuted in 1991. It is extensively used for a broad range of tasks like web development, scientific computing, data analysis, artificial intelligence, and others.
Python’s appeal stems in part from its simplicity and ease of usage. Python" code is simple to read and understand, making it an excellent choice for both new and experienced programmers. Python" also has a big and active community that contributes numerous libraries, modules, and frameworks to the language’s utility and versatility.
Pandas
Pandas is a strong and adaptable open-source data manipulation and analysis package for Python". It offers data structures and data manipulation capabilities for processing and analysing structured data, such as spreadsheets and SQL" tables, in a manner similar to R’s DataFrame" objects or Python’s data frame idea.
The Series and DataFrame" are the two major data structures in Pandas. The DataFrame" is a two-dimensional table-like object that can carry various data types, whereas the Series is a one-dimensional array-like object that can hold any data type.
Pandas has a wide range of data manipulation techniques, such as data indexing and slicing, data reshaping, data combining, and data aggregation. It also has a robust collection of functions for dealing with missing data, working with time series data, and reading and exporting data to CSV", Excel", JSON", and SQL" file formats.
Pandas also works well with other popular Python" data science and machine learning" libraries like NumPy, Scikit-learn, and Matplotlib.
Pandas DataFrame
A DataFrame" is a two-dimensional, size-mutable, heterogeneous tabular data format with named axes in pandas (rows and columns). It may be compared to a spreadsheet or SQL" table, as well as a dict-like container for Series objects.
A DataFrame" is made up of one or more Series objects, each of which represents a DataFrame column". The Series objects are aligned along the DataFrame’s index (rows) and share a common index.
A DataFrame" can be created from a variety of data sources, such as:
A DataFrame" can be manipulated using various methods such as:
- indexing, slicing and filtering to select and extract subsets of data
- adding and dropping columns
- sorting and reordering rows
- merging and joining with other DataFrames
- groupby and aggregation to compute statistics and group data
- reshaping, pivoting and melting to change the layout of the data
- handling missing data
DataFrames in Pandas are a key tool for data analysis and manipulation in Python", providing a powerful and versatile approach to interact with structured data.
Spark DataFrame
In Apache Spark", a DataFrame" is a distributed collection of data structured into named columns. It is conceptually similar to a table in a relational database" or a data frame in R/Python, but with distributed processing improvements and the potential to scale to massive data.
Spark DataFrames are built on the Spark" RDD (Resilient Distributed Dataset) API, which serves as an abstraction for distributed data processing. The DataFrame" API is comparable to the pandas DataFrame" API in that it provides for the quick and fast processing of huge amounts of structured data.
A Spark DataFrame" can be created from a variety of data sources, such as:
- a local or remote file (e.g. CSV, JSON", Parquet")
- a Hive" table
- an RDD"
- a pandas DataFrame" (using the
createDataFrame()
method)
A Spark DataFrame" can be manipulated using various methods, such as:
- selecting and filtering data
- adding and dropping columns
- aggregating and grouping data
- joining and merging with other DataFrames
- performing SQL-like operations using the DataFrame" API or Spark SQL"
- performing machine learning" operations using the MLlib library
Spark DataFrames are intended for large-scale, distributed data processing, and they harness the power of Spark’s cluster computing capabilities to conduct data operations significantly quicker than standard data processing frameworks such as pandas.
Pandas DataFrame Vs Spark DataFrame
Pandas DataFrames and Spark" DataFrames are both data structures for managing and processing tabular data, however they differ significantly:
- Scale: One of the primary distinctions is the amount of data they can process. Pandas DataFrames are intended to operate with tiny datasets that can fit into memory, but Spark DataFrames can manage considerably bigger datasets by exploiting Apache Spark’s distributed computing capabilities.
- Memory usage: Spark DataFrames consume less memory per node than Pandas DataFrames, which must be loaded into memory on a single workstation since they are dispersed over a cluster.
- Performance: Because of their distributed computing capabilities, Spark DataFrames can conduct operations on large-scale datasets significantly quicker than Pandas DataFrames. However, because to the complexity of distributed processing, Pandas DataFrames are generally quicker for small to medium-sized datasets.
- API: Both pandas and Spark DataFrames feature APIs that are comparable, however Spark DataFrames have additional distributed computing optimizations, such as repartition() and coalesce() for regulating data distribution throughout the cluster and persist() for caching interim findings.
- Language support: Pandas is only available for Python", whereas Spark supports multiple languages such as Python", Java", and Scala".
Convert Pandas DataFrame To Spark DataFrame
To convert a pandas DataFrame" to a Spark DataFrame", you can use the createDataFrame
method provided by the pyspark.sql
module. Here is an example of how to do this:
from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.appName("PandasToSparkDF").getOrCreate() # Create a pandas DataFrame import pandas as pd pdf = pd.DataFrame({'x':[1,2,3], 'y':[4,5,6]}) # Convert the pandas DataFrame to a Spark DataFrame sdf = spark.createDataFrame(pdf) # Print the schema of the Spark DataFrame sdf.printSchema()
Once you have the Spark DataFrame", you can use the DataFrame" API to execute operations on it such as filtering, aggregating, and joining.
Convert Pandas DataFrame To Spark Dataframe With Schema
To convert a pandas DataFrame" to a Spark DataFrame" with a specific schema, you can use the createDataFrame
method provided by the pyspark.sql
module and pass in the pandas DataFrame" as well as the desired schema. Here is an example of how to do this:
from pyspark.sql import SparkSession, StructType, StructField, IntegerType, StringType # Create a SparkSession spark = SparkSession.builder.appName("PandasToSparkDFWithSchema").getOrCreate() # Create a pandas DataFrame import pandas as pd pdf = pd.DataFrame({'x':[1,2,3], 'y':[4,5,6]}) # Define the schema for the Spark DataFrame schema = StructType([ StructField("x", IntegerType()), StructField("y", IntegerType()) ]) # Convert the pandas DataFrame to a Spark DataFrame with the specified schema sdf = spark.createDataFrame(pdf, schema)
To create the schema, use the pyspark.sql.types
module’s StructType
and StructField
classes, where StructType
represents the schema as a whole and StructField
represents a single column in the database. Column data types are represented by classes such as IntegerType, StringType
, DoubleType
, and so on. DecimalType
can be used for decimal numbers, TimestampType
for timestamp values, and so on.
Once you have the Spark DataFrame" with the given schema, you can use the DataFrame" API methods to execute operations on it such as filtering, aggregating, and joining.
Pandas DataFrame Converted To a JSON
Using spark.read.json()
with a pandas DataFrame" converted to a JSON" string:
import pandas as pd from pyspark.sql import SparkSession # Create a pandas DataFrame pdf = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]}) # Convert the pandas DataFrame to a JSON string pdf_json = pdf.to_json(orient='records') # Create a SparkSession spark = SparkSession.builder.appName("PandasToSparkDF").getOrCreate() # Read the JSON string as a Spark DataFrame sdf = spark.read.json(pdf_json) # Print the schema of the Spark DataFrame sdf.printSchema()
Summary
In this post I presented you how to convert a pandas DataFrame" to a Spark DataFrame". We discussed many approaches to this conversion, including utilising the pyspark".sql module’s createDataFrame function, spark.read".json() with a pandas DataFrame" transformed to a JSON" string, and spark.read".format(‘csv’) to read a csv file and convert it to a Spark DataFrame".
We also spoke about how Pandas DataFrames and Spark" DataFrames vary in terms of scalability, memory utilisation, performance, API, and language support. Finally, we discussed how to transform a pandas DataFrame" into a spark DataFrame" with a certain schema.
Could You Please Share This Post?
I appreciate It And Thank YOU! :)
Have A Nice Day!