In this post I will present you topic: PySpark" / Spark GroupBy With DataFrame" Or DataSet"!
Table of Contents
Introduction
SQL GroupBY
GROUP BY is a clause in SQL" that is used to group rows in a result set based on one or more columns. It is typically used in conjunction with an aggregate function, such as SUM, COUNT, AVG, MIN, or MAX, to compute a value for each group of rows. Some common use cases of GROUP BY are:
- Data analysis and reporting: You can use GROUP BY to group data by one or more columns and then compute various statistics, such as the total sales, average price, or the number of customers for each group. This is useful for creating reports and dashboards that summarize data by different dimensions, such as product, category, region, or time.
- Finding patterns and trends: You can use GROUP BY to identify patterns and trends in your data by grouping it by different columns and computing statistics for each group. For example, you can group sales data by month, product, and region to find the best-selling products, the most profitable regions, or the trends in sales over time.
- Data cleaning and validation: You can use GROUP BY to check the integrity and consistency of your data by looking for outliers, errors, or missing values. For example, you can group data by a primary key column" and check that there are no duplicate rows, or you can group data by a date column and check that there are no missing dates.
- Data transformation: You can use GROUP BY to reshape and pivot" your data, by creating new columns and calculating values based on existing columns. For example, you can use GROUP BY to calculate the running total, moving average, or cumulative distribution of a column, or you could use window functions to create a new columns based on existing one.
- Data sampling and random sampling You can use GROUP BY to subsets your data by a specific column, giving you a random sample of your data to work with. This can be useful when working with large datasets and testing new models or methods on smaller chunks of data.
These are just a few examples of how GROUP BY can be used in SQL", but the possibilities are endless and depend on the specific requirements of your project and the nature of your data. With the power of GROUP BY and the other clauses of SQL", you can perform a wide range of data analysis, manipulation and reporting tasks.
GroupBy In Spark / PySpark
In Apache Spark", the groupBy()
method is used to group the rows of a DataFrame" or a Dataset" based on the values in one or more columns. When you group a DataFrame" or DataSet", the result is a new DataFrame" or DataSet" that contains one row for each unique value in the grouped column(s), and the values in the other columns are aggregated using one or more aggregation functions.
The groupBy()
method is called on a DataFrame" or a Dataset, and it’s passed one or more column names that you want to group the rows by. After the DataFrame" or DataSet" is grouped, you can call various aggregation functions such as count()
, sum()
, avg()
, min()
, max()
, etc. on the grouped DataFrame" or DataSet" to compute statistics for each group.
For example, you can group a DataFrame" of sales data by the “product” column to find the total sales for each product, or you can group a DataFrame" of employee data by the “department” column to find the average salary for each department.
Grouping is a powerful way to perform complex data manipulation and analysis operations in Spark", and it’s a core feature of the Spark DataFrame and DataSet" API. It’s also the first step in building more complex data processing pipelines, like joining, filtering, and windowing.
Spark GroupBy With DataFrame Or DataSet
Here’s an example of how to use the groupBy()
method on a DataFrame" in Spark to group the rows based on the values in one or more columns:
from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.appName("GroupBy Example").getOrCreate() # Create a DataFrame data = [("Tom", 1, "tiger"), ("Jerry", 2, "elephant"), ("John", 3, "tiger"), ("Paul", 4, "elephant"), ("George", 5, "shark")] columns = ["name", "age", "animal"] df = spark.createDataFrame(data, columns) # Group the DataFrame by the "animal" column grouped_df = df.groupBy("animal").count() # Show the results grouped_df.show()
This will output the following DataFrame":
+---------+-----+ |animal |count| +---------+-----+ | tiger| 2| | shark| 1| | elephant| 2| +---------+-----+
The groupBy()
method is called on the DataFrame", and it’s passed the column that we want to group the rows by. In this case, it’s the “animal” column. After the DataFrame" is grouped, we can call the count()
method on the grouped DataFrame" to count the number of rows in each group. The result is a new DataFrame" with two columns: “animal” and “count”. The “animal” column contains the unique values from the original “animal” column, and the “count” column contains the number of rows in each group.
Spark GroupBy Multiple Columns
Spark GroupBy Count
You could also use multiple columns to group by, just pass as many column names as you want to the Spark groupBy
count function, it will group by all the provided columns. And you can use any aggregation function such as sum, avg, min, max etc inplace of count.
grouped_df = df.groupBy("animal","age").count() grouped_df.show()
This will give a DataFrame" with 3" columns, animal, age and count, where count will contain the number of rows of each age for each animal.
Spark GroupBy Sum
In Apache Spark", you can use the sum()
function to compute the sum of all the values in a column of a grouped DataFrame" or DataSet". The sum()
function can be used in conjunction with the groupBy()
method to group the rows of a DataFrame" or DataSet" by one or more columns, and then compute the sum of the values in a specific column for each group.
Here’s an example of how to use the groupBy()
and sum()
methods to group a DataFrame" by the “animal” column and compute the sum of the values in the “age” column for each group:
# Group the DataFrame by the "animal" column # and compute the sum of the "age" column grouped_df = df.groupBy("animal").agg(sum("age").alias("total_age")) # Show the results grouped_df.show()
This will output the following DataFrame":
+------+---------+ |animal|total_age| +------+---------+ | elephant| 6| | shark| 5| | tiger| 4| +------+---------+
The Spark groupBy()
method is called on the DataFrame", and it’s passed the “animal” column that we want to group the rows by. After the DataFrame" is grouped, we call the agg()
function with sum("age")
and an alias is given to the resultant column which is total_age.
The sum()
function is called on the “age” column, which computes the sum of all the ages for each group. The result is a new DataFrame" with two columns: “animal” and “total_age”. The “animal” column contains the unique values from the original “animal” column, and the “total_age” column contains the sum of ages for each group.
You can use any other aggregation function of your choice inplace of sum
, like avg()
, min()
, max()
, count()
etc and you could use multiple columns in the groupBy
function and use multiple aggregation functions in the agg
function to get more complex analysis on the data.
Spark GroupBy COLLECT_LIST
n Apache Spark, the collect_list()
function is used to aggregate the values in a column of a grouped DataFrame" or DataSet" into a list. The collect_list()
function can be used in conjunction with the groupBy()
method to group the rows of a DataFrame" or DataSet" by one or more columns, and then collect the values in a specific column into a list for each group.
Here’s an example of how to use the groupBy()
and collect_list()
methods to group a DataFrame" by the “animal” column and collect the values in the “name” column into a list for each group:
# Group the DataFrame by the "animal" column # and collect the values in the "name" column into a list grouped_df = df.groupBy("animal").agg(collect_list("name").alias("name_list")) # Show the results grouped_df.show()
This will output the following DataFrame":
+------+-------------+ |animal| name_list| +------+-------------+ | tiger| [Tom, John]| | shark| [George]| | elephant|[Jerry, Paul]| +------+-------------+
The Spark groupBy()
method is called on the DataFrame", and it’s passed the “animal” column that we want to group the rows by. After the DataFrame" is grouped, we call the agg()
function with Spark collect_list("name")
and an alias is given to the resultant column which is name_list.
The collect_list()
function is called on the “name” column, which collects all the names for each group and returns a list. The result is a new DataFrame" with two columns: “animal” and “name_list”. The “animal” column contains the unique values from the original “animal” column, and the “name_list” column contains the list of names for each group.
You could use collect_set
similarly as well if you want a set of names instead of a list.
Summary
In this post we discussed the usage of Spark groupBy()
method in Apache Spark", which is used to group the rows of a DataFrame" or a Dataset" based on the values in one or more columns. We discussed examples of using groupBy()
with various aggregation functions like Spark count()
, sum()
, Spark collect_list()
and also how it can be used to do complex data manipulation and analysis operations in Spark. We also talked about some common use cases of group by in SQL" like data analysis and reporting, finding patterns and trends, data cleaning and validation, data transformation and data sampling.
Could You Please Share This Post?
I appreciate It And Thank YOU! :)
Have A Nice Day!