PySpark / Spark GroupBy With DataFrame Or DataSet

PySpark / Spark GroupBy With DataFrame Or DataSet
Share this post and Earn Free Points!

In this post I will present you topic: PySpark / Spark GroupBy With DataFrame Or DataSet!

Introduction

SQL GroupBY

GROUP BY is a clause in SQL that is used to group rows in a result set based on one or more columns. It is typically used in conjunction with an aggregate function, such as SUM, COUNT, AVG, MIN, or MAX, to compute a value for each group of rows. Some common use cases of GROUP BY are:

  1. Data analysis and reporting: You can use GROUP BY to group data by one or more columns and then compute various statistics, such as the total sales, average price, or the number of customers for each group. This is useful for creating reports and dashboards that summarize data by different dimensions, such as product, category, region, or time.
  2. Finding patterns and trends: You can use GROUP BY to identify patterns and trends in your data by grouping it by different columns and computing statistics for each group. For example, you can group sales data by month, product, and region to find the best-selling products, the most profitable regions, or the trends in sales over time.
  3. Data cleaning and validation: You can use GROUP BY to check the integrity and consistency of your data by looking for outliers, errors, or missing values. For example, you can group data by a primary key column and check that there are no duplicate rows, or you can group data by a date column and check that there are no missing dates.
  4. Data transformation: You can use GROUP BY to reshape and pivot your data, by creating new columns and calculating values based on existing columns. For example, you can use GROUP BY to calculate the running total, moving average, or cumulative distribution of a column, or you could use window functions to create a new columns based on existing one.
  5. Data sampling and random sampling You can use GROUP BY to subsets your data by a specific column, giving you a random sample of your data to work with. This can be useful when working with large datasets and testing new models or methods on smaller chunks of data.

These are just a few examples of how GROUP BY can be used in SQL, but the possibilities are endless and depend on the specific requirements of your project and the nature of your data. With the power of GROUP BY and the other clauses of SQL, you can perform a wide range of data analysis, manipulation and reporting tasks.

GroupBy In Spark / PySpark

In Apache Spark, the groupBy() method is used to group the rows of a DataFrame or a Dataset based on the values in one or more columns. When you group a DataFrame or DataSet, the result is a new DataFrame or DataSet that contains one row for each unique value in the grouped column(s), and the values in the other columns are aggregated using one or more aggregation functions.

The groupBy() method is called on a DataFrame or a Dataset, and it’s passed one or more column names that you want to group the rows by. After the DataFrame or DataSet is grouped, you can call various aggregation functions such as count(), sum(), avg(), min(), max(), etc. on the grouped DataFrame or DataSet to compute statistics for each group.

For example, you can group a DataFrame of sales data by the “product” column to find the total sales for each product, or you can group a DataFrame of employee data by the “department” column to find the average salary for each department.

Grouping is a powerful way to perform complex data manipulation and analysis operations in Spark, and it’s a core feature of the Spark DataFrame and DataSet API. It’s also the first step in building more complex data processing pipelines, like joining, filtering, and windowing.

Spark GroupBy With DataFrame Or DataSet

Here’s an example of how to use the groupBy() method on a DataFrame in Spark to group the rows based on the values in one or more columns:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("GroupBy Example").getOrCreate()

# Create a DataFrame
data = [("Tom", 1, "tiger"),
        ("Jerry", 2, "elephant"),
        ("John", 3, "tiger"),
        ("Paul", 4, "elephant"),
        ("George", 5, "shark")]
columns = ["name", "age", "animal"]
df = spark.createDataFrame(data, columns)

# Group the DataFrame by the "animal" column
grouped_df = df.groupBy("animal").count()

# Show the results
grouped_df.show()

This will output the following DataFrame:

+---------+-----+
|animal   |count|
+---------+-----+
|    tiger|    2|
|    shark|    1|
| elephant|    2|
+---------+-----+

The groupBy() method is called on the DataFrame, and it’s passed the column that we want to group the rows by. In this case, it’s the “animal” column. After the DataFrame is grouped, we can call the count() method on the grouped DataFrame to count the number of rows in each group. The result is a new DataFrame with two columns: “animal” and “count”. The “animal” column contains the unique values from the original “animal” column, and the “count” column contains the number of rows in each group.

Spark GroupBy Multiple Columns

Spark GroupBy Count

You could also use multiple columns to group by, just pass as many column names as you want to the Spark groupBy count function, it will group by all the provided columns. And you can use any aggregation function such as sum, avg, min, max etc inplace of count.

grouped_df = df.groupBy("animal","age").count()
grouped_df.show()

This will give a DataFrame with 3 columns, animal, age and count, where count will contain the number of rows of each age for each animal.

Spark GroupBy Sum

In Apache Spark, you can use the sum() function to compute the sum of all the values in a column of a grouped DataFrame or DataSet. The sum() function can be used in conjunction with the groupBy() method to group the rows of a DataFrame or DataSet by one or more columns, and then compute the sum of the values in a specific column for each group.

Here’s an example of how to use the groupBy() and sum() methods to group a DataFrame by the “animal” column and compute the sum of the values in the “age” column for each group:

# Group the DataFrame by the "animal" column
# and compute the sum of the "age" column
grouped_df = df.groupBy("animal").agg(sum("age").alias("total_age"))

# Show the results
grouped_df.show()

This will output the following DataFrame:

+------+---------+
|animal|total_age|
+------+---------+
|   elephant|        6|
|  shark|        5|
|   tiger|        4|
+------+---------+

The Spark groupBy() method is called on the DataFrame, and it’s passed the “animal” column that we want to group the rows by. After the DataFrame is grouped, we call the agg() function with sum("age") and an alias is given to the resultant column which is total_age.

The sum() function is called on the “age” column, which computes the sum of all the ages for each group. The result is a new DataFrame with two columns: “animal” and “total_age”. The “animal” column contains the unique values from the original “animal” column, and the “total_age” column contains the sum of ages for each group.

You can use any other aggregation function of your choice inplace of sum, like avg(), min(), max(), count() etc and you could use multiple columns in the groupBy function and use multiple aggregation functions in the agg function to get more complex analysis on the data.

Spark GroupBy COLLECT_LIST

n Apache Spark, the collect_list() function is used to aggregate the values in a column of a grouped DataFrame or DataSet into a list. The collect_list() function can be used in conjunction with the groupBy() method to group the rows of a DataFrame or DataSet by one or more columns, and then collect the values in a specific column into a list for each group.

Here’s an example of how to use the groupBy() and collect_list() methods to group a DataFrame by the “animal” column and collect the values in the “name” column into a list for each group:

# Group the DataFrame by the "animal" column
# and collect the values in the "name" column into a list
grouped_df = df.groupBy("animal").agg(collect_list("name").alias("name_list"))

# Show the results
grouped_df.show()

This will output the following DataFrame:

+------+-------------+
|animal|    name_list|
+------+-------------+
|   tiger|   [Tom, John]|
|  shark|       [George]|
|   elephant|[Jerry, Paul]|
+------+-------------+

The Spark groupBy() method is called on the DataFrame, and it’s passed the “animal” column that we want to group the rows by. After the DataFrame is grouped, we call the agg() function with Spark collect_list("name") and an alias is given to the resultant column which is name_list.

The collect_list() function is called on the “name” column, which collects all the names for each group and returns a list. The result is a new DataFrame with two columns: “animal” and “name_list”. The “animal” column contains the unique values from the original “animal” column, and the “name_list” column contains the list of names for each group.

You could use collect_set similarly as well if you want a set of names instead of a list.

Summary

In this post we discussed the usage of Spark groupBy() method in Apache Spark, which is used to group the rows of a DataFrame or a Dataset based on the values in one or more columns. We discussed examples of using groupBy() with various aggregation functions like Spark count(), sum(), Spark collect_list() and also how it can be used to do complex data manipulation and analysis operations in Spark. We also talked about some common use cases of group by in SQL like data analysis and reporting, finding patterns and trends, data cleaning and validation, data transformation and data sampling.

Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!

How useful was this post?

Click on a star to rate it!

Average rating 5 / 5. Vote count: 1

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?