PySpark / Spark Distinct On Multiple Columns – Deep Dive Into Distinct() In 5 Min!

Spark Distinct On Multiple Columns
Share this post and Earn Free Points!

In this post I will show you how to execute Spark Distinct on Multiple Columns.

Introduction

Apache Spark

Apache Spark is an open-source, distributed computing system used for big data processing and analytics. It was developed at the UC Berkeley’s AMPLab in the early 2010s and later donated to the Apache Software Foundation. Spark is designed to be fast, flexible, and easy to use. It can be used for a wide range of data processing tasks, including data ingestion, data cleansing, data transformation, machine learning, and data visualization.

Spark is built on top of the Hadoop Distributed File System (HDFS) and is designed to be compatible with other big data tools in the Hadoop ecosystem. However, Spark can also work with other storage systems, such as Amazon S3, Apache Cassandra, and Apache Kafka.

Spark has several core components:

Spark can be programmed using a variety of languages, including Python, Java, Scala, and R. And also provides support for various data sources like JSON, Parquet, Avro, and JDBC.

Spark’s ability to handle iterative algorithms over large datasets and its in-memory data storage makes it ideal for machine learning and other data-intensive tasks that require a lot of compute power. It is widely used across industry and academia, and has a large and active community that continues to develop new tools and libraries for Spark.

What Does Distinct Mean?

The term “distinct” is used to refer to unique or distinct values in a set of data. In the context of databases and data processing, distinct is often used to filter out duplicate rows or values in a table, query, or dataset.

For example, if you have a table with a column called “name,” and this column contains many duplicated names, you might use the “distinct” keyword to only show the unique names in that column. This is typically done by the SQL query SELECT DISTINCT name FROM table which will show all the unique names present in the name column of that table.

In Apache Spark SQL, the distinct method is used to remove duplicate rows from a DataFrame or dataset. It returns a new DataFrame or dataset that contains only the unique rows from the original DataFrame or dataset. Another method, dropDuplicates(), can also be used in Spark which accepts one or more column names as arguments. If you pass one or more column names, the method removes duplicate rows based on the values in those columns. If you don’t pass any column names, the method removes duplicate rows based on all columns in the DataFrame.

Spark Distinct On Multiple Columns / PySpark Distinct

Spark Distinct()

In Apache Spark, you can use the distinct method to remove duplicate rows from a DataFrame or dataset. The method returns a new DataFrame or dataset that contains only the unique rows from the original DataFrame or dataset.

Here’s an example of how you can use the distinct method to remove duplicate rows from a DataFrame:

from pyspark.sql import SparkSession

# create a spark session
spark = SparkSession.builder.appName("distinct_example").getOrCreate()

# create a dataframe
data = [("Tom", 1), ("John", 2), ("Jerry", 3), ("John", 2), ("Jerry", 3),("Jerry", 4)]
df = spark.createDataFrame(data, ["name", "age"])

# remove duplicate rows
distinct_df = df.distinct()

# show the distinct rows
distinct_df.show()

Spark dropDuplicates()

You can also use the dropDuplicates method to remove duplicate rows, like so:

# remove duplicate rows
distinct_df = df.dropDuplicates()

This method accepts one or more column names as arguments. If you pass one or more column names, the method removes duplicate rows based on the values in those columns. If you don’t pass any column names, the method removes duplicate rows based on all columns in the DataFrame.

It’s also possible to remove the duplicates based on specific columns, like so:

# remove duplicates based on the name column
distinct_df = df.dropDuplicates(['name'])

It will leave the rows that have different value in age.

Spark Select Distinct

You can also use the select function to select the specific columns you want in the distinct rows before calling the distinct() function

# remove duplicates based on the name column
distinct_df = df.select('name').distinct()

This will return only the distinct name column.

Spark Distinct Count

In Apache Spark, you can use the distinct() method to remove duplicates from a DataFrame or RDD (Resilient Distributed dataset) and then use the count() method to get the number of remaining unique elements.

Here’s an example of how to use these methods to get the distinct count of elements in a DataFrame:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("DistinctCountExample").getOrCreate()

# Create a DataFrame
data = [("Tom", 1), ("John", 2), ("Jerry", 3), ("Tom", 4), ("John", 5)]
df = spark.createDataFrame(data, ["name", "age"])

# Get the distinct count of elements in the "name" column
distinct_count = df.select("name").distinct().count()

print("Distinct count of names:", distinct_count)

This will output:

Distinct count of names: 3

You can also use the distinct() method on a specific column of DataFrame, this way it will count the distinct value of that column.

distinct_count = df.select("name").distinct().count()

You can do the same thing using RDD as well, with the difference of using distinct() method on the RDD, not a specific column and then use count() method:

# Get distinct count of elements in the RDD
distinct_count = rdd.distinct().count()

Keep in mind that using the distinct() method can be computationally expensive, as it requires shuffling all the data.

Spark dropDuplicates() vs Distinct()

In Apache Spark, both the dropDuplicates() method and the distinct() method can be used to remove duplicates from a DataFrame or RDD (Resilient Distributed dataset). However, they have some key differences:

  1. dropDuplicates(): This method removes duplicates based on the entire row of data. If multiple rows have the same values in all columns, then all but one of those rows will be removed. It has also option for select the columns for considering for duplication
df = df.dropDuplicates(['name','age'])
  1. distinct(): This method removes duplicates based on the selected column or all the column. If multiple rows have the same values in the selected columns, then all but one of those rows will be removed.
df = df.select("name").distinct()

In terms of performance, distinct() method is typically more expensive because it requires shuffling all the data. dropDuplicates() is generally faster as it only needs to look at one row at a time, so it can be more appropriate if you have a large dataset.

Spark DropDuplicates Performance

dropDuplicates is better when you are looking to remove the duplicate rows with respect to one or more columns, while distinct is useful when you are looking to get the unique rows based on all the columns or based on the selection of columns.

Spark Distinct Vs GroupBy

In Apache Spark, both the distinct() method and the groupBy() method can be used to group data in a DataFrame or RDD (Resilient Distributed dataset), but they have different purposes and produce different results.

  1. distinct(): This method removes duplicates from a DataFrame or RDD based on the selected column or all the columns. It returns a new DataFrame or RDD containing only the unique rows. This method is useful when you want to find the unique elements in a column or set of columns.
df = df.select("name").distinct()
  1. groupBy(): This method groups the data in a DataFrame or RDD based on the specified column or set of columns and returns a GroupedData object. This method is useful when you want to group the data and perform aggregate operations (such as sum, count, etc.) on the groups.
grouped_df = df.groupBy("name").sum()

In summary, distinct() method is used to remove the duplicate rows and get the unique rows, while groupBy() method is used to group the data based on specific column(s) or set of columns and perform aggregate operations on the groups.

Also keep in mind that, groupBy() method is more flexible and powerful because it allows you to perform any aggregate operation on each group, while the distinct() method is only useful to remove duplicates and return unique elements.

Summary

The distinct method in Apache Spark is used to remove duplicate rows from a DataFrame or dataset. The method returns a new DataFrame or dataset that contains only the unique rows from the original DataFrame or dataset.

You can also use the dropDuplicates method to remove duplicate rows. This method accepts one or more column names as arguments. If you pass one or more column names, the method removes duplicate rows based on the values in those columns. If you don’t pass any column names, the method removes duplicate rows based on all columns in the DataFrame.

You can also chain the select function to select the specific columns you want in the distinct rows before calling the distinct() function.

Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!

How useful was this post?

Click on a star to rate it!

Average rating 5 / 5. Vote count: 1

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?