In this post I will show you how to execute Spark" Distinct on Multiple Columns.
Table of Contents
Introduction
Apache Spark
Apache Spark" is an open-source, distributed computing system used for big data" processing and analytics. It was developed at the UC Berkeley’s AMPLab in the early 2010s and later donated to the Apache Software Foundation. Spark is designed to be fast, flexible, and easy to use. It can be used for a wide range of data processing tasks, including data ingestion, data cleansing, data transformation, machine learning", and data visualization.
Spark is built on top of the Hadoop" Distributed File System (HDFS") and is designed to be compatible with other big data" tools in the Hadoop" ecosystem. However, Spark can also work with other storage systems, such as Amazon S3, Apache Cassandra, and Apache Kafka".
Spark has several core components:
- Spark Core: This is the foundation of Spark and provides the basic functionality for the other modules. It includes the Resilient Distributed Dataset" (RDD") API, which is a fault-tolerant collection of elements that can be processed in parallel.
- Spark SQL": This module provides a programming" interface for working with structured and semi-structured data. It allows you to work with data in the form of DataFrames and Datasets, which are similar to tables in a relational database".
- Spark Streaming: This module allows you to process real-time streaming data, such as data from social media feeds, IoT devices, and logs.
- Spark MLlib": This module provides a library of machine learning" algorithms that can be used to train models on big data".
- Spark GraphX: This module provides a library for graph processing and allows you to perform complex graph algorithms on big data".
Spark can be programmed using a variety of languages, including Python", Java", Scala", and R. And also provides support for various data sources like JSON", Parquet", Avro", and JDBC.
Spark’s ability to handle iterative algorithms over large datasets and its in-memory data storage makes it ideal for machine learning" and other data-intensive tasks that require a lot of compute power. It is widely used across industry and academia, and has a large and active community that continues to develop new tools and libraries for Spark.
What Does Distinct Mean?
The term “distinct” is used to refer to unique or distinct values in a set of data. In the context of databases and data processing, distinct is often used to filter out duplicate rows or values in a table, query, or dataset".
For example, if you have a table with a column" called “name,” and this column contains many duplicated names, you might use the “distinct” keyword to only show the unique names in that column. This is typically done by the SQL" query SELECT DISTINCT name FROM table
which will show all the unique names present in the name column of that table.
In Apache Spark" SQL", the distinct
method is used to remove duplicate rows from a DataFrame" or dataset". It returns a new DataFrame" or dataset" that contains only the unique rows from the original DataFrame" or dataset". Another method, dropDuplicates()
, can also be used in Spark which accepts one or more column names as arguments. If you pass one or more column names, the method removes duplicate rows based on the values in those columns. If you don’t pass any column names, the method removes duplicate rows based on all columns in the DataFrame".
Spark Distinct On Multiple Columns / PySpark Distinct
Spark Distinct()
In Apache Spark", you can use the distinct
method to remove duplicate rows from a DataFrame" or dataset". The method returns a new DataFrame" or dataset" that contains only the unique rows from the original DataFrame" or dataset".
Here’s an example of how you can use the distinct
method to remove duplicate rows from a DataFrame":
from pyspark.sql import SparkSession # create a spark session spark = SparkSession.builder.appName("distinct_example").getOrCreate() # create a dataframe data = [("Tom", 1), ("John", 2), ("Jerry", 3), ("John", 2), ("Jerry", 3),("Jerry", 4)] df = spark.createDataFrame(data, ["name", "age"]) # remove duplicate rows distinct_df = df.distinct() # show the distinct rows distinct_df.show()
Spark dropDuplicates()
You can also use the dropDuplicates
method to remove duplicate rows, like so:
# remove duplicate rows distinct_df = df.dropDuplicates()
This method accepts one or more column names as arguments. If you pass one or more column names, the method removes duplicate rows based on the values in those columns. If you don’t pass any column names, the method removes duplicate rows based on all columns in the DataFrame.
It’s also possible to remove the duplicates based on specific columns, like so:
# remove duplicates based on the name column distinct_df = df.dropDuplicates(['name'])
It will leave the rows that have different value in age.
Spark Select Distinct
You can also use the select
function to select the specific columns you want in the distinct rows before calling the distinct()
function
# remove duplicates based on the name column distinct_df = df.select('name').distinct()
This will return only the distinct name column.
Spark Distinct Count
In Apache Spark", you can use the distinct()
method to remove duplicates from a DataFrame" or RDD" (Resilient Distributed dataset") and then use the count()
method to get the number of remaining unique elements.
Here’s an example of how to use these methods to get the distinct count of elements in a DataFrame":
from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.appName("DistinctCountExample").getOrCreate() # Create a DataFrame data = [("Tom", 1), ("John", 2), ("Jerry", 3), ("Tom", 4), ("John", 5)] df = spark.createDataFrame(data, ["name", "age"]) # Get the distinct count of elements in the "name" column distinct_count = df.select("name").distinct().count() print("Distinct count of names:", distinct_count)
This will output:
Distinct count of names: 3
You can also use the distinct()
method on a specific column of DataFrame", this way it will count the distinct value of that column.
distinct_count = df.select("name").distinct().count()
You can do the same thing using RDD" as well, with the difference of using distinct()
method on the RDD", not a specific column and then use count()
method:
# Get distinct count of elements in the RDD distinct_count = rdd.distinct().count()
Keep in mind that using the distinct()
method can be computationally expensive, as it requires shuffling all the data.
Spark dropDuplicates() vs Distinct()
In Apache Spark", both the dropDuplicates()
method and the distinct()
method can be used to remove duplicates from a DataFrame" or RDD" (Resilient Distributed dataset"). However, they have some key differences:
dropDuplicates()
: This method removes duplicates based on the entire row of data. If multiple rows have the same values in all columns, then all but one of those rows will be removed. It has also option for select the columns for considering for duplication
df = df.dropDuplicates(['name','age'])
distinct()
: This method removes duplicates based on the selected column or all the column. If multiple rows have the same values in the selected columns, then all but one of those rows will be removed.
df = df.select("name").distinct()
In terms of performance, distinct()
method is typically more expensive because it requires shuffling all the data. dropDuplicates()
is generally faster as it only needs to look at one row at a time, so it can be more appropriate if you have a large dataset".
Spark DropDuplicates Performance
dropDuplicates
is better when you are looking to remove the duplicate rows with respect to one or more columns, while distinct
is useful when you are looking to get the unique rows based on all the columns or based on the selection of columns.
Spark Distinct Vs GroupBy
In Apache Spark", both the distinct()
method and the groupBy()
method can be used to group data in a DataFrame" or RDD" (Resilient Distributed dataset"), but they have different purposes and produce different results.
distinct()
: This method removes duplicates from a DataFrame" or RDD" based on the selected column or all the columns. It returns a new DataFrame" or RDD" containing only the unique rows. This method is useful when you want to find the unique elements in a column or set of columns.
df = df.select("name").distinct()
groupBy()
: This method groups the data in a DataFrame" or RDD" based on the specified column or set of columns and returns aGroupedData
object. This method is useful when you want to group the data and perform aggregate operations (such as sum, count, etc.) on the groups.
grouped_df = df.groupBy("name").sum()
In summary, distinct()
method is used to remove the duplicate rows and get the unique rows, while groupBy()
method is used to group the data based on specific column(s) or set of columns and perform aggregate operations on the groups.
Also keep in mind that, groupBy()
method is more flexible and powerful because it allows you to perform any aggregate operation on each group, while the distinct()
method is only useful to remove duplicates and return unique elements.
Summary
The distinct
method in Apache Spark" is used to remove duplicate rows from a DataFrame" or dataset". The method returns a new DataFrame" or dataset" that contains only the unique rows from the original DataFrame" or dataset".
You can also use the dropDuplicates
method to remove duplicate rows. This method accepts one or more column names as arguments. If you pass one or more column names, the method removes duplicate rows based on the values in those columns. If you don’t pass any column names, the method removes duplicate rows based on all columns in the DataFrame".
You can also chain the select
function to select the specific columns you want in the distinct rows before calling the distinct()
function.
Could You Please Share This Post?
I appreciate It And Thank YOU! :)
Have A Nice Day!