Delete Duplicates in Pyspark Dataframe

#Scenario

There are two ways to handle row duplication in PySpark dataframes.

The distinct() function in PySpark is used to drop/remove duplicate rows (all columns) from a DataFrame, while dropDuplicates() is used to drop rows based on one or more columns.


Here’s an example showing how to utilize the distinct() and dropDuplicates() methods-


First, we need to create a sample dataframe.


import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
spark = SparkSession.builder.appName(‘ProjectPro).getOrCreate()
data = [(“James”, “Sales”, 3000), \
(“Michael”, “Sales”, 4600), \
(“Robert”, “Sales”, 4100), \
(“Maria”, “Finance”, 3000), \
(“James”, “Sales”, 3000), \
(“Scott”, “Finance”, 3300), \
(“Jen”, “Finance”, 3900), \
(“Jeff”, “Marketing”, 3000), \
(“Kumar”, “Marketing”, 2000), \
(“Saif”, “Sales”, 4100) \
]
column= [“employee_name”, “department”, “salary”]
df = spark.createDataFrame(data = data, schema = column)
df.printSchema()
df.show(truncate=False)
Output : image one

The record with the employer name Robert contains duplicate rows in the table above. As we can see, there are two rows with duplicate values in all fields and four rows with duplicate values in the department and salary columns.


Below is the entire code for removing duplicate rows-


import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
spark = SparkSession.builder.appName(‘ProjectPro’).getOrCreate()
data = [(“James”, “Sales”, 3000), \
(“Michael”, “Sales”, 4600), \
(“Robert”, “Sales”, 4100), \
(“Maria”, “Finance”, 3000), \
(“James”, “Sales”, 3000), \
(“Scott”, “Finance”, 3300), \
(“Jen”, “Finance”, 3900), \
(“Jeff”, “Marketing”, 3000), \
(“Kumar”, “Marketing”, 2000), \
(“Saif”, “Sales”, 4100) \
]
column= [“employee_name”, “department”, “salary”]
df = spark.createDataFrame(data = data, schema = column)
df.printSchema()
df.show(truncate=False)

#Distinct
distinctDF = df.distinct()
print(“Distinct count: “+str(distinctDF.count()))
distinctDF.show(truncate=False)
#Drop_duplicates
df2 = df.dropDuplicates()
print(“Distinct count: “+str(df2.count()))
df2.show(truncate=False)
#Drop_duplicates_on_selected_columns
dropDisDF = df.dropDuplicates([“department”,”salary”])
print(“Distinct count of department salary : “+str(dropDisDF.count()))
dropDisDF.show(truncate=False)
}

#sql #pyspark #analytics #window #apachespark #interviewpreparation #tips #bigdata #datascience #bussinessanalyst #dataanalytics #bigdataengineer #databricks #pyspark #spark #deltalivetables #deltalake #analytics #dataengineer #bigdata #ETL #pyspark #apachespark #transformation #analysis #extraction #load #interviewpreparation

Leave a comment

Create a website or blog at WordPress.com

Up ↑

Design a site like this with WordPress.com
Get started