#Scenario
There are two ways to handle row duplication in PySpark dataframes.
The distinct() function in PySpark is used to drop/remove duplicate rows (all columns) from a DataFrame, while dropDuplicates() is used to drop rows based on one or more columns.
Here’s an example showing how to utilize the distinct() and dropDuplicates() methods-
First, we need to create a sample dataframe.
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
spark = SparkSession.builder.appName(‘ProjectPro).getOrCreate()
data = [(“James”, “Sales”, 3000), \
(“Michael”, “Sales”, 4600), \
(“Robert”, “Sales”, 4100), \
(“Maria”, “Finance”, 3000), \
(“James”, “Sales”, 3000), \
(“Scott”, “Finance”, 3300), \
(“Jen”, “Finance”, 3900), \
(“Jeff”, “Marketing”, 3000), \
(“Kumar”, “Marketing”, 2000), \
(“Saif”, “Sales”, 4100) \
]
column= [“employee_name”, “department”, “salary”]
df = spark.createDataFrame(data = data, schema = column)
df.printSchema()
df.show(truncate=False)
Output : image one
The record with the employer name Robert contains duplicate rows in the table above. As we can see, there are two rows with duplicate values in all fields and four rows with duplicate values in the department and salary columns.
Below is the entire code for removing duplicate rows-
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
spark = SparkSession.builder.appName(‘ProjectPro’).getOrCreate()
data = [(“James”, “Sales”, 3000), \
(“Michael”, “Sales”, 4600), \
(“Robert”, “Sales”, 4100), \
(“Maria”, “Finance”, 3000), \
(“James”, “Sales”, 3000), \
(“Scott”, “Finance”, 3300), \
(“Jen”, “Finance”, 3900), \
(“Jeff”, “Marketing”, 3000), \
(“Kumar”, “Marketing”, 2000), \
(“Saif”, “Sales”, 4100) \
]
column= [“employee_name”, “department”, “salary”]
df = spark.createDataFrame(data = data, schema = column)
df.printSchema()
df.show(truncate=False)
#Distinct
distinctDF = df.distinct()
print(“Distinct count: “+str(distinctDF.count()))
distinctDF.show(truncate=False)
#Drop_duplicates
df2 = df.dropDuplicates()
print(“Distinct count: “+str(df2.count()))
df2.show(truncate=False)
#Drop_duplicates_on_selected_columns
dropDisDF = df.dropDuplicates([“department”,”salary”])
print(“Distinct count of department salary : “+str(dropDisDF.count()))
dropDisDF.show(truncate=False)
}
#sql #pyspark #analytics #window #apachespark #interviewpreparation #tips #bigdata #datascience #bussinessanalyst #dataanalytics #bigdataengineer #databricks #pyspark #spark #deltalivetables #deltalake #analytics #dataengineer #bigdata #ETL #pyspark #apachespark #transformation #analysis #extraction #load #interviewpreparation
Leave a comment