๐ Delving into PySpark: Cleansing Data with Regex Magic!โ๏ธ
๐ Example: Transforming Names with Special Characters ๐
Picture yourself in the realm of data, where you’ve stumbled upon a trove of Indian names. However, these names are shrouded in a layer of noise, with special characters cluttering them.
๐ Step 1๏ธโฃ: The Challenge
Imagine a dataset of Indian names, each potentially marred by special characters like !, @, or #. Your mission? To cleanse these names, remove any unwanted characters and preserve the beauty of Indian nomenclature.
๐ ๏ธ Step 2๏ธโฃ: PySpark’s Code
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
# Create a Spark session
spark = SparkSession.builder.appName(“RegexNameCleansing”).getOrCreate()
# Sample data with names containing special characters
data = [
{“name”: “R@hul K#mar”},
{“name”: “S@m!rtha P@tel”},
{“name”: “M!dhavi S#ngh”}
]
# Create DataFrame
df = spark.createDataFrame(data)
# Remove special characters from names using regex
df_cleaned = df.withColumn(“cleaned_name”, F.regexp_replace(F.col(“name”), “[^a-zA-Z ]”, “”))
df_cleaned.show(truncate=False)
๐ Step 3๏ธโฃ: The Transformation
In the code above, PySpark’s regex functions work their magic, delicately stripping away the unwanted special characters. ๐ผ๐งน
#onestepanalytics #PySpark #DataCleansing #RegexMagic #DataTransformation #DataQuality #datanengineering
Leave a comment