PySpark: Cleansing Data with Regex

🔍 Delving into PySpark: Cleansing Data with Regex Magic!⚙️
🌟 Example: Transforming Names with Special Characters 🚀

Picture yourself in the realm of data, where you’ve stumbled upon a trove of Indian names. However, these names are shrouded in a layer of noise, with special characters cluttering them.

🔑 Step 1️⃣: The Challenge
Imagine a dataset of Indian names, each potentially marred by special characters like !, @, or #. Your mission? To cleanse these names, remove any unwanted characters and preserve the beauty of Indian nomenclature.

🛠️ Step 2️⃣: PySpark’s Code
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# Create a Spark session
spark = SparkSession.builder.appName(“RegexNameCleansing”).getOrCreate()

# Sample data with names containing special characters
data = [
{“name”: “R@hul K#mar”},
{“name”: “S@m!rtha P@tel”},
{“name”: “M!dhavi S#ngh”}
]

# Create DataFrame
df = spark.createDataFrame(data)

# Remove special characters from names using regex
df_cleaned = df.withColumn(“cleaned_name”, F.regexp_replace(F.col(“name”), “[^a-zA-Z ]”, “”))
df_cleaned.show(truncate=False)

🎉 Step 3️⃣: The Transformation
In the code above, PySpark’s regex functions work their magic, delicately stripping away the unwanted special characters. 🌼🧹

#onestepanalytics #PySpark #DataCleansing #RegexMagic #DataTransformation #DataQuality #datanengineering

Share this:

Related

Leave a comment Cancel reply