PySpark: Cleansing Data with Regex

๐Ÿ” Delving into PySpark: Cleansing Data with Regex Magic!โš™๏ธ
๐ŸŒŸ Example: Transforming Names with Special Characters ๐Ÿš€

Picture yourself in the realm of data, where you’ve stumbled upon a trove of Indian names. However, these names are shrouded in a layer of noise, with special characters cluttering them.

๐Ÿ”‘ Step 1๏ธโƒฃ: The Challenge
Imagine a dataset of Indian names, each potentially marred by special characters like !, @, or #. Your mission? To cleanse these names, remove any unwanted characters and preserve the beauty of Indian nomenclature.

๐Ÿ› ๏ธ Step 2๏ธโƒฃ: PySpark’s Code
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# Create a Spark session
spark = SparkSession.builder.appName(“RegexNameCleansing”).getOrCreate()

# Sample data with names containing special characters
data = [
{“name”: “R@hul K#mar”},
{“name”: “S@m!rtha P@tel”},
{“name”: “M!dhavi S#ngh”}
]

# Create DataFrame
df = spark.createDataFrame(data)

# Remove special characters from names using regex
df_cleaned = df.withColumn(“cleaned_name”, F.regexp_replace(F.col(“name”), “[^a-zA-Z ]”, “”))
df_cleaned.show(truncate=False)

๐ŸŽ‰ Step 3๏ธโƒฃ: The Transformation
In the code above, PySpark’s regex functions work their magic, delicately stripping away the unwanted special characters. ๐ŸŒผ๐Ÿงน

#onestepanalytics #PySpark #DataCleansing #RegexMagic #DataTransformation #DataQuality #datanengineering

Leave a comment

Create a website or blog at WordPress.com

Up ↑

Design a site like this with WordPress.com
Get started