Partition Scenario with Pyspark

📕how to create partitions based on year and month ?

Data partitioning is critical to data processing performance especially for large volume of data processing in spark.

Most of the traditional databases will be having default date format DD-MM-YYYY.

But cloud storage (spark delta lake/databricks tables) will be using YYYY-MM-DD format.

So here we will be see how to convert DD-MM-YYYY to YYYY-MM-DD using to_date() function in spark.

to_date():
===========
This function comes under spark sql is used to format(StringType) to date (DateType) column.

code in scala:
=================
%scala
import org.apache.spark.sql.functions.{col,to_date}
import spark.sqlContext.implicits._
Seq((“06-03-2009”),(“07-24-2009”)).toDF(“Date”).select(col(“Date”),to_date(col(“Date”),”MM-dd-yyyy”).as(“to_date”)).show()

output:
==========
+———-+———-+
| Date| to_date|
+———-+———-+
|06-03-2009|2009-06-03|
|07-24-2009|2009-07-24|
+———-+———-+

When we are migrating data from on premises traditional databases this is one of the most common issue.

Traditional Data format : DD-MM-YYYY

Spark Date format : YYYY-MM-DD

if it is string data type it will convert but if it is date data type it will treat it as null.

date_format() in spark:
===========================
date_format() formats Date to String format.

code in spark file is provided in the comment section:

df1=spark.read.option(“header”,True).option(“inferSchema”,True).csv(“/FileStore/emp-3.csv”)

df1.show()
df1.printSchema()

from pyspark.sql.functions import to_date
df2=df1.withColumn(“HIREDATE”,to_date(“HIREDATE”,’dd-mm-yyyy’)).fillna({“HIREDATE”:’9999-12-31′})

from pyspark.sql.functions import date_format
df3=df2.withColumn(“YEAR”,date_format(“HIREDATE”,’yyyy’)).withColumn(“MONTH”,date_format(“HIREDATE”,’MM’))
df3.show()

df3.write.option(“header”,True) \
.partitionBy(“YEAR”,”MONTH”) \
.mode(“overwrite”) \
.csv(“/tmp/testdata”)

Share this:

Related

Leave a comment Cancel reply