📕how to create partitions based on year and month ?
Data partitioning is critical to data processing performance especially for large volume of data processing in spark.
Most of the traditional databases will be having default date format DD-MM-YYYY.
But cloud storage (spark delta lake/databricks tables) will be using YYYY-MM-DD format.
So here we will be see how to convert DD-MM-YYYY to YYYY-MM-DD using to_date() function in spark.
to_date():
===========
This function comes under spark sql is used to format(StringType) to date (DateType) column.
code in scala:
=================
%scala
import org.apache.spark.sql.functions.{col,to_date}
import spark.sqlContext.implicits._
Seq((“06-03-2009”),(“07-24-2009”)).toDF(“Date”).select(col(“Date”),to_date(col(“Date”),”MM-dd-yyyy”).as(“to_date”)).show()
output:
==========
+———-+———-+
| Date| to_date|
+———-+———-+
|06-03-2009|2009-06-03|
|07-24-2009|2009-07-24|
+———-+———-+
When we are migrating data from on premises traditional databases this is one of the most common issue.
Traditional Data format : DD-MM-YYYY
Spark Date format : YYYY-MM-DD
if it is string data type it will convert but if it is date data type it will treat it as null.
date_format() in spark:
===========================
date_format() formats Date to String format.
code in spark file is provided in the comment section:
df1=spark.read.option(“header”,True).option(“inferSchema”,True).csv(“/FileStore/emp-3.csv”)
df1.show()
df1.printSchema()
from pyspark.sql.functions import to_date
df2=df1.withColumn(“HIREDATE”,to_date(“HIREDATE”,’dd-mm-yyyy’)).fillna({“HIREDATE”:’9999-12-31′})
from pyspark.sql.functions import date_format
df3=df2.withColumn(“YEAR”,date_format(“HIREDATE”,’yyyy’)).withColumn(“MONTH”,date_format(“HIREDATE”,’MM’))
df3.show()
df3.write.option(“header”,True) \
.partitionBy(“YEAR”,”MONTH”) \
.mode(“overwrite”) \
.csv(“/tmp/testdata”)
Leave a comment