#PySpark_UDF_with_the_help_of_an_example
π π π The most important aspect of Spark SQL & DataFrame is PySpark UDF (i.e., User Defined Function), which is used to expand PySpark’s built-in capabilities. UDFs in PySpark work similarly to UDFs in conventional databases.
β We write a Python function and wrap it in PySpark SQL udf() or register it as udf and use it on DataFrame and SQL, respectively, in the case of PySpark.
Example of how we can create a UDF-
First, we need to create a sample dataframe.
spark = SparkSession.builder.appName(‘ProjectPro’).getOrCreate()
column = [“Seqno”,”Name”]
data = [(“1”, “john jones”),
(“2”, “tracey smith”),
(“3”, “amy sanders”)]
df = spark.createDataFrame(data=data,schema=column)
df.show(truncate=False)
Output- in image
π’ The next step is creating a Python function. The code below generates the convertCase() method, which accepts a string parameter and turns every word’s initial letter to a capital letter.
def convertCase(str):
resStr=””
arr = str.split(” “)
for x in arr:
resStr= resStr + x[0:1].upper() + x[1:len(x)] + ” “
return resStr
π The final step is converting a Python function to a PySpark UDF.
By passing the function to PySpark SQL udf(), we can convert the convertCase() function to UDF(). The org.apache.spark.sql.functions.udf package contains this function. Before we use this package, we must first import it.
π€ The org.apache.spark.sql.expressions.UserDefinedFunction class object is returned by the PySpark SQL udf() function.
” Converting function to UDF “
convertUDF = udf(lambda z: convertCase(z),StringType())
#sql #pyspark #analytics #window #apachespark #interviewpreparation #tips #bigdata #datascience #bussinessanalyst #dataanalytics #bigdataengineer #databricks #pyspark #spark #deltalivetables #deltalake #analytics #dataengineer #bigdata #ETL #pyspark #apachespark #transformation #analysis #extraction #load #interviewpreparation
Leave a comment