Dynamic Column handling in file

‐———-Spark Interview Questions————-
📍Important Note : This scenario is bit complex I would suggest go through it multiple times. (code implementation is in #databricks )
📕how to handle or how to read variable/dynamic number of columns details?
id,name,location,emaild,phone
1, aman
2,abhi,Delhi
3,john,chennai,sample123@gmail.com,688080

in a scenario we are geeting not complete columnar information but vary from row to row.

pyspark code :
===============
dbutils.fs.put(“/dbfs/tmp/dynamic_columns.csv”,”””
id,name,location,emaild,phone
1, aman
2,abhi,Delhi
3,john,chennai,sample123@gmail.com,688080
“””)

now lets create a dataframe with a header.

df=spark.read.csv(“/dbfs/tmp/dynamic_columns.csv”,header=True)
display(df)

it will fill with null where there is no data available

but if we not take header means there is not any header row in a data frame

dbutils.fs.put(“/dbfs/tmp/dynamic_withoutcols.csv”,”””
1, aman
2,abhi,Delhi
3,john,chennai,sample123@gmail.com,688080
“””)

df1=spark.read.csv(“/dbfs/tmp/dynamic_withoutcols.csv”)
display(df1)

then it will take 2 columns because first data/row contains two data points.

so the resultant data frame contains 2 columns.

+—+—–+
|_c0| _c1|
+—+—–+
| 1| aman|
| 2| abhi|
| 3| john|
+—+—–+

so how to handle such scenarios ?

don’t use csv while reading data use txt file.

df1=spark.read.txt(“/dbfs/tmp/dynamic_withoutcols.csv”)
display(df1)
or
df1=spark.read.text(“/dbfs/tmp/dynamic_withoutcols.csv”)
df1.show(20,60)

now use split() function and create a new column:

from pyspark.sql.functions import split
df1=df1.withColumn(“splittable_col”,split(“value”,”,”).alias(“splittable_col”))
df1.show(truncate=30)

split() function always give a list.

now we can drop the value column as we don’t required.

from pyspark.sql.functions import split
df1=df1.withColumn(“splittable_col”,split(“value”,”,”).alias(“splittable_col”)).drop(“value”)
df1.show(truncate=30)

now if I want to dynamically create a new column I must know length of it.

so lets first find length

[3, john, chennai, sample123@gmail.com, 688080]| —> maximum length is 5

now to calculate length dynamically use size() function

from pyspark.sql.functions import size
df1.select(“splittable_col”,size(‘splittable_col’).alias(“total_length”)).show(truncate=False)


now we can take max value of the “total_length” column.

from pyspark.sql.functions import size,max
for i in range(df1.select(max(size(‘splittable_col’).alias(“total_length”))).collect()[0][0]):
df1=df1.withColumn(“col”+str(i),df1[‘splittable_col’][i])
df1.show()

final_df=df1.drop(“splittable_col”)
final_df.show()

📕Summary of steps:
===============
1. read the data as a text file
2. use split fuction and create a list
3. use size function to measure the length of the column
4. run a for loop that iterate on every element of the list and create a new column.

Leave a comment

Create a website or blog at WordPress.com

Up ↑

Design a site like this with WordPress.com
Get started