‐———-Spark Interview Questions————-
📍Important Note : This scenario is bit complex I would suggest go through it multiple times. (code implementation is in #databricks )
📕how to handle or how to read variable/dynamic number of columns details?
id,name,location,emaild,phone
1, aman
2,abhi,Delhi
3,john,chennai,sample123@gmail.com,688080
in a scenario we are geeting not complete columnar information but vary from row to row.
pyspark code :
===============
dbutils.fs.put(“/dbfs/tmp/dynamic_columns.csv”,”””
id,name,location,emaild,phone
1, aman
2,abhi,Delhi
3,john,chennai,sample123@gmail.com,688080
“””)
now lets create a dataframe with a header.
df=spark.read.csv(“/dbfs/tmp/dynamic_columns.csv”,header=True)
display(df)
it will fill with null where there is no data available
but if we not take header means there is not any header row in a data frame
dbutils.fs.put(“/dbfs/tmp/dynamic_withoutcols.csv”,”””
1, aman
2,abhi,Delhi
3,john,chennai,sample123@gmail.com,688080
“””)
df1=spark.read.csv(“/dbfs/tmp/dynamic_withoutcols.csv”)
display(df1)
then it will take 2 columns because first data/row contains two data points.
so the resultant data frame contains 2 columns.
+—+—–+
|_c0| _c1|
+—+—–+
| 1| aman|
| 2| abhi|
| 3| john|
+—+—–+
so how to handle such scenarios ?
don’t use csv while reading data use txt file.
df1=spark.read.txt(“/dbfs/tmp/dynamic_withoutcols.csv”)
display(df1)
or
df1=spark.read.text(“/dbfs/tmp/dynamic_withoutcols.csv”)
df1.show(20,60)
now use split() function and create a new column:
from pyspark.sql.functions import split
df1=df1.withColumn(“splittable_col”,split(“value”,”,”).alias(“splittable_col”))
df1.show(truncate=30)
split() function always give a list.
now we can drop the value column as we don’t required.
from pyspark.sql.functions import split
df1=df1.withColumn(“splittable_col”,split(“value”,”,”).alias(“splittable_col”)).drop(“value”)
df1.show(truncate=30)
now if I want to dynamically create a new column I must know length of it.
so lets first find length
[3, john, chennai, sample123@gmail.com, 688080]| —> maximum length is 5
now to calculate length dynamically use size() function
from pyspark.sql.functions import size
df1.select(“splittable_col”,size(‘splittable_col’).alias(“total_length”)).show(truncate=False)
now we can take max value of the “total_length” column.
from pyspark.sql.functions import size,max
for i in range(df1.select(max(size(‘splittable_col’).alias(“total_length”))).collect()[0][0]):
df1=df1.withColumn(“col”+str(i),df1[‘splittable_col’][i])
df1.show()
final_df=df1.drop(“splittable_col”)
final_df.show()
📕Summary of steps:
===============
1. read the data as a text file
2. use split fuction and create a list
3. use size function to measure the length of the column
4. run a for loop that iterate on every element of the list and create a new column.
Leave a comment