---------------Spark Interview Questions------------๐How to read a csv file in spark?Method 1: ---------------spark.read.csv("path")df=spark.read.csv("dbfs:/FileStore/small_zipcode.csv")df.show()---+-------+--------+-------------------+-----+----------+|_c0| _c1| _c2| _c3| _c4| _c5|+---+-------+--------+-------------------+-----+----------+| id|zipcode| type| city|state|population|| 1| 704|STANDARD| null| PR| 30100|| 2| 704| null|PASEO COSTA DEL SUR| PR| null|| 3| 709| null| BDA SAN LUIS| PR| 3700|| 4| 76166| UNIQUE| CINGULAR WIRELESS| TX| 84000|| 5| 76177|STANDARD| null| TX| null|+---+-------+--------+-------------------+-----+----------+Method 2 :--------------df=spark.read.format("csv").option("inferSchema",True).option("header",True).option("sep",",").load("dbfs:/FileStore/small_zipcode.csv")df.show()+---+-------+--------+-------------------+-----+----------+|... Continue Reading →
Free Spark Course
Don't pay for Apache Spark Course because it is in demand.You can learn for free here......1. Install spark from here....https://lnkd.in/gx_Dc8phhttps://lnkd.in/gg6-8xDz2. Learn spark Basics from here--https://lnkd.in/g-gCpUyihttps://lnkd.in/gkNhMnTZhttps://lnkd.in/gkbVB6YX2.1 Learn spark with Scala from here:https://lnkd.in/gtrZAmn42.2 Learn spark with python from here:https://lnkd.in/gQaeSjbH3. Learn pyspark from here:https://lnkd.in/g6kyihyW4. Work on Spark projects from here..https://lnkd.in/gE8hsyZxhttps://lnkd.in/gwWytS-Qhttps://lnkd.in/gR7DR6_5https://lnkd.in/gzngHhrChttps://lnkd.in/gACn6bK85. Finally list down your projects Here.....https://github.com/I highly recommend... Continue Reading →
Git Guide
๐ ๐ฆ๐ถ๐บ๐ฝ๐น๐ถ๐ณ๐ถ๐ฒ๐ฑ ๐๐๐ถ๐ฑ๐ฒ ๐๐ผ ๐๐ถ๐ ๐๐ถ๐๐ต ๐๐ฅ๐๐ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ฟ๐ฒ๐๐ผ๐๐ฟ๐ฐ๐ฒ๐ ๐จ๐ปGit, the powerhouse of version control, transforms how developers manage code history and teamwork. Here's a quick breakdown:โข ๐๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ฒ๐ฑ ๐๐ฟ๐ถ๐น๐น๐ถ๐ฎ๐ป๐ฐ๐ฒ: Git repositories are self-contained, offering flexibility and stability. Every developer holds a complete project history, fostering autonomy.โข ๐ฆ๐ป๐ฎ๐ฝ๐๐ต๐ผ๐ ๐ ๐ฎ๐๐๐ฒ๐ฟ๐: Git captures "snapshots" of files, enabling easy... Continue Reading →
SCD 2 with Pyspark
Implementing slowly changing dimension (SCD type2) in Pyspark earlier we saw in SQL https://lnkd.in/dH6j3MWE# Define the schema for the DataFrameschema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("salary", IntegerType(), True), StructField("department", StringType(), True), StructField("active", BooleanType(), True), StructField("start", StringType(), True), StructField("end", StringType(), True)])Employee_data = [ (1,"John", 100, "HR",True,'2023-10-20',None), (2,"Alice", 200, "Finance",True,'2023-10-20',None), (3,"Bob", 300, "Engineering",True,'2023-10-20',None), (4,"Jane",... Continue Reading →
Data Engineering Questions – 1
if your #dataengineering experience grows more than 5 years you expect these questions in your interviews.....1. Explain me the architecture of spark?2. How does internals job execution happens?3. what will happen when you fire the Spark Job?4. How did you tune your jobs?5. Explain optimizations you have used in your project?6. How did you connected... Continue Reading →
Data Masking in Pyspark
Hide Credit card number:Accept 16 digit credit card number from user and display only last 4 characters of card numberinput :1234567891234567output :************4567We can use Py spark or pythonCode In Pyspark:---------------------from pyspark.sql import SparkSessionfrom pyspark.sql.functions import substring# Create a SparkSessionspark = SparkSession.builder.appName("HideCreditCard").getOrCreate()# Sample input credit card numberinput_cc_number = "1234567891234567"# Hide all characters except the last four... Continue Reading →
PySpark: Cleansing Data with Regex
๐ Delving into PySpark: Cleansing Data with Regex Magic!โ๏ธ๐ Example: Transforming Names with Special Characters ๐Picture yourself in the realm of data, where you've stumbled upon a trove of Indian names. However, these names are shrouded in a layer of noise, with special characters cluttering them. ๐ Step 1๏ธโฃ: The ChallengeImagine a dataset of Indian... Continue Reading →
Cloud Data Engineering Road Map
๐ถ๐ป Cloud Data Engineering Road Map ๐๐ปโ Basic Version Control toolhttps://lnkd.in/gEqyhzZRhttps://lnkd.in/g_t2xKnGhttps://lnkd.in/gZT7QNjS โ Data Warehousing Conceptshttps://lnkd.in/gq99PDcp โ Core Pythonhttps://lnkd.in/gQpmSnM โ Spark SQLhttps://lnkd.in/gDcR5bwM โ Databrickshttps://lnkd.in/gSpKBWbJhttps://lnkd.in/gpbMg9nU โ Sparkhttps://lnkd.in/gtqRtTPvhttps://lnkd.in/gs2gkqRq โ Pysparkhttps://lnkd.in/gmkPpmAXhttps://lnkd.in/gh-_KzjE โ Delta Lakehttps://lnkd.in/gt6ggER6 โ Cloud ETL Tool + Storagehttps://lnkd.in/gTs8y4Ai โ Cloud MPP Warehousehttps://lnkd.in/gMTHCrNZ โ Databricks Unity Cataloghttps://lnkd.in/gH6Q2a5K๐ Learn , Lead and Make Leaders ๐. Happy Learning ๐Follow ๐... Continue Reading →
PySpark: Sales Data Analysis
Exploring PySpark: Advanced Data Analysisโ๏ธ๐ฑ Scenario: Analyzing Multi-Dimensional Sales Data๐Imagine being tasked with analyzing sales data that spans multiple dimensions, including time, regions, and product categories. To unlock insights from this complex dataset, PySpark's powerful capabilities come into play.๐ Step 1๏ธโฃ: Defining the ChallengeYour goal is to gain a comprehensive understanding of sales performance by... Continue Reading →
Learn Apache Spark Step by Step
Learn Apache Spark Step by Step (Follow the Sequence)1. Getting started with Apache Sparkhttps://lnkd.in/gFRpe3-D2. A quick introduction to the Spark APIhttps://lnkd.in/g8Y3tdhX3. Overview of Spark - RDD, accumulators, broadcast variablehttps://lnkd.in/g7fepuFF4. Spark SQL, Datasets, and DataFrames:https://lnkd.in/g3iZp7zk5. PySpark - Processing data with Spark in Pythonhttps://lnkd.in/gBnh6PAi6. Processing data with SQL on the command linehttps://lnkd.in/ggnxDaUu7. Cluster Overviewhttps://lnkd.in/guCQnJnv8. Packaging and deploying... Continue Reading →