INTERVIEW QUESTIONS ON APACHE SPARK ,PYSPARK FOR DATAENGINEERS

SET OF 82 QUESTIONS 1. How is Apache Spark different from MapReduce? Apache SparkMapReduceSpark processes data in batches as well as in real-timeMapReduce processes data in batches onlySpark runs almost 100 times faster than Hadoop Map ReduceHadoop MapReduce is slower when it comes to large sc processingSpark stores data in the RAM i.e. in-memory. So,... Continue Reading →

December 4, 2023 0

Google Cloud GCloud Commands Cheat Sheet

Google Cloud Config PURPOSECOMMANDList projectsgcloud config list, gcloud config list projectList projectsgcloud config list, gcloud config list projectShow project infogcloud compute project-info describeSwitch projectgcloud config set project <project-id>Set the active accountgcloud config set account <ACCOUNT>Set default regiongcloud config set compute/region us-westSet default zonegcloud config set compute/zone us-west1-bList configurationsgcloud config configurations listActivate configurationgcloud config configurations activate Google Cloud... Continue Reading →

December 4, 2023 0

Read CSV File by Spark

---------------Spark Interview Questions------------📕How to read a csv file in spark?Method 1: ---------------spark.read.csv("path")df=spark.read.csv("dbfs:/FileStore/small_zipcode.csv")df.show()---+-------+--------+-------------------+-----+----------+|_c0| _c1| _c2| _c3| _c4| _c5|+---+-------+--------+-------------------+-----+----------+| id|zipcode| type| city|state|population|| 1| 704|STANDARD| null| PR| 30100|| 2| 704| null|PASEO COSTA DEL SUR| PR| null|| 3| 709| null| BDA SAN LUIS| PR| 3700|| 4| 76166| UNIQUE| CINGULAR WIRELESS| TX| 84000|| 5| 76177|STANDARD| null| TX| null|+---+-------+--------+-------------------+-----+----------+Method 2 :--------------df=spark.read.format("csv").option("inferSchema",True).option("header",True).option("sep",",").load("dbfs:/FileStore/small_zipcode.csv")df.show()+---+-------+--------+-------------------+-----+----------+|... Continue Reading →

November 29, 2023 0

AWS Certification

FREE AWS Certificate by Amazon that you can't miss in 20231. Getting Started with Data Analytics on AWS🔗https://lnkd.in/dwRhRAzM2. Practical Data Science on the AWS Cloud Specialization🔗https://lnkd.in/d3-3GZbG3. Getting Started with AWS Machine Learning🔗https://lnkd.in/dhAp-Vjh4. Introduction to Machine Learning on AWS🔗https://lnkd.in/detfDCWA5. Hands-on Machine Learning with AWS and NVIDIA🔗https://lnkd.in/dgGvATq26. AWS Fundamentals Specialization🔗https://lnkd.in/dSV9jhRz7. Building Modern Python Applications on AWS🔗https://lnkd.in/dQAinFGy8. AWS... Continue Reading →

November 29, 2023 0

Free Spark Course

Don't pay for Apache Spark Course because it is in demand.You can learn for free here......1. Install spark from here....https://lnkd.in/gx_Dc8phhttps://lnkd.in/gg6-8xDz2. Learn spark Basics from here--https://lnkd.in/g-gCpUyihttps://lnkd.in/gkNhMnTZhttps://lnkd.in/gkbVB6YX2.1 Learn spark with Scala from here:https://lnkd.in/gtrZAmn42.2 Learn spark with python from here:https://lnkd.in/gQaeSjbH3. Learn pyspark from here:https://lnkd.in/g6kyihyW4. Work on Spark projects from here..https://lnkd.in/gE8hsyZxhttps://lnkd.in/gwWytS-Qhttps://lnkd.in/gR7DR6_5https://lnkd.in/gzngHhrChttps://lnkd.in/gACn6bK85. Finally list down your projects Here.....https://github.com/I highly recommend... Continue Reading →

November 28, 2023 0

System Design Blogs

30 Blogs to learn 30 System Design Concepts:1) Content Delivery Network (CDN): https://lnkd.in/gjJrEJeH2) Caching: https://lnkd.in/gC9piQbJ3) Distributed Caching: https://lnkd.in/g7WKydNg4) Latency vs Throughput: https://lnkd.in/g_amhAtN5) CAP Theorem: https://lnkd.in/g3hmVamx6) Load Balancing: https://lnkd.in/gQaa8sXK7) ACID Transactions: https://lnkd.in/gMe2JqaF8) SQL vs NoSQL: https://lnkd.in/g3WC_yxn9) Consistent Hashing: https://lnkd.in/gd3eAQKA10) Database Index: https://lnkd.in/gCeshYVt11) Rate Limiting: https://lnkd.in/gWsTDR3m12) Microservices Architecture: https://lnkd.in/gFXUrz_T13) Strong vs Eventual Consistency: https://lnkd.in/gJ-uXQXZ14) REST vs RPC:... Continue Reading →

November 24, 2023 0

PySpark: Cleansing Data with Regex

🔍 Delving into PySpark: Cleansing Data with Regex Magic!⚙️🌟 Example: Transforming Names with Special Characters 🚀Picture yourself in the realm of data, where you've stumbled upon a trove of Indian names. However, these names are shrouded in a layer of noise, with special characters cluttering them. 🔑 Step 1️⃣: The ChallengeImagine a dataset of Indian... Continue Reading →

November 15, 2023 0

Spark – BTS

Internal working of Apache Spark (don't forget to save it)𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤 works on the principle of in-memory computation making it 100x faster and a highly performant distributed framework.Here is a detailed explanation on what happens internally when a spark job is executed using the spark-submit command - 📋𝐒𝐭𝐞𝐩 1 : Client application initiates the execution... Continue Reading →

October 27, 2023 1