โ----------Spark Interview Questions-------------๐Important Note : This scenario is bit complex I would suggest go through it multiple times. (code implementation is in #databricks )๐how to handle or how to read variable/dynamic number of columns details?id,name,location,emaild,phone1, aman2,abhi,Delhi3,john,chennai,sample123@gmail.com,688080in a scenario we are geeting not complete columnar information but vary from row to row.pyspark code :===============dbutils.fs.put("/dbfs/tmp/dynamic_columns.csv","""id,name,location,emaild,phone1, aman2,abhi,Delhi3,john,chennai,sample123@gmail.com,688080""")now lets... Continue Reading →
Azure Data Engineering by Deepak Goyal
List of All azure / data / devops /ML Interview Q& ASave & Share.1. ๐๐๐๐ฟ๐ฒ ๐๐ฎ๐๐ฎ ๐๐ฎ๐ฐ๐๐ผ๐ฟ๐ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐ ๐ค&๐https://lnkd.in/dVzCmzcZ2. ๐๐๐๐ฟ๐ฒ ๐๐ฎ๐๐ฎ๐ฏ๐ฟ๐ถ๐ฐ๐ธ๐ ๐ฆ๐ฐ๐ฒ๐ป๐ฎ๐ฟ๐ถ๐ผ ๐ฏ๐ฎ๐๐ฒ๐ฑ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐ ๐ค&๐https://lnkd.in/dUCf8qf8๐ฏ. ๐ฅ๐ฒ๐ฎ๐น๐๐ถ๐บ๐ฒ ๐๐๐๐ฟ๐ฒ ๐๐ฎ๐๐ฎ ๐๐ฎ๐ฐ๐๐ผ๐ฟ๐ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐ ๐ค&๐https://lnkd.in/ex_Vixh๐ฐ.๐๐ฎ๐๐ฒ๐๐ ๐๐๐๐ฟ๐ฒ ๐๐ฒ๐๐ข๐ฝ๐ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐ ๐ค&๐https://lnkd.in/g7PdATm๐ฑ. ๐๐๐๐ฟ๐ฒ ๐๐ฐ๐๐ถ๐๐ฒ ๐๐ถ๐ฟ๐ฒ๐ฐ๐๐ผ๐ฟ๐ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐ ๐ค&๐https://lnkd.in/dtWYXTKN๐ฒ. ๐๐๐๐ฟ๐ฒ ๐๐ฎ๐๐ฎ ๐๐ฎ๐ธ๐ฒ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐ ๐ค&๐https://lnkd.in/dgr-uGQB๐ณ. ๐๐๐๐ฟ๐ฒ ๐๐ฝ๐ฝ ๐ฆ๐ฒ๐ฟ๐๐ถ๐ฐ๐ฒ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐ ๐ค&๐https://lnkd.in/dP4Afqkb๐ด. ๐๐๐๐ฟ๐ฒ ๐๐ฎ๐๐ฎ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐ ๐ค&๐https://lnkd.in/dj_m2yeQ๐ต.... Continue Reading →
AWS Solution Architect in 2 months – Road Map
8-week journey toward becoming an AWS Solutions Architect AssociateHere's the breakdown:๐ Week 1: AWS Fundamentals- Introduction to AWS: Discover the basics and the core services that form the backbone of AWS- AWS Free Tier Account: Learn how to set up an account to leverage AWS's free offerings.- AWS Management Console: Navigate the user interface to... Continue Reading →
Spotify Cloud Project
Spotify Stream Analytics ๐ฅBuilt a synthetic data pipeline for real-time music insights, stunning dashboards, and actionable decisions.๐ Project Overview:Addresses limited Spotify stream data access with a synthetic pipeline. Realistic events stream to Kafka, processed by Spark, stored in Deltalake. Airflow ensures a seamless pipeline, and dbt transforms data into captivating dashboards.๐ Key Features:Streamlined Infrastructure: Scripts... Continue Reading →
Caching in Pyspark
Internals of Caching in PysparkCaching DataFrames in PySpark is a powerful technique to improve query performance. However, there's a subtle difference in how you can cache DataFrames in PySpark.cached_df = orders_df.cache() and orders_df.cache() are two common approaches & they serve different purposes.The choice between these two depends on your specific use case and whether you... Continue Reading →
Google Cloud Associate Cloud engineer(ACE) Resources
I receive 10+ DMs daily regarding "How to start their journey in Google Cloud ". So I have curated a complete list of resources for The Google Cloud Associate Cloud engineer(ACE).1. Basics of Linux commands - https://lnkd.in/dN5BPhTq2. File system - https://lnkd.in/dkEAA_qU3. Linux Files Hierarchy Structure - https://lnkd.in/d8hQR5m44. Linux Directory Hierarchy Structure- https://lnkd.in/dWMNd6J95. Associate Cloud Engineer... Continue Reading →
Big Data Learning Resources
Complete Plan to learn Big Data Step by Step (All Free resources Included) by Sumit Sir.1. Learn SQL Basics - https://lnkd.in/g9NEJMVESQL will be used at a lot of places - Hive/Spark SQL/RDBMS queriesJoins & windowing functions are very important2. Learn Programming/Python for Data Engineering - https://lnkd.in/gr6fFPdULearn Python to an extent required for Data Engineers.3. Learn... Continue Reading →
Cloud Services in one line
If you are an aspiring Data Engineer then you must know these cloud services w.r.t AWS or AZURE or GCP ๐ Save this post for future reference ...1๏ธโฃ Amazon Web Services (AWS)๐ AWS Data Pipeline: For creating complex data processing workloads.๐ AWS Glue: Our favourite fully managed ETL service.๐พ Amazon S3: An object storage service... Continue Reading →
Google Cloud Developerโs Cheat Sheet
All Products Compute Cloud Run: Serverless for containerized applications ๐ ๐ Cloud Functions: Event-driven serverless functions ๐ ๐ Compute Engine: VMs, GPUs, TPUs, Disks ๐ ๐ Kubernetes Engine (GKE): Managed Kubernetes/containers ๐ ๐ App Engine: Managed app platform ๐ ๐ Bare Metal Solution: Hardware for specialized workloads ๐ Preemptible VMs: Short-lived compute instances ๐ ๐ Shielded VMs: Hardened VMs ๐ ๐ Sole-tenant nodes: Dedicated physical servers ๐ ๐ Storage Cloud Filestore: Managed... Continue Reading →
INTERVIEW QUESTIONS ON APACHE SPARK ,PYSPARK FOR DATAENGINEERS
SET OF 82 QUESTIONS 1. How is Apache Spark different from MapReduce? Apache SparkMapReduceSpark processes data in batches as well as in real-timeMapReduce processes data in batches onlySpark runs almost 100 times faster than Hadoop Map ReduceHadoop MapReduce is slower when it comes to large sc processingSpark stores data in the RAM i.e. in-memory. So,... Continue Reading →