Dynamic Column handling in file

‐----------Spark Interview Questions-------------📍Important Note : This scenario is bit complex I would suggest go through it multiple times. (code implementation is in #databricks )📕how to handle or how to read variable/dynamic number of columns details?id,name,location,emaild,phone1, aman2,abhi,Delhi3,john,chennai,sample123@gmail.com,688080in a scenario we are geeting not complete columnar information but vary from row to row.pyspark code :===============dbutils.fs.put("/dbfs/tmp/dynamic_columns.csv","""id,name,location,emaild,phone1, aman2,abhi,Delhi3,john,chennai,sample123@gmail.com,688080""")now lets... Continue Reading →

December 25, 2023 0

Spotify Cloud Project

Spotify Stream Analytics 🎥Built a synthetic data pipeline for real-time music insights, stunning dashboards, and actionable decisions.🌟 Project Overview:Addresses limited Spotify stream data access with a synthetic pipeline. Realistic events stream to Kafka, processed by Spark, stored in Deltalake. Airflow ensures a seamless pipeline, and dbt transforms data into captivating dashboards.📌 Key Features:Streamlined Infrastructure: Scripts... Continue Reading →

December 17, 2023 0

Caching in Pyspark

Internals of Caching in PysparkCaching DataFrames in PySpark is a powerful technique to improve query performance. However, there's a subtle difference in how you can cache DataFrames in PySpark.cached_df = orders_df.cache() and orders_df.cache() are two common approaches & they serve different purposes.The choice between these two depends on your specific use case and whether you... Continue Reading →

December 13, 2023 0

Google Cloud Associate Cloud engineer(ACE) Resources

I receive 10+ DMs daily regarding "How to start their journey in Google Cloud ". So I have curated a complete list of resources for The Google Cloud Associate Cloud engineer(ACE).1. Basics of Linux commands - https://lnkd.in/dN5BPhTq2. File system - https://lnkd.in/dkEAA_qU3. Linux Files Hierarchy Structure - https://lnkd.in/d8hQR5m44. Linux Directory Hierarchy Structure- https://lnkd.in/dWMNd6J95. Associate Cloud Engineer... Continue Reading →

December 12, 2023 0

Big Data Learning Resources

Complete Plan to learn Big Data Step by Step (All Free resources Included) by Sumit Sir.1. Learn SQL Basics - https://lnkd.in/g9NEJMVESQL will be used at a lot of places - Hive/Spark SQL/RDBMS queriesJoins & windowing functions are very important2. Learn Programming/Python for Data Engineering - https://lnkd.in/gr6fFPdULearn Python to an extent required for Data Engineers.3. Learn... Continue Reading →

December 6, 2023 0

Cloud Services in one line

If you are an aspiring Data Engineer then you must know these cloud services w.r.t AWS or AZURE or GCP 👇 Save this post for future reference ...1️⃣ Amazon Web Services (AWS)🛠 AWS Data Pipeline: For creating complex data processing workloads.📊 AWS Glue: Our favourite fully managed ETL service.💾 Amazon S3: An object storage service... Continue Reading →

December 6, 2023 0

Google Cloud Developer’s Cheat Sheet

All Products Compute Cloud Run: Serverless for containerized applications 🔗 📄 Cloud Functions: Event-driven serverless functions 🔗 📄 Compute Engine: VMs, GPUs, TPUs, Disks 🔗 📄 Kubernetes Engine (GKE): Managed Kubernetes/containers 🔗 📄 App Engine: Managed app platform 🔗 📄 Bare Metal Solution: Hardware for specialized workloads 🔗 Preemptible VMs: Short-lived compute instances 🔗 📄 Shielded VMs: Hardened VMs 🔗 📄 Sole-tenant nodes: Dedicated physical servers 🔗 📄 Storage Cloud Filestore: Managed... Continue Reading →

December 4, 2023 0

INTERVIEW QUESTIONS ON APACHE SPARK ,PYSPARK FOR DATAENGINEERS

SET OF 82 QUESTIONS 1. How is Apache Spark different from MapReduce? Apache SparkMapReduceSpark processes data in batches as well as in real-timeMapReduce processes data in batches onlySpark runs almost 100 times faster than Hadoop Map ReduceHadoop MapReduce is slower when it comes to large sc processingSpark stores data in the RAM i.e. in-memory. So,... Continue Reading →

December 4, 2023 0