Databricks Interview Series

Below is a detailed response to your questions about Unity Catalog in Databricks, organized by the sections you provided. Each answer includes explanations, examples, and practical insights where applicable, aiming to provide a comprehensive understanding suitable for both foundational and advanced scenarios.---### Basic Understanding#### 1. What is Unity Catalog in Databricks?Unity Catalog is a unified... Continue Reading →

May 8, 2025 0

An azure pipeline usually run for 2 hrs but currently it is running for 10 hours. Find the bottleneck in pipeline.

To identify the bottleneck in an Azure Pipeline that’s running for 10 hours instead of the usual 2 hours, you need to systematically analyze the pipeline’s execution. Here’s a step-by-step approach to pinpoint the issue:### 1. **Check Pipeline Logs and Execution Details** - **Action**: Navigate to the Azure DevOps portal, open the pipeline run, and... Continue Reading →

April 18, 2025 0

Processing 10 TB of Data in Databricks!!

Interviewer: Let's assume you're processing 10 TB of data in Databricks. How would you configure the cluster to optimize performance?Candidate: To process 10 TB of data efficiently, I would recommend a cluster configuration with a large number of nodes and sufficient memory.First, I would estimate the number of partitions required to process the data in... Continue Reading →

April 17, 2025 0

Data migration from DB2 to Azure Data Lake Storage

Below is an example PySpark script to load data from a DB2 table into an Azure Data Lake table. The script is optimized for handling high-volume data efficiently by leveraging Spark's distributed computing capabilities.Prerequisites:Spark Configuration: Ensure Spark is configured with the necessary dependencies:spark-sql-connector for Azure Data Lake Gen2. db2jcc driver for connecting to DB2.Azure Authentication:... Continue Reading →

November 27, 2024 0

Python Programming Interview Questions for Entry-Level Data Analysts 🐍

Are you ready to take your Python skills to the next level? Delve into these essential interview questions designed specifically for entry-level data analysts. Sharpen your Python skills with these fundamental interview questions:Here are detailed answers to your Python questions, with examples: 1. What is Python, and why is it popular in data analysis? Python... Continue Reading →

March 25, 2024 0

Low Level System design articles

These articles will save you 50+ hours of hopping to resources and wasting time. 1) Scalability: https://lnkd.in/gq4hW9qx 2) Horizontal vs Vertical Scaling: https://lnkd.in/g8qcwRCy 3) Latency vs Throughput: https://lnkd.in/gDAx6QQd 4) Load Balancing: https://lnkd.in/gefSiXEJ 5) Caching: https://lnkd.in/gAp-9udf 6) ACID Transactions: https://lnkd.in/g-sjsMwX 7) SQL vs NoSQL: https://lnkd.in/gwCe58TU 8) Database Indexes: https://lnkd.in/gE_q5m_g 9) Database Sharding: https://lnkd.in/gFdNxDrU 10) Content Delivery... Continue Reading →

March 21, 2024 0

Databricks Learning Path

If you know working with databricks, it helps lot in your data engineering job…You can learn databricks here…1. Learn databricks basics here...https://lnkd.in/gQNKd8HEhttps://lnkd.in/gf_-6EEg2. pyspark with databricks herehttps://lnkd.in/g2iTevyJ2.1 azure databricks with python herehttps://lnkd.in/gyeNtq8n2.2 databricks with scala herehttps://lnkd.in/gzMAcm3s2.3 databricks with sql herehttps://lnkd.in/gdby9_bj3. databricks with spark herehttps://lnkd.in/g-YT-qiF4. databricks on AWShttps://lnkd.in/gYcxe8Tn5. official guide to learn databricks herehttps://lnkd.in/gt8sQeeH6. Databricks projectshttps://lnkd.in/gtpa7jhRhttps://lnkd.in/gdWUBUN9follow this... Continue Reading →

February 8, 2024 0

Partition Scenario with Pyspark

📕how to create partitions based on year and month ?Data partitioning is critical to data processing performance especially for large volume of data processing in spark.Most of the traditional databases will be having default date format DD-MM-YYYY.But cloud storage (spark delta lake/databricks tables) will be using YYYY-MM-DD format.So here we will be see how to... Continue Reading →

December 28, 2023 0

Incremental Loading with CDC using Pyspark

⏫ Incremental Loading technique with Change Data Capture (CDC): ➡️ Incremental Load with Change Data Capture (CDC) is a strategy in data warehousing and ETL (Extract, Transform, Load) processes where only the changed or newly added data is loaded from source systems to the target system. CDC is particularly useful in scenarios where processing the... Continue Reading →

December 27, 2023 0

Cloud Services in one line

If you are an aspiring Data Engineer then you must know these cloud services w.r.t AWS or AZURE or GCP 👇 Save this post for future reference ...1️⃣ Amazon Web Services (AWS)🛠 AWS Data Pipeline: For creating complex data processing workloads.📊 AWS Glue: Our favourite fully managed ETL service.💾 Amazon S3: An object storage service... Continue Reading →

December 6, 2023 0