Cluster configuration

Data Engineer Interview – Thinking with Numbers 🧮Interviewer:You need to process 1 TB of data in Spark. How do you decide the cluster size?Candidate:I don’t guess. I calculate..🔢 Step 1 | Understand the Data Volume • Total data = 1 TB ≈ 1,024 GB • Target partition size = 128 MB • Total partitions required:1,024... Continue Reading →

February 9, 2026 0

25 blogs, 25 data engineering concepts

👇25 blogs to guide you through every important concept 👇1. Data Lake vs Data Warehouse→ https://lnkd.in/gEpmTyMS2. Delta Lake Architecture→ https://lnkd.in/gk5x5uqR3. Medallion Architecture→ https://lnkd.in/gmyMpVpT4. ETL vs ELT→ https://lnkd.in/gvg3hgqe5. Apache Airflow Basics→ https://lnkd.in/gGwkvCXd6. DAG Design Patterns→ https://lnkd.in/gHTKQWyR7. dbt Core Explained→ https://lnkd.in/g5mQi8-y8. Incremental Models in dbt→ https://lnkd.in/gS25HCez9. Spark Transformations vs Actions→ https://lnkd.in/g2RRCGMW10. Partitioning in Spark→ https://lnkd.in/g5fXjSJD11. Window Functions... Continue Reading →

July 27, 2025 0

Pyspark SQL Cheatsheet

Here's a PySpark SQL cheatsheet, covering common operations and concepts. This is designed to be a quick reference for those working with PySpark DataFrames and SQL-like operations.PySpark SQL Cheatsheet1. Initialization & Data Loadingfrom pyspark.sql import SparkSessionfrom pyspark.sql.functions import *from pyspark.sql.types import *# Initialize SparkSessionspark = SparkSession.builder \ .appName("PySparkSQLCheatsheet") \ .getOrCreate()# Load Data (e.g., CSV, Parquet)df_csv... Continue Reading →

June 11, 2025 0

How to reflect data in trino catalog table using parquet file generated from databricks

To reflect data in a Trino catalog table using a Parquet file stored in an **Azure Blob Storage** container (generated from Databricks), follow these steps:1. **Generate Parquet File in Databricks**: - In Databricks, write your data to a Parquet file stored in an Azure Blob Storage container. Use the `abfss` protocol for Azure Data Lake... Continue Reading →

June 5, 2025 0

Cloud Operations Architecture Interview Questions

Provide detailed answers with scenario for below questionsCloud Operations Architecture Interview Questions:1. How would you implement Infrastructure as Code (IaC) in a cloud environment?Scenario: Using Terraform to manage AWS resources, enabling version control and reusable configurations.2. Describe your approach to cost optimization in cloud solutions.Scenario: Using AWS Cost Explorer to identify underutilized resources and implement... Continue Reading →

May 1, 2025 0

Big Data Engineering Interview series – 2

**Big Data Interview Questions - Detailed Answers**Below are detailed answers to the questions from the interview discussion, focusing on Cloud Data Engineering, Azure, Spark, SQL, and Python. Each answer is comprehensive, addressing the concepts, their applications, and practical considerations, without timestamps.---1. **Project Discussion** In a Cloud Data Engineering interview, the project discussion requires explaining... Continue Reading →

April 25, 2025 0

Big Data Engineering Interview series-1

**Top Big Data Interview Questions (2024) - Detailed Answers**1. **What is Hadoop and how does it work?** Hadoop is an open-source framework designed for distributed storage and processing of large datasets across clusters of computers. It consists of two main components: Hadoop Distributed File System (HDFS) for fault-tolerant storage, which splits data into blocks... Continue Reading →

April 25, 2025 0

An azure pipeline usually run for 2 hrs but currently it is running for 10 hours. Find the bottleneck in pipeline.

To identify the bottleneck in an Azure Pipeline that’s running for 10 hours instead of the usual 2 hours, you need to systematically analyze the pipeline’s execution. Here’s a step-by-step approach to pinpoint the issue:### 1. **Check Pipeline Logs and Execution Details** - **Action**: Navigate to the Azure DevOps portal, open the pipeline run, and... Continue Reading →

April 18, 2025 0

How to find the bottleneck in azure data factory pipeline having databricks notebook too. It has multiple types of sources. What are the steps to follow?

To identify bottlenecks in an Azure Data Factory (ADF) pipeline that includes Databricks notebooks and multiple types of sources, you need to systematically monitor, analyze, and optimize the pipeline's components. Bottlenecks can arise from data ingestion, transformation logic, Databricks cluster performance, or pipeline orchestration. Below are the steps to diagnose and address bottlenecks, tailored to... Continue Reading →

April 18, 2025 0