Cluster configuration

Data Engineer Interview – Thinking with Numbers 🧮Interviewer:You need to process 1 TB of data in Spark. How do you decide the cluster size?Candidate:I don’t guess. I calculate..🔢 Step 1 | Understand the Data Volume • Total data = 1 TB ≈ 1,024 GB • Target partition size = 128 MB • Total partitions required:1,024... Continue Reading →

February 9, 2026 0

Python use cases

You don't need to learn Python more than this for a Data Engineering role➊ List Comprehensions and Dict Comprehensions↳ Optimize iteration with one-liners↳ Fast filtering and transformations↳ O(n) time complexity➋ Lambda Functions↳ Anonymous functions for concise operations↳ Used in map(), filter(), and sort()↳ Key for functional programming➌ Functional Programming (map, filter, reduce)↳ Apply transformations efficiently↳... Continue Reading →

December 3, 2025 0

25 blogs, 25 data engineering concepts

👇25 blogs to guide you through every important concept 👇1. Data Lake vs Data Warehouse→ https://lnkd.in/gEpmTyMS2. Delta Lake Architecture→ https://lnkd.in/gk5x5uqR3. Medallion Architecture→ https://lnkd.in/gmyMpVpT4. ETL vs ELT→ https://lnkd.in/gvg3hgqe5. Apache Airflow Basics→ https://lnkd.in/gGwkvCXd6. DAG Design Patterns→ https://lnkd.in/gHTKQWyR7. dbt Core Explained→ https://lnkd.in/g5mQi8-y8. Incremental Models in dbt→ https://lnkd.in/gS25HCez9. Spark Transformations vs Actions→ https://lnkd.in/g2RRCGMW10. Partitioning in Spark→ https://lnkd.in/g5fXjSJD11. Window Functions... Continue Reading →

July 27, 2025 0

Load data from CSV file into Trino Table

To create a table in Trino and load data from a CSV file stored in Azure Data Lake Storage (ADLS), you’ll use Trino’s Hive connector to register the CSV file as a table. The Hive connector, backed by a Hive metastore, allows Trino to query files in ADLS. Below is a step-by-step guide to achieve... Continue Reading →

July 4, 2025 0

Can we connect on cloud airflow to onprem informatica

Yes, it is possible to connect a cloud-hosted Apache Airflow instance to an on-premises Informatica environment, but it requires careful configuration to bridge the cloud and on-premises environments. Below, I outline the key considerations and steps based on available information and general data integration practices.### Key Considerations1. **Network Connectivity**: - A secure network connection between... Continue Reading →

June 24, 2025 0

Pyspark SQL Cheatsheet

Here's a PySpark SQL cheatsheet, covering common operations and concepts. This is designed to be a quick reference for those working with PySpark DataFrames and SQL-like operations.PySpark SQL Cheatsheet1. Initialization & Data Loadingfrom pyspark.sql import SparkSessionfrom pyspark.sql.functions import *from pyspark.sql.types import *# Initialize SparkSessionspark = SparkSession.builder \ .appName("PySparkSQLCheatsheet") \ .getOrCreate()# Load Data (e.g., CSV, Parquet)df_csv... Continue Reading →

June 11, 2025 0

How to reflect data in trino catalog table using parquet file generated from databricks

To reflect data in a Trino catalog table using a Parquet file stored in an **Azure Blob Storage** container (generated from Databricks), follow these steps:1. **Generate Parquet File in Databricks**: - In Databricks, write your data to a Parquet file stored in an Azure Blob Storage container. Use the `abfss` protocol for Azure Data Lake... Continue Reading →

June 5, 2025 0

Databricks Interview Series

Below is a detailed response to your questions about Unity Catalog in Databricks, organized by the sections you provided. Each answer includes explanations, examples, and practical insights where applicable, aiming to provide a comprehensive understanding suitable for both foundational and advanced scenarios.---### Basic Understanding#### 1. What is Unity Catalog in Databricks?Unity Catalog is a unified... Continue Reading →

May 8, 2025 0