Cluster configuration

Data Engineer Interview – Thinking with Numbers 🧮Interviewer:You need to process 1 TB of data in Spark. How do you decide the cluster size?Candidate:I don’t guess. I calculate..🔢 Step 1 | Understand the Data Volume • Total data = 1 TB ≈ 1,024 GB • Target partition size = 128 MB • Total partitions required:1,024... Continue Reading →

February 9, 2026 0

Can we connect on cloud airflow to onprem informatica

Yes, it is possible to connect a cloud-hosted Apache Airflow instance to an on-premises Informatica environment, but it requires careful configuration to bridge the cloud and on-premises environments. Below, I outline the key considerations and steps based on available information and general data integration practices.### Key Considerations1. **Network Connectivity**: - A secure network connection between... Continue Reading →

June 24, 2025 0

How to reflect data in trino catalog table using parquet file generated from databricks

To reflect data in a Trino catalog table using a Parquet file stored in an **Azure Blob Storage** container (generated from Databricks), follow these steps:1. **Generate Parquet File in Databricks**: - In Databricks, write your data to a Parquet file stored in an Azure Blob Storage container. Use the `abfss` protocol for Azure Data Lake... Continue Reading →

June 5, 2025 0

Databricks Interview Series

Below is a detailed response to your questions about Unity Catalog in Databricks, organized by the sections you provided. Each answer includes explanations, examples, and practical insights where applicable, aiming to provide a comprehensive understanding suitable for both foundational and advanced scenarios.---### Basic Understanding#### 1. What is Unity Catalog in Databricks?Unity Catalog is a unified... Continue Reading →

May 8, 2025 0

Perfect ETL Pipeline on Azure Cloud

ETL Pipeline Implementation on AzureThis document outlines the creation of an end-to-end ETL pipeline on Microsoft Azure, utilizing Azure Data Factory for orchestration, Azure Databricks for transformation, Azure Data Lake Storage Gen2 for storage, Azure Synapse Analytics for data warehousing, and Power BI for visualization. The pipeline is designed to be scalable, secure, and efficient,... Continue Reading →

April 18, 2025 0

An azure pipeline usually run for 2 hrs but currently it is running for 10 hours. Find the bottleneck in pipeline.

To identify the bottleneck in an Azure Pipeline that’s running for 10 hours instead of the usual 2 hours, you need to systematically analyze the pipeline’s execution. Here’s a step-by-step approach to pinpoint the issue:### 1. **Check Pipeline Logs and Execution Details** - **Action**: Navigate to the Azure DevOps portal, open the pipeline run, and... Continue Reading →

April 18, 2025 0

How to find the bottleneck in azure data factory pipeline having databricks notebook too. It has multiple types of sources. What are the steps to follow?

To identify bottlenecks in an Azure Data Factory (ADF) pipeline that includes Databricks notebooks and multiple types of sources, you need to systematically monitor, analyze, and optimize the pipeline's components. Bottlenecks can arise from data ingestion, transformation logic, Databricks cluster performance, or pipeline orchestration. Below are the steps to diagnose and address bottlenecks, tailored to... Continue Reading →

April 18, 2025 0

Data migration from DB2 to Azure Data Lake Storage

Below is an example PySpark script to load data from a DB2 table into an Azure Data Lake table. The script is optimized for handling high-volume data efficiently by leveraging Spark's distributed computing capabilities.Prerequisites:Spark Configuration: Ensure Spark is configured with the necessary dependencies:spark-sql-connector for Azure Data Lake Gen2. db2jcc driver for connecting to DB2.Azure Authentication:... Continue Reading →

November 27, 2024 0

Up ↑

Design a site like this with WordPress.com

Get started