Cluster configuration

Data Engineer Interview – Thinking with Numbers 🧮Interviewer:You need to process 1 TB of data in Spark. How do you decide the cluster size?Candidate:I don’t guess. I calculate..🔢 Step 1 | Understand the Data Volume • Total data = 1 TB ≈ 1,024 GB • Target partition size = 128 MB • Total partitions required:1,024... Continue Reading →

How do you handle 50GB Dataset in spark

what are ...Total numbers of cores and partitions?Total numbers of executors?Total memory required?Let's walk through how to estimate the resources needed when processing a 50GB dataset in Apache Spark,  the default partition size of 128MB.Convert Data to MBSince Spark works with partition sizes in MB by default:50 GB *1024 = 51,200 MBSpark creates one task... Continue Reading →

Can we connect on cloud airflow to onprem informatica

Yes, it is possible to connect a cloud-hosted Apache Airflow instance to an on-premises Informatica environment, but it requires careful configuration to bridge the cloud and on-premises environments. Below, I outline the key considerations and steps based on available information and general data integration practices.### Key Considerations1. **Network Connectivity**:   - A secure network connection between... Continue Reading →

Databricks Interview Series

Below is a detailed response to your questions about Unity Catalog in Databricks, organized by the sections you provided. Each answer includes explanations, examples, and practical insights where applicable, aiming to provide a comprehensive understanding suitable for both foundational and advanced scenarios.---### Basic Understanding#### 1. What is Unity Catalog in Databricks?Unity Catalog is a unified... Continue Reading →

Walmart Interview

Below is a comprehensive list of all questions and their corresponding answers from the Walmart interview Experience:---### **Round 1: Technical Interview 1**1. **Question**: Can you describe your role and responsibilities in your recent project?     **Answer**: In my recent project, I was responsible for designing and implementing data pipelines using PySpark to process large datasets.... Continue Reading →

Perfect ETL Pipeline on Azure Cloud

ETL Pipeline Implementation on AzureThis document outlines the creation of an end-to-end ETL pipeline on Microsoft Azure, utilizing Azure Data Factory for orchestration, Azure Databricks for transformation, Azure Data Lake Storage Gen2 for storage, Azure Synapse Analytics for data warehousing, and Power BI for visualization. The pipeline is designed to be scalable, secure, and efficient,... Continue Reading →

An azure pipeline usually run for 2 hrs but currently it is running for 10 hours. Find the bottleneck in pipeline.

To identify the bottleneck in an Azure Pipeline that’s running for 10 hours instead of the usual 2 hours, you need to systematically analyze the pipeline’s execution. Here’s a step-by-step approach to pinpoint the issue:### 1. **Check Pipeline Logs and Execution Details**   - **Action**: Navigate to the Azure DevOps portal, open the pipeline run, and... Continue Reading →

How to find the bottleneck in azure data factory pipeline having databricks notebook too. It has multiple types of sources. What are the steps to follow?

To identify bottlenecks in an Azure Data Factory (ADF) pipeline that includes Databricks notebooks and multiple types of sources, you need to systematically monitor, analyze, and optimize the pipeline's components. Bottlenecks can arise from data ingestion, transformation logic, Databricks cluster performance, or pipeline orchestration. Below are the steps to diagnose and address bottlenecks, tailored to... Continue Reading →

Data migration from DB2 to Azure Data Lake Storage

Below is an example PySpark script to load data from a DB2 table into an Azure Data Lake table. The script is optimized for handling high-volume data efficiently by leveraging Spark's distributed computing capabilities.Prerequisites:Spark Configuration: Ensure Spark is configured with the necessary dependencies:spark-sql-connector for Azure Data Lake Gen2. db2jcc driver for connecting to DB2.Azure Authentication:... Continue Reading →

Create a website or blog at WordPress.com

Up ↑

Design a site like this with WordPress.com
Get started