Data Engineer Interview – Thinking with Numbers 🧮Interviewer:You need to process 1 TB of data in Spark. How do you decide the cluster size?Candidate:I don’t guess. I calculate..🔢 Step 1 | Understand the Data Volume • Total data = 1 TB ≈ 1,024 GB • Target partition size = 128 MB • Total partitions required:1,024... Continue Reading →
25 blogs, 25 data engineering concepts
👇25 blogs to guide you through every important concept 👇1. Data Lake vs Data Warehouse→ https://lnkd.in/gEpmTyMS2. Delta Lake Architecture→ https://lnkd.in/gk5x5uqR3. Medallion Architecture→ https://lnkd.in/gmyMpVpT4. ETL vs ELT→ https://lnkd.in/gvg3hgqe5. Apache Airflow Basics→ https://lnkd.in/gGwkvCXd6. DAG Design Patterns→ https://lnkd.in/gHTKQWyR7. dbt Core Explained→ https://lnkd.in/g5mQi8-y8. Incremental Models in dbt→ https://lnkd.in/gS25HCez9. Spark Transformations vs Actions→ https://lnkd.in/g2RRCGMW10. Partitioning in Spark→ https://lnkd.in/g5fXjSJD11. Window Functions... Continue Reading →
Pyspark SQL Cheatsheet
Here's a PySpark SQL cheatsheet, covering common operations and concepts. This is designed to be a quick reference for those working with PySpark DataFrames and SQL-like operations.PySpark SQL Cheatsheet1. Initialization & Data Loadingfrom pyspark.sql import SparkSessionfrom pyspark.sql.functions import *from pyspark.sql.types import *# Initialize SparkSessionspark = SparkSession.builder \ .appName("PySparkSQLCheatsheet") \ .getOrCreate()# Load Data (e.g., CSV, Parquet)df_csv... Continue Reading →
How to reflect data in trino catalog table using parquet file generated from databricks
To reflect data in a Trino catalog table using a Parquet file stored in an **Azure Blob Storage** container (generated from Databricks), follow these steps:1. **Generate Parquet File in Databricks**: - In Databricks, write your data to a Parquet file stored in an Azure Blob Storage container. Use the `abfss` protocol for Azure Data Lake... Continue Reading →
Walmart Interview
Below is a comprehensive list of all questions and their corresponding answers from the Walmart interview Experience:---### **Round 1: Technical Interview 1**1. **Question**: Can you describe your role and responsibilities in your recent project? **Answer**: In my recent project, I was responsible for designing and implementing data pipelines using PySpark to process large datasets.... Continue Reading →
Cloud Operations Architecture Interview Questions
Provide detailed answers with scenario for below questionsCloud Operations Architecture Interview Questions:1. How would you implement Infrastructure as Code (IaC) in a cloud environment?Scenario: Using Terraform to manage AWS resources, enabling version control and reusable configurations.2. Describe your approach to cost optimization in cloud solutions.Scenario: Using AWS Cost Explorer to identify underutilized resources and implement... Continue Reading →
Big Data Engineering Interview series – 2
**Big Data Interview Questions - Detailed Answers**Below are detailed answers to the questions from the interview discussion, focusing on Cloud Data Engineering, Azure, Spark, SQL, and Python. Each answer is comprehensive, addressing the concepts, their applications, and practical considerations, without timestamps.---1. **Project Discussion** In a Cloud Data Engineering interview, the project discussion requires explaining... Continue Reading →
Big Data Engineering Interview series-1
**Top Big Data Interview Questions (2024) - Detailed Answers**1. **What is Hadoop and how does it work?** Hadoop is an open-source framework designed for distributed storage and processing of large datasets across clusters of computers. It consists of two main components: Hadoop Distributed File System (HDFS) for fault-tolerant storage, which splits data into blocks... Continue Reading →
An azure pipeline usually run for 2 hrs but currently it is running for 10 hours. Find the bottleneck in pipeline.
To identify the bottleneck in an Azure Pipeline that’s running for 10 hours instead of the usual 2 hours, you need to systematically analyze the pipeline’s execution. Here’s a step-by-step approach to pinpoint the issue:### 1. **Check Pipeline Logs and Execution Details** - **Action**: Navigate to the Azure DevOps portal, open the pipeline run, and... Continue Reading →
How to find the bottleneck in azure data factory pipeline having databricks notebook too. It has multiple types of sources. What are the steps to follow?
To identify bottlenecks in an Azure Data Factory (ADF) pipeline that includes Databricks notebooks and multiple types of sources, you need to systematically monitor, analyze, and optimize the pipeline's components. Bottlenecks can arise from data ingestion, transformation logic, Databricks cluster performance, or pipeline orchestration. Below are the steps to diagnose and address bottlenecks, tailored to... Continue Reading →