Data Engineer Interview â Thinking with Numbers đ§ŽInterviewer:You need to process 1 TB of data in Spark. How do you decide the cluster size?Candidate:I donât guess. I calculate..đ˘ Step 1 | Understand the Data Volume ⢠Total data = 1 TB â 1,024 GB ⢠Target partition size = 128 MB ⢠Total partitions required:1,024... Continue Reading →
Python use cases
You don't need to learn Python more than this for a Data Engineering roleâ List Comprehensions and Dict Comprehensionsâł Optimize iteration with one-linersâł Fast filtering and transformationsâł O(n) time complexityâ Lambda Functionsâł Anonymous functions for concise operationsâł Used in map(), filter(), and sort()âł Key for functional programmingâ Functional Programming (map, filter, reduce)âł Apply transformations efficientlyâł... Continue Reading →
How do you handle 50GB Dataset in spark
what are ...Total numbers of cores and partitions?Total numbers of executors?Total memory required?Let's walk through how to estimate the resources needed when processing a 50GB dataset in Apache Spark, the default partition size of 128MB.Convert Data to MBSince Spark works with partition sizes in MB by default:50 GB *1024 = 51,200 MBSpark creates one task... Continue Reading →
25 blogs, 25 data engineering concepts
đ25 blogs to guide you through every important concept đ1. Data Lake vs Data Warehouseâ https://lnkd.in/gEpmTyMS2. Delta Lake Architectureâ https://lnkd.in/gk5x5uqR3. Medallion Architectureâ https://lnkd.in/gmyMpVpT4. ETL vs ELTâ https://lnkd.in/gvg3hgqe5. Apache Airflow Basicsâ https://lnkd.in/gGwkvCXd6. DAG Design Patternsâ https://lnkd.in/gHTKQWyR7. dbt Core Explainedâ https://lnkd.in/g5mQi8-y8. Incremental Models in dbtâ https://lnkd.in/gS25HCez9. Spark Transformations vs Actionsâ https://lnkd.in/g2RRCGMW10. Partitioning in Sparkâ https://lnkd.in/g5fXjSJD11. Window Functions... Continue Reading →
Load data from CSV file into Trino Table
To create a table in Trino and load data from a CSV file stored in Azure Data Lake Storage (ADLS), youâll use Trinoâs Hive connector to register the CSV file as a table. The Hive connector, backed by a Hive metastore, allows Trino to query files in ADLS. Below is a step-by-step guide to achieve... Continue Reading →
Can we connect on cloud airflow to onprem informatica
Yes, it is possible to connect a cloud-hosted Apache Airflow instance to an on-premises Informatica environment, but it requires careful configuration to bridge the cloud and on-premises environments. Below, I outline the key considerations and steps based on available information and general data integration practices.### Key Considerations1. **Network Connectivity**: - A secure network connection between... Continue Reading →
Pyspark SQL Cheatsheet
Here's a PySpark SQL cheatsheet, covering common operations and concepts. This is designed to be a quick reference for those working with PySpark DataFrames and SQL-like operations.PySpark SQL Cheatsheet1. Initialization & Data Loadingfrom pyspark.sql import SparkSessionfrom pyspark.sql.functions import *from pyspark.sql.types import *# Initialize SparkSessionspark = SparkSession.builder \ .appName("PySparkSQLCheatsheet") \ .getOrCreate()# Load Data (e.g., CSV, Parquet)df_csv... Continue Reading →
How to reflect data in trino catalog table using parquet file generated from databricks
To reflect data in a Trino catalog table using a Parquet file stored in an **Azure Blob Storage** container (generated from Databricks), follow these steps:1. **Generate Parquet File in Databricks**: - In Databricks, write your data to a Parquet file stored in an Azure Blob Storage container. Use the `abfss` protocol for Azure Data Lake... Continue Reading →
Databricks Interview Series
Below is a detailed response to your questions about Unity Catalog in Databricks, organized by the sections you provided. Each answer includes explanations, examples, and practical insights where applicable, aiming to provide a comprehensive understanding suitable for both foundational and advanced scenarios.---### Basic Understanding#### 1. What is Unity Catalog in Databricks?Unity Catalog is a unified... Continue Reading →
Walmart Interview
Below is a comprehensive list of all questions and their corresponding answers from the Walmart interview Experience:---### **Round 1: Technical Interview 1**1. **Question**: Can you describe your role and responsibilities in your recent project? **Answer**: In my recent project, I was responsible for designing and implementing data pipelines using PySpark to process large datasets.... Continue Reading →