what are ...Total numbers of cores and partitions?Total numbers of executors?Total memory required?Let's walk through how to estimate the resources needed when processing a 50GB dataset in Apache Spark, the default partition size of 128MB.Convert Data to MBSince Spark works with partition sizes in MB by default:50 GB *1024 = 51,200 MBSpark creates one task... Continue Reading →
25 blogs, 25 data engineering concepts
👇25 blogs to guide you through every important concept 👇1. Data Lake vs Data Warehouse→ https://lnkd.in/gEpmTyMS2. Delta Lake Architecture→ https://lnkd.in/gk5x5uqR3. Medallion Architecture→ https://lnkd.in/gmyMpVpT4. ETL vs ELT→ https://lnkd.in/gvg3hgqe5. Apache Airflow Basics→ https://lnkd.in/gGwkvCXd6. DAG Design Patterns→ https://lnkd.in/gHTKQWyR7. dbt Core Explained→ https://lnkd.in/g5mQi8-y8. Incremental Models in dbt→ https://lnkd.in/gS25HCez9. Spark Transformations vs Actions→ https://lnkd.in/g2RRCGMW10. Partitioning in Spark→ https://lnkd.in/g5fXjSJD11. Window Functions... Continue Reading →
Azure devops intermediate level questions
Below is a curated list of intermediate-level Azure DevOps questions that focus on practical knowledge, technical understanding, and scenario-based problem-solving. These questions are designed to assess a candidate’s ability to implement and manage Azure DevOps tools and processes effectively, suitable for professionals with some experience in DevOps practices. Each question includes a brief explanation or... Continue Reading →
Big Data Engineering Interview series-1
**Top Big Data Interview Questions (2024) - Detailed Answers**1. **What is Hadoop and how does it work?** Hadoop is an open-source framework designed for distributed storage and processing of large datasets across clusters of computers. It consists of two main components: Hadoop Distributed File System (HDFS) for fault-tolerant storage, which splits data into blocks... Continue Reading →
Perfect ETL Pipeline on Azure Cloud
ETL Pipeline Implementation on AzureThis document outlines the creation of an end-to-end ETL pipeline on Microsoft Azure, utilizing Azure Data Factory for orchestration, Azure Databricks for transformation, Azure Data Lake Storage Gen2 for storage, Azure Synapse Analytics for data warehousing, and Power BI for visualization. The pipeline is designed to be scalable, secure, and efficient,... Continue Reading →
Data migration from DB2 to Azure Data Lake Storage
Below is an example PySpark script to load data from a DB2 table into an Azure Data Lake table. The script is optimized for handling high-volume data efficiently by leveraging Spark's distributed computing capabilities.Prerequisites:Spark Configuration: Ensure Spark is configured with the necessary dependencies:spark-sql-connector for Azure Data Lake Gen2. db2jcc driver for connecting to DB2.Azure Authentication:... Continue Reading →
Low Level System design articles
These articles will save you 50+ hours of hopping to resources and wasting time. 1) Scalability: https://lnkd.in/gq4hW9qx 2) Horizontal vs Vertical Scaling: https://lnkd.in/g8qcwRCy 3) Latency vs Throughput: https://lnkd.in/gDAx6QQd 4) Load Balancing: https://lnkd.in/gefSiXEJ 5) Caching: https://lnkd.in/gAp-9udf 6) ACID Transactions: https://lnkd.in/g-sjsMwX 7) SQL vs NoSQL: https://lnkd.in/gwCe58TU 8) Database Indexes: https://lnkd.in/gE_q5m_g 9) Database Sharding: https://lnkd.in/gFdNxDrU 10) Content Delivery... Continue Reading →
Google Cloud Dataprep vs Google Cloud Data Fusion
Google Cloud Dataprep and Google Cloud Data Fusion are two different data integration services offered by Google Cloud.Here are some key differences between the two: Purpose:Google Cloud Dataprep is a visual data preparation service that allows users to clean, transform, and prepare data for analysis without writing code. Google Cloud Data Fusion, on the other... Continue Reading →
Data Engineering with Cloud Resources link
learn here about data pipeline for FREE.....data pipeline consists of several stages that work together to ensure that data is processed efficiently and accurately. it involves....1. data ingestion2. data transformation3. data analysis4. data visualisation5. data storage📌 complete data pipeline diagram can be found here....https://lnkd.in/gdifVyHY📌 FREE guide to data pipeline in AWS, Azure cloud....https://lnkd.in/gtq_8rd9📌 learn more... Continue Reading →
GCP ZERO TO HERO
Do you have the knowledge and skills to design a mobile gaming analytics platform that collects, stores, and analyzes large amounts of bulk and real-time data? Well, after reading this article, you will. I aim to take you from zero to hero in Google Cloud Platform (GCP) in just one article. I will show you... Continue Reading →