Azure Data Engineer Journey Learning links

Start your Azure journey here.....1. Azure Data Factory.https://lnkd.in/gEmpbyrMProject: https://lnkd.in/gFG2aCgy2. Azure Data bricks.https://lnkd.in/gvFwKxaNproject: https://lnkd.in/gFG2aCgy3. Azure Stream Analytics.https://lnkd.in/g35VbSTv4. Azure Synapse Analytics.https://lnkd.in/gCufskNC5. Azure Data Lake Storage.https://lnkd.in/gcEKjWsc6. Azure SQL database.https://lnkd.in/gmHxqxQX7. Azure Postgres SQL database.https://lnkd.in/grHWJvWZ8. Azure MariaDB.https://lnkd.in/gYSp7MZi9. Azure Cosmos DB.https://lnkd.in/g6jPZA36This is an excellent guide to become azure data engineer. No need to become expert. but learn how to work with... Continue Reading →

PySpark DataFrames Practice Questions with Answers

PySpark DataFrames provide a powerful and user-friendly API for working with structured and semi-structured data. In this article, we present a set of practice questions to help you reinforce your understanding of PySpark DataFrames and their operations. Loading DataLoad the "sales_data.csv" file into a PySpark DataFrame. The CSV file contains the following columns: "transaction_id", "customer_id",... Continue Reading →

Data Scientist Roadmap

How I would relearn Data Science In 2024 to get a job: Getting Started: ⬇️ -  Data Science Intro: DataCamp-  Anaconda Setup: Anaconda Documentation Programming: -  Python Basics: Real Python-  R Basics: R-bloggers-  SQL Fundamentals: SQLZoo- 六 Java for Data Science: Udemy - Java Programming and Software Engineering Fundamentals Mathematics:... Continue Reading →

Partition Scenario with Pyspark

📕how to create partitions based on year and month ?Data partitioning is critical to data processing performance especially for large volume of data processing in spark.Most of the traditional databases will be having default date format DD-MM-YYYY.But cloud storage (spark delta lake/databricks tables) will be using YYYY-MM-DD format.So here we will be see how to... Continue Reading →

Incremental Loading with CDC using Pyspark

⏫ Incremental Loading technique with Change Data Capture (CDC): ➡️ Incremental Load with Change Data Capture (CDC) is a strategy in data warehousing and ETL (Extract, Transform, Load) processes where only the changed or newly added data is loaded from source systems to the target system. CDC is particularly useful in scenarios where processing the... Continue Reading →

Dynamic Column handling in file

‐----------Spark Interview Questions-------------📍Important Note : This scenario is bit complex I would suggest go through it multiple times. (code implementation is in #databricks )📕how to handle or how to read variable/dynamic number of columns details?id,name,location,emaild,phone1, aman2,abhi,Delhi3,john,chennai,sample123@gmail.com,688080in a scenario we are geeting not complete columnar information but vary from row to row.pyspark code :===============dbutils.fs.put("/dbfs/tmp/dynamic_columns.csv","""id,name,location,emaild,phone1, aman2,abhi,Delhi3,john,chennai,sample123@gmail.com,688080""")now lets... Continue Reading →

Create a website or blog at WordPress.com

Up ↑

Design a site like this with WordPress.com
Get started