Write a solution in PySpark to find the average selling price for each product. average_price should be rounded to 2 decimal places.Solution :import datetimefrom pyspark.sql import SparkSessionfrom pyspark.sql.functions import col, sum, roundfrom pyspark.sql.types import StructType, StructField, IntegerType, DateType# Initialize Spark sessionspark = SparkSession.builder.appName("average_selling_price").getOrCreate()# Data for Prices and Units Soldprices_data = [(1, datetime.date(2019, 2, 17), datetime.date(2019,... Continue Reading →
Pyspark Scenarios
Check out these 23 complete PySpark real-time scenario videos covering everything from partitioning data by month and year to handling complex JSON files and implementing multiprocessing in Azure Databricks. โ Pyspark Scenarios 1: How to create partition by month and year in pyspark https://lnkd.in/dFfxYR_F โ pyspark scenarios 2 : how to read variable number of... Continue Reading →
Azure and Databricks Prep
๐๐๐ญ๐๐๐ซ๐ข๐๐ค๐ฌ ๐๐ง๐ ๐๐ฒ๐๐ฉ๐๐ซ๐ค ๐๐ซ๐ ๐ญ๐ก๐ ๐ฆ๐จ๐ฌ๐ญ ๐ข๐ฆ๐ฉ๐จ๐ซ๐ญ๐๐ง๐ญ ๐ฌ๐ค๐ข๐ฅ๐ฅ๐ฌ ๐ข๐ง ๐๐๐ญ๐ ๐๐ง๐ ๐ข๐ง๐๐๐ซ๐ข๐ง๐ . ๐๐ฅ๐ฆ๐จ๐ฌ๐ญ ๐๐ฅ๐ฅ ๐๐จ๐ฆ๐ฉ๐๐ง๐ข๐๐ฌ ๐๐ซ๐ ๐ฆ๐จ๐ฏ๐ข๐ง๐ ๐๐ซ๐จ๐ฆ ๐๐๐๐จ๐จ๐ฉ ๐ญ๐จ ๐๐ฉ๐๐๐ก๐ ๐๐ฉ๐๐ซ๐ค. ๐ ๐ก๐๐ฏ๐ ๐๐จ๐ฏ๐๐ซ๐๐ ๐๐ฅ๐ฆ๐จ๐ฌ๐ญ ๐๐ฏ๐๐ซ๐ฒ๐ญ๐ก๐ข๐ง๐ ๐ข๐ง ๐ฆ๐ฒ ๐ ๐ซ๐๐ ๐๐จ๐ฎ๐๐ฎ๐๐ ๐ฉ๐ฅ๐๐ฒ๐ฅ๐ข๐ฌ๐ญ. ๐๐ก๐๐ซ๐ ๐๐ซ๐ 70 ๐ฏ๐ข๐๐๐จ๐ฌ ๐๐ฏ๐๐ข๐ฅ๐๐๐ฅ๐ ๐๐จ๐ซ ๐๐ซ๐๐.0. Introduction to How to setup Account 1. How to read CSV file in PySpark 2. How to... Continue Reading →
Partition Scenario with Pyspark
๐how to create partitions based on year and month ?Data partitioning is critical to data processing performance especially for large volume of data processing in spark.Most of the traditional databases will be having default date format DD-MM-YYYY.But cloud storage (spark delta lake/databricks tables) will be using YYYY-MM-DD format.So here we will be see how to... Continue Reading →
Incremental Loading with CDC using Pyspark
โซ Incremental Loading technique with Change Data Capture (CDC): โก๏ธ Incremental Load with Change Data Capture (CDC) is a strategy in data warehousing and ETL (Extract, Transform, Load) processes where only the changed or newly added data is loaded from source systems to the target system. CDC is particularly useful in scenarios where processing the... Continue Reading →
Dynamic Column handling in file
โ----------Spark Interview Questions-------------๐Important Note : This scenario is bit complex I would suggest go through it multiple times. (code implementation is in #databricks )๐how to handle or how to read variable/dynamic number of columns details?id,name,location,emaild,phone1, aman2,abhi,Delhi3,john,chennai,sample123@gmail.com,688080in a scenario we are geeting not complete columnar information but vary from row to row.pyspark code :===============dbutils.fs.put("/dbfs/tmp/dynamic_columns.csv","""id,name,location,emaild,phone1, aman2,abhi,Delhi3,john,chennai,sample123@gmail.com,688080""")now lets... Continue Reading →
Spotify Cloud Project
Spotify Stream Analytics ๐ฅBuilt a synthetic data pipeline for real-time music insights, stunning dashboards, and actionable decisions.๐ Project Overview:Addresses limited Spotify stream data access with a synthetic pipeline. Realistic events stream to Kafka, processed by Spark, stored in Deltalake. Airflow ensures a seamless pipeline, and dbt transforms data into captivating dashboards.๐ Key Features:Streamlined Infrastructure: Scripts... Continue Reading →
Caching in Pyspark
Internals of Caching in PysparkCaching DataFrames in PySpark is a powerful technique to improve query performance. However, there's a subtle difference in how you can cache DataFrames in PySpark.cached_df = orders_df.cache() and orders_df.cache() are two common approaches & they serve different purposes.The choice between these two depends on your specific use case and whether you... Continue Reading →
Big Data Learning Resources
Complete Plan to learn Big Data Step by Step (All Free resources Included) by Sumit Sir.1. Learn SQL Basics - https://lnkd.in/g9NEJMVESQL will be used at a lot of places - Hive/Spark SQL/RDBMS queriesJoins & windowing functions are very important2. Learn Programming/Python for Data Engineering - https://lnkd.in/gr6fFPdULearn Python to an extent required for Data Engineers.3. Learn... Continue Reading →
INTERVIEW QUESTIONS ON APACHE SPARK ,PYSPARK FOR DATAENGINEERS
SET OF 82 QUESTIONS 1. How is Apache Spark different from MapReduce? Apache SparkMapReduceSpark processes data in batches as well as in real-timeMapReduce processes data in batches onlySpark runs almost 100 times faster than Hadoop Map ReduceHadoop MapReduce is slower when it comes to large sc processingSpark stores data in the RAM i.e. in-memory. So,... Continue Reading →