Data engineering is the backbone of the modern data-driven world. It’s the meticulous process of designing and building systems for collecting, storing, and analyzing data at scale. However, finding comprehensive projects and courses that are also free can be a challenge. To bridge this gap, I’ve created a list of five end-to-end data engineering courses... Continue Reading →
Pyspark UDF
#PySpark_UDF_with_the_help_of_an_example👉 👉 👉 The most important aspect of Spark SQL & DataFrame is PySpark UDF (i.e., User Defined Function), which is used to expand PySpark's built-in capabilities. UDFs in PySpark work similarly to UDFs in conventional databases.✍ We write a Python function and wrap it in PySpark SQL udf() or register it as udf and... Continue Reading →
Delete Duplicates in Pyspark Dataframe
#ScenarioThere are two ways to handle row duplication in PySpark dataframes. The distinct() function in PySpark is used to drop/remove duplicate rows (all columns) from a DataFrame, while dropDuplicates() is used to drop rows based on one or more columns. Here’s an example showing how to utilize the distinct() and dropDuplicates() methods- First, we need... Continue Reading →
Covid-19 Data Analysis | End-To-End Data Engineering Project
Description:In this project, I undertook a comprehensive data engineering journey focused on COVID-19 data, leveraging AWS services to create a powerful data infrastructure. My goal was to make the COVID-19 data accessible, understandable, and valuable for analysis.Key Steps:Data Collection and Storage: I started by downloading COVID-19 datasets from Registry of Open Data on AWS and... Continue Reading →
Big Data Pro Resources
#Resources Referred by me for Big data Technologies These resources are available for free in YouTube, which helped me to crack CISCO.. and for you to crack product based companies also..1.Hadoop ,sqoop and Hive concepts by Saif shaik:https://lnkd.in/ewyYweTJ2.pyspark concepts in depth by karunakar goud:https://lnkd.in/eNtFkxmd3.Another spark playlist which useful Raja's Data Engineering channel.https://lnkd.in/eqiy7dBS4. Hadoop and Kafka... Continue Reading →
Crack The Spark
🚀Data Engineer Interview Experience📢Apache Spark⌛How "Executor Out Of Memory" can be explained in step by step manner👉🏽https://lnkd.in/gPsrw9Wp How "Salting" can be explained in step by step manner👉🏽https://lnkd.in/gUQUPj8x How "Data Locality in Spark" can be explained in step by step manner👉🏽https://lnkd.in/gcQ_CJZs How "Garbage Collection (GC) Tuning" can be explained in step by step manner👉🏽https://lnkd.in/gY5CQM9c How "Predicate... Continue Reading →
Top Github Repositories
Top Github repositories which would be really helpful for job preparation, upskilling and much more 💫 - Free programming books : https://lnkd.in/gbSk9NRr- System Design : https://lnkd.in/graSZG3Phttps://lnkd.in/gykTqH6k- Project Based Learning : https://lnkd.in/gjewtywD- Coding Interview : https://lnkd.in/ge7e7gyh- Resources for Preparation of Placements : https://lnkd.in/d6zpHj4P- Data Science : https://lnkd.in/gbnGnGRD- Projects : https://lnkd.in/gNvjU9jr- For Roadmaps : https://lnkd.in/gYNSH-dc- JavaScript :... Continue Reading →
AWS Training
AWS just launched a new free training, along with a digital badge "Migrations Foundations". This means you can now earn up to 12 AWS digital badges at no cost!To secure this badge, simply enroll in the free course and score 80% or higher on the final assessment. It's a fantastic opportunity to demonstrate your cloud... Continue Reading →
Types of Dimensions in DWH
Types of Dimensions in Dimensional Data Modelling👉 There are 9 types of Dimensions/metrics when dealing with Dimensional Data Modelling. They are given below:🔹Conformed Dimension🔹Outrigger Dimension🔹Shrunken Dimension🔹Role-Playing Dimension🔹Dimension to Dimension Table🔹Junk Dimension🔹Degenerate Dimension🔹Swappable Dimension🔹Step Dimension✅ Conformed DimensionA Conformed Dimension is a type of Dimension that has the same meaning to all the Facts it relates to.... Continue Reading →
Spark & Hadoop beginner to Advanced Questions
1. What are the different cluster managers provided by Apache Spark? Three different cluster managers are available on Apache Spark. These are: Standalone Cluster Manager: The Standalone Cluster Manager is a simple cluster manager responsible for managing resources based on application requirements. The Standalone Cluster Manager is resilient in that it can handle task failures.... Continue Reading →