Github Repos for Developer that will reveal thousands of free resources. 1 The Algorithms: https://lnkd.in/dpzAd_vE2 freeCodeCamp : https://lnkd.in/diBh4dVy3 Freely available programming books : https://lnkd.in/d2bwBmU94 100 Days of ML Coding : https://lnkd.in/dz8dDr9U5 project-based tutorials: https://lnkd.in/dSiiKHXK6 Public APIs : https://lnkd.in/dvGamaUM7 Coding Interview University : https://lnkd.in/dhY5pCxH8 Developer Roadmap: https://lnkd.in/dJ4wAG2B9 Computer Science: https://lnkd.in/d2uFXzPz10 30 Seconds of Code : https://lnkd.in/dwDNk_VX11... Continue Reading →
Learn Apache Spark Step by Step
Learn Apache Spark Step by Step (Follow the Sequence)1. Getting started with Apache Sparkhttps://lnkd.in/gFRpe3-D2. A quick introduction to the Spark APIhttps://lnkd.in/g8Y3tdhX3. Overview of Spark - RDD, accumulators, broadcast variablehttps://lnkd.in/g7fepuFF4. Spark SQL, Datasets, and DataFrames:https://lnkd.in/g3iZp7zk5. PySpark - Processing data with Spark in Pythonhttps://lnkd.in/gBnh6PAi6. Processing data with SQL on the command linehttps://lnkd.in/ggnxDaUu7. Cluster Overviewhttps://lnkd.in/guCQnJnv8. Packaging and deploying... Continue Reading →
Databricks lakehouse fundamentals
You Can Try Free Databricks lakehouse fundamentals recorded videos and certification. Link is below. https://lnkd.in/gXx2GUH8#lakehouse #databricks
Basic to Medium #Python (pandas) interview questions for entry level Data analyst role
1. What are the differences between lists and tuples in Python, and how does this distinction relate to Pandas operations?2. What is a DataFrame in Pandas, and how does it differ from a Series?3. Can you explain how to handle missing data in Pandas, including the difference between 'fillna()' and 'dropna()'?4. Describe the process of... Continue Reading →
Data Engineering Blogs
75 Engineering blogs worth reading to improve your system design:High Scalability https://lnkd.in/eQ4eDw4EEngineering at Meta https://lnkd.in/e8tiSkEv AWS Architecture Blog https://lnkd.in/eEchKJif All Things Distributed https://lnkd.in/emXaQDaS The Nextflix Tech Blog https://lnkd.in/efPuR39b LinkedIn Engineering Blog https://lnkd.in/ehaePQth Uber Engineering Blog https://eng.uber.com/ Engineering at Quora https://lnkd.in/em-WkhJd Pinterest Engineering https://lnkd.in/esBTntjq Lyft Engineering Blog https://eng.lyft.com/ Twitter Engineering Blog https://lnkd.in/evMFNhEs Dropbox Engineering Blog https://dropbox.tech/... Continue Reading →
Insert, Update and Delete in PySpark
Here's the scenario: We had two data tables, Table_A and Table_B, each containing a "Name" and "Age" column. ๐๐กTable_A:Name | Age------------S1 | 20S2 | 23-------------------------Table_B:Name | Age------------S1 | 22S4 | 27Our mission was to determine the differences between these tables and generate a Action between Update, Delete, Insert๐ and here's the solution we came up... Continue Reading →
๐๐ ๐๐ผ๐ ๐๐ผ ๐๐๐ถ๐น๐ฑ ๐ฎ๐ป ๐๐๐ฒ๐ป๐-๐๐ฟ๐ถ๐๐ฒ๐ป ๐ฆ๐ฒ๐ฟ๐๐ฒ๐ฟ๐น๐ฒ๐๐ ๐๐ง๐ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ ๐ผ๐ป ๐๐ช๐ฆ
๐๐ง๐ => ๐๐ ๐๐ฟ๐ฎ๐ฐ๐ | ๐ง๐ฟ๐ฎ๐ป๐๐ณ๐ผ๐ฟ๐บ | ๐๐ผ๐ฎ๐ฑEvent-Driven Serverless ETL Pipelines is a data processing architecture that is used to process large amounts of data in real-time.Here data is processed as soon as it is generated, rather than being stored and processed later.This allows for faster processing times and more efficient use of resources.Here are the... Continue Reading →
FREE DATA ENGINEERING COURSES ON CLOUD
Data engineering is the backbone of the modern data-driven world. Itโs the meticulous process of designing and building systems for collecting, storing, and analyzing data at scale. However, finding comprehensive projects and courses that are also free can be a challenge. To bridge this gap, Iโve created a list of five end-to-end data engineering courses... Continue Reading →
Pyspark UDF
#PySpark_UDF_with_the_help_of_an_example๐ ๐ ๐ The most important aspect of Spark SQL & DataFrame is PySpark UDF (i.e., User Defined Function), which is used to expand PySpark's built-in capabilities. UDFs in PySpark work similarly to UDFs in conventional databases.โ We write a Python function and wrap it in PySpark SQL udf() or register it as udf and... Continue Reading →
Delete Duplicates in Pyspark Dataframe
#ScenarioThere are two ways to handle row duplication in PySpark dataframes. The distinct() function in PySpark is used to drop/remove duplicate rows (all columns) from a DataFrame, while dropDuplicates() is used to drop rows based on one or more columns. Hereโs an example showing how to utilize the distinct() and dropDuplicates() methods- First, we need... Continue Reading →