Basic to Medium #Python (pandas) interview questions for entry level Data analyst role

1. What are the differences between lists and tuples in Python, and how does this distinction relate to Pandas operations?2. What is a DataFrame in Pandas, and how does it differ from a Series?3. Can you explain how to handle missing data in Pandas, including the difference between 'fillna()' and 'dropna()'?4. Describe the process of... Continue Reading →

November 9, 2023 0

Data Engineering Blogs

75 Engineering blogs worth reading to improve your system design:High Scalability https://lnkd.in/eQ4eDw4EEngineering at Meta https://lnkd.in/e8tiSkEv AWS Architecture Blog https://lnkd.in/eEchKJif All Things Distributed https://lnkd.in/emXaQDaS The Nextflix Tech Blog https://lnkd.in/efPuR39b LinkedIn Engineering Blog https://lnkd.in/ehaePQth Uber Engineering Blog https://eng.uber.com/ Engineering at Quora https://lnkd.in/em-WkhJd Pinterest Engineering https://lnkd.in/esBTntjq Lyft Engineering Blog https://eng.lyft.com/ Twitter Engineering Blog https://lnkd.in/evMFNhEs Dropbox Engineering Blog https://dropbox.tech/... Continue Reading →

November 8, 2023 0

Insert, Update and Delete in PySpark

Here's the scenario: We had two data tables, Table_A and Table_B, each containing a "Name" and "Age" column. 📋💡Table_A:Name | Age------------S1 | 20S2 | 23-------------------------Table_B:Name | Age------------S1 | 22S4 | 27Our mission was to determine the differences between these tables and generate a Action between Update, Delete, Insert🚀 and here's the solution we came up... Continue Reading →

November 8, 2023 0

Pyspark UDF

#PySpark_UDF_with_the_help_of_an_example👉 👉 👉 The most important aspect of Spark SQL & DataFrame is PySpark UDF (i.e., User Defined Function), which is used to expand PySpark's built-in capabilities. UDFs in PySpark work similarly to UDFs in conventional databases.✍ We write a Python function and wrap it in PySpark SQL udf() or register it as udf and... Continue Reading →

November 6, 2023 0

Delete Duplicates in Pyspark Dataframe

#ScenarioThere are two ways to handle row duplication in PySpark dataframes. The distinct() function in PySpark is used to drop/remove duplicate rows (all columns) from a DataFrame, while dropDuplicates() is used to drop rows based on one or more columns. Here’s an example showing how to utilize the distinct() and dropDuplicates() methods- First, we need... Continue Reading →

November 6, 2023 0

Big Data Pro Resources

#Resources Referred by me for Big data Technologies These resources are available for free in YouTube, which helped me to crack CISCO.. and for you to crack product based companies also..1.Hadoop ,sqoop and Hive concepts by Saif shaik:https://lnkd.in/ewyYweTJ2.pyspark concepts in depth by karunakar goud:https://lnkd.in/eNtFkxmd3.Another spark playlist which useful Raja's Data Engineering channel.https://lnkd.in/eqiy7dBS4. Hadoop and Kafka... Continue Reading →

November 3, 2023 0

Crack The Spark

🚀Data Engineer Interview Experience📢Apache Spark⌛How "Executor Out Of Memory" can be explained in step by step manner👉🏽https://lnkd.in/gPsrw9Wp How "Salting" can be explained in step by step manner👉🏽https://lnkd.in/gUQUPj8x How "Data Locality in Spark" can be explained in step by step manner👉🏽https://lnkd.in/gcQ_CJZs How "Garbage Collection (GC) Tuning" can be explained in step by step manner👉🏽https://lnkd.in/gY5CQM9c How "Predicate... Continue Reading →

November 2, 2023 0

Spark & Hadoop beginner to Advanced Questions

1. What are the different cluster managers provided by Apache Spark? Three different cluster managers are available on Apache Spark. These are: Standalone Cluster Manager: The Standalone Cluster Manager is a simple cluster manager responsible for managing resources based on application requirements. The Standalone Cluster Manager is resilient in that it can handle task failures.... Continue Reading →

October 30, 2023 0

Pyspark Basic questions

Q1. What is PySpark?PySpark is the Python API for Apache Spark. It is an open-source distributed system that is used for big data processing. Q2. What is the difference between RDD, DataFrame, and Dataset in PySpark?Resilient Distributed Datasets is a basic data structure in PySpark. It represents a distributed collection of objects. The Dataset is... Continue Reading →

October 29, 2023 0

Apache Spark Learning Resources

♐️Apache Spark for data engineers is like SQL is for relational databases. Just as SQL is a standard language used to interact with and manipulate data in relational databases, Apache Spark provides a powerful framework for processing and analyzing data in a distributed computing environment. With Apache Spark, data engineers can perform complex data transformations,... Continue Reading →

October 28, 2023 0