Pyspark Basic questions

Q1. What is PySpark?PySpark is the Python API for Apache Spark. It is an open-source distributed system that is used for big data processing. Q2. What is the difference between RDD, DataFrame, and Dataset in PySpark?Resilient Distributed Datasets is a basic data structure in PySpark. It represents a distributed collection of objects. The Dataset is... Continue Reading →

October 29, 2023 0

ADVANCED GCP QUESTIONS AND ANSWERS

1.How can you create a new virtual machine instance on Google Cloud Platform using the gcloud command-line tool? Here are the steps to create a new virtual machine instance on Google Cloud Platform using the gcloud command-line tool. Open your terminal or command prompt.Make sure you have the gcloud command-line tool installed and configured on... Continue Reading →

October 28, 2023 0

INTERMEDIATE GCP QUESTIONS AND ANSWERS

Why does Google Cloud Platform differ from other services? Google Cloud Platform (GCP) has a number of distinct characteristics and features that differentiates it from other cloud services: Google-grade Security: GCP uses the same robust architecture and security model Google uses for its own products like Gmail and Search. Advanced Data Analytics and Machine Learning:... Continue Reading →

October 28, 2023 0

Basic GCP questions and answers

What are the many levels of cloud architecture? The following are the many layers of cloud architecture: Physical Layer: This layer contains the network, physical servers, and other components.Infrastructure layer: This layer includes virtualized storage levels, among other things.Platform layer: This layer consists of the applications, operating systems, and other components.Application layer: It is the... Continue Reading →

October 28, 2023 0

Data Warehouse, Datalake, Datamesh

Data is the lifeblood of any modern business. But with so much data available, it can be difficult to know how to store, manage, and analyze it effectively.That's where data warehouse, data lake, lakehouse, and data mesh come in.1. **Data Warehouse:**- 📂 Structured Data: Designed primarily for structured data storage.- 📊 Analytical Focus: Optimized for... Continue Reading →

October 28, 2023 0

Delete Duplicates in SQL Data

1. Using rowidSQL > delete from empwhere rowid not in(select max(rowid) from emp group by empno);This technique can be applied to almost scenarios. Group by operation should be on the columns which identify the duplicates.2. Using self-joinSQL > delete from emp e1where rowid not in(select max(rowid) from emp e2where e1.empno = e2.empno );3. Using row_number()SQL > delete... Continue Reading →

October 28, 2023 0

Apache Spark Learning Resources

♐️Apache Spark for data engineers is like SQL is for relational databases. Just as SQL is a standard language used to interact with and manipulate data in relational databases, Apache Spark provides a powerful framework for processing and analyzing data in a distributed computing environment. With Apache Spark, data engineers can perform complex data transformations,... Continue Reading →

October 28, 2023 0

Spark – BTS

Internal working of Apache Spark (don't forget to save it)𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤 works on the principle of in-memory computation making it 100x faster and a highly performant distributed framework.Here is a detailed explanation on what happens internally when a spark job is executed using the spark-submit command - 📋𝐒𝐭𝐞𝐩 1 : Client application initiates the execution... Continue Reading →

October 27, 2023 1

Cloud Resources

☁️ Cloud whispers secrets of data, and in the hands of engineers, it becomes a symphony of insights that reshape the world.🔰From big data being the most demanding technologies today the demand for cloud such as AWS, GCP or Azure is high, with changing times to have multi-skilled professionals.✔️Talking about what a data engineer must... Continue Reading →

October 1, 2023 0