Read CSV File by Spark

---------------Spark Interview Questions------------๐Ÿ“•How to read a csv file in spark?Method 1: ---------------spark.read.csv("path")df=spark.read.csv("dbfs:/FileStore/small_zipcode.csv")df.show()---+-------+--------+-------------------+-----+----------+|_c0| _c1| _c2| _c3| _c4| _c5|+---+-------+--------+-------------------+-----+----------+| id|zipcode| type| city|state|population|| 1| 704|STANDARD| null| PR| 30100|| 2| 704| null|PASEO COSTA DEL SUR| PR| null|| 3| 709| null| BDA SAN LUIS| PR| 3700|| 4| 76166| UNIQUE| CINGULAR WIRELESS| TX| 84000|| 5| 76177|STANDARD| null| TX| null|+---+-------+--------+-------------------+-----+----------+Method 2 :--------------df=spark.read.format("csv").option("inferSchema",True).option("header",True).option("sep",",").load("dbfs:/FileStore/small_zipcode.csv")df.show()+---+-------+--------+-------------------+-----+----------+|... Continue Reading →

Free Spark Course

Don't pay for Apache Spark Course because it is in demand.You can learn for free here......1. Install spark from here....https://lnkd.in/gx_Dc8phhttps://lnkd.in/gg6-8xDz2. Learn spark Basics from here--https://lnkd.in/g-gCpUyihttps://lnkd.in/gkNhMnTZhttps://lnkd.in/gkbVB6YX2.1 Learn spark with Scala from here:https://lnkd.in/gtrZAmn42.2 Learn spark with python from here:https://lnkd.in/gQaeSjbH3. Learn pyspark from here:https://lnkd.in/g6kyihyW4. Work on Spark projects from here..https://lnkd.in/gE8hsyZxhttps://lnkd.in/gwWytS-Qhttps://lnkd.in/gR7DR6_5https://lnkd.in/gzngHhrChttps://lnkd.in/gACn6bK85. Finally list down your projects Here.....https://github.com/I highly recommend... Continue Reading →

Git Guide

๐—” ๐—ฆ๐—ถ๐—บ๐—ฝ๐—น๐—ถ๐—ณ๐—ถ๐—ฒ๐—ฑ ๐—š๐˜‚๐—ถ๐—ฑ๐—ฒ ๐˜๐—ผ ๐—š๐—ถ๐˜ ๐˜„๐—ถ๐˜๐—ต ๐—™๐—ฅ๐—˜๐—˜ ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐—ฟ๐—ฒ๐˜€๐—ผ๐˜‚๐—ฟ๐—ฐ๐—ฒ๐˜€ ๐Ÿ‘จ๐Ÿ’ปGit, the powerhouse of version control, transforms how developers manage code history and teamwork. Here's a quick breakdown:โ€ข ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ฒ๐—ฑ ๐—•๐—ฟ๐—ถ๐—น๐—น๐—ถ๐—ฎ๐—ป๐—ฐ๐—ฒ: Git repositories are self-contained, offering flexibility and stability. Every developer holds a complete project history, fostering autonomy.โ€ข ๐—ฆ๐—ป๐—ฎ๐—ฝ๐˜€๐—ต๐—ผ๐˜ ๐— ๐—ฎ๐˜€๐˜๐—ฒ๐—ฟ๐˜†: Git captures "snapshots" of files, enabling easy... Continue Reading →

SCD 2 with Pyspark

Implementing slowly changing dimension (SCD type2) in Pyspark earlier we saw in SQL https://lnkd.in/dH6j3MWE# Define the schema for the DataFrameschema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("salary", IntegerType(), True), StructField("department", StringType(), True), StructField("active", BooleanType(), True), StructField("start", StringType(), True), StructField("end", StringType(), True)])Employee_data = [ (1,"John", 100, "HR",True,'2023-10-20',None), (2,"Alice", 200, "Finance",True,'2023-10-20',None), (3,"Bob", 300, "Engineering",True,'2023-10-20',None), (4,"Jane",... Continue Reading →

Data Engineering Questions – 1

if your #dataengineering experience grows more than 5 years you expect these questions in your interviews.....1. Explain me the architecture of spark?2. How does internals job execution happens?3. what will happen when you fire the Spark Job?4. How did you tune your jobs?5. Explain optimizations you have used in your project?6. How did you connected... Continue Reading →

Data Masking in Pyspark

Hide Credit card number:Accept 16 digit credit card number from user and display only last 4 characters of card numberinput :1234567891234567output :************4567We can use Py spark or pythonCode In Pyspark:---------------------from pyspark.sql import SparkSessionfrom pyspark.sql.functions import substring# Create a SparkSessionspark = SparkSession.builder.appName("HideCreditCard").getOrCreate()# Sample input credit card numberinput_cc_number = "1234567891234567"# Hide all characters except the last four... Continue Reading →

PySpark: Cleansing Data with Regex

๐Ÿ” Delving into PySpark: Cleansing Data with Regex Magic!โš™๏ธ๐ŸŒŸ Example: Transforming Names with Special Characters ๐Ÿš€Picture yourself in the realm of data, where you've stumbled upon a trove of Indian names. However, these names are shrouded in a layer of noise, with special characters cluttering them. ๐Ÿ”‘ Step 1๏ธโƒฃ: The ChallengeImagine a dataset of Indian... Continue Reading →

Cloud Data Engineering Road Map

๐Ÿšถ๐Ÿป Cloud Data Engineering Road Map ๐Ÿƒ๐Ÿปโœ… Basic Version Control toolhttps://lnkd.in/gEqyhzZRhttps://lnkd.in/g_t2xKnGhttps://lnkd.in/gZT7QNjS โœ… Data Warehousing Conceptshttps://lnkd.in/gq99PDcp โœ… Core Pythonhttps://lnkd.in/gQpmSnM โœ… Spark SQLhttps://lnkd.in/gDcR5bwM โœ… Databrickshttps://lnkd.in/gSpKBWbJhttps://lnkd.in/gpbMg9nU โœ… Sparkhttps://lnkd.in/gtqRtTPvhttps://lnkd.in/gs2gkqRq โœ… Pysparkhttps://lnkd.in/gmkPpmAXhttps://lnkd.in/gh-_KzjE โœ… Delta Lakehttps://lnkd.in/gt6ggER6 โœ… Cloud ETL Tool + Storagehttps://lnkd.in/gTs8y4Ai โœ… Cloud MPP Warehousehttps://lnkd.in/gMTHCrNZ โœ… Databricks Unity Cataloghttps://lnkd.in/gH6Q2a5K๐Ÿ“• Learn , Lead and Make Leaders ๐Ÿš€. Happy Learning ๐Ÿ“–Follow ๐Ÿ‘‰... Continue Reading →

PySpark: Sales Data Analysis

Exploring PySpark: Advanced Data Analysisโš™๏ธ๐ŸŒฑ Scenario: Analyzing Multi-Dimensional Sales Data๐Ÿ“ŠImagine being tasked with analyzing sales data that spans multiple dimensions, including time, regions, and product categories. To unlock insights from this complex dataset, PySpark's powerful capabilities come into play.๐Ÿ”‘ Step 1๏ธโƒฃ: Defining the ChallengeYour goal is to gain a comprehensive understanding of sales performance by... Continue Reading →

Learn Apache Spark Step by Step

Learn Apache Spark Step by Step (Follow the Sequence)1. Getting started with Apache Sparkhttps://lnkd.in/gFRpe3-D2. A quick introduction to the Spark APIhttps://lnkd.in/g8Y3tdhX3. Overview of Spark - RDD, accumulators, broadcast variablehttps://lnkd.in/g7fepuFF4. Spark SQL, Datasets, and DataFrames:https://lnkd.in/g3iZp7zk5. PySpark - Processing data with Spark in Pythonhttps://lnkd.in/gBnh6PAi6. Processing data with SQL on the command linehttps://lnkd.in/ggnxDaUu7. Cluster Overviewhttps://lnkd.in/guCQnJnv8. Packaging and deploying... Continue Reading →

Create a website or blog at WordPress.com

Up ↑

Design a site like this with WordPress.com
Get started