Spark – BTS

Internal working of Apache Spark (don’t forget to save it)

๐€๐ฉ๐š๐œ๐ก๐ž ๐’๐ฉ๐š๐ซ๐ค works on the principle of in-memory computation making it 100x faster and a highly performant distributed framework.

Here is a detailed explanation on what happens internally when a spark job is executed using the spark-submit command – ๐Ÿ“‹

๐’๐ญ๐ž๐ฉ 1 : Client application initiates the execution of spark job using the ๐’๐ฉ๐š๐ซ๐ค-๐’๐ฎ๐›๐ฆ๐ข๐ญ command.

๐’๐ญ๐ž๐ฉ 2 : The request first hits the Cluster/Resource Manager which launches an Application Master with an AM container and the Spark Driver.

๐’๐ญ๐ž๐ฉ 3 : Driver program runs the main method. It creates a Spark Context/Session that would be active until the applicationโ€™s lifecycle.

The Spark Context helps create the ๐ƒ๐ข๐ซ๐ž๐œ๐ญ๐ž๐ ๐€๐œ๐ฒ๐œ๐ฅ๐ข๐œ ๐†๐ซ๐š๐ฉ๐ก(๐ƒ๐€๐†) based on the transformations in the running program that consists of the RDD lineages.

Once an action is encountered, a Job is created and submitted to DAG Scheduler.

๐ƒ๐€๐† ๐’๐œ๐ก๐ž๐๐ฎ๐ฅ๐ž๐ซ ๐Ÿงฎ then divides the graph into different stages, which are further divided into Tasks.

Tasks are then submitted to the ๐“๐š๐ฌ๐ค ๐’๐œ๐ก๐ž๐๐ฎ๐ฅ๐ž๐ซ ๐Ÿ•’ which launches the tasks via the Cluster Manager on the different worker nodes and the executors execute these jobs.

๐’๐ญ๐ž๐ฉ 4 : Spark Driver requests the resource manager for resources to be allocated for the execution of the tasks.

๐’๐ญ๐ž๐ฉ 5 : The Resource Manager then allocates the worker nodes with the requested number of executors for further processing.

๐’๐ญ๐ž๐ฉ 6 : The Spark Driver then submits the code and dependencies to the executors where the execution of tasks takes place.

One thought on “Spark – BTS

Add yours

Leave a comment

Create a website or blog at WordPress.com

Up ↑

Design a site like this with WordPress.com
Get started