Internal working of Apache Spark (don’t forget to save it)
๐๐ฉ๐๐๐ก๐ ๐๐ฉ๐๐ซ๐ค works on the principle of in-memory computation making it 100x faster and a highly performant distributed framework.
Here is a detailed explanation on what happens internally when a spark job is executed using the spark-submit command – ๐
๐๐ญ๐๐ฉ 1 : Client application initiates the execution of spark job using the ๐๐ฉ๐๐ซ๐ค-๐๐ฎ๐๐ฆ๐ข๐ญ command.
๐๐ญ๐๐ฉ 2 : The request first hits the Cluster/Resource Manager which launches an Application Master with an AM container and the Spark Driver.
๐๐ญ๐๐ฉ 3 : Driver program runs the main method. It creates a Spark Context/Session that would be active until the applicationโs lifecycle.
The Spark Context helps create the ๐๐ข๐ซ๐๐๐ญ๐๐ ๐๐๐ฒ๐๐ฅ๐ข๐ ๐๐ซ๐๐ฉ๐ก(๐๐๐) based on the transformations in the running program that consists of the RDD lineages.
Once an action is encountered, a Job is created and submitted to DAG Scheduler.
๐๐๐ ๐๐๐ก๐๐๐ฎ๐ฅ๐๐ซ ๐งฎ then divides the graph into different stages, which are further divided into Tasks.
Tasks are then submitted to the ๐๐๐ฌ๐ค ๐๐๐ก๐๐๐ฎ๐ฅ๐๐ซ ๐ which launches the tasks via the Cluster Manager on the different worker nodes and the executors execute these jobs.
๐๐ญ๐๐ฉ 4 : Spark Driver requests the resource manager for resources to be allocated for the execution of the tasks.
๐๐ญ๐๐ฉ 5 : The Resource Manager then allocates the worker nodes with the requested number of executors for further processing.
๐๐ญ๐๐ฉ 6 : The Spark Driver then submits the code and dependencies to the executors where the execution of tasks takes place.
Knowledgeable and helpful ๐๐ป
LikeLike