what are …
Total numbers of cores and partitions?
Total numbers of executors?
Total memory required?
Let’s walk through how to estimate the resources needed
when processing a 50GB dataset in Apache Spark, the default partition size of 128MB.
Convert Data to MB
Since Spark works with partition sizes in MB by default:
50 GB *1024 = 51,200 MB
Spark creates one task per partition. With a default partition size of 128MB:
51,200 MB ÷ 128 MB = 400 partitions
400 cores needed
Determine Number of Executors
A good rule of thumb is to allocate 2 to 5 cores per executor.
Let’s assume 4 cores per executor:
400 cores ÷ 4 cores/executor = 100 executors
Recommended by the developer of spark
4 cores × 128MB = 512MB per executor (minimum memory required)
Now we have 4 cores and minimum required 512mb
now 4*512=2GB memory required per executor
100 executors * 2GB = 200GB total memory
Summary:-
Dataset Size: 50GB
Total Partitions / Cores: ~400
Executors: 100
Memory per Executor: ~2GB
Total Memory: ~200GB
To handle the 50GB of data efficiently we need this configuration
Leave a comment