How do you handle 50GB Dataset in spark

what are …
Total numbers of cores and partitions?
Total numbers of executors?
Total memory required?

Let’s walk through how to estimate the resources needed

when processing a 50GB dataset in Apache Spark,  the default partition size of 128MB.

Convert Data to MB
Since Spark works with partition sizes in MB by default:

50 GB *1024 = 51,200 MB
Spark creates one task per partition. With a default partition size of 128MB:

51,200 MB ÷ 128 MB = 400 partitions
400 cores needed

Determine Number of Executors
A good rule of thumb is to allocate 2 to 5 cores per executor.
Let’s assume 4 cores per executor:


400 cores ÷ 4 cores/executor = 100 executors

Recommended by the developer of spark
4 cores × 128MB = 512MB per executor (minimum memory required)

Now we have 4 cores and minimum required 512mb

now 4*512=2GB memory required per executor

100 executors * 2GB = 200GB total memory


Summary:-
Dataset Size: 50GB
Total Partitions / Cores: ~400
Executors: 100
Memory per Executor: ~2GB
Total Memory: ~200GB

To handle the 50GB of data efficiently we need this configuration

Leave a comment

Create a website or blog at WordPress.com

Up ↑

Design a site like this with WordPress.com
Get started