Cluster configuration

Data Engineer Interview – Thinking with Numbers 🧮

Interviewer:
You need to process 1 TB of data in Spark. How do you decide the cluster size?

Candidate:
I don’t guess. I calculate..



🔢 Step 1 | Understand the Data Volume
• Total data = 1 TB ≈ 1,024 GB
• Target partition size = 128 MB
• Total partitions required:
1,024 × 1024 / 128 ≈ 8,192 partitions

This sets the minimum parallelism needed.



⚙️ Step 2 | Decide Parallel Execution Capacity
Assume:
• 12 worker nodes
• 16 cores per node

Total parallel tasks at a time:
12 × 16 = 192 tasks

Number of execution waves:
8,192 ÷ 192 ≈ 43 waves

This tells me how long the job will run before execution even starts.



🧠 Step 3 | Memory Math (Not Guesswork)
Assume:
• 64 GB RAM per node
• Usable Spark memory ≈ 70% → 45 GB

Per executor (4 executors/node):
• Memory per executor ≈ 11 GB

Rule of thumb:
• One Spark task should not exceed ~1–1.5 GB
• So max safe concurrent tasks per executor = 7–8

This avoids OOM during shuffles and joins.



🔄 Step 4 | Shuffle Cost Estimation
If:
• Join causes 3× data expansion
• Shuffle size ≈ 3 TB

With disk throughput ~500 MB/s (SSD):
• Minimum shuffle time ≈
3,000 GB ÷ 0.5 GB/s ≈ 100 minutes (worst case)

This tells me whether to repartition, broadcast, or redesign the join logic.



📈 Step 5 | Scale Decision
If SLA requires < 30 minutes:
• Required parallelism increase ≈ 3–4×

Options:
• Increase nodes
• Increase cores
• Reduce shuffle volume

This becomes a math optimization problem, not a Spark config problem.

Leave a comment

Create a website or blog at WordPress.com

Up ↑

Design a site like this with WordPress.com
Get started