Data Engineer Interview – Thinking with Numbers 🧮
Interviewer:
You need to process 1 TB of data in Spark. How do you decide the cluster size?
Candidate:
I don’t guess. I calculate..
🔢 Step 1 | Understand the Data Volume
• Total data = 1 TB ≈ 1,024 GB
• Target partition size = 128 MB
• Total partitions required:
1,024 × 1024 / 128 ≈ 8,192 partitions
This sets the minimum parallelism needed.
⚙️ Step 2 | Decide Parallel Execution Capacity
Assume:
• 12 worker nodes
• 16 cores per node
Total parallel tasks at a time:
12 × 16 = 192 tasks
Number of execution waves:
8,192 ÷ 192 ≈ 43 waves
This tells me how long the job will run before execution even starts.
🧠 Step 3 | Memory Math (Not Guesswork)
Assume:
• 64 GB RAM per node
• Usable Spark memory ≈ 70% → 45 GB
Per executor (4 executors/node):
• Memory per executor ≈ 11 GB
Rule of thumb:
• One Spark task should not exceed ~1–1.5 GB
• So max safe concurrent tasks per executor = 7–8
This avoids OOM during shuffles and joins.
🔄 Step 4 | Shuffle Cost Estimation
If:
• Join causes 3× data expansion
• Shuffle size ≈ 3 TB
With disk throughput ~500 MB/s (SSD):
• Minimum shuffle time ≈
3,000 GB ÷ 0.5 GB/s ≈ 100 minutes (worst case)
This tells me whether to repartition, broadcast, or redesign the join logic.
📈 Step 5 | Scale Decision
If SLA requires < 30 minutes:
• Required parallelism increase ≈ 3–4×
Options:
• Increase nodes
• Increase cores
• Reduce shuffle volume
This becomes a math optimization problem, not a Spark config problem.
Leave a comment