Cluster configuration

Data Engineer Interview – Thinking with Numbers 🧮

Interviewer:
You need to process 1 TB of data in Spark. How do you decide the cluster size?

Candidate:
I don’t guess. I calculate..

🔢 Step 1 | Understand the Data Volume
• Total data = 1 TB ≈ 1,024 GB
• Target partition size = 128 MB
• Total partitions required:
1,024 × 1024 / 128 ≈ 8,192 partitions

This sets the minimum parallelism needed.

⚙️ Step 2 | Decide Parallel Execution Capacity
Assume:
• 12 worker nodes
• 16 cores per node

Total parallel tasks at a time:
12 × 16 = 192 tasks

Number of execution waves:
8,192 ÷ 192 ≈ 43 waves

This tells me how long the job will run before execution even starts.

🧠 Step 3 | Memory Math (Not Guesswork)
Assume:
• 64 GB RAM per node
• Usable Spark memory ≈ 70% → 45 GB

Per executor (4 executors/node):
• Memory per executor ≈ 11 GB

Rule of thumb:
• One Spark task should not exceed ~1–1.5 GB
• So max safe concurrent tasks per executor = 7–8

This avoids OOM during shuffles and joins.

🔄 Step 4 | Shuffle Cost Estimation
If:
• Join causes 3× data expansion
• Shuffle size ≈ 3 TB

With disk throughput ~500 MB/s (SSD):
• Minimum shuffle time ≈
3,000 GB ÷ 0.5 GB/s ≈ 100 minutes (worst case)

This tells me whether to repartition, broadcast, or redesign the join logic.

📈 Step 5 | Scale Decision
If SLA requires < 30 minutes:
• Required parallelism increase ≈ 3–4×

Options:
• Increase nodes
• Increase cores
• Reduce shuffle volume

This becomes a math optimization problem, not a Spark config problem.

Share this:

Related

Leave a comment Cancel reply