The Setup (Where I Thought I Was Smart)
Four years ago, I was fresh at my job. Still figuring out Kubernetes. Still learning DevOps. But I was eager to prove I knew what I was doing.
So I volunteered to set up the lower environment cluster on AWS. You know, the cluster developers use for testing before pushing to production.
I built it from scratch:
# Infrastructure Stack for EKS Migration:
- EKS cluster with eksctl YAML
- Private subnets for worker nodes
- NAT gateway for egress
- Microservices deployed (being tested by developers)
- IPv4-only VPC
- NGINX Ingress Controller
- AWS ACM for SSL/TLS
- Network Load Balancer
- CloudFront distribution in front
- Route 53 A record pointing to CloudFront
And it worked. Everything worked perfectly. I was feeling like a real DevOps engineer. Then the messages started.
The Incident
We were migrating from EC2 to Kubernetes. I'd set up the EKS cluster and asked developers to test the microservices while I finished deploying the rest.
They reported the domain wouldn't resolve. On certain networks, browsers showed ERR_NAME_NOT_RESOLVED or DNS_PROBE_FINISHED_NXDOMAIN. But it worked fine from my machine and office network.
Some networks: broken. Others: fine.
I was sweating.
The Panic Phase
Everything checked out:
Application logs → Fine
Kubernetes events → Normal
Cluster health → Good
Load balancer → Healthy
Route 53 DNS records → All there
CloudFront settings → Correct
But developers were still getting resolution failures. Works from some networks (IPv4), fails from others (IPv6-only).
The pattern was screaming at me. I just wasn't listening.
The Rabbit Hole
I debugged the symptom, not the problem.
First 15 minutes: checked DNS. Records were there. CNAME was correct. CloudFront alias configured. But I kept debugging because that's where the error came from.
Next 20 minutes: tried different DNS providers. Changed to Google Public DNS. Flushed cache. Still broken from some networks, fine from others.
Next 15 minutes: blamed CloudFront. Checked everything. Tried cache invalidation. Tried recreating the distribution.
Then I started spiraling. Route 53 routing policy? Load balancer misconfigured? Rebuild the cluster?
One hour in. No progress. Just deeper into the wrong layer.
The Moment I Stopped (And Actually Thought)
Then it hit me: Why does it work from network A but not network B?
That's not a DNS question. That's not a CloudFront question. That's a network connectivity question.
I'd been debugging the wrong layer the whole time.
So I did what I should have done 45 minutes earlier. I checked my network:
# From the problematic network:
$ curl https://ipv4.icanhazip.com
# (no response, timeout)
$ curl https://ipv6.icanhazip.com
# 2409:XXXX:XXXX::1 ✓
That's when it hit me. I only had IPv6 from that network. No IPv4. But my entire cluster was IPv4-only.
The Realization
I had only created an A record (IPv4) in Route 53. Developers on IPv6-only networks had no way to resolve the domain.
Simple:
IPv6-only developer
↓
Looks up mydomain.com
↓
Route 53 returns only A record
↓
"I don't have IPv4, can't use this"
↓
ERR_NAME_NOT_RESOLVED
The answer was staring at me. I was debugging the wrong layer.
The Fix
Add an AAAA record (IPv6) to Route 53 pointing to CloudFront.
mydomain.com
A record → CloudFront
AAAA record → CloudFront
Now IPv6-only clients resolve the domain and CloudFront handles the dual-stack translation. Took 5 minutes to fix. Took 55 minutes to find.
The Lesson
When something breaks unevenly, don't debug the error. Debug the difference.
I focused on the symptom (DNS resolution failed). I should have asked: why does it work from network A but not network B?
The difference tells you the problem:
- Works: Networks with IPv4
- Fails: Networks with IPv6 only
- Problem: Only IPv4 DNS records
One question. That's all it took.
Most engineers debug vertically-deeper into the same layer. Good engineers debug horizontally-they find what's different between working and broken.
What You Should Know
When something breaks:
- Don't panic. Seriously. Panic makes you stupid.
- Check the obvious (logs, configs, health).
- If that's fine, zoom out. What's different between working and broken?
- Debug from the affected user's perspective, not your machine.
Wrong vs Right:
- WRONG: "The error says DNS_PROBE_FINISHED_NXDOMAIN, so I'll debug DNS"
- RIGHT: "Works from IPv4 networks, fails from IPv6 networks. Why?"
The Timeline
14:30 - Error report
14:35 - Check logs (fine)
14:40 - Check Kubernetes (fine)
14:50 - Blame DNS (waste 15 min)
15:05 - Try different DNS provider (waste 15 min)
15:20 - Blame CloudFront (waste 15 min)
15:35 - Finally ask: "Why does it work here but not there?"
15:40 - Realize: only A record, no AAAA record
15:45 - Add AAAA record to Route 53
15:50 - Test: Works
Total: 1 hour 20 minutes
Time wasted debugging wrong layer: 50 minutes
Time to actually fix: 5 minutes
The Real Lesson
You'll have incidents. You'll panic. That's normal.
The engineers who get ahead stop in the middle of the panic and ask: "Am I looking at the problem or the symptom?"
Symptom: DNS resolution failed
Problem: Missing AAAA record (IPv6 DNS)
I spent an hour on the symptom. Five minutes on the problem would have fixed it.
TL;DR
Problem: EKS cluster worked from IPv4 networks, failed from IPv6-only networks.
My Mistake: Debugged DNS, CloudFront, Kubernetes-the wrong layers.
Root Cause: Only created A record (IPv4). No AAAA record (IPv6).
Fix: Add AAAA record to Route 53.
Pattern: When something breaks unevenly, debug the difference-not the error.
Tags: DevOps · Kubernetes · AWS · EKS · Debugging · IPv6 · IPv4 · CloudFront · Network · Lessons Learned · Infrastructure