I Panicked Over Nothing (And It Took Me An Hour To Realize)

The Setup (Where I Thought I Was Smart)

Four years ago, I was fresh at my job. Still figuring out Kubernetes. Still learning DevOps. But I was eager to prove I knew what I was doing.

So I volunteered to set up the lower environment cluster on AWS. You know, the cluster developers use for testing before pushing to production.

I built it from scratch:

# Infrastructure Stack for EKS Migration:
- EKS cluster with eksctl YAML
- Private subnets for worker nodes
- NAT gateway for egress
- Microservices deployed (being tested by developers)
- IPv4-only VPC
- NGINX Ingress Controller
- AWS ACM for SSL/TLS
- Network Load Balancer
- CloudFront distribution in front
- Route 53 A record pointing to CloudFront

And it worked. Everything worked perfectly. I was feeling like a real DevOps engineer. Then the messages started.

The Incident

We were migrating from EC2 to Kubernetes. I'd set up the EKS cluster and asked developers to test the microservices while I finished deploying the rest.

They reported the domain wouldn't resolve. On certain networks, browsers showed ERR_NAME_NOT_RESOLVED or DNS_PROBE_FINISHED_NXDOMAIN. But it worked fine from my machine and office network.

Some networks: broken. Others: fine.

I was sweating.

The Panic Phase

Everything checked out:

Application logs       → Fine
Kubernetes events      → Normal
Cluster health         → Good
Load balancer          → Healthy
Route 53 DNS records   → All there
CloudFront settings    → Correct

But developers were still getting resolution failures. Works from some networks (IPv4), fails from others (IPv6-only).

The pattern was screaming at me. I just wasn't listening.

The Rabbit Hole

I debugged the symptom, not the problem.

First 15 minutes: checked DNS. Records were there. CNAME was correct. CloudFront alias configured. But I kept debugging because that's where the error came from.

Next 20 minutes: tried different DNS providers. Changed to Google Public DNS. Flushed cache. Still broken from some networks, fine from others.

Next 15 minutes: blamed CloudFront. Checked everything. Tried cache invalidation. Tried recreating the distribution.

Then I started spiraling. Route 53 routing policy? Load balancer misconfigured? Rebuild the cluster?

One hour in. No progress. Just deeper into the wrong layer.

The Moment I Stopped (And Actually Thought)

Then it hit me: Why does it work from network A but not network B?

That's not a DNS question. That's not a CloudFront question. That's a network connectivity question.

I'd been debugging the wrong layer the whole time.

So I did what I should have done 45 minutes earlier. I checked my network:

# From the problematic network:
$ curl https://ipv4.icanhazip.com
# (no response, timeout)

$ curl https://ipv6.icanhazip.com
# 2409:XXXX:XXXX::1 ✓

That's when it hit me. I only had IPv6 from that network. No IPv4. But my entire cluster was IPv4-only.

The Realization

I had only created an A record (IPv4) in Route 53. Developers on IPv6-only networks had no way to resolve the domain.

Simple:

IPv6-only developer
    ↓
Looks up mydomain.com
    ↓
Route 53 returns only A record
    ↓
"I don't have IPv4, can't use this"
    ↓
ERR_NAME_NOT_RESOLVED

The answer was staring at me. I was debugging the wrong layer.

The Fix

Add an AAAA record (IPv6) to Route 53 pointing to CloudFront.

mydomain.com
  A record    → CloudFront
  AAAA record → CloudFront

Now IPv6-only clients resolve the domain and CloudFront handles the dual-stack translation. Took 5 minutes to fix. Took 55 minutes to find.

The Lesson

When something breaks unevenly, don't debug the error. Debug the difference.

I focused on the symptom (DNS resolution failed). I should have asked: why does it work from network A but not network B?

The difference tells you the problem:

Works: Networks with IPv4
Fails: Networks with IPv6 only
Problem: Only IPv4 DNS records

One question. That's all it took.

Most engineers debug vertically-deeper into the same layer. Good engineers debug horizontally-they find what's different between working and broken.

What You Should Know

When something breaks:

Don't panic. Seriously. Panic makes you stupid.
Check the obvious (logs, configs, health).
If that's fine, zoom out. What's different between working and broken?
Debug from the affected user's perspective, not your machine.

Wrong vs Right:

WRONG: "The error says DNS_PROBE_FINISHED_NXDOMAIN, so I'll debug DNS"
RIGHT: "Works from IPv4 networks, fails from IPv6 networks. Why?"

The Timeline

14:30 - Error report
14:35 - Check logs (fine)
14:40 - Check Kubernetes (fine)
14:50 - Blame DNS (waste 15 min)
15:05 - Try different DNS provider (waste 15 min)
15:20 - Blame CloudFront (waste 15 min)
15:35 - Finally ask: "Why does it work here but not there?"
15:40 - Realize: only A record, no AAAA record
15:45 - Add AAAA record to Route 53
15:50 - Test: Works

Total: 1 hour 20 minutes
Time wasted debugging wrong layer: 50 minutes
Time to actually fix: 5 minutes

The Real Lesson

You'll have incidents. You'll panic. That's normal.

The engineers who get ahead stop in the middle of the panic and ask: "Am I looking at the problem or the symptom?"

Symptom: DNS resolution failed
Problem: Missing AAAA record (IPv6 DNS)

I spent an hour on the symptom. Five minutes on the problem would have fixed it.

TL;DR

Problem: EKS cluster worked from IPv4 networks, failed from IPv6-only networks.
My Mistake: Debugged DNS, CloudFront, Kubernetes-the wrong layers.
Root Cause: Only created A record (IPv4). No AAAA record (IPv6).
Fix: Add AAAA record to Route 53.
Pattern: When something breaks unevenly, debug the difference-not the error.

Tags: DevOps · Kubernetes · AWS · EKS · Debugging · IPv6 · IPv4 · CloudFront · Network · Lessons Learned · Infrastructure