Get rid of NAT64 setup

2026-06-16 00:29:18 +02:00
parent b993115b41
commit 679ebb3465
13 changed files with 316 additions and 419 deletions
@@ -0,0 +1,136 @@
+# Postmortem: NAT64 / IPv6-mostly attempt
+
+A record of an architecture that was built, run for ~2 days, and removed. Kept
+so the reasoning isn't re-discovered the hard way. For the current DNS setup see
+[coredns.md](./coredns.md); for network overview see [network.md](./network.md).
+
+## The original problem
+
+The ISP provides no native IPv6 — only a Hurricane Electric (HE) 6in4 tunnel
+(`2001:470:61a3::/48`). HE address ranges are widely classified as
+datacenter/hosting space, so some sites (Google, Cloudflare-fronted services,
+various login flows) treat IPv6 traffic from them as bot/VPN traffic: endless
+CAPTCHAs, "unusual traffic" interstitials, or outright blocks. IPv4 egress
+(the ISP's residential PPPoE address) is unaffected.
+
+The goal: keep using the network normally without IPv6 triggering these flags,
+while still wanting some IPv6 (e.g. inbound to self-hosted services).
+
+## What was built
+
+An **IPv6-mostly** network (RFC 8925) with **DNS64 + NAT64**, intended to push
+egress onto IPv4 while presenting IPv6 to clients:
+
+- **CoreDNS container** with the `dns64` plugin (`translate_all`): synthesized
+  `64:ff9b::/96` AAAA records from A records for *all* names, so even dual-stack
+  destinations resolved to a NAT64 address.
+- **Tayga container** (`ghcr.io/apalrd/tayga-nat64`): stateless NAT64 translator.
+  IPv6 traffic to `64:ff9b::/96` was routed to it, translated to IPv4, and
+  masqueraded out the GPON PPPoE interface. So all "IPv6" egress actually left
+  as IPv4 on the residential address — bypassing the HE tunnel and its flagging.
+- **RouterOS RA + DHCP**: DHCP option 108 (IPv6-only preferred) to make capable
+  clients drop IPv4, PREF64 (RFC 8781) to advertise the NAT64 prefix for CLAT,
+  RDNSS (RFC 8106) to hand IPv6-only clients a resolver.
+- Dedicated `nat64` bridge, `fc64::/126` link, `192.168.240.0/20` Tayga pool,
+  static routes, and firewall rules (including NAT64-mapped RFC1918 blocks to
+  prevent the translator being used as a policy bypass).
+
+## Why it was removed
+
+### 1. Performance — the dealbreaker
+
+Throughput collapsed from line rate (~1 Gbps) to **~200-300 Mbps**, saturating
+the router CPU. Causes, all structural:
+
+- Tayga is a **userspace** translator. Every translated packet leaves the kernel
+  fastpath, is copied to userspace, translated, and re-injected.
+- Translated traffic crosses RouterOS **twice** — once as IPv6 (LAN → Tayga),
+  once as IPv4 (Tayga → WAN, with masquerade) — doubling firewall/conntrack work.
+- No hardware offload or fasttrack applies to either leg.
+
+With `translate_all`, *nearly all* internet traffic went through this path, so
+the penalty hit everything, not just IPv4-only destinations.
+
+### 2. Single point of failure
+
+DNS (CoreDNS) and most of the datapath (Tayga) became two containers in the
+critical path on a router whose built-in forwarder had been completely reliable.
+Container restarts, image pulls, or a crash now took down connectivity.
+
+### 3. Architectural inversion
+
+NAT64 exists to let **IPv6-only** clients reach the **IPv4** internet. The actual
+goal here was the opposite — *avoid* IPv6 egress entirely. Building an IPv6-only
+client environment (option 108, CLAT, PREF64) and then translating all of it back
+to IPv4 was solving the problem backwards. The complexity existed only to route
+around a property of the HE tunnel.
+
+### 4. Firewall complexity and a translation bypass hole
+
+NAT64 punched a hole in the firewall model. RouterOS filters IPv4 and IPv6
+independently, but NAT64 traffic enters as IPv6 and *leaves* as IPv4 after
+translation — so the carefully-built IPv4 forward policy (inter-VLAN isolation,
+RFC1918-to-WAN blocks) was simply bypassed for anything arriving via the
+translator. A client could reach a private IPv4 range by encoding it in the
+NAT64 prefix (`64:ff9b::c0a8:xxyy` = `192.168.x.y`), and the IPv4 rules would
+never see it because the packet was IPv6 until Tayga rewrote it.
+
+Plugging this required mirroring the IPv4 policy in the IPv6 chain: explicit
+`reject` rules for every NAT64-mapped RFC1918 block (`64:ff9b::a00:0/104`,
+`64:ff9b::ac10:0/108`, `64:ff9b::c0a8:0/112`), per-VLAN accept rules toward the
+`nat64` interface, plus a separate masquerade and LB hairpin-accept for the
+Tayga pool. That is a parallel, easy-to-get-wrong copy of the existing ruleset,
+whose correctness depended on getting CIDR-to-prefix arithmetic right. Removing
+NAT64 deleted all of it.
+
+### 5. Operational fragility (see coredns.md for detail)
+
+The setup had a long tail of subtle failure modes, each presenting identically
+as "client can't connect":
+
+- RouterOS static `FWD` entries return `NOERROR`/empty instead of relaying
+  `NXDOMAIN`, which broke `getaddrinfo` search-domain handling in Kubernetes
+  pods (`ENOTFOUND` for valid names).
+- `translate_all` discarded real AAAA for IPv6-only internal services, and
+  returned empty answers for names with no A record.
+- Per-interface RouterOS `ipv6 nd` entries default to `advertise-dns=no` and must
+  be *created* (not modified), so RDNSS/PREF64 silently never advertised.
+- Dynamic `from-pool` VLAN addressing made advertised RDNSS addresses point at
+  nonexistent router addresses.
+- Option 108 honoured by clients before the NAT64 path was verified working left
+  them stuck "obtaining IP address".
+
+Each was individually fixable, but the aggregate was a brittle system whose
+benefit didn't justify the surface area.
+
+## What replaced it
+
+Plain CoreDNS forwarder with **AAAA suppression by default** plus a whitelist for
+domains that should keep IPv6 (our own zone over the HE prefix, and any explicitly
+trusted domain). Clients prefer IPv4 because they simply don't receive AAAA for
+most names — no translation, no extra datapath hop, packet forwarding stays on the
+RouterOS fastpath at line rate. DNS is the only thing in the path. See
+[coredns.md](./coredns.md).
+
+Tradeoff accepted: a non-whitelisted IPv6-only destination (no A record) is
+unreachable. In practice essentially everything on the public internet still has
+an A record. The intended future refinement is a CoreDNS plugin that suppresses
+AAAA only when an A record also exists, removing the need for the whitelist; no
+in-tree plugin does this today.
+
+## Lessons
+
+- **Measure throughput before committing to an in-path translator on SOHO-class
+  hardware.** Userspace NAT64 (Tayga/Jool-in-container) on a MikroTik CPU is
+  fine for a few hundred Mbps, not for saturating a gigabit line.
+- **Match the mechanism to the actual goal.** The goal was "prefer IPv4 egress",
+  which is a one-line DNS policy, not a transition technology.
+- **Prefer solutions that stay on the fastpath.** Anything that pulls bulk
+  traffic into userspace or doubles the forwarding work will dominate the CPU.
+- **Fewer moving parts in the critical path.** Two containers carrying all DNS
+  and most traffic is a worse availability story than the stock forwarder, for a
+  cosmetic benefit (avoiding CAPTCHAs on some sites).
+- **Protocol translation breaks the firewall model.** When traffic changes L3
+  protocol mid-path, the two firewall policies must be kept in sync by hand, and
+  any gap is a silent bypass. A solution that doesn't translate keeps a single
+  coherent policy.