Get rid of NAT64 setup
ci/woodpecker/push/flux-reconcile-source Pipeline was successful
ci/woodpecker/push/coredns-build Pipeline was successful

This commit is contained in:
2026-06-16 00:29:18 +02:00
parent b993115b41
commit 679ebb3465
13 changed files with 316 additions and 419 deletions
+136
View File
@@ -0,0 +1,136 @@
# Postmortem: NAT64 / IPv6-mostly attempt
A record of an architecture that was built, run for ~2 days, and removed. Kept
so the reasoning isn't re-discovered the hard way. For the current DNS setup see
[coredns.md](./coredns.md); for network overview see [network.md](./network.md).
## The original problem
The ISP provides no native IPv6 — only a Hurricane Electric (HE) 6in4 tunnel
(`2001:470:61a3::/48`). HE address ranges are widely classified as
datacenter/hosting space, so some sites (Google, Cloudflare-fronted services,
various login flows) treat IPv6 traffic from them as bot/VPN traffic: endless
CAPTCHAs, "unusual traffic" interstitials, or outright blocks. IPv4 egress
(the ISP's residential PPPoE address) is unaffected.
The goal: keep using the network normally without IPv6 triggering these flags,
while still wanting some IPv6 (e.g. inbound to self-hosted services).
## What was built
An **IPv6-mostly** network (RFC 8925) with **DNS64 + NAT64**, intended to push
egress onto IPv4 while presenting IPv6 to clients:
- **CoreDNS container** with the `dns64` plugin (`translate_all`): synthesized
`64:ff9b::/96` AAAA records from A records for *all* names, so even dual-stack
destinations resolved to a NAT64 address.
- **Tayga container** (`ghcr.io/apalrd/tayga-nat64`): stateless NAT64 translator.
IPv6 traffic to `64:ff9b::/96` was routed to it, translated to IPv4, and
masqueraded out the GPON PPPoE interface. So all "IPv6" egress actually left
as IPv4 on the residential address — bypassing the HE tunnel and its flagging.
- **RouterOS RA + DHCP**: DHCP option 108 (IPv6-only preferred) to make capable
clients drop IPv4, PREF64 (RFC 8781) to advertise the NAT64 prefix for CLAT,
RDNSS (RFC 8106) to hand IPv6-only clients a resolver.
- Dedicated `nat64` bridge, `fc64::/126` link, `192.168.240.0/20` Tayga pool,
static routes, and firewall rules (including NAT64-mapped RFC1918 blocks to
prevent the translator being used as a policy bypass).
## Why it was removed
### 1. Performance — the dealbreaker
Throughput collapsed from line rate (~1 Gbps) to **~200-300 Mbps**, saturating
the router CPU. Causes, all structural:
- Tayga is a **userspace** translator. Every translated packet leaves the kernel
fastpath, is copied to userspace, translated, and re-injected.
- Translated traffic crosses RouterOS **twice** — once as IPv6 (LAN → Tayga),
once as IPv4 (Tayga → WAN, with masquerade) — doubling firewall/conntrack work.
- No hardware offload or fasttrack applies to either leg.
With `translate_all`, *nearly all* internet traffic went through this path, so
the penalty hit everything, not just IPv4-only destinations.
### 2. Single point of failure
DNS (CoreDNS) and most of the datapath (Tayga) became two containers in the
critical path on a router whose built-in forwarder had been completely reliable.
Container restarts, image pulls, or a crash now took down connectivity.
### 3. Architectural inversion
NAT64 exists to let **IPv6-only** clients reach the **IPv4** internet. The actual
goal here was the opposite — *avoid* IPv6 egress entirely. Building an IPv6-only
client environment (option 108, CLAT, PREF64) and then translating all of it back
to IPv4 was solving the problem backwards. The complexity existed only to route
around a property of the HE tunnel.
### 4. Firewall complexity and a translation bypass hole
NAT64 punched a hole in the firewall model. RouterOS filters IPv4 and IPv6
independently, but NAT64 traffic enters as IPv6 and *leaves* as IPv4 after
translation — so the carefully-built IPv4 forward policy (inter-VLAN isolation,
RFC1918-to-WAN blocks) was simply bypassed for anything arriving via the
translator. A client could reach a private IPv4 range by encoding it in the
NAT64 prefix (`64:ff9b::c0a8:xxyy` = `192.168.x.y`), and the IPv4 rules would
never see it because the packet was IPv6 until Tayga rewrote it.
Plugging this required mirroring the IPv4 policy in the IPv6 chain: explicit
`reject` rules for every NAT64-mapped RFC1918 block (`64:ff9b::a00:0/104`,
`64:ff9b::ac10:0/108`, `64:ff9b::c0a8:0/112`), per-VLAN accept rules toward the
`nat64` interface, plus a separate masquerade and LB hairpin-accept for the
Tayga pool. That is a parallel, easy-to-get-wrong copy of the existing ruleset,
whose correctness depended on getting CIDR-to-prefix arithmetic right. Removing
NAT64 deleted all of it.
### 5. Operational fragility (see coredns.md for detail)
The setup had a long tail of subtle failure modes, each presenting identically
as "client can't connect":
- RouterOS static `FWD` entries return `NOERROR`/empty instead of relaying
`NXDOMAIN`, which broke `getaddrinfo` search-domain handling in Kubernetes
pods (`ENOTFOUND` for valid names).
- `translate_all` discarded real AAAA for IPv6-only internal services, and
returned empty answers for names with no A record.
- Per-interface RouterOS `ipv6 nd` entries default to `advertise-dns=no` and must
be *created* (not modified), so RDNSS/PREF64 silently never advertised.
- Dynamic `from-pool` VLAN addressing made advertised RDNSS addresses point at
nonexistent router addresses.
- Option 108 honoured by clients before the NAT64 path was verified working left
them stuck "obtaining IP address".
Each was individually fixable, but the aggregate was a brittle system whose
benefit didn't justify the surface area.
## What replaced it
Plain CoreDNS forwarder with **AAAA suppression by default** plus a whitelist for
domains that should keep IPv6 (our own zone over the HE prefix, and any explicitly
trusted domain). Clients prefer IPv4 because they simply don't receive AAAA for
most names — no translation, no extra datapath hop, packet forwarding stays on the
RouterOS fastpath at line rate. DNS is the only thing in the path. See
[coredns.md](./coredns.md).
Tradeoff accepted: a non-whitelisted IPv6-only destination (no A record) is
unreachable. In practice essentially everything on the public internet still has
an A record. The intended future refinement is a CoreDNS plugin that suppresses
AAAA only when an A record also exists, removing the need for the whitelist; no
in-tree plugin does this today.
## Lessons
- **Measure throughput before committing to an in-path translator on SOHO-class
hardware.** Userspace NAT64 (Tayga/Jool-in-container) on a MikroTik CPU is
fine for a few hundred Mbps, not for saturating a gigabit line.
- **Match the mechanism to the actual goal.** The goal was "prefer IPv4 egress",
which is a one-line DNS policy, not a transition technology.
- **Prefer solutions that stay on the fastpath.** Anything that pulls bulk
traffic into userspace or doubles the forwarding work will dominate the CPU.
- **Fewer moving parts in the critical path.** Two containers carrying all DNS
and most traffic is a worse availability story than the stock forwarder, for a
cosmetic benefit (avoiding CAPTCHAs on some sites).
- **Protocol translation breaks the firewall model.** When traffic changes L3
protocol mid-path, the two firewall policies must be kept in sync by hand, and
any gap is a silent bypass. A solution that doesn't translate keeps a single
coherent policy.