diff --git a/docs/DESIGN.md b/docs/DESIGN.md index 5c0b2e3..1b97f09 100644 --- a/docs/DESIGN.md +++ b/docs/DESIGN.md @@ -15,22 +15,26 @@ Measured flattened rootfs for the arm64 image: | Component | On-disk size | |---|---| -| `tailscale.combined` (UPX-compressed) | ~2.98 MB | +| `tailscale.combined` (UPX-compressed) | ~3.47 MB | | custom static busybox (UPX, ~100 applets) | ~218 kB | | CA certificates | ~213 kB | -| **Total extracted rootfs** | **~3.4 MB** | +| **Total extracted rootfs** | **~3.9 MB** | -(The compressed image / transfer tarball is ~3.3–4.3 MB depending on arch.) +The `tailscale.combined` figure includes `netstack` (gVisor), which adds +~0.5 MB on disk over a netstack-omitted build — a deliberate inclusion, see +[Why netstack is required (even with a kernel TUN)](#why-netstack-is-required-even-with-a-kernel-tun). + +(The compressed image / transfer tarball is ~3.8–4.3 MB depending on arch.) | Arch | Image (compressed) | |---|---| -| amd64 | ~4.2 MB | -| arm64 | ~3.5 MB | -| arm/v7 | ~3.5 MB | +| amd64 | ~4.3 MB | +| arm64 | ~4.0 MB | +| arm/v7 | ~4.0 MB | -On a deployed RouterOS device the container consumes **~3.7 MiB of flash** +On a deployed RouterOS device the container consumes **~4.2 MiB of flash** (measured by `free-hdd-space` delta). Note that `du` *inside* the container -reports roughly double that (~7 MB) — that is RouterOS block-allocation +reports roughly double that (~8 MB) — that is RouterOS block-allocation rounding, **not** real usage or duplication; see [Avoiding overlayfs layer duplication](#avoiding-overlayfs-layer-duplication) for how to measure correctly. @@ -118,13 +122,13 @@ delta**, not `du`: /system/resource/print # note free-hdd-space before and after adding the container ``` -The container should consume **~3.7 MiB** of flash (e.g. 94.6 → 90.9 MiB free). +The container should consume **~4.2 MiB** of flash (e.g. 94.6 → 90.4 MiB free). Do **not** trust `du` inside the container for this. Busybox `du` reports -*allocated blocks*, and RouterOS's container store rounds a ~3 MB file up to -~6 MB of blocks — so `du -sx /` reports ~7 MB even though real flash use is -~3.7 MB. `ls -la /usr/local/bin` confirms the binary's true content size -(~3.1 MB) and that it is a single file with two symlinks (no duplication). +*allocated blocks*, and RouterOS's container store rounds the ~3.5 MB binary up +to ~7 MB of blocks — so `du -sx /` reports ~8 MB even though real flash use is +~4.2 MB. `ls -la /usr/local/bin` confirms the binary's true content size +(~3.5 MB) and that it is a single file with two symlinks (no duplication). The image itself carries the binary in exactly one layer (verified at the blob level); the inflation is purely the filesystem's block accounting. @@ -149,7 +153,8 @@ that's a separate build, not just a `--platform` change. | `advertise-routes` | Expose LAN subnets to the tailnet | | `use-exit-node` | Route the router's own traffic via a remote exit node | | `accept-routes` | Receive subnet routes from other tailnet nodes | -| DNS / MagicDNS | Resolve `*.ts.net` names | +| DNS / MagicDNS | Resolve `*.ts.net` names (resolver + resolv.conf manager). **Note:** serving `100.100.100.100` also requires `netstack` — see [Why netstack is required (even with a kernel TUN)](#why-netstack-is-required-even-with-a-kernel-tun) | +| `netstack` + `gro` | gVisor userspace stack. Counter-intuitively **required** to serve MagicDNS on `100.100.100.100`, even though the router uses a real kernel TUN — see [Why netstack is required (even with a kernel TUN)](#why-netstack-is-required-even-with-a-kernel-tun) | | portmapper (NAT-PMP/PCP/UPnP) | Punch through upstream NAT | | listenrawdisco | Raw socket disco for better NAT traversal | | health | Powers `tailscale status` output | @@ -166,7 +171,6 @@ that's a separate build, not just a `--platform` change. | `cachenetmap` | **Deliberately removed** — see [Why netmap disk-caching is removed](#why-netmap-disk-caching-is-removed) | | `logtail` | Would attempt persistent log writes; wear flash. Removing it also removes stderr verbosity filtering — restored by an injected filter, see [Log verbosity filtering](#log-verbosity-filtering) | | `netlog` | Network flow logging; separate concern | -| `netstack` + `gro` | Userspace/gVisor networking; router uses kernel TUN | | `ssh` | Access via MikroTik SSH + `tailscale` CLI instead | | `linuxdnsfight` | inotify on `/etc/resolv.conf`; no systemd in container | | `networkmanager` / `resolved` / `dbus` / `sdnotify` | No systemd stack in container | @@ -226,6 +230,84 @@ the in-memory resilience (the common case) while eliminating per-netmap flash writes. Only `tailscaled.state` (written on auth / key rotation) ever touches flash. +### Why netstack is required (even with a kernel TUN) + +This is the least obvious inclusion in the build, so it is documented in full. + +`netstack` is Tailscale's embedded **gVisor userspace TCP/IP stack**. The +natural assumption — and what earlier versions of this build acted on — is that +a router which owns a **real kernel TUN device** (it is *not* run with +`--tun=userspace-networking`) has no use for a userspace stack, so `netstack` +(and its dependent `gro`) can be omitted to save space. That assumption is +**wrong for one specific, important path: MagicDNS.** + +**MagicDNS on `100.100.100.100` is served only by netstack.** In Tailscale +v1.98.5 the in-process listener for the Tailscale service IP +(`100.100.100.100:53`, UDP) is installed exclusively by netstack's +`handleLocalPackets`, wired into the TUN wrapper as +`PreFilterPacketOutboundToWireGuardNetstackIntercept` +(`wgengine/netstack/netstack.go`). When a packet leaves the host toward +`100.100.100.100`, this hook absorbs it into the gVisor stack, whose UDP-53 +acceptor runs the MagicDNS resolver. + +**The "engine fallback" does not actually exist.** The TUN wrapper consults a +second hook, `PreFilterPacketOutboundToWireGuardEngineIntercept`, and a comment +in `net/tstun/wrap.go` claims it "primarily handles quad-100 if netstack is not +installed." In v1.98.5 that comment is **false on Linux**: the engine +`handleLocalPackets` (`wgengine/userspace.go`) only reflects loopback on +darwin/ios/plan9 and otherwise returns `Accept` — it never touches +`100.100.100.100`. So with `ts_omit_netstack` there is **no** code that absorbs +quad-100 packets at all. + +**`dns` and `netstack` are independent tags.** The `dns` feature (which this +build opts in) links the resolver and the `/etc/resolv.conf` manager, but it has +no dependency on `netstack` and does **not** install any quad-100 transport. +The net result of `dns` on + `netstack` off is a resolver that is correctly +wired up but that **never receives any packets** — the worst kind of silent +breakage. Symptoms observed on the device: + +- `/etc/resolv.conf` correctly points at `100.100.100.100` (the manager works), +- but `dig anything @100.100.100.100` from inside the container **times out** + ("no servers could be reached"), +- and even tailnet-internal names fail: `ping host..ts.net` → + `bad address` (a name that needs **no** upstream forwarding still can't + resolve, proving the listener itself is dead, not an upstream-resolver issue), +- while `ping 1.1.1.1` (a raw IP needing no DNS) works fine over the kernel data + path — confirming forwarding/exit-node connectivity is unaffected and isolating + the fault to DNS serving. + +**It also fixed a crash.** Omitting `netstack` set `buildfeatures.HasNetstack` +to a compile-time `false`, which turned the guard in +`net/tstun.invertGSOChecksum` (`if !HasNetstack { panic("unreachable") }`) into +an always-panic. That function is called on the packet-injection path used when +enabling exit-node mode, producing `panic: unreachable` and a daemon restart +loop. Enabling `netstack` makes `HasNetstack` a const `true`, so the guard +becomes dead code and the crash disappears as a side effect — fixed at the root +cause rather than patched around. + +**Cost.** Measured on arm64, a netstack-enabled build versus a netstack-omitted +one: + +| Metric | netstack omitted | netstack enabled | Delta | +|---|---|---|---| +| Extracted rootfs (flash) | ~3.42 MB | ~3.91 MB | **+0.49 MB** | +| `tailscale.combined` on disk (UPX) | ~2.99 MB | ~3.47 MB | +0.48 MB | +| Resident RAM after UPX decompress | ~12.25 MB | ~14.56 MB | **+2.31 MB** | + +The flash cost (~0.5 MB) is negligible on a 16 MB-class device. The RAM cost +(~2.3 MB resident) is the real consideration on low-memory models, but is +acceptable given that without it MagicDNS is entirely non-functional. The +trade is: **half a megabyte of flash to make MagicDNS work at all.** `gro` +(Generic Receive Offload) depends on `netstack` and is pulled in alongside it; +it is small and improves throughput on the netstack path. + +**Caveat for future Tailscale bumps.** This coupling (quad-100 serving living +only in netstack) is an upstream implementation detail, not a stable contract. +If a future release adds a genuine non-netstack quad-100 path — or the daemon +itself is refactored — re-test whether `netstack` can be dropped again. The +canary is simple: from inside the container, `dig google.com @100.100.100.100` +must return answers and `ping ..ts.net` must resolve. + ### Log verbosity filtering Upstream `tailscaled` embeds verbosity tags (`[v1]`, `[v2]`, …) inside its log