diff --git a/Dockerfile b/Dockerfile index f8a9b6a..98b2ccb 100644 --- a/Dockerfile +++ b/Dockerfile @@ -70,49 +70,6 @@ WORKDIR /src/tailscale # disables the filter at runtime for debugging — no rebuild needed. COPY patches/stderr_verbosity_filter.go cmd/tailscaled/ -# Patch net/tstun/wrap.go: fix panic("unreachable") in invertGSOChecksum for -# ts_omit_netstack builds. -# -# invertGSOChecksum is a gVisor/GSO helper that inverts a transport-layer -# checksum before/after SNAT when gVisor hands us a segment with a partial -# checksum (NeedsCsum=true). It is only meaningful when netstack (gVisor) is -# compiled in (HasNetstack=true). -# -# The function correctly guards its body with: -# if !buildfeatures.HasNetstack { panic("unreachable") } -# -# When built with ts_omit_netstack, HasNetstack is a const false, so that guard -# evaluates to `if true { panic(...) }` — the function always panics. -# -# The problem: invertGSOChecksum is called unconditionally from injectedRead() -# (twice, around pc.snat()), even for the res.data path where res.packet==nil -# and gso is a zero-value netstack_GSO (NeedsCsum=false). The HasNetstack -# guard in the res.packet branch does NOT protect these calls. -# -# As a result, any code path that injects an outbound packet via InjectOutbound() -# — which happens when enabling exit-node use (Tailscale sends TSMP messages -# and synthesizes packets through the TUN injection path) — hits injectedRead -# with res.data!=nil, calls invertGSOChecksum, and crashes with: -# panic: unreachable -# tailscale.com/net/tstun.invertGSOChecksum(...) -# tailscale.com/net/tstun.(*Wrapper).injectedRead(...) wrap.go:1077 -# -# Fix: replace the `panic("unreachable")` with a `return` in invertGSOChecksum. -# When HasNetstack=false (ts_omit_netstack), a zero-value netstack_GSO always -# has NeedsCsum=false, so the function is correctly a no-op anyway. This matches -# what the function would do if the rest of its body ran: NeedsCsum=false → return. -# -# The sed expression targets the function precisely: it matches the three-line -# sequence that opens invertGSOChecksum's HasNetstack guard, and replaces only -# the panic line with return. The pattern is stable across minor reformats -# because it anchors on the literal function comment and the specific panic string. -# -# See tailscale/tailscale issue for context (no upstream fix as of v1.98.5): -# panic happens when using exit-node via a ts_omit_netstack build. -RUN sed -i \ - -e '/func invertGSOChecksum/,/^}/ s/\t\tpanic("unreachable")/\t\treturn/' \ - net/tstun/wrap.go - # Build a minimal combined binary (tailscale CLI + tailscaled daemon in one file). # # Tag strategy — ALLOWLIST, not blocklist: @@ -148,6 +105,34 @@ RUN sed -i \ # waiting for completion") WITHOUT printing the auth URL # or confirming success. Including it makes interactive # 'up' behave normally (blocks, prints login URL). +# netstack — gVisor userspace network stack. Counter-intuitively +# REQUIRED even though the router uses a real kernel TUN +# (NOT --tun=userspace-networking). In v1.98.5 the +# 100.100.100.100:53 MagicDNS listener is served ONLY by +# netstack's handleLocalPackets, installed via +# PreFilterPacketOutboundToWireGuardNetstackIntercept. +# The non-netstack "engine" interceptor that the wrap.go +# comments claim handles quad-100 "if netstack is not +# installed" does NOT actually do so on Linux (its body +# only reflects loopback on darwin/ios/plan9, else +# Accept). So with ts_omit_netstack, NOTHING absorbs +# packets to 100.100.100.100: queries fall through to +# WireGuard, no peer owns that IP, and even tailnet-name +# resolution (and 'ping host.tailnet.ts.net') times out. +# The 'dns' tag links the resolver but nothing routes +# packets to it without netstack — the two tags are +# independent (dns has no Dep on netstack). Omitting +# netstack ALSO triggered a panic("unreachable") in +# net/tstun.invertGSOChecksum on the exit-node inject +# path (HasNetstack=const false made the guard always +# panic); enabling netstack makes that guard dead code, +# fixing the crash as a side effect. Cost (arm64, vs a +# netstack-omitted build): ~+0.5 MB extracted on flash +# and ~+2.3 MB resident RAM after UPX decompression — +# measured, acceptable for a 16 MB-flash router. +# gro — Generic Receive Offload (perf). Depends on netstack; +# pulled in with it. Small, and improves throughput on +# the netstack DNS/inject path. # # Everything else remains omitted, including (rationale): # clientupdate — DELIBERATELY removed. The built-in updater would download @@ -172,9 +157,11 @@ RUN sed -i \ # which is exactly the flash wear we want to avoid. # logtail — no persistent log writes to flash; also pass # --no-logs-no-support at runtime -# netstack+gro — userspace networking; router uses kernel TUN # ssh — not needed; access via MikroTik SSH + tailscale CLI # all GUI/desktop/cloud/k8s features — irrelevant for a headless router +# +# NOTE: netstack/gro are NOT in this omit list — see the opted-in section above +# for why MagicDNS quad-100 serving structurally requires them in v1.98.5. RUN mkdir -p /out && \ ALL_OMIT=$(GOOS= GOARCH= go run ./cmd/featuretags --min --add=osrouter) && \ @@ -191,6 +178,8 @@ RUN mkdir -p /out && \ -e 's/ts_omit_iptables,\{0,1\}//g' \ -e 's/ts_omit_unixsocketidentity,\{0,1\}//g' \ -e 's/ts_omit_ipnbus,\{0,1\}//g' \ + -e 's/ts_omit_netstack,\{0,1\}//g' \ + -e 's/ts_omit_gro,\{0,1\}//g' \ -e 's/,$//' \ ) && \ echo "Build tags: ${TAGS}" && \ diff --git a/docs/DESIGN.md b/docs/DESIGN.md index 5c0b2e3..1b97f09 100644 --- a/docs/DESIGN.md +++ b/docs/DESIGN.md @@ -15,22 +15,26 @@ Measured flattened rootfs for the arm64 image: | Component | On-disk size | |---|---| -| `tailscale.combined` (UPX-compressed) | ~2.98 MB | +| `tailscale.combined` (UPX-compressed) | ~3.47 MB | | custom static busybox (UPX, ~100 applets) | ~218 kB | | CA certificates | ~213 kB | -| **Total extracted rootfs** | **~3.4 MB** | +| **Total extracted rootfs** | **~3.9 MB** | -(The compressed image / transfer tarball is ~3.3–4.3 MB depending on arch.) +The `tailscale.combined` figure includes `netstack` (gVisor), which adds +~0.5 MB on disk over a netstack-omitted build — a deliberate inclusion, see +[Why netstack is required (even with a kernel TUN)](#why-netstack-is-required-even-with-a-kernel-tun). + +(The compressed image / transfer tarball is ~3.8–4.3 MB depending on arch.) | Arch | Image (compressed) | |---|---| -| amd64 | ~4.2 MB | -| arm64 | ~3.5 MB | -| arm/v7 | ~3.5 MB | +| amd64 | ~4.3 MB | +| arm64 | ~4.0 MB | +| arm/v7 | ~4.0 MB | -On a deployed RouterOS device the container consumes **~3.7 MiB of flash** +On a deployed RouterOS device the container consumes **~4.2 MiB of flash** (measured by `free-hdd-space` delta). Note that `du` *inside* the container -reports roughly double that (~7 MB) — that is RouterOS block-allocation +reports roughly double that (~8 MB) — that is RouterOS block-allocation rounding, **not** real usage or duplication; see [Avoiding overlayfs layer duplication](#avoiding-overlayfs-layer-duplication) for how to measure correctly. @@ -118,13 +122,13 @@ delta**, not `du`: /system/resource/print # note free-hdd-space before and after adding the container ``` -The container should consume **~3.7 MiB** of flash (e.g. 94.6 → 90.9 MiB free). +The container should consume **~4.2 MiB** of flash (e.g. 94.6 → 90.4 MiB free). Do **not** trust `du` inside the container for this. Busybox `du` reports -*allocated blocks*, and RouterOS's container store rounds a ~3 MB file up to -~6 MB of blocks — so `du -sx /` reports ~7 MB even though real flash use is -~3.7 MB. `ls -la /usr/local/bin` confirms the binary's true content size -(~3.1 MB) and that it is a single file with two symlinks (no duplication). +*allocated blocks*, and RouterOS's container store rounds the ~3.5 MB binary up +to ~7 MB of blocks — so `du -sx /` reports ~8 MB even though real flash use is +~4.2 MB. `ls -la /usr/local/bin` confirms the binary's true content size +(~3.5 MB) and that it is a single file with two symlinks (no duplication). The image itself carries the binary in exactly one layer (verified at the blob level); the inflation is purely the filesystem's block accounting. @@ -149,7 +153,8 @@ that's a separate build, not just a `--platform` change. | `advertise-routes` | Expose LAN subnets to the tailnet | | `use-exit-node` | Route the router's own traffic via a remote exit node | | `accept-routes` | Receive subnet routes from other tailnet nodes | -| DNS / MagicDNS | Resolve `*.ts.net` names | +| DNS / MagicDNS | Resolve `*.ts.net` names (resolver + resolv.conf manager). **Note:** serving `100.100.100.100` also requires `netstack` — see [Why netstack is required (even with a kernel TUN)](#why-netstack-is-required-even-with-a-kernel-tun) | +| `netstack` + `gro` | gVisor userspace stack. Counter-intuitively **required** to serve MagicDNS on `100.100.100.100`, even though the router uses a real kernel TUN — see [Why netstack is required (even with a kernel TUN)](#why-netstack-is-required-even-with-a-kernel-tun) | | portmapper (NAT-PMP/PCP/UPnP) | Punch through upstream NAT | | listenrawdisco | Raw socket disco for better NAT traversal | | health | Powers `tailscale status` output | @@ -166,7 +171,6 @@ that's a separate build, not just a `--platform` change. | `cachenetmap` | **Deliberately removed** — see [Why netmap disk-caching is removed](#why-netmap-disk-caching-is-removed) | | `logtail` | Would attempt persistent log writes; wear flash. Removing it also removes stderr verbosity filtering — restored by an injected filter, see [Log verbosity filtering](#log-verbosity-filtering) | | `netlog` | Network flow logging; separate concern | -| `netstack` + `gro` | Userspace/gVisor networking; router uses kernel TUN | | `ssh` | Access via MikroTik SSH + `tailscale` CLI instead | | `linuxdnsfight` | inotify on `/etc/resolv.conf`; no systemd in container | | `networkmanager` / `resolved` / `dbus` / `sdnotify` | No systemd stack in container | @@ -226,6 +230,84 @@ the in-memory resilience (the common case) while eliminating per-netmap flash writes. Only `tailscaled.state` (written on auth / key rotation) ever touches flash. +### Why netstack is required (even with a kernel TUN) + +This is the least obvious inclusion in the build, so it is documented in full. + +`netstack` is Tailscale's embedded **gVisor userspace TCP/IP stack**. The +natural assumption — and what earlier versions of this build acted on — is that +a router which owns a **real kernel TUN device** (it is *not* run with +`--tun=userspace-networking`) has no use for a userspace stack, so `netstack` +(and its dependent `gro`) can be omitted to save space. That assumption is +**wrong for one specific, important path: MagicDNS.** + +**MagicDNS on `100.100.100.100` is served only by netstack.** In Tailscale +v1.98.5 the in-process listener for the Tailscale service IP +(`100.100.100.100:53`, UDP) is installed exclusively by netstack's +`handleLocalPackets`, wired into the TUN wrapper as +`PreFilterPacketOutboundToWireGuardNetstackIntercept` +(`wgengine/netstack/netstack.go`). When a packet leaves the host toward +`100.100.100.100`, this hook absorbs it into the gVisor stack, whose UDP-53 +acceptor runs the MagicDNS resolver. + +**The "engine fallback" does not actually exist.** The TUN wrapper consults a +second hook, `PreFilterPacketOutboundToWireGuardEngineIntercept`, and a comment +in `net/tstun/wrap.go` claims it "primarily handles quad-100 if netstack is not +installed." In v1.98.5 that comment is **false on Linux**: the engine +`handleLocalPackets` (`wgengine/userspace.go`) only reflects loopback on +darwin/ios/plan9 and otherwise returns `Accept` — it never touches +`100.100.100.100`. So with `ts_omit_netstack` there is **no** code that absorbs +quad-100 packets at all. + +**`dns` and `netstack` are independent tags.** The `dns` feature (which this +build opts in) links the resolver and the `/etc/resolv.conf` manager, but it has +no dependency on `netstack` and does **not** install any quad-100 transport. +The net result of `dns` on + `netstack` off is a resolver that is correctly +wired up but that **never receives any packets** — the worst kind of silent +breakage. Symptoms observed on the device: + +- `/etc/resolv.conf` correctly points at `100.100.100.100` (the manager works), +- but `dig anything @100.100.100.100` from inside the container **times out** + ("no servers could be reached"), +- and even tailnet-internal names fail: `ping host..ts.net` → + `bad address` (a name that needs **no** upstream forwarding still can't + resolve, proving the listener itself is dead, not an upstream-resolver issue), +- while `ping 1.1.1.1` (a raw IP needing no DNS) works fine over the kernel data + path — confirming forwarding/exit-node connectivity is unaffected and isolating + the fault to DNS serving. + +**It also fixed a crash.** Omitting `netstack` set `buildfeatures.HasNetstack` +to a compile-time `false`, which turned the guard in +`net/tstun.invertGSOChecksum` (`if !HasNetstack { panic("unreachable") }`) into +an always-panic. That function is called on the packet-injection path used when +enabling exit-node mode, producing `panic: unreachable` and a daemon restart +loop. Enabling `netstack` makes `HasNetstack` a const `true`, so the guard +becomes dead code and the crash disappears as a side effect — fixed at the root +cause rather than patched around. + +**Cost.** Measured on arm64, a netstack-enabled build versus a netstack-omitted +one: + +| Metric | netstack omitted | netstack enabled | Delta | +|---|---|---|---| +| Extracted rootfs (flash) | ~3.42 MB | ~3.91 MB | **+0.49 MB** | +| `tailscale.combined` on disk (UPX) | ~2.99 MB | ~3.47 MB | +0.48 MB | +| Resident RAM after UPX decompress | ~12.25 MB | ~14.56 MB | **+2.31 MB** | + +The flash cost (~0.5 MB) is negligible on a 16 MB-class device. The RAM cost +(~2.3 MB resident) is the real consideration on low-memory models, but is +acceptable given that without it MagicDNS is entirely non-functional. The +trade is: **half a megabyte of flash to make MagicDNS work at all.** `gro` +(Generic Receive Offload) depends on `netstack` and is pulled in alongside it; +it is small and improves throughput on the netstack path. + +**Caveat for future Tailscale bumps.** This coupling (quad-100 serving living +only in netstack) is an upstream implementation detail, not a stable contract. +If a future release adds a genuine non-netstack quad-100 path — or the daemon +itself is refactored — re-test whether `netstack` can be dropped again. The +canary is simple: from inside the container, `dig google.com @100.100.100.100` +must return answers and `ping ..ts.net` must resolve. + ### Log verbosity filtering Upstream `tailscaled` embeds verbosity tags (`[v1]`, `[v2]`, …) inside its log