Debugging Surface
Most production failures are misdiagnosed when the search area is too narrow. This is the surface I usually inspect before accepting the first explanation.
| Layer | What commonly fails | First checks |
|---|---|---|
| App code | regressions, config assumptions, feature flags | recent changes, config diffs, feature flag state, minimal reproduction |
| Runtime | process env, permissions, linked libraries, system packages | environment variables, service startup, linked-library checks, package baseline |
| Toolchain | CLI/SDK versions, API behavior, model/tool compatibility | installed versions, expected behavior, minimal command, compatibility notes |
| Network | IP conflicts, routing, DNS, tunnel path, firewall rules | local reachability, IP/MAC consistency, ARP ownership, route/DNS checks |
| Device | identity, local state, power, storage, clock, peripherals | device identity, disk/power state, local logs, clock drift, peripheral status |
| Customer site | DHCP, unmanaged changes, access limits, environment drift | DHCP range, access policy, recent site changes, environment baseline |
The point is not to check every layer every time. The point is to avoid getting trapped in the wrong layer.