Systems debugging for production reality.

I help AI, IoT, and edge teams debug failures that sit between application code, runtime dependencies, networks, devices, and customer environments.

Failure boundary: app code / runtime / toolchain / network / device / customer site

Discuss a production issue Read case notes

Premise

Production failures rarely stay in one layer.

Recent code changes are not always the cause. Tunnel instability may be a LAN problem. A database startup failure may be a runtime dependency problem. An agent failure may be a toolchain compatibility problem. The work starts by finding the boundary everyone assumed was already known.

Work

How I usually help

Most work starts with a focused diagnostic: define the failure boundary, test the highest-signal hypotheses, and turn the result into something the team can reuse.

Failure Boundary Review

A focused diagnostic for production or customer-site failures that do not fit the usual checklist.

Demo-to-Production Readiness

A review for AI, IoT, and edge teams preparing to move a working demo into a real customer environment.

See how engagements work

Case Notes

Anonymized failure patterns

These notes are anonymized, but the failure patterns are real: toolchain mismatch, replaced runtime dependency, LAN-level IP conflict, and SSH ACL rollout mismatch.

Agent failure was not a Git change

An agent workflow stopped working, but recent repository changes did not explain the failure.

Looked like: Code regression
Actually: Outdated toolchain version
Layer: Agent runtime / Toolchain

Read the full note

Database startup failure was not database configuration

A customer-site MySQL service failed to start, and normal database-level checks did not explain the issue.

Looked like: MySQL configuration issue
Actually: Replaced shared library
Layer: Linux runtime dependencies

Read the full note

Intermittent connectivity was not tunnel instability

An edge device appeared unstable through remote access tooling, but the actual failure was inside the local network.

Looked like: VPN / tunnel instability
Actually: LAN IP conflict
Layer: Network / Customer site

Read the full note

Tailscale SSH failure was not node reachability

An edge node was reachable over the tailnet, but Tailscale SSH was denied by a separate SSH ACL path.

Looked like: SSH or network reachability failure
Actually: Tailscale SSH ACL mismatch
Layer: Tailnet access policy

Read the full note

Read all case notes

Debugging surface

Where assumptions usually break

App code / Runtime / Toolchain / Network / Device / Customer site

View debugging surface

Contact

Have a production issue that does not fit the usual checklist?

Send the symptom, environment, and what has already been tried.

Start with a diagnostic note