Troubleshooting

Fast incident triage order

Run checks in this order to narrow faults quickly:

  1. cellmgr cell list --view merged
  2. cellmgr apply --dry-run --all
  3. cellctl list -T and cellctl stats -T
  4. host syslog for supervise output and restart patterns
  5. cellmgr cell shell <name> for in-cell diagnostics
  6. compare desired and rendered configuration files

If you are new to Unix operations, do not jump directly into in-cell shell debugging. Start with list/dry-run/runtime checks first.

Compare desired vs runtime config

Inspect these paths when drift is suspected:

  • desired: /etc/cellmgr/<name>.cell
  • runtime: /var/cellmgr/cells/<name>/cell.conf

If policy values differ, remember that policy-only drift may warn without restart unless --restart-changed is provided during apply.

Common failure patterns

  • dependency cycle or missing dependency in CELL_DEPENDS_ON
  • invalid volume mount target or overlapping mount paths
  • strict TSV schema mismatch in machine-output consumers
  • blocked restore because a volume is mounted or cell is still running
  • command permission issues because shell is not root (or doas not used)
  • typo in resource name or wrong scope (desired vs runtime)

Recovery playbooks

  • Config drift: fix manifests, then run cellmgr apply --all
  • Apply plan issues: test with cellmgr cell plan run <name>
  • Storage recovery: stop relevant workloads and perform guarded restore with --yes
  • IPC/UI errors: verify cellmgr ipc serve --stdio and reconnect from CellUI