# Research Findings: Consolidated homelab secrets and access management

## Phase Contract

Allowed reads:
- `tickets/artifacts/2026-05-18-consolidated-homelab-secrets-management/01-questions.md`
- current repo docs/runbooks/systems/tickets relevant to access model
- built-in non-web knowledge for option comparison, clearly marked

Do not read:
- secret values under `.ssh/` or `.tokens/`
- future design/plan artifacts

Will write:
- `tickets/artifacts/2026-05-18-consolidated-homelab-secrets-management/02-research.md`

## Q1: What current secrets/access mechanisms are already represented in this repo?

### Findings

- `.pi/ssh/hosts.json` defines named SSH aliases and blocks raw hosts by policy. Current aliases include `nextcloud-vm` for `piagent@192.168.0.110` and `amp-gameserver` for `piagent@192.168.0.90`, both with write/bash allowed and destructive confirmation enabled.
- `.pi/ssh/README.md` documents the repo-local SSH model: stable aliases, policy per host, raw hosts disabled unless deliberately enabled, and risky-command confirmation.
- `runbooks/configure-assistant-ssh-access.md` defines current SSH safety rules: never paste private keys, only install public keys, prefer dedicated `piagent`, prefer guest VM access before Proxmox host access, and require confirmation for destructive Proxmox actions.
- `runbooks/configure-assistant-ssh-access.md` documents a Debian/Ubuntu grant pattern: create `piagent`, lock the password, install public key into `authorized_keys`, and optionally add sudo/passwordless sudo.
- `scripts/bootstrap-assistant-ssh-user.sh` automates assistant user bootstrap: creates/updates user, locks password, installs public key, can optionally add passwordless sudo, and writes SSH hardening drop-in where supported.
- `systems/inventory.md` documents the Nextcloud VM at `192.168.0.110` and says SSH access works as `piagent` using `.ssh/piagent_homelab`.
- `runbooks/nextcloud-operations.md` documents Nextcloud accounts `ncadmin`, `deeso`, and `piagent`; passwords are not stored in repo and are stored on the VM at `/root/nextcloud-credentials.env`.
- `runbooks/nextcloud-talk-assistant-bridge.md` documents the Talk bridge app password storage: `NEXTCLOUD_APP_PASSWORD` lives only on the VM in `/etc/nextcloud-talk-assistant/env`, not in the repo.
- `scripts/nextcloud_pi_request_queue.py` uses local SSH key/known-host files and reads remote Nextcloud config/env over SSH so Nextcloud secrets stay on the VM.
- `docs/server-change-log.md` records prior secret/access decisions, including storing Nextcloud app password only on the VM and rollback by revoking the app password.

## Q2: What access-management needs recur across known systems?

### Findings

- `AGENTS.md` requires remote administration capabilities: SSH command execution, remote file read/write, upload/download, service management, and controlled Proxmox management.
- `AGENTS.md` infrastructure guidance prefers dedicated VMs, snapshots/backups before major changes, SSH key auth, restricted/auditable admin access, and avoiding public exposure of management interfaces.
- Existing Nextcloud and AMP work both require named SSH aliases, non-interactive key auth, service/container management, backups before changes, verification, and rollback notes.
- `tickets/active/2026-05-17-nextcloud-talk-controlled-pi-backend.md` requires importing approved Talk requests without local Nextcloud secrets, operator-controlled execution, and auditability through tickets/docs/logs.
- `systems/network-plan.md` expects future segmentation and stricter access boundaries for Proxmox, OPNsense, Home Assistant, IoT, servers, guest networks, and management interfaces.
- Older Heimdall notes in `info/agents/heimdall/MEMORY.md` mention expected OPNsense API credentials via BWS pending setup, but this appears aspirational/untested rather than current implementation.
- Home Assistant and network/security monitoring roles in `docs/assistant-role-architecture.md` will eventually need distinct permissions, careful secret handling, and safety boundaries.

## Q3: What practical roles could candidate tools play?

### Findings

#### Bitwarden / Vaultwarden

Non-repo knowledge:
- Best fit: human-friendly credential vault for passwords, app passwords, recovery codes, API tokens, secure notes, and possibly SSH key passphrases/private keys if policy allows.
- Bitwarden cloud reduces self-hosting burden; Vaultwarden is a lightweight self-hosted Bitwarden-compatible server commonly used in homelabs.
- Not ideal as the sole dynamic machine-to-machine secrets broker; it is primarily a password manager.

Practical role here:
- Primary human/admin vault.
- Store Nextcloud app passwords, Proxmox/OPNsense/Home Assistant token values or metadata, break-glass credentials, recovery codes, and service-account notes.
- Initial assistant access can be user-mediated rather than direct vault access.

#### SOPS / age

Non-repo knowledge:
- Good for Git-tracked encrypted structured secrets such as YAML/JSON/env files.
- Good for reproducible deployment configs.
- Not a human password manager and does not provide leases, web UI, audit trails, or dynamic revocation.

Practical role here:
- Optional later complement for encrypted repo-managed deployment secrets.
- Age private keys become high-value secrets and need their own recovery plan.

#### OpenBao / HashiCorp Vault

Non-repo knowledge:
- Best fit: machine-to-machine secrets, dynamic credentials, leases, audit logs, PKI, and SSH certificate authority workflows.
- More operationally complex than Vaultwarden: unseal/recovery keys, storage backend, TLS, backups, audit logs, and lifecycle management.

Practical role here:
- Later-stage option if short-lived SSH credentials, dynamic service tokens, or central audit logs justify the burden.
- Likely overbuilt for the first pilot.

#### SSH certificates

Non-repo knowledge:
- Hosts trust a CA public key; users receive signed certs with principals, validity periods, and restrictions.
- Strong fit for temporary/expiring SSH access.
- Revocation can rely on short expiry for many cases; urgent revocation needs KRLs or trust removal.

Practical role here:
- Strong future candidate for assistant SSH lifecycle.
- Requires host configuration and CA key protection.
- Could be piloted on one non-critical host after current key-based workflow is documented.

#### Tailscale ACLs

Non-repo knowledge:
- Controls network reachability by identity/device; not a secrets vault.
- Useful to ensure assistant/user devices can reach only approved hosts/ports.

Practical role here:
- Network boundary layer complementing secrets management.
- Helps keep vault and management interfaces LAN/VPN-only.

#### Per-service API tokens/app passwords

Non-repo knowledge:
- Service-scoped tokens are generally better than shared admin passwords.
- Should be named, scoped, documented, rotated, and revocable independently.

Practical role here:
- One token per integration/assistant role.
- Store values in vault, not git.
- Document owner, scope, creation date, rotation date, storage location, and revocation path.

## Q4: What are operational requirements for hosting a secrets manager in a Proxmox LXC or VM?

### Findings

- Repo constraints prefer dedicated VMs/LXCs, snapshots/backups before risky changes, no public management exposure, and server change logging.
- Non-repo knowledge: a small dedicated VM gives a stronger isolation boundary than an LXC and is often the safer default for a high-value vault service.
- Non-repo knowledge: an unprivileged LXC is lighter and can be acceptable for a low-risk homelab Vaultwarden pilot if the user accepts shared-kernel/container risk.
- Backups are mandatory: encrypted backups, off-host/offline copy, retention policy, snapshot before upgrades, and regular restore tests.
- Vaultwarden-style hosting requires backing up database, config, attachments/sends/icons if used, admin token material, and SMTP/2FA settings.
- Vault/OpenBao-style hosting additionally requires protecting unseal/recovery keys separately; storage backup is insufficient if recovery material is lost.
- TLS is required for any login UI. Practical options include LAN/VPN-only reverse proxy, internal DNS name such as `vault.home.arpa`, or DNS-challenge certificates without public exposure.
- Network exposure should default to LAN/Tailscale only. Public port forwarding should require separate security review.
- Monitoring should include service health, failed login spikes if available, backup success, certificate expiry, disk usage, update availability, and restore-test status.
- Admin recovery must be documented outside the vault: recovery codes, backup decryption keys, unseal keys if any, 2FA recovery, and restore steps.

## Q5: What grant/revoke/rotate workflows should exist?

### Findings

#### Assistant SSH

Current grant pattern from repo:
1. Generate/select assistant key; never expose private key.
2. Install only public key.
3. Create/update dedicated `piagent`.
4. Lock password.
5. Add least required sudo/admin permissions.
6. Add `.pi/ssh/hosts.json` alias with destructive confirmation.
7. Verify non-interactive SSH.
8. Log change in `docs/server-change-log.md`.

Needed revoke pattern:
1. Remove assistant public key from `authorized_keys`.
2. Remove/disable sudoers entry if present.
3. Lock/expire/remove `piagent` account if appropriate.
4. Remove/disable `.pi/ssh/hosts.json` alias.
5. Verify login fails.
6. Log result.

Needed rotate pattern:
1. Generate new keypair.
2. Add new public key to target.
3. Verify new key works.
4. Remove old key.
5. Update local alias/config if needed.
6. Verify old key fails.
7. Log rotation.

#### App passwords/API tokens

Current Nextcloud pattern:
- App passwords are not stored in repo.
- Secrets live in remote service env files such as `/etc/nextcloud-talk-assistant/env`.
- Rollback includes stopping services and revoking the app password.

Needed grant/revoke/rotate pattern:
1. Prefer dedicated least-privilege service account.
2. Create per-purpose token/app password.
3. Store only in approved secret location/vault, not git.
4. Restrict file permissions.
5. Restart dependent service.
6. Verify behavior.
7. Log token purpose/account/storage path/revocation owner, not secret value.

#### Emergency break-glass

- Current ticket identifies break-glass as needed but no runbook exists.
- Needed: user-controlled emergency admin credentials stored outside assistant’s normal access path, offline/separately backed vault recovery material, documented steps, explicit user action for use, and post-use rotation.

## Q6: What risks and failure modes should design address?

### Findings

- Vault lockout: lost master password/2FA/recovery keys can block all secrets.
- Vault compromise: centralization creates a high-value target.
- Untested backup/restore: especially important because Nextcloud docs already identify a backup gap.
- Stale assistant SSH keys: current key-based model needs age tracking and removal.
- Overbroad `piagent` privileges: passwordless sudo or broad write/bash can become root-equivalent.
- Leaked repo config: `.pi/ssh/hosts.json` exposes host/user/IP metadata even without secrets.
- Secret values committed by mistake: existing runbooks repeatedly forbid private keys/app passwords/passwords in repo.
- Public exposure: repo guidance prefers LAN/Tailscale and avoiding management-interface exposure.
- Loss of 2FA/admin access for vault, Nextcloud, Tailscale, Proxmox, OPNsense.
- Assistant direct vault access: useful later but expands blast radius.
- Revocation not verified: removing a token/key is incomplete unless failure is tested.
- Shared service credentials: reuse increases blast radius and complicates rotation.
- Hypervisor access risk: docs prefer guest VM administration and cautious Proxmox permissions.

## Q7: What is the smallest safe pilot?

### Findings

Option A — Disposable Debian/Ubuntu VM or LXC:
- Proves SSH grant/revoke/rotate lifecycle with no production service impact.
- Can snapshot/destroy after test.
- Does not prove app-password/API-token lifecycle.

Option B — AMP game server SSH lifecycle:
- Uses a real known host already in `.pi/ssh/hosts.json`.
- Proves real host key rotation/revocation.
- Risk: can interrupt active assistant access to AMP.

Option C — Nextcloud app-password rotation:
- Proves service app-password lifecycle.
- Risk: can interrupt the active Talk assistant bridge.

Smallest safe pilot shape:
1. One non-critical/disposable host.
2. One assistant SSH key.
3. One documented grant/revoke/rotate workflow.
4. One repo alias with raw hosts disabled.
5. One change-log entry.
6. No secret values committed or read.

## Cross-Cutting Observations

- Current access is mostly key-based SSH plus remote service env files.
- Secrets are documented by location and purpose, not value.
- Vaultwarden/Bitwarden looks like the practical first central vault for human/admin secrets.
- SSH certificates/OpenBao look like stronger future mechanisms for short-lived automated access but are probably not MVP.
- A Proxmox-hosted Vaultwarden LXC/VM is plausible, but backup/restore and recovery design should come before deployment.

## Proxmox API Discovery Notes — 2026-06-05

Read-only API probe using `.tokens/proxmox.env` succeeded for authentication after token correction. After permissions were adjusted, discovery succeeded.

Facts discovered:
- Proxmox host: `192.168.0.88`
- Version: `9.1.6`, release `9.1`
- Node: `buntbox01`, online
- Node hardware visible through API:
  - CPU: Intel Core i5-8500, 6 cores
  - RAM: ~33.4 GB total, ~19.4 GB used at discovery time
  - Root filesystem: ~100.9 GB total, ~10.5 GB used
- Existing LXC containers:
  - `100` — `yams`, running, 2 CPUs, 4 GB max memory, ~30 GB disk
  - `101` — `AMP`, running, 4 CPUs, ~32 GB max memory, ~210 GB disk
- Existing QEMU VMs: none visible.
- Pool: `pooly`.
- Network:
  - Physical interface `eno1`
  - Bridge `vmbr0`, static `192.168.0.88/24`, gateway `192.168.0.1`, bridge port `eno1`
- Storage:
  - `local-lvm`: content `rootdir,images`, ~374.5 GB total, ~223.6 GB used, ~150.9 GB available
  - `local`: content `iso,vztmpl,backup`, ~100.9 GB total, ~10.5 GB used, ~85.1 GB available
  - `m1`: content `images`, ~2.95 TB total, ~812 GB used, ~2.14 TB available
  - `m2`: content `images`, ~100.9 GB total, ~10.5 GB used, ~85.1 GB available
  - `m3`: content `images`, ~100.9 GB total, ~10.5 GB used, ~85.1 GB available
- Available local templates/ISO:
  - `local:vztmpl/debian-12-standard_12.12-1_amd64.tar.zst`
  - `local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst`
  - `local:vztmpl/ubuntu-24.04-standard_24.04-2_amd64.tar.zst`
  - `local:iso/openmediavault_7.4.17-amd64.iso`

Permission note:
- The current token now has broad privileges visible at `/`, including VM allocation/config/power/snapshot, datastore allocation, sys modify/power, and permissions/user modification.
- This is sufficient for implementation, but broader than ideal long-term. After bootstrap, permissions should be reduced or the token revoked/replaced with narrower operational tokens.

Implication:
- Discovery now supports planning a disposable LXC pilot using the available Debian/Ubuntu templates, `vmbr0`, and one of the available storages.
- A dedicated Vaultwarden VM still needs either an ISO/cloud-init image strategy or agreement to deploy as LXC despite the safer-VM preference; currently visible VM ISO inventory does not include a Debian/Ubuntu server ISO.

## Open Areas

- Storage pools, bridges, templates/ISOs, and existing VMID/CTID usage are not discoverable with the current token permissions.
- Whether Proxmox backup storage is already configured and reliable remains unknown.
- Whether the first pilot target should be VM or LXC is chosen as disposable Debian/Ubuntu, but concrete Proxmox allocation details are still needed.
- Assistant direct vault access is deferred to a later authorization/security design.
