# Design: Consolidated Homelab Secrets and Access Management

## Current State

Homelab access is currently handled with a mix of repo-local SSH aliases, dedicated `piagent` users, local SSH keys, service app passwords, remote environment files, and manually documented runbooks.

Known current patterns:

- `.pi/ssh/hosts.json` defines named SSH targets such as `nextcloud-vm` and `amp-gameserver`, with raw hosts disabled and destructive-command confirmation enabled.
- Assistant SSH access uses dedicated `piagent` accounts, public-key authentication, locked passwords, and optional sudo privileges.
- Secret values are intentionally not stored in git.
- Nextcloud credentials and app passwords are stored on the relevant VM or service host, not in this repository.
- Existing runbooks document how to create assistant SSH users and avoid exposing private keys.
- Server-side operational changes are expected to be logged in `docs/server-change-log.md`.

This model is workable for a small number of hosts, but it is increasingly ad hoc as access expands across Nextcloud, AMP/game servers, Proxmox, OPNsense, Home Assistant, client machines, API tokens, and future services.

## Desired End State

The homelab should have a centralized, recoverable, auditable secrets and access-management model that supports:

- Human-friendly secure storage for passwords, recovery codes, app passwords, API tokens, and credential metadata.
- Controlled assistant access with least privilege, clear grant/revoke/rotate procedures, and user approval for sensitive secret release.
- LAN and Tailscale access to the vault, without public internet exposure by default.
- A safe bootstrap path that does not require migrating every credential immediately.
- Backup and restore procedures sufficient to avoid vault lockout or permanent credential loss.
- A first low-risk pilot proving SSH access lifecycle on a disposable Debian/Ubuntu LXC or VM.

## Design Decisions

1. **Primary vault: Vaultwarden/Bitwarden-compatible vault**
   - Use Vaultwarden/Bitwarden as the primary human/admin secrets vault.
   - Store passwords, app passwords, API tokens, recovery codes, break-glass notes, SSH credential metadata, and operational secure notes there.
   - Do not use Vaultwarden as the only long-term machine-to-machine dynamic secrets broker; OpenBao/Vault, SSH certificates, SOPS/age, or Tailscale ACLs may be added later if needed.

2. **Hosting model: dedicated Proxmox VM**
   - Deploy Vaultwarden in a dedicated Proxmox VM rather than an LXC.
   - Rationale: the vault is a high-value service, and a dedicated VM provides a stronger isolation boundary than a shared-kernel container.
   - The VM should be treated as critical infrastructure, with snapshots before major changes and documented backup/restore procedures.

3. **Network exposure: LAN and Tailscale only**
   - The vault should be reachable from the LAN and over Tailscale.
   - It should not be publicly exposed via port forwarding or public reverse proxy without a separate security review.
   - TLS should still be used for the vault UI/API, using an internal hostname and an appropriate certificate strategy.

4. **Assistant vault access: not in MVP**
   - The assistant should not receive direct vault access during initial deployment.
   - Assistant direct vault access may be designed later after a separate security review.
   - The future design must include an authorization/user-approval option so the user can explicitly approve sensitive secret release or vault operations.

5. **Initial assistant access model: user-mediated secrets plus SSH lifecycle**
   - For now, the user remains the authority for releasing secrets from the vault.
   - The assistant may operate using existing approved SSH keys and service-specific tokens, but secret values must not be committed to git.
   - Repository documentation should record credential purpose, owner, scope, storage location, and revocation path, but never the secret value.

6. **First pilot: disposable Debian/Ubuntu LXC or VM**
   - The first access-management pilot should use a disposable Debian/Ubuntu LXC or VM.
   - The pilot proves grant, verify, revoke, and rotate flows for assistant SSH access without risking Nextcloud, AMP, Proxmox, or other production services.
   - The disposable host may be snapshotted or destroyed after testing.

7. **Backup approach: basic backup first, restore test soon after**
   - Use backup option B: establish a basic working backup first, then perform a restore test soon after deployment.
   - Backups must cover Vaultwarden data, configuration, attachments/sends/icons if enabled, admin token material, TLS/reverse-proxy config, and any backup encryption material.
   - Restore testing is part of making the vault trustworthy, not an optional later enhancement.

8. **Revocation model**
   - SSH access revocation must remove the assistant public key, remove or disable sudo where applicable, optionally lock/remove the `piagent` account, update repo SSH aliases if needed, verify login failure, and log the change.
   - API/app-password revocation must revoke the service token, update dependent services, verify failure/success as appropriate, and log the token purpose and revocation action without exposing the token.
   - Vault user revocation must disable/remove vault account access, rotate any shared credentials the user or assistant could have seen, and verify loss of access.

9. **Future enhancements are staged, not part of the bootstrap**
   - SSH certificates are a strong future option for temporary/expiring SSH access.
   - Tailscale ACLs should eventually restrict which users/devices can reach vault and management services.
   - OpenBao/Vault may be considered later for dynamic secrets, leases, audit logs, PKI, or SSH certificate authority workflows.
   - SOPS/age may be considered later for encrypted repo-managed deployment secrets.

## Not Doing

- Do not deploy Vaultwarden as part of this design artifact.
- Do not migrate every credential immediately.
- Do not store plaintext secrets, private keys, passwords, app passwords, recovery codes, or API tokens in git.
- Do not publicly expose the vault or management interfaces without separate security review.
- Do not grant the assistant direct vault access in the initial implementation.
- Do not replace the current SSH model with SSH certificates or OpenBao/Vault during the first pilot.
- Do not perform production-host access experiments before proving the flow on a disposable Debian/Ubuntu LXC or VM.

## Risks and Mitigations

- **Vault lockout**: Store recovery material outside the vault, document recovery steps, preserve 2FA recovery codes, and perform a restore test soon after basic backup is configured.
- **Vault compromise**: Keep vault LAN/Tailscale-only, require TLS, use strong user authentication, keep the VM updated, restrict admin access, and avoid direct assistant vault access initially.
- **Backup failure**: Start with a basic backup immediately, then run a restore test soon after. Track backup status and restore-test date.
- **Stale assistant keys**: Document key owner, scope, creation date, target hosts, and revocation steps. Verify revocation by confirming login failure.
- **Overbroad assistant privileges**: Use dedicated `piagent` users, grant the minimum sudo/service permissions required, and require explicit approval for destructive or hypervisor-level actions.
- **Secret leakage into repo**: Continue storing only metadata and paths in repository docs. Never read or commit secret values.
- **Public exposure mistake**: Default to LAN/Tailscale-only access and require a separate review before any public exposure.
- **Unclear future automation boundary**: Defer direct assistant vault access to a separate design with user-approval and authorization controls.

## Acceptance Criteria

- `03-design.md` documents the selected architecture and major tradeoffs.
- Vaultwarden/Bitwarden in a dedicated Proxmox VM is selected as the primary vault approach.
- The vault exposure model is LAN and Tailscale only by default.
- Assistant direct vault access is explicitly deferred to a later design/security review with user-approval controls.
- Basic backup first plus restore test soon after is documented as the backup approach.
- The first pilot target is a disposable Debian/Ubuntu LXC or VM for assistant SSH grant/revoke/rotate lifecycle.
- The design includes revocation expectations for assistant SSH, service/API tokens, and vault users.
- The design preserves the rule that secret values are never committed to git.
