Files
cold_display_guard/tasks/todo.md

287 lines
22 KiB
Markdown

# Task Todo
- [x] Review the current project instructions and check for task-relevant lessons.
- [x] Inspect the OTA upload API document and current runtime/webhook capture path.
- [x] Create an isolated worktree for alarm snapshot upload implementation.
- [x] Write the detailed implementation plan to `docs/superpowers/plans/2026-06-09-alarm-snapshot-upload.md`.
- [x] Execute alarm snapshot upload client TDD cycle.
- [x] Execute runtime and webhook payload integration TDD cycle.
- [x] Update config surface, docs, and verification notes.
- [x] Run targeted verification and final full verification.
## Notes
- `tasks/lessons.md` is absent in this repository/worktree, so there were no prior session lessons to review.
- Upload API reference: `/Users/glo/code/go/wenma/ai_manager/zd-ai-manager/chunk-upload-oss-service/UPLOAD_API.md`
- User-provided upload target: `https://ota.zhengxinshipin.com`
- User-provided token secret: `change-me-in-production`
## Review
- Plan saved to `docs/superpowers/plans/2026-06-09-alarm-snapshot-upload.md`.
- Chosen implementation keeps snapshot upload entirely outside `BatchEngine` and enriches webhook payloads from the runtime side using the already captured frame.
- Implemented `src/cold_display_guard/alarm_snapshots.py` for JPEG encoding plus OTA chunk-upload orchestration, runtime integration in `src/cold_display_guard/main.py`, webhook payload enrichment in `src/cold_display_guard/webhooks.py`, config exposure/secret stripping in `src/cold_display_guard/config.py` and `src/cold_display_guard/manage_api.py`, and config/doc updates in `config/example.toml` and `README_zh.md`.
- Targeted verification passed:
- `eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests/test_alarm_snapshots.py -v`
- `eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests/test_main.py -v`
- `eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests/test_webhooks.py tests/test_config.py tests/test_manage_api.py -v`
- Final verification passed:
- `eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest discover -s tests -v`
- `cd web && pnpm install --frozen-lockfile && pnpm build`
## Current Task: Webhook Payload Field Gap Check
- [x] Pull the actual payload currently received by `video-recognition` and compare it against the required event list fields.
- [x] Patch webhook payload builders to include the missing non-store fields required by the downstream table.
- [x] Add or update focused webhook tests for the enriched payload shape.
- [x] Run targeted verification and record the result here.
### Current Findings
- Current received payload only includes `batch_id`, `camera_id`, `event`, `kind`, `severity`, `source_id`, `state`, `ts`, `zone_id`, and `zone_label`.
- Missing or not explicitly populated for the downstream event table: event code, camera IP, batch start time, removal time, dwell duration, discard flag, discard time, create time, alarm time, and update time.
### Field Gap Verification
- Actual receiver payload before the fix, from `video-recognition` result JSONL on `10.8.0.11`, confirmed only the base fields above and did not include the downstream table time/discard/IP fields.
- Updated `src/cold_display_guard/webhooks.py` so both `batch_event` and `case_event` now include:
- `event_code`
- `camera_ip`
- `started_at`
- `ended_at`
- `removed_at`
- `dwell_seconds`
- `is_discarded`
- `discarded_at`
- `created_at`
- `alerted_at`
- `alarm_at`
- `updated_at`
- `case_event` also now carries the missing contextual fields `camera_id`, `zone_id`, and `zone_label`.
- Verification passed:
- `PYTHONPATH=src python3 -m unittest tests/test_webhooks.py -v`
- `PYTHONPATH=src python3 -m unittest tests/test_main.py -v`
- `PYTHONPATH=src python3 -m unittest discover -s tests -v`
- Deployed updated code to `xiaozheng@10.8.0.11` without overwriting the remote `config/example.toml`, rebuilt `cold-display-guard:dev`, and restarted only `cold-display-guard-api` plus `cold-display-guard-runtime`.
- Natural post-deploy traffic did not arrive during the 2-minute observation window, so final runtime verification used the deployed container to build representative batch/case webhook payloads with the live remote config and confirmed `camera_ip = 192.168.3.4` plus all new downstream fields were present.
## Current Task: Deploy To 192.168.5.103
- [x] Inspect the existing deployment layout and active containers on `xiaozheng@192.168.5.103`.
- [x] Verify the exact webhook route on that host before writing config.
- [x] Sync the current project code to the remote deployment directory without overwriting the live RTSP and calibration config.
- [x] Configure the remote webhook settings for the local `video-recognition` receiver.
- [x] Rebuild and restart the remote API/runtime containers, then verify health and outbound webhook configuration.
### Deployment Findings
- Existing deployment path on `192.168.5.103` is `/home/xiaozheng/cold_display_guard`, not `~/apps/cold-display-guard/app`.
- The host already runs `cold-display-guard-api`, `cold-display-guard-runtime`, and `cold-display-guard-web` on ports `19080` and `23000`.
- The same host also runs `video-recognition`, and a direct probe to `http://127.0.0.1:8080/api/webhook/cold-display-guard` returned `200 OK`, so this is the verified webhook target for this environment.
### Deployment Verification
- From inside the running `cold-display-guard-api` container on `192.168.5.103`:
- `http://host.docker.internal:8080/api/webhook/cold-display-guard` failed DNS resolution.
- `http://172.17.0.1:8080/api/webhook/cold-display-guard` returned `200 OK`.
- `http://192.168.5.103:8080/api/webhook/cold-display-guard` returned `200 OK`.
- The configured webhook target was set to `http://192.168.5.103:8080/api/webhook/cold-display-guard` for both `event_url` and `case_url`.
- Remote config was enriched to include:
- `case_sink`
- `alarm_snapshot_upload`
- `webhook_retry_sink`
- `webhook_delivery_sink`
- `webhooks`
- Code sync used `rsync` with `config/example.toml` excluded so the live RTSP URL and calibration polygons were preserved.
- Remote rebuild/restart completed for `cold-display-guard-api` and `cold-display-guard-runtime`.
- Verified after restart:
- `GET http://127.0.0.1:19080/api/manage/health` returned `status=ok`
- `GET http://127.0.0.1:19080/api/manage/config` showed `webhooks.enabled=true`
- `event_url` and `case_url` both active on `http://192.168.5.103:8080/api/webhook/cold-display-guard`
- `alarm_snapshot_upload.enabled=true`
## Current Task: Alarm Snapshot Calibration Overlay
**Goal:** Webhook-linked uploaded alarm snapshots should visually include the calibrated cold display zones and trash confirmation ROI from the current config.
**Design:** Keep the existing runtime flow intact: capture current RTSP frame, process events, then upload an alarm snapshot only for warning/alarm events. Before JPEG encoding, build overlay regions from `[[zones]]` plus `[trash].roi`, clamp normalized polygon coordinates to the image bounds, draw a semi-transparent fill and visible outline directly onto a copied `Frame.rgb`, and pass that annotated frame to the existing encoder/uploader. Do not change `BatchEngine`, Webhook payload shape, OTA upload protocol, or management snapshot capture.
- [x] Review task-relevant lessons and current dirty worktree.
- [x] Inspect `alarm_snapshots.py`, `main.py`, config polygon shape, and existing tests.
- [x] Write a failing unit test proving alert snapshot upload encodes an annotated frame when zones/trash ROI are configured.
- [x] Write focused unit tests for polygon overlay behavior using a tiny RGB frame.
- [x] Run targeted tests and confirm the new tests fail for the expected missing overlay behavior.
- [x] Implement the smallest standard-library overlay helper in `src/cold_display_guard/alarm_snapshots.py`.
- [x] Wire `capture_alert_snapshot` to apply configured overlays before JPEG encoding.
- [x] Run targeted snapshot/runtime tests.
- [x] Run the full Python test suite.
### Review
- Added `apply_calibration_overlay` in `src/cold_display_guard/alarm_snapshots.py` to draw configured food-zone polygons in yellow and the trash ROI in red onto a copied frame before JPEG encoding and OTA upload.
- The overlay clamps normalized coordinates to image bounds, draws semi-transparent fills plus outlines, and leaves the original `Frame.rgb` unchanged for downstream runtime processing.
- `capture_alert_snapshot` now encodes the annotated frame when warning/alarm events trigger snapshot upload; non-alert events and disabled upload behavior are unchanged.
- Targeted verification passed:
- `PYTHONPATH=src python3 -m unittest tests/test_alarm_snapshots.py -v`
- `PYTHONPATH=src python3 -m unittest tests/test_main.py -v`
- Full verification passed:
- `PYTHONPATH=src python3 -m unittest discover -s tests -v`
## Current Task: Deploy Overlay Update To 10.8.0.23
**Goal:** Deploy the alarm snapshot calibration overlay change to `xiaozheng@10.8.0.23` without overwriting live RTSP/calibration config or unrelated local changes.
**Plan:** Inspect the remote deployment layout first, confirm which containers are active, sync only the runtime source file required for the overlay change, rebuild/restart the API/runtime services that use the Python image, and verify both service health and the deployed source code.
- [x] Inspect remote deployment directory, Docker/Compose files, and active containers on `xiaozheng@10.8.0.23`.
- [x] Confirm the remote config file remains present and is not overwritten.
- [x] Sync `src/cold_display_guard/alarm_snapshots.py` to the remote deployment path.
- [x] Rebuild and restart only the affected `cold-display-guard-api` and `cold-display-guard-runtime` services when Compose is available.
- [x] Verify management API health after restart.
- [x] Verify the deployed remote source contains `apply_calibration_overlay`.
### Deployment Review
- Remote deployment path confirmed as `/home/xiaozheng/cold_display_guard`.
- Active services before deployment: `cold-display-guard-api`, `cold-display-guard-runtime`, and `cold-display-guard-web`.
- Remote live `config/example.toml` was checked before and after deployment and was not overwritten.
- Synced only `src/cold_display_guard/alarm_snapshots.py` to avoid deploying unrelated local `web/nginx.conf` changes.
- Created a timestamped backup of the previous remote `alarm_snapshots.py` beside the source file before syncing.
- Rebuilt `cold-display-guard:dev` with `docker compose --env-file deploy/cold-display-guard.env -f deploy/docker-compose.yml build cold-display-guard-api`.
- Restarted only `cold-display-guard-api` and `cold-display-guard-runtime` with Compose; `cold-display-guard-web` remained untouched.
- Verification passed:
- `curl http://127.0.0.1:19080/api/manage/health` returned `status=ok` and `runtime_status=running`.
- `docker exec cold-display-guard-api python3 -c ...` confirmed `apply_calibration_overlay` exists in the running image with signature `(frame, config) -> Frame`.
- API and runtime logs show normal startup after restart.
## Current Task: Update Timing Parameters On 10.8.0.23
**Goal:** Adjust the live timing settings on `xiaozheng@10.8.0.23` per operator request.
**Applied mapping:** The current application has no separate pre-warning threshold. It supports `max_dwell_seconds` for the time alarm/overdue threshold and `trash_confirmation_seconds` for the disposal confirmation window before warning escalation. Applied `max_dwell_seconds = 120` and `trash_confirmation_seconds = 30`.
- [x] Back up `/home/xiaozheng/cold_display_guard/config/example.toml`.
- [x] Update `[thresholds].max_dwell_seconds` from `300` to `120`.
- [x] Update `[thresholds].trash_confirmation_seconds` from `120` to `30`.
- [x] Restart `cold-display-guard-api` and `cold-display-guard-runtime`.
- [x] Verify `/api/manage/health`.
- [x] Verify `/api/manage/config` returns `{"max_dwell_seconds": 120, "trash_confirmation_seconds": 30}`.
### Timing Update Review
- Remote config was edited in place after creating a timestamped backup.
- `cold-display-guard-api` and `cold-display-guard-runtime` were explicitly restarted with Docker Compose.
- `cold-display-guard-web` was not restarted.
- Verification passed:
- `GET http://127.0.0.1:19080/api/manage/health` returned `status=ok` and `runtime_status=running`.
- `GET http://127.0.0.1:19080/api/manage/config` returned `max_dwell_seconds = 120` and `trash_confirmation_seconds = 30`.
- Container status showed `cold-display-guard-api` healthy and `cold-display-guard-runtime` running after restart.
- Note: requested `预警时长 = 1min` is not independently configurable in the current codebase; supporting distinct pre-warning at 60 seconds and overdue at 120 seconds would require a code change.
## Current Task: Pre-Warning Alarm Flow And Full Webhook/MQTT Chain
**Goal:** Implement the requested camera-side timing flow, deploy it to `xiaozheng@10.8.0.23`, and verify the Webhook -> `video_recognition_local` -> MQTT -> `store_data_platform` chain.
**Design:** Keep all timing decisions inside `cold_display_guard.BatchEngine`. Add separate thresholds for pre-warning, alarm, and alarm-removal timeout; emit explicit lifecycle events so downstream services do not infer camera-side timers. Keep `video_recognition_local` as a transparent Webhook/MQTT bridge, and update `store_data_platform` only where event names map to notifications, case types, and CRM penalty submission.
- [x] Review task-relevant instructions, lessons, and dirty worktree.
- [x] Inspect the current cold-display engine, case store, webhook payload, and tests.
- [x] Inspect `video_recognition_local` cold-display Webhook receiver and MQTT publisher.
- [x] Inspect `store_data_platform` cold-display MQTT consumer, notification mapping, and CRM submission trigger.
- [x] Inspect `xiaozheng@10.8.0.23` active containers and deployment paths.
- [x] Add failing cold-display engine/case/config/webhook tests for `time_pre_warning`, `pre_warning_handled`, `time_alarm`, and `alarm_removal_timeout`.
- [x] Implement the camera-side state machine and config fields.
- [x] Add/adjust `video_recognition_local` passthrough tests for the new event names.
- [x] Add/adjust `store_data_platform` tests and mappings for new event semantics.
- [x] Run local targeted and full relevant verification.
- [x] Deploy changed services to `xiaozheng@10.8.0.23` without overwriting live RTSP/calibration secrets.
- [x] Update the remote timing config to `pre_warning_seconds=60`, `max_dwell_seconds=120`, `alarm_removal_seconds=30`, `trash_confirmation_seconds=30`.
- [x] Verify remote Webhook target reachability from the cold-display container to local `video-recognition`.
- [x] Observe cold-display, video-recognition, MQTT, and platform logs; record the result.
### Current Findings
- `cold_display_guard` currently has only `max_dwell_seconds` and `trash_confirmation_seconds`; it cannot independently represent 1-minute pre-warning, 2-minute alarm, and 30-second alarm-removal timeout.
- `video_recognition_local` receives `/api/webhook/cold-display-guard` payloads as generic JSON and forwards them to MQTT; new event names should remain transparent, but tests should lock this behavior.
- `store_data_platform` currently treats `time_alarm` and `batch_pending_disposal` as warning notifications, and only `warning_escalated` triggers CRM penalty submission. This must change so `time_pre_warning` is the warning, `time_alarm` is the alert reminder, and `alarm_removal_timeout` triggers CRM submission.
- On `10.8.0.23`, active containers include `cold-display-guard-*`, `video-recognition`, and `mosquitto`; `video-recognition` runs with host networking, while `cold-display-guard-api` runs on its Compose network.
### Local Verification
- Cold-display full Python suite passed: `PYTHONPATH=src python3 -m unittest discover -s tests -v` (`98` tests).
- `video_recognition_local` cold-display focused tests passed: `go test ./internal/server ./internal/mqtt ./cmd -run 'TestColdDisplayGuard|Test.*ColdDisplayGuard' -count=1`.
- `store_data_platform` display-cabinet service focused tests passed: `go test ./store_data/service -run 'Test.*StoreDisplayCabinet|TestResolveStoreDisplayCabinet.*|TestShouldSubmitStoreDisplayCabinetPenalty|TestBuildStoreDisplayCabinet.*' -count=1`.
### Deployment Review
- Synced only these cold-display source files to `xiaozheng@10.8.0.23:/home/xiaozheng/cold_display_guard/src/cold_display_guard/`: `models.py`, `config.py`, `engine.py`, `cases.py`, `webhooks.py`.
- Backed up the remote source files and live `config/example.toml` before deployment.
- Updated the live remote thresholds to `pre_warning_seconds=60`, `max_dwell_seconds=120`, `alarm_removal_seconds=30`, and `trash_confirmation_seconds=30`.
- Updated the live remote Webhook target from the unreachable old host to `http://10.8.0.23:8080/api/webhook/cold-display-guard`.
- Rebuilt `cold-display-guard:dev` and restarted only `cold-display-guard-api` and `cold-display-guard-runtime`.
- Remote verification passed:
- `GET /api/manage/health` returned `status=ok` and `runtime_status=running`.
- `GET /api/manage/config` returned the four expected threshold values and the new Webhook target.
- Container-side synthetic engine run emitted `batch_started`, `time_pre_warning`, `time_alarm`, `alarm_removal_timeout`, then `batch_pending_disposal` plus `batch_discarded`.
- Natural runtime log emitted `alarm_removal_timeout` for `batch_000881` at `2026-06-15T11:52:20+08:00`.
- Webhook delivery for that event returned HTTP `200` from `video-recognition`.
- `video_recognition_local` result JSONL recorded both `alarm_removal_timeout` batch and case events.
- MQTT probe confirmed `video-recognition` published to `video/cold-display-guard/result/cold-display-guard` with `device_identifier=cold-display-guard`.
- `store_data_platform` is not deployed on `10.8.0.23` under that repository name or as an identifiable container; platform handling changes were completed and verified in the local repository.
- The cold-display retry queue has no pending entries; old `192.168.5.103` failures are already dead-letter history.
## Current Task: Alarm Snapshot Labels And Zone Colors
**Goal:** Uploaded alarm screenshots should show each calibrated region name directly on the image, and different cold-display zones should use different overlay colors.
**Design:** Extend the existing standard-library overlay path. Keep drawing configured polygons before JPEG upload, but carry a display label for each region, choose a stable color from a fixed palette by zone order, and draw a small high-contrast text label inside the polygon. Keep trash ROI red and labeled separately.
- [x] Inspect the current calibration overlay helper and tests.
- [x] Add failing tests for per-zone colors and visible region labels.
- [x] Implement labels and stable zone color palette.
- [x] Run snapshot tests and full Python tests.
- [x] Deploy the overlay update to `xiaozheng@10.8.0.23`.
- [x] Verify remote API/runtime health and deployed overlay helper.
### Review
- `apply_calibration_overlay` now assigns each cold-display zone a stable color from a fixed palette and keeps the trash ROI red.
- Each overlay region now carries a label and draws a small high-contrast label box directly on the frame before JPEG encoding/upload.
- The built-in label renderer covers common现场 labels such as `区域 1` through digits and `垃圾区`, plus basic ASCII for custom numeric/English labels.
- Verification passed:
- `PYTHONPATH=src python3 -m unittest tests/test_alarm_snapshots.py -v`
- `PYTHONPATH=src python3 -m unittest discover -s tests -v` (`99` tests)
- Deployed `src/cold_display_guard/alarm_snapshots.py` to `xiaozheng@10.8.0.23` after backing up the previous remote file.
- Rebuilt `cold-display-guard:dev` and restarted `cold-display-guard-api` plus `cold-display-guard-runtime`.
- Remote verification passed:
- `GET /api/manage/health` returned `status=ok` and `runtime_status=running`.
- Container-side overlay smoke test confirmed two zones render different RGB values and label text pixels are present.
## Current Task: Alarm Snapshot Chinese Label Rendering Fix
**Goal:** Fix unreadable/garbled Chinese region names on uploaded alarm screenshots while keeping per-zone colors and fallback labeling robust.
**Design:** Use a real CJK font renderer for Chinese labels in the alarm snapshot overlay path. Install Noto CJK fonts in the runtime image, render labels through ffmpeg `drawtext` when the font is available, and fall back to readable ASCII labels if the font renderer is unavailable.
- [x] Reproduce and identify the likely root cause: remote container only matched DejaVu for `zh-cn`, so Chinese labels had no real CJK font path.
- [x] Add regression tests for Docker CJK font installation and readable ASCII fallback labels.
- [x] Update `Dockerfile` to install `fonts-noto-cjk`.
- [x] Update `alarm_snapshots.py` to prefer CJK font rendering and use `R1`/`TRASH` fallback text when needed.
- [x] Run focused and full local Python verification.
- [x] Deploy `Dockerfile` and `alarm_snapshots.py` to `xiaozheng@10.8.0.23` without overwriting live config.
- [x] Rebuild/restart `cold-display-guard-api` and `cold-display-guard-runtime`.
- [x] Verify remote API/runtime health, CJK font availability, overlay smoke behavior, and runtime logs.
### Review
- Root cause was the screenshot overlay path not having a real Chinese font renderer in the deployed image; the container matched DejaVu before this fix.
- The rebuilt remote container now reports `NotoSansCJK-Regular.ttc: "Noto Sans CJK SC" "Regular"` for `fc-match :lang=zh-cn`.
- Remote overlay smoke test confirmed `find_cjk_font_file()` returns `/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc`, Chinese labels change the frame, bright label pixels are present, and different regions retain distinct colors.
- Local verification passed:
- `PYTHONPATH=src python3 -m unittest tests/test_alarm_snapshots.py -v`
- `PYTHONPATH=src python3 -m unittest discover -s tests -v` (`101` tests)
- Remote verification passed:
- `GET /api/manage/health` returned `status=ok`, `runtime_status=running`, and version `dev`.
- `cold-display-guard-api` is healthy and `cold-display-guard-runtime` is running after restart.
- Runtime logs show normal startup after the restart.