Files
cold_display_guard/tasks/todo.md
skye.yue 46889c0621 feat: draw calibration overlay on alarm snapshots
Before JPEG encoding and OTA upload, paint the configured [[zones]]
polygons (yellow) and the [trash].roi (red) directly onto a copied
Frame.rgb so uploaded alarm snapshots visually carry the calibrated
regions. Normalized coordinates are clamped to image bounds, the source
frame stays untouched for downstream runtime processing, and
non-alert/disabled paths are unchanged. Adds stdlib-only polygon
fill/outline helpers plus focused unit tests.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-15 12:34:46 +08:00

18 KiB

Task Todo

  • Review the current project instructions and check for task-relevant lessons.
  • Inspect the OTA upload API document and current runtime/webhook capture path.
  • Create an isolated worktree for alarm snapshot upload implementation.
  • Write the detailed implementation plan to docs/superpowers/plans/2026-06-09-alarm-snapshot-upload.md.
  • Execute alarm snapshot upload client TDD cycle.
  • Execute runtime and webhook payload integration TDD cycle.
  • Update config surface, docs, and verification notes.
  • Run targeted verification and final full verification.

Notes

  • tasks/lessons.md is absent in this repository/worktree, so there were no prior session lessons to review.
  • Upload API reference: /Users/glo/code/go/wenma/ai_manager/zd-ai-manager/chunk-upload-oss-service/UPLOAD_API.md
  • User-provided upload target: https://ota.zhengxinshipin.com
  • User-provided token secret: change-me-in-production

Review

  • Plan saved to docs/superpowers/plans/2026-06-09-alarm-snapshot-upload.md.
  • Chosen implementation keeps snapshot upload entirely outside BatchEngine and enriches webhook payloads from the runtime side using the already captured frame.
  • Implemented src/cold_display_guard/alarm_snapshots.py for JPEG encoding plus OTA chunk-upload orchestration, runtime integration in src/cold_display_guard/main.py, webhook payload enrichment in src/cold_display_guard/webhooks.py, config exposure/secret stripping in src/cold_display_guard/config.py and src/cold_display_guard/manage_api.py, and config/doc updates in config/example.toml and README_zh.md.
  • Targeted verification passed:
    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests/test_alarm_snapshots.py -v
    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests/test_main.py -v
    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests/test_webhooks.py tests/test_config.py tests/test_manage_api.py -v
  • Final verification passed:
    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest discover -s tests -v
    • cd web && pnpm install --frozen-lockfile && pnpm build

Current Task: Webhook Payload Field Gap Check

  • Pull the actual payload currently received by video-recognition and compare it against the required event list fields.
  • Patch webhook payload builders to include the missing non-store fields required by the downstream table.
  • Add or update focused webhook tests for the enriched payload shape.
  • Run targeted verification and record the result here.

Current Findings

  • Current received payload only includes batch_id, camera_id, event, kind, severity, source_id, state, ts, zone_id, and zone_label.
  • Missing or not explicitly populated for the downstream event table: event code, camera IP, batch start time, removal time, dwell duration, discard flag, discard time, create time, alarm time, and update time.

Field Gap Verification

  • Actual receiver payload before the fix, from video-recognition result JSONL on 10.8.0.11, confirmed only the base fields above and did not include the downstream table time/discard/IP fields.
  • Updated src/cold_display_guard/webhooks.py so both batch_event and case_event now include:
    • event_code
    • camera_ip
    • started_at
    • ended_at
    • removed_at
    • dwell_seconds
    • is_discarded
    • discarded_at
    • created_at
    • alerted_at
    • alarm_at
    • updated_at
  • case_event also now carries the missing contextual fields camera_id, zone_id, and zone_label.
  • Verification passed:
    • PYTHONPATH=src python3 -m unittest tests/test_webhooks.py -v
    • PYTHONPATH=src python3 -m unittest tests/test_main.py -v
    • PYTHONPATH=src python3 -m unittest discover -s tests -v
  • Deployed updated code to xiaozheng@10.8.0.11 without overwriting the remote config/example.toml, rebuilt cold-display-guard:dev, and restarted only cold-display-guard-api plus cold-display-guard-runtime.
  • Natural post-deploy traffic did not arrive during the 2-minute observation window, so final runtime verification used the deployed container to build representative batch/case webhook payloads with the live remote config and confirmed camera_ip = 192.168.3.4 plus all new downstream fields were present.

Current Task: Deploy To 192.168.5.103

  • Inspect the existing deployment layout and active containers on xiaozheng@192.168.5.103.
  • Verify the exact webhook route on that host before writing config.
  • Sync the current project code to the remote deployment directory without overwriting the live RTSP and calibration config.
  • Configure the remote webhook settings for the local video-recognition receiver.
  • Rebuild and restart the remote API/runtime containers, then verify health and outbound webhook configuration.

Deployment Findings

  • Existing deployment path on 192.168.5.103 is /home/xiaozheng/cold_display_guard, not ~/apps/cold-display-guard/app.
  • The host already runs cold-display-guard-api, cold-display-guard-runtime, and cold-display-guard-web on ports 19080 and 23000.
  • The same host also runs video-recognition, and a direct probe to http://127.0.0.1:8080/api/webhook/cold-display-guard returned 200 OK, so this is the verified webhook target for this environment.

Deployment Verification

  • From inside the running cold-display-guard-api container on 192.168.5.103:
    • http://host.docker.internal:8080/api/webhook/cold-display-guard failed DNS resolution.
    • http://172.17.0.1:8080/api/webhook/cold-display-guard returned 200 OK.
    • http://192.168.5.103:8080/api/webhook/cold-display-guard returned 200 OK.
  • The configured webhook target was set to http://192.168.5.103:8080/api/webhook/cold-display-guard for both event_url and case_url.
  • Remote config was enriched to include:
    • case_sink
    • alarm_snapshot_upload
    • webhook_retry_sink
    • webhook_delivery_sink
    • webhooks
  • Code sync used rsync with config/example.toml excluded so the live RTSP URL and calibration polygons were preserved.
  • Remote rebuild/restart completed for cold-display-guard-api and cold-display-guard-runtime.
  • Verified after restart:
    • GET http://127.0.0.1:19080/api/manage/health returned status=ok
    • GET http://127.0.0.1:19080/api/manage/config showed webhooks.enabled=true
    • event_url and case_url both active on http://192.168.5.103:8080/api/webhook/cold-display-guard
    • alarm_snapshot_upload.enabled=true

Current Task: Alarm Snapshot Calibration Overlay

Goal: Webhook-linked uploaded alarm snapshots should visually include the calibrated cold display zones and trash confirmation ROI from the current config.

Design: Keep the existing runtime flow intact: capture current RTSP frame, process events, then upload an alarm snapshot only for warning/alarm events. Before JPEG encoding, build overlay regions from [[zones]] plus [trash].roi, clamp normalized polygon coordinates to the image bounds, draw a semi-transparent fill and visible outline directly onto a copied Frame.rgb, and pass that annotated frame to the existing encoder/uploader. Do not change BatchEngine, Webhook payload shape, OTA upload protocol, or management snapshot capture.

  • Review task-relevant lessons and current dirty worktree.
  • Inspect alarm_snapshots.py, main.py, config polygon shape, and existing tests.
  • Write a failing unit test proving alert snapshot upload encodes an annotated frame when zones/trash ROI are configured.
  • Write focused unit tests for polygon overlay behavior using a tiny RGB frame.
  • Run targeted tests and confirm the new tests fail for the expected missing overlay behavior.
  • Implement the smallest standard-library overlay helper in src/cold_display_guard/alarm_snapshots.py.
  • Wire capture_alert_snapshot to apply configured overlays before JPEG encoding.
  • Run targeted snapshot/runtime tests.
  • Run the full Python test suite.

Review

  • Added apply_calibration_overlay in src/cold_display_guard/alarm_snapshots.py to draw configured food-zone polygons in yellow and the trash ROI in red onto a copied frame before JPEG encoding and OTA upload.
  • The overlay clamps normalized coordinates to image bounds, draws semi-transparent fills plus outlines, and leaves the original Frame.rgb unchanged for downstream runtime processing.
  • capture_alert_snapshot now encodes the annotated frame when warning/alarm events trigger snapshot upload; non-alert events and disabled upload behavior are unchanged.
  • Targeted verification passed:
    • PYTHONPATH=src python3 -m unittest tests/test_alarm_snapshots.py -v
    • PYTHONPATH=src python3 -m unittest tests/test_main.py -v
  • Full verification passed:
    • PYTHONPATH=src python3 -m unittest discover -s tests -v

Current Task: Deploy Overlay Update To 10.8.0.23

Goal: Deploy the alarm snapshot calibration overlay change to xiaozheng@10.8.0.23 without overwriting live RTSP/calibration config or unrelated local changes.

Plan: Inspect the remote deployment layout first, confirm which containers are active, sync only the runtime source file required for the overlay change, rebuild/restart the API/runtime services that use the Python image, and verify both service health and the deployed source code.

  • Inspect remote deployment directory, Docker/Compose files, and active containers on xiaozheng@10.8.0.23.
  • Confirm the remote config file remains present and is not overwritten.
  • Sync src/cold_display_guard/alarm_snapshots.py to the remote deployment path.
  • Rebuild and restart only the affected cold-display-guard-api and cold-display-guard-runtime services when Compose is available.
  • Verify management API health after restart.
  • Verify the deployed remote source contains apply_calibration_overlay.

Deployment Review

  • Remote deployment path confirmed as /home/xiaozheng/cold_display_guard.
  • Active services before deployment: cold-display-guard-api, cold-display-guard-runtime, and cold-display-guard-web.
  • Remote live config/example.toml was checked before and after deployment and was not overwritten.
  • Synced only src/cold_display_guard/alarm_snapshots.py to avoid deploying unrelated local web/nginx.conf changes.
  • Created a timestamped backup of the previous remote alarm_snapshots.py beside the source file before syncing.
  • Rebuilt cold-display-guard:dev with docker compose --env-file deploy/cold-display-guard.env -f deploy/docker-compose.yml build cold-display-guard-api.
  • Restarted only cold-display-guard-api and cold-display-guard-runtime with Compose; cold-display-guard-web remained untouched.
  • Verification passed:
    • curl http://127.0.0.1:19080/api/manage/health returned status=ok and runtime_status=running.
    • docker exec cold-display-guard-api python3 -c ... confirmed apply_calibration_overlay exists in the running image with signature (frame, config) -> Frame.
    • API and runtime logs show normal startup after restart.

Current Task: Update Timing Parameters On 10.8.0.23

Goal: Adjust the live timing settings on xiaozheng@10.8.0.23 per operator request.

Applied mapping: The current application has no separate pre-warning threshold. It supports max_dwell_seconds for the time alarm/overdue threshold and trash_confirmation_seconds for the disposal confirmation window before warning escalation. Applied max_dwell_seconds = 120 and trash_confirmation_seconds = 30.

  • Back up /home/xiaozheng/cold_display_guard/config/example.toml.
  • Update [thresholds].max_dwell_seconds from 300 to 120.
  • Update [thresholds].trash_confirmation_seconds from 120 to 30.
  • Restart cold-display-guard-api and cold-display-guard-runtime.
  • Verify /api/manage/health.
  • Verify /api/manage/config returns {"max_dwell_seconds": 120, "trash_confirmation_seconds": 30}.

Timing Update Review

  • Remote config was edited in place after creating a timestamped backup.
  • cold-display-guard-api and cold-display-guard-runtime were explicitly restarted with Docker Compose.
  • cold-display-guard-web was not restarted.
  • Verification passed:
    • GET http://127.0.0.1:19080/api/manage/health returned status=ok and runtime_status=running.
    • GET http://127.0.0.1:19080/api/manage/config returned max_dwell_seconds = 120 and trash_confirmation_seconds = 30.
    • Container status showed cold-display-guard-api healthy and cold-display-guard-runtime running after restart.
  • Note: requested 预警时长 = 1min is not independently configurable in the current codebase; supporting distinct pre-warning at 60 seconds and overdue at 120 seconds would require a code change.

Current Task: Pre-Warning Alarm Flow And Full Webhook/MQTT Chain

Goal: Implement the requested camera-side timing flow, deploy it to xiaozheng@10.8.0.23, and verify the Webhook -> video_recognition_local -> MQTT -> store_data_platform chain.

Design: Keep all timing decisions inside cold_display_guard.BatchEngine. Add separate thresholds for pre-warning, alarm, and alarm-removal timeout; emit explicit lifecycle events so downstream services do not infer camera-side timers. Keep video_recognition_local as a transparent Webhook/MQTT bridge, and update store_data_platform only where event names map to notifications, case types, and CRM penalty submission.

  • Review task-relevant instructions, lessons, and dirty worktree.
  • Inspect the current cold-display engine, case store, webhook payload, and tests.
  • Inspect video_recognition_local cold-display Webhook receiver and MQTT publisher.
  • Inspect store_data_platform cold-display MQTT consumer, notification mapping, and CRM submission trigger.
  • Inspect xiaozheng@10.8.0.23 active containers and deployment paths.
  • Add failing cold-display engine/case/config/webhook tests for time_pre_warning, pre_warning_handled, time_alarm, and alarm_removal_timeout.
  • Implement the camera-side state machine and config fields.
  • Add/adjust video_recognition_local passthrough tests for the new event names.
  • Add/adjust store_data_platform tests and mappings for new event semantics.
  • Run local targeted and full relevant verification.
  • Deploy changed services to xiaozheng@10.8.0.23 without overwriting live RTSP/calibration secrets.
  • Update the remote timing config to pre_warning_seconds=60, max_dwell_seconds=120, alarm_removal_seconds=30, trash_confirmation_seconds=30.
  • Verify remote Webhook target reachability from the cold-display container to local video-recognition.
  • Observe cold-display, video-recognition, MQTT, and platform logs; record the result.

Current Findings

  • cold_display_guard currently has only max_dwell_seconds and trash_confirmation_seconds; it cannot independently represent 1-minute pre-warning, 2-minute alarm, and 30-second alarm-removal timeout.
  • video_recognition_local receives /api/webhook/cold-display-guard payloads as generic JSON and forwards them to MQTT; new event names should remain transparent, but tests should lock this behavior.
  • store_data_platform currently treats time_alarm and batch_pending_disposal as warning notifications, and only warning_escalated triggers CRM penalty submission. This must change so time_pre_warning is the warning, time_alarm is the alert reminder, and alarm_removal_timeout triggers CRM submission.
  • On 10.8.0.23, active containers include cold-display-guard-*, video-recognition, and mosquitto; video-recognition runs with host networking, while cold-display-guard-api runs on its Compose network.

Local Verification

  • Cold-display full Python suite passed: PYTHONPATH=src python3 -m unittest discover -s tests -v (98 tests).
  • video_recognition_local cold-display focused tests passed: go test ./internal/server ./internal/mqtt ./cmd -run 'TestColdDisplayGuard|Test.*ColdDisplayGuard' -count=1.
  • store_data_platform display-cabinet service focused tests passed: go test ./store_data/service -run 'Test.*StoreDisplayCabinet|TestResolveStoreDisplayCabinet.*|TestShouldSubmitStoreDisplayCabinetPenalty|TestBuildStoreDisplayCabinet.*' -count=1.

Deployment Review

  • Synced only these cold-display source files to xiaozheng@10.8.0.23:/home/xiaozheng/cold_display_guard/src/cold_display_guard/: models.py, config.py, engine.py, cases.py, webhooks.py.
  • Backed up the remote source files and live config/example.toml before deployment.
  • Updated the live remote thresholds to pre_warning_seconds=60, max_dwell_seconds=120, alarm_removal_seconds=30, and trash_confirmation_seconds=30.
  • Updated the live remote Webhook target from the unreachable old host to http://10.8.0.23:8080/api/webhook/cold-display-guard.
  • Rebuilt cold-display-guard:dev and restarted only cold-display-guard-api and cold-display-guard-runtime.
  • Remote verification passed:
    • GET /api/manage/health returned status=ok and runtime_status=running.
    • GET /api/manage/config returned the four expected threshold values and the new Webhook target.
    • Container-side synthetic engine run emitted batch_started, time_pre_warning, time_alarm, alarm_removal_timeout, then batch_pending_disposal plus batch_discarded.
    • Natural runtime log emitted alarm_removal_timeout for batch_000881 at 2026-06-15T11:52:20+08:00.
    • Webhook delivery for that event returned HTTP 200 from video-recognition.
    • video_recognition_local result JSONL recorded both alarm_removal_timeout batch and case events.
    • MQTT probe confirmed video-recognition published to video/cold-display-guard/result/cold-display-guard with device_identifier=cold-display-guard.
  • store_data_platform is not deployed on 10.8.0.23 under that repository name or as an identifiable container; platform handling changes were completed and verified in the local repository.
  • The cold-display retry queue has no pending entries; old 192.168.5.103 failures are already dead-letter history.