Files
cold_display_guard/tasks/todo.md

22 KiB

Task Todo

  • Review the current project instructions and check for task-relevant lessons.
  • Inspect the OTA upload API document and current runtime/webhook capture path.
  • Create an isolated worktree for alarm snapshot upload implementation.
  • Write the detailed implementation plan to docs/superpowers/plans/2026-06-09-alarm-snapshot-upload.md.
  • Execute alarm snapshot upload client TDD cycle.
  • Execute runtime and webhook payload integration TDD cycle.
  • Update config surface, docs, and verification notes.
  • Run targeted verification and final full verification.

Notes

  • tasks/lessons.md is absent in this repository/worktree, so there were no prior session lessons to review.
  • Upload API reference: /Users/glo/code/go/wenma/ai_manager/zd-ai-manager/chunk-upload-oss-service/UPLOAD_API.md
  • User-provided upload target: https://ota.zhengxinshipin.com
  • User-provided token secret: change-me-in-production

Review

  • Plan saved to docs/superpowers/plans/2026-06-09-alarm-snapshot-upload.md.
  • Chosen implementation keeps snapshot upload entirely outside BatchEngine and enriches webhook payloads from the runtime side using the already captured frame.
  • Implemented src/cold_display_guard/alarm_snapshots.py for JPEG encoding plus OTA chunk-upload orchestration, runtime integration in src/cold_display_guard/main.py, webhook payload enrichment in src/cold_display_guard/webhooks.py, config exposure/secret stripping in src/cold_display_guard/config.py and src/cold_display_guard/manage_api.py, and config/doc updates in config/example.toml and README_zh.md.
  • Targeted verification passed:
    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests/test_alarm_snapshots.py -v
    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests/test_main.py -v
    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests/test_webhooks.py tests/test_config.py tests/test_manage_api.py -v
  • Final verification passed:
    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest discover -s tests -v
    • cd web && pnpm install --frozen-lockfile && pnpm build

Current Task: Webhook Payload Field Gap Check

  • Pull the actual payload currently received by video-recognition and compare it against the required event list fields.
  • Patch webhook payload builders to include the missing non-store fields required by the downstream table.
  • Add or update focused webhook tests for the enriched payload shape.
  • Run targeted verification and record the result here.

Current Findings

  • Current received payload only includes batch_id, camera_id, event, kind, severity, source_id, state, ts, zone_id, and zone_label.
  • Missing or not explicitly populated for the downstream event table: event code, camera IP, batch start time, removal time, dwell duration, discard flag, discard time, create time, alarm time, and update time.

Field Gap Verification

  • Actual receiver payload before the fix, from video-recognition result JSONL on 10.8.0.11, confirmed only the base fields above and did not include the downstream table time/discard/IP fields.
  • Updated src/cold_display_guard/webhooks.py so both batch_event and case_event now include:
    • event_code
    • camera_ip
    • started_at
    • ended_at
    • removed_at
    • dwell_seconds
    • is_discarded
    • discarded_at
    • created_at
    • alerted_at
    • alarm_at
    • updated_at
  • case_event also now carries the missing contextual fields camera_id, zone_id, and zone_label.
  • Verification passed:
    • PYTHONPATH=src python3 -m unittest tests/test_webhooks.py -v
    • PYTHONPATH=src python3 -m unittest tests/test_main.py -v
    • PYTHONPATH=src python3 -m unittest discover -s tests -v
  • Deployed updated code to xiaozheng@10.8.0.11 without overwriting the remote config/example.toml, rebuilt cold-display-guard:dev, and restarted only cold-display-guard-api plus cold-display-guard-runtime.
  • Natural post-deploy traffic did not arrive during the 2-minute observation window, so final runtime verification used the deployed container to build representative batch/case webhook payloads with the live remote config and confirmed camera_ip = 192.168.3.4 plus all new downstream fields were present.

Current Task: Deploy To 192.168.5.103

  • Inspect the existing deployment layout and active containers on xiaozheng@192.168.5.103.
  • Verify the exact webhook route on that host before writing config.
  • Sync the current project code to the remote deployment directory without overwriting the live RTSP and calibration config.
  • Configure the remote webhook settings for the local video-recognition receiver.
  • Rebuild and restart the remote API/runtime containers, then verify health and outbound webhook configuration.

Deployment Findings

  • Existing deployment path on 192.168.5.103 is /home/xiaozheng/cold_display_guard, not ~/apps/cold-display-guard/app.
  • The host already runs cold-display-guard-api, cold-display-guard-runtime, and cold-display-guard-web on ports 19080 and 23000.
  • The same host also runs video-recognition, and a direct probe to http://127.0.0.1:8080/api/webhook/cold-display-guard returned 200 OK, so this is the verified webhook target for this environment.

Deployment Verification

  • From inside the running cold-display-guard-api container on 192.168.5.103:
    • http://host.docker.internal:8080/api/webhook/cold-display-guard failed DNS resolution.
    • http://172.17.0.1:8080/api/webhook/cold-display-guard returned 200 OK.
    • http://192.168.5.103:8080/api/webhook/cold-display-guard returned 200 OK.
  • The configured webhook target was set to http://192.168.5.103:8080/api/webhook/cold-display-guard for both event_url and case_url.
  • Remote config was enriched to include:
    • case_sink
    • alarm_snapshot_upload
    • webhook_retry_sink
    • webhook_delivery_sink
    • webhooks
  • Code sync used rsync with config/example.toml excluded so the live RTSP URL and calibration polygons were preserved.
  • Remote rebuild/restart completed for cold-display-guard-api and cold-display-guard-runtime.
  • Verified after restart:
    • GET http://127.0.0.1:19080/api/manage/health returned status=ok
    • GET http://127.0.0.1:19080/api/manage/config showed webhooks.enabled=true
    • event_url and case_url both active on http://192.168.5.103:8080/api/webhook/cold-display-guard
    • alarm_snapshot_upload.enabled=true

Current Task: Alarm Snapshot Calibration Overlay

Goal: Webhook-linked uploaded alarm snapshots should visually include the calibrated cold display zones and trash confirmation ROI from the current config.

Design: Keep the existing runtime flow intact: capture current RTSP frame, process events, then upload an alarm snapshot only for warning/alarm events. Before JPEG encoding, build overlay regions from [[zones]] plus [trash].roi, clamp normalized polygon coordinates to the image bounds, draw a semi-transparent fill and visible outline directly onto a copied Frame.rgb, and pass that annotated frame to the existing encoder/uploader. Do not change BatchEngine, Webhook payload shape, OTA upload protocol, or management snapshot capture.

  • Review task-relevant lessons and current dirty worktree.
  • Inspect alarm_snapshots.py, main.py, config polygon shape, and existing tests.
  • Write a failing unit test proving alert snapshot upload encodes an annotated frame when zones/trash ROI are configured.
  • Write focused unit tests for polygon overlay behavior using a tiny RGB frame.
  • Run targeted tests and confirm the new tests fail for the expected missing overlay behavior.
  • Implement the smallest standard-library overlay helper in src/cold_display_guard/alarm_snapshots.py.
  • Wire capture_alert_snapshot to apply configured overlays before JPEG encoding.
  • Run targeted snapshot/runtime tests.
  • Run the full Python test suite.

Review

  • Added apply_calibration_overlay in src/cold_display_guard/alarm_snapshots.py to draw configured food-zone polygons in yellow and the trash ROI in red onto a copied frame before JPEG encoding and OTA upload.
  • The overlay clamps normalized coordinates to image bounds, draws semi-transparent fills plus outlines, and leaves the original Frame.rgb unchanged for downstream runtime processing.
  • capture_alert_snapshot now encodes the annotated frame when warning/alarm events trigger snapshot upload; non-alert events and disabled upload behavior are unchanged.
  • Targeted verification passed:
    • PYTHONPATH=src python3 -m unittest tests/test_alarm_snapshots.py -v
    • PYTHONPATH=src python3 -m unittest tests/test_main.py -v
  • Full verification passed:
    • PYTHONPATH=src python3 -m unittest discover -s tests -v

Current Task: Deploy Overlay Update To 10.8.0.23

Goal: Deploy the alarm snapshot calibration overlay change to xiaozheng@10.8.0.23 without overwriting live RTSP/calibration config or unrelated local changes.

Plan: Inspect the remote deployment layout first, confirm which containers are active, sync only the runtime source file required for the overlay change, rebuild/restart the API/runtime services that use the Python image, and verify both service health and the deployed source code.

  • Inspect remote deployment directory, Docker/Compose files, and active containers on xiaozheng@10.8.0.23.
  • Confirm the remote config file remains present and is not overwritten.
  • Sync src/cold_display_guard/alarm_snapshots.py to the remote deployment path.
  • Rebuild and restart only the affected cold-display-guard-api and cold-display-guard-runtime services when Compose is available.
  • Verify management API health after restart.
  • Verify the deployed remote source contains apply_calibration_overlay.

Deployment Review

  • Remote deployment path confirmed as /home/xiaozheng/cold_display_guard.
  • Active services before deployment: cold-display-guard-api, cold-display-guard-runtime, and cold-display-guard-web.
  • Remote live config/example.toml was checked before and after deployment and was not overwritten.
  • Synced only src/cold_display_guard/alarm_snapshots.py to avoid deploying unrelated local web/nginx.conf changes.
  • Created a timestamped backup of the previous remote alarm_snapshots.py beside the source file before syncing.
  • Rebuilt cold-display-guard:dev with docker compose --env-file deploy/cold-display-guard.env -f deploy/docker-compose.yml build cold-display-guard-api.
  • Restarted only cold-display-guard-api and cold-display-guard-runtime with Compose; cold-display-guard-web remained untouched.
  • Verification passed:
    • curl http://127.0.0.1:19080/api/manage/health returned status=ok and runtime_status=running.
    • docker exec cold-display-guard-api python3 -c ... confirmed apply_calibration_overlay exists in the running image with signature (frame, config) -> Frame.
    • API and runtime logs show normal startup after restart.

Current Task: Update Timing Parameters On 10.8.0.23

Goal: Adjust the live timing settings on xiaozheng@10.8.0.23 per operator request.

Applied mapping: The current application has no separate pre-warning threshold. It supports max_dwell_seconds for the time alarm/overdue threshold and trash_confirmation_seconds for the disposal confirmation window before warning escalation. Applied max_dwell_seconds = 120 and trash_confirmation_seconds = 30.

  • Back up /home/xiaozheng/cold_display_guard/config/example.toml.
  • Update [thresholds].max_dwell_seconds from 300 to 120.
  • Update [thresholds].trash_confirmation_seconds from 120 to 30.
  • Restart cold-display-guard-api and cold-display-guard-runtime.
  • Verify /api/manage/health.
  • Verify /api/manage/config returns {"max_dwell_seconds": 120, "trash_confirmation_seconds": 30}.

Timing Update Review

  • Remote config was edited in place after creating a timestamped backup.
  • cold-display-guard-api and cold-display-guard-runtime were explicitly restarted with Docker Compose.
  • cold-display-guard-web was not restarted.
  • Verification passed:
    • GET http://127.0.0.1:19080/api/manage/health returned status=ok and runtime_status=running.
    • GET http://127.0.0.1:19080/api/manage/config returned max_dwell_seconds = 120 and trash_confirmation_seconds = 30.
    • Container status showed cold-display-guard-api healthy and cold-display-guard-runtime running after restart.
  • Note: requested 预警时长 = 1min is not independently configurable in the current codebase; supporting distinct pre-warning at 60 seconds and overdue at 120 seconds would require a code change.

Current Task: Pre-Warning Alarm Flow And Full Webhook/MQTT Chain

Goal: Implement the requested camera-side timing flow, deploy it to xiaozheng@10.8.0.23, and verify the Webhook -> video_recognition_local -> MQTT -> store_data_platform chain.

Design: Keep all timing decisions inside cold_display_guard.BatchEngine. Add separate thresholds for pre-warning, alarm, and alarm-removal timeout; emit explicit lifecycle events so downstream services do not infer camera-side timers. Keep video_recognition_local as a transparent Webhook/MQTT bridge, and update store_data_platform only where event names map to notifications, case types, and CRM penalty submission.

  • Review task-relevant instructions, lessons, and dirty worktree.
  • Inspect the current cold-display engine, case store, webhook payload, and tests.
  • Inspect video_recognition_local cold-display Webhook receiver and MQTT publisher.
  • Inspect store_data_platform cold-display MQTT consumer, notification mapping, and CRM submission trigger.
  • Inspect xiaozheng@10.8.0.23 active containers and deployment paths.
  • Add failing cold-display engine/case/config/webhook tests for time_pre_warning, pre_warning_handled, time_alarm, and alarm_removal_timeout.
  • Implement the camera-side state machine and config fields.
  • Add/adjust video_recognition_local passthrough tests for the new event names.
  • Add/adjust store_data_platform tests and mappings for new event semantics.
  • Run local targeted and full relevant verification.
  • Deploy changed services to xiaozheng@10.8.0.23 without overwriting live RTSP/calibration secrets.
  • Update the remote timing config to pre_warning_seconds=60, max_dwell_seconds=120, alarm_removal_seconds=30, trash_confirmation_seconds=30.
  • Verify remote Webhook target reachability from the cold-display container to local video-recognition.
  • Observe cold-display, video-recognition, MQTT, and platform logs; record the result.

Current Findings

  • cold_display_guard currently has only max_dwell_seconds and trash_confirmation_seconds; it cannot independently represent 1-minute pre-warning, 2-minute alarm, and 30-second alarm-removal timeout.
  • video_recognition_local receives /api/webhook/cold-display-guard payloads as generic JSON and forwards them to MQTT; new event names should remain transparent, but tests should lock this behavior.
  • store_data_platform currently treats time_alarm and batch_pending_disposal as warning notifications, and only warning_escalated triggers CRM penalty submission. This must change so time_pre_warning is the warning, time_alarm is the alert reminder, and alarm_removal_timeout triggers CRM submission.
  • On 10.8.0.23, active containers include cold-display-guard-*, video-recognition, and mosquitto; video-recognition runs with host networking, while cold-display-guard-api runs on its Compose network.

Local Verification

  • Cold-display full Python suite passed: PYTHONPATH=src python3 -m unittest discover -s tests -v (98 tests).
  • video_recognition_local cold-display focused tests passed: go test ./internal/server ./internal/mqtt ./cmd -run 'TestColdDisplayGuard|Test.*ColdDisplayGuard' -count=1.
  • store_data_platform display-cabinet service focused tests passed: go test ./store_data/service -run 'Test.*StoreDisplayCabinet|TestResolveStoreDisplayCabinet.*|TestShouldSubmitStoreDisplayCabinetPenalty|TestBuildStoreDisplayCabinet.*' -count=1.

Deployment Review

  • Synced only these cold-display source files to xiaozheng@10.8.0.23:/home/xiaozheng/cold_display_guard/src/cold_display_guard/: models.py, config.py, engine.py, cases.py, webhooks.py.
  • Backed up the remote source files and live config/example.toml before deployment.
  • Updated the live remote thresholds to pre_warning_seconds=60, max_dwell_seconds=120, alarm_removal_seconds=30, and trash_confirmation_seconds=30.
  • Updated the live remote Webhook target from the unreachable old host to http://10.8.0.23:8080/api/webhook/cold-display-guard.
  • Rebuilt cold-display-guard:dev and restarted only cold-display-guard-api and cold-display-guard-runtime.
  • Remote verification passed:
    • GET /api/manage/health returned status=ok and runtime_status=running.
    • GET /api/manage/config returned the four expected threshold values and the new Webhook target.
    • Container-side synthetic engine run emitted batch_started, time_pre_warning, time_alarm, alarm_removal_timeout, then batch_pending_disposal plus batch_discarded.
    • Natural runtime log emitted alarm_removal_timeout for batch_000881 at 2026-06-15T11:52:20+08:00.
    • Webhook delivery for that event returned HTTP 200 from video-recognition.
    • video_recognition_local result JSONL recorded both alarm_removal_timeout batch and case events.
    • MQTT probe confirmed video-recognition published to video/cold-display-guard/result/cold-display-guard with device_identifier=cold-display-guard.
  • store_data_platform is not deployed on 10.8.0.23 under that repository name or as an identifiable container; platform handling changes were completed and verified in the local repository.
  • The cold-display retry queue has no pending entries; old 192.168.5.103 failures are already dead-letter history.

Current Task: Alarm Snapshot Labels And Zone Colors

Goal: Uploaded alarm screenshots should show each calibrated region name directly on the image, and different cold-display zones should use different overlay colors.

Design: Extend the existing standard-library overlay path. Keep drawing configured polygons before JPEG upload, but carry a display label for each region, choose a stable color from a fixed palette by zone order, and draw a small high-contrast text label inside the polygon. Keep trash ROI red and labeled separately.

  • Inspect the current calibration overlay helper and tests.
  • Add failing tests for per-zone colors and visible region labels.
  • Implement labels and stable zone color palette.
  • Run snapshot tests and full Python tests.
  • Deploy the overlay update to xiaozheng@10.8.0.23.
  • Verify remote API/runtime health and deployed overlay helper.

Review

  • apply_calibration_overlay now assigns each cold-display zone a stable color from a fixed palette and keeps the trash ROI red.
  • Each overlay region now carries a label and draws a small high-contrast label box directly on the frame before JPEG encoding/upload.
  • The built-in label renderer covers common现场 labels such as 区域 1 through digits and 垃圾区, plus basic ASCII for custom numeric/English labels.
  • Verification passed:
    • PYTHONPATH=src python3 -m unittest tests/test_alarm_snapshots.py -v
    • PYTHONPATH=src python3 -m unittest discover -s tests -v (99 tests)
  • Deployed src/cold_display_guard/alarm_snapshots.py to xiaozheng@10.8.0.23 after backing up the previous remote file.
  • Rebuilt cold-display-guard:dev and restarted cold-display-guard-api plus cold-display-guard-runtime.
  • Remote verification passed:
    • GET /api/manage/health returned status=ok and runtime_status=running.
    • Container-side overlay smoke test confirmed two zones render different RGB values and label text pixels are present.

Current Task: Alarm Snapshot Chinese Label Rendering Fix

Goal: Fix unreadable/garbled Chinese region names on uploaded alarm screenshots while keeping per-zone colors and fallback labeling robust.

Design: Use a real CJK font renderer for Chinese labels in the alarm snapshot overlay path. Install Noto CJK fonts in the runtime image, render labels through ffmpeg drawtext when the font is available, and fall back to readable ASCII labels if the font renderer is unavailable.

  • Reproduce and identify the likely root cause: remote container only matched DejaVu for zh-cn, so Chinese labels had no real CJK font path.
  • Add regression tests for Docker CJK font installation and readable ASCII fallback labels.
  • Update Dockerfile to install fonts-noto-cjk.
  • Update alarm_snapshots.py to prefer CJK font rendering and use R1/TRASH fallback text when needed.
  • Run focused and full local Python verification.
  • Deploy Dockerfile and alarm_snapshots.py to xiaozheng@10.8.0.23 without overwriting live config.
  • Rebuild/restart cold-display-guard-api and cold-display-guard-runtime.
  • Verify remote API/runtime health, CJK font availability, overlay smoke behavior, and runtime logs.

Review

  • Root cause was the screenshot overlay path not having a real Chinese font renderer in the deployed image; the container matched DejaVu before this fix.
  • The rebuilt remote container now reports NotoSansCJK-Regular.ttc: "Noto Sans CJK SC" "Regular" for fc-match :lang=zh-cn.
  • Remote overlay smoke test confirmed find_cjk_font_file() returns /usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc, Chinese labels change the frame, bright label pixels are present, and different regions retain distinct colors.
  • Local verification passed:
    • PYTHONPATH=src python3 -m unittest tests/test_alarm_snapshots.py -v
    • PYTHONPATH=src python3 -m unittest discover -s tests -v (101 tests)
  • Remote verification passed:
    • GET /api/manage/health returned status=ok, runtime_status=running, and version dev.
    • cold-display-guard-api is healthy and cold-display-guard-runtime is running after restart.
    • Runtime logs show normal startup after the restart.