Files
cold_display_guard/tasks/todo.md

32 KiB

Task Todo

Current Task: Runtime/API Case State Reopen Fix

Goal: When the management API marks a display-cabinet case as handled, the runtime process must not later append a newer open snapshot for the same case from stale in-memory state.

  • Add a failing regression test for API-written handled state being preserved when runtime persists later events.
  • Fix runtime case persistence to reconcile with the latest JSONL snapshots before applying new events.
  • Run targeted case/runtime tests.
  • Record remote chain verification and deployment status.

Findings

  • On xiaozheng@10.8.0.23, case_batch_000911 was marked handled at 2026-06-15T07:27:12Z, then runtime appended a newer open snapshot for the same case at 2026-06-15T15:38:03+08:00.
  • The API and runtime are separate processes sharing logs/cases.jsonl; runtime keeps a long-lived CaseStore loaded at startup and did not see the API-written handled snapshot.

Verification

  • RED:

    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests.test_main.RuntimeRestoreTests.test_persist_case_updates_preserves_api_handled_snapshot -v
    • Result before fix: failed because runtime appended a later open snapshot.
  • Local targeted verification:

    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests.test_main.RuntimeRestoreTests.test_persist_case_updates_preserves_api_handled_snapshot -v
    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests/test_cases.py -v
    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests/test_main.py -v
    • Result: all passed.
  • Remote deployment:

    • Synced only src/cold_display_guard/main.py to xiaozheng@10.8.0.23:/home/xiaozheng/cold_display_guard/src/cold_display_guard/main.py.
    • Ran docker compose --env-file deploy/cold-display-guard.env -f deploy/docker-compose.yml up -d --build cold-display-guard-runtime.
    • Compose recreated cold-display-guard-api and cold-display-guard-runtime; health check returned status=ok.
  • Remote behavior check:

    • Ran the same API-handled/runtime-later-event scenario inside cold-display-guard-runtime using a temp JSONL file.
    • Result: {"handled_source": "manual", "latest_status": "handled", "new_snapshots": 0}.
  • Review the current project instructions and check for task-relevant lessons.

  • Inspect the OTA upload API document and current runtime/webhook capture path.

  • Create an isolated worktree for alarm snapshot upload implementation.

  • Write the detailed implementation plan to docs/superpowers/plans/2026-06-09-alarm-snapshot-upload.md.

  • Execute alarm snapshot upload client TDD cycle.

  • Execute runtime and webhook payload integration TDD cycle.

  • Update config surface, docs, and verification notes.

  • Run targeted verification and final full verification.

Notes

  • tasks/lessons.md is absent in this repository/worktree, so there were no prior session lessons to review.
  • Upload API reference: /Users/glo/code/go/wenma/ai_manager/zd-ai-manager/chunk-upload-oss-service/UPLOAD_API.md
  • User-provided upload target: https://ota.zhengxinshipin.com
  • User-provided token secret: change-me-in-production

Review

  • Plan saved to docs/superpowers/plans/2026-06-09-alarm-snapshot-upload.md.
  • Chosen implementation keeps snapshot upload entirely outside BatchEngine and enriches webhook payloads from the runtime side using the already captured frame.
  • Implemented src/cold_display_guard/alarm_snapshots.py for JPEG encoding plus OTA chunk-upload orchestration, runtime integration in src/cold_display_guard/main.py, webhook payload enrichment in src/cold_display_guard/webhooks.py, config exposure/secret stripping in src/cold_display_guard/config.py and src/cold_display_guard/manage_api.py, and config/doc updates in config/example.toml and README_zh.md.
  • Targeted verification passed:
    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests/test_alarm_snapshots.py -v
    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests/test_main.py -v
    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest tests/test_webhooks.py tests/test_config.py tests/test_manage_api.py -v
  • Final verification passed:
    • eval "$(/opt/homebrew/bin/pyenv init -)" && PYTHONPATH=src python -m unittest discover -s tests -v
    • cd web && pnpm install --frozen-lockfile && pnpm build

Current Task: Webhook Payload Field Gap Check

  • Pull the actual payload currently received by video-recognition and compare it against the required event list fields.
  • Patch webhook payload builders to include the missing non-store fields required by the downstream table.
  • Add or update focused webhook tests for the enriched payload shape.
  • Run targeted verification and record the result here.

Current Findings

  • Current received payload only includes batch_id, camera_id, event, kind, severity, source_id, state, ts, zone_id, and zone_label.
  • Missing or not explicitly populated for the downstream event table: event code, camera IP, batch start time, removal time, dwell duration, discard flag, discard time, create time, alarm time, and update time.

Field Gap Verification

  • Actual receiver payload before the fix, from video-recognition result JSONL on 10.8.0.11, confirmed only the base fields above and did not include the downstream table time/discard/IP fields.
  • Updated src/cold_display_guard/webhooks.py so both batch_event and case_event now include:
    • event_code
    • camera_ip
    • started_at
    • ended_at
    • removed_at
    • dwell_seconds
    • is_discarded
    • discarded_at
    • created_at
    • alerted_at
    • alarm_at
    • updated_at
  • case_event also now carries the missing contextual fields camera_id, zone_id, and zone_label.
  • Verification passed:
    • PYTHONPATH=src python3 -m unittest tests/test_webhooks.py -v
    • PYTHONPATH=src python3 -m unittest tests/test_main.py -v
    • PYTHONPATH=src python3 -m unittest discover -s tests -v
  • Deployed updated code to xiaozheng@10.8.0.11 without overwriting the remote config/example.toml, rebuilt cold-display-guard:dev, and restarted only cold-display-guard-api plus cold-display-guard-runtime.
  • Natural post-deploy traffic did not arrive during the 2-minute observation window, so final runtime verification used the deployed container to build representative batch/case webhook payloads with the live remote config and confirmed camera_ip = 192.168.3.4 plus all new downstream fields were present.

Current Task: Deploy To 192.168.5.103

  • Inspect the existing deployment layout and active containers on xiaozheng@192.168.5.103.
  • Verify the exact webhook route on that host before writing config.
  • Sync the current project code to the remote deployment directory without overwriting the live RTSP and calibration config.
  • Configure the remote webhook settings for the local video-recognition receiver.
  • Rebuild and restart the remote API/runtime containers, then verify health and outbound webhook configuration.

Deployment Findings

  • Existing deployment path on 192.168.5.103 is /home/xiaozheng/cold_display_guard, not ~/apps/cold-display-guard/app.
  • The host already runs cold-display-guard-api, cold-display-guard-runtime, and cold-display-guard-web on ports 19080 and 23000.
  • The same host also runs video-recognition, and a direct probe to http://127.0.0.1:8080/api/webhook/cold-display-guard returned 200 OK, so this is the verified webhook target for this environment.

Deployment Verification

  • From inside the running cold-display-guard-api container on 192.168.5.103:
    • http://host.docker.internal:8080/api/webhook/cold-display-guard failed DNS resolution.
    • http://172.17.0.1:8080/api/webhook/cold-display-guard returned 200 OK.
    • http://192.168.5.103:8080/api/webhook/cold-display-guard returned 200 OK.
  • The configured webhook target was set to http://192.168.5.103:8080/api/webhook/cold-display-guard for both event_url and case_url.
  • Remote config was enriched to include:
    • case_sink
    • alarm_snapshot_upload
    • webhook_retry_sink
    • webhook_delivery_sink
    • webhooks
  • Code sync used rsync with config/example.toml excluded so the live RTSP URL and calibration polygons were preserved.
  • Remote rebuild/restart completed for cold-display-guard-api and cold-display-guard-runtime.
  • Verified after restart:
    • GET http://127.0.0.1:19080/api/manage/health returned status=ok
    • GET http://127.0.0.1:19080/api/manage/config showed webhooks.enabled=true
    • event_url and case_url both active on http://192.168.5.103:8080/api/webhook/cold-display-guard
    • alarm_snapshot_upload.enabled=true

Current Task: Alarm Snapshot Calibration Overlay

Goal: Webhook-linked uploaded alarm snapshots should visually include the calibrated cold display zones and trash confirmation ROI from the current config.

Design: Keep the existing runtime flow intact: capture current RTSP frame, process events, then upload an alarm snapshot only for warning/alarm events. Before JPEG encoding, build overlay regions from [[zones]] plus [trash].roi, clamp normalized polygon coordinates to the image bounds, draw a semi-transparent fill and visible outline directly onto a copied Frame.rgb, and pass that annotated frame to the existing encoder/uploader. Do not change BatchEngine, Webhook payload shape, OTA upload protocol, or management snapshot capture.

  • Review task-relevant lessons and current dirty worktree.
  • Inspect alarm_snapshots.py, main.py, config polygon shape, and existing tests.
  • Write a failing unit test proving alert snapshot upload encodes an annotated frame when zones/trash ROI are configured.
  • Write focused unit tests for polygon overlay behavior using a tiny RGB frame.
  • Run targeted tests and confirm the new tests fail for the expected missing overlay behavior.
  • Implement the smallest standard-library overlay helper in src/cold_display_guard/alarm_snapshots.py.
  • Wire capture_alert_snapshot to apply configured overlays before JPEG encoding.
  • Run targeted snapshot/runtime tests.
  • Run the full Python test suite.

Review

  • Added apply_calibration_overlay in src/cold_display_guard/alarm_snapshots.py to draw configured food-zone polygons in yellow and the trash ROI in red onto a copied frame before JPEG encoding and OTA upload.
  • The overlay clamps normalized coordinates to image bounds, draws semi-transparent fills plus outlines, and leaves the original Frame.rgb unchanged for downstream runtime processing.
  • capture_alert_snapshot now encodes the annotated frame when warning/alarm events trigger snapshot upload; non-alert events and disabled upload behavior are unchanged.
  • Targeted verification passed:
    • PYTHONPATH=src python3 -m unittest tests/test_alarm_snapshots.py -v
    • PYTHONPATH=src python3 -m unittest tests/test_main.py -v
  • Full verification passed:
    • PYTHONPATH=src python3 -m unittest discover -s tests -v

Current Task: Deploy Overlay Update To 10.8.0.23

Goal: Deploy the alarm snapshot calibration overlay change to xiaozheng@10.8.0.23 without overwriting live RTSP/calibration config or unrelated local changes.

Plan: Inspect the remote deployment layout first, confirm which containers are active, sync only the runtime source file required for the overlay change, rebuild/restart the API/runtime services that use the Python image, and verify both service health and the deployed source code.

  • Inspect remote deployment directory, Docker/Compose files, and active containers on xiaozheng@10.8.0.23.
  • Confirm the remote config file remains present and is not overwritten.
  • Sync src/cold_display_guard/alarm_snapshots.py to the remote deployment path.
  • Rebuild and restart only the affected cold-display-guard-api and cold-display-guard-runtime services when Compose is available.
  • Verify management API health after restart.
  • Verify the deployed remote source contains apply_calibration_overlay.

Deployment Review

  • Remote deployment path confirmed as /home/xiaozheng/cold_display_guard.
  • Active services before deployment: cold-display-guard-api, cold-display-guard-runtime, and cold-display-guard-web.
  • Remote live config/example.toml was checked before and after deployment and was not overwritten.
  • Synced only src/cold_display_guard/alarm_snapshots.py to avoid deploying unrelated local web/nginx.conf changes.
  • Created a timestamped backup of the previous remote alarm_snapshots.py beside the source file before syncing.
  • Rebuilt cold-display-guard:dev with docker compose --env-file deploy/cold-display-guard.env -f deploy/docker-compose.yml build cold-display-guard-api.
  • Restarted only cold-display-guard-api and cold-display-guard-runtime with Compose; cold-display-guard-web remained untouched.
  • Verification passed:
    • curl http://127.0.0.1:19080/api/manage/health returned status=ok and runtime_status=running.
    • docker exec cold-display-guard-api python3 -c ... confirmed apply_calibration_overlay exists in the running image with signature (frame, config) -> Frame.
    • API and runtime logs show normal startup after restart.

Current Task: Update Timing Parameters On 10.8.0.23

Goal: Adjust the live timing settings on xiaozheng@10.8.0.23 per operator request.

Applied mapping: The current application has no separate pre-warning threshold. It supports max_dwell_seconds for the time alarm/overdue threshold and trash_confirmation_seconds for the disposal confirmation window before warning escalation. Applied max_dwell_seconds = 120 and trash_confirmation_seconds = 30.

  • Back up /home/xiaozheng/cold_display_guard/config/example.toml.
  • Update [thresholds].max_dwell_seconds from 300 to 120.
  • Update [thresholds].trash_confirmation_seconds from 120 to 30.
  • Restart cold-display-guard-api and cold-display-guard-runtime.
  • Verify /api/manage/health.
  • Verify /api/manage/config returns {"max_dwell_seconds": 120, "trash_confirmation_seconds": 30}.

Timing Update Review

  • Remote config was edited in place after creating a timestamped backup.
  • cold-display-guard-api and cold-display-guard-runtime were explicitly restarted with Docker Compose.
  • cold-display-guard-web was not restarted.
  • Verification passed:
    • GET http://127.0.0.1:19080/api/manage/health returned status=ok and runtime_status=running.
    • GET http://127.0.0.1:19080/api/manage/config returned max_dwell_seconds = 120 and trash_confirmation_seconds = 30.
    • Container status showed cold-display-guard-api healthy and cold-display-guard-runtime running after restart.
  • Note: requested 预警时长 = 1min is not independently configurable in the current codebase; supporting distinct pre-warning at 60 seconds and overdue at 120 seconds would require a code change.

Current Task: Pre-Warning Alarm Flow And Full Webhook/MQTT Chain

Goal: Implement the requested camera-side timing flow, deploy it to xiaozheng@10.8.0.23, and verify the Webhook -> video_recognition_local -> MQTT -> store_data_platform chain.

Design: Keep all timing decisions inside cold_display_guard.BatchEngine. Add separate thresholds for pre-warning, alarm, and alarm-removal timeout; emit explicit lifecycle events so downstream services do not infer camera-side timers. Keep video_recognition_local as a transparent Webhook/MQTT bridge, and update store_data_platform only where event names map to notifications, case types, and CRM penalty submission.

  • Review task-relevant instructions, lessons, and dirty worktree.
  • Inspect the current cold-display engine, case store, webhook payload, and tests.
  • Inspect video_recognition_local cold-display Webhook receiver and MQTT publisher.
  • Inspect store_data_platform cold-display MQTT consumer, notification mapping, and CRM submission trigger.
  • Inspect xiaozheng@10.8.0.23 active containers and deployment paths.
  • Add failing cold-display engine/case/config/webhook tests for time_pre_warning, pre_warning_handled, time_alarm, and alarm_removal_timeout.
  • Implement the camera-side state machine and config fields.
  • Add/adjust video_recognition_local passthrough tests for the new event names.
  • Add/adjust store_data_platform tests and mappings for new event semantics.
  • Run local targeted and full relevant verification.
  • Deploy changed services to xiaozheng@10.8.0.23 without overwriting live RTSP/calibration secrets.
  • Update the remote timing config to pre_warning_seconds=60, max_dwell_seconds=120, alarm_removal_seconds=30, trash_confirmation_seconds=30.
  • Verify remote Webhook target reachability from the cold-display container to local video-recognition.
  • Observe cold-display, video-recognition, MQTT, and platform logs; record the result.

Current Findings

  • cold_display_guard currently has only max_dwell_seconds and trash_confirmation_seconds; it cannot independently represent 1-minute pre-warning, 2-minute alarm, and 30-second alarm-removal timeout.
  • video_recognition_local receives /api/webhook/cold-display-guard payloads as generic JSON and forwards them to MQTT; new event names should remain transparent, but tests should lock this behavior.
  • store_data_platform currently treats time_alarm and batch_pending_disposal as warning notifications, and only warning_escalated triggers CRM penalty submission. This must change so time_pre_warning is the warning, time_alarm is the alert reminder, and alarm_removal_timeout triggers CRM submission.
  • On 10.8.0.23, active containers include cold-display-guard-*, video-recognition, and mosquitto; video-recognition runs with host networking, while cold-display-guard-api runs on its Compose network.

Local Verification

  • Cold-display full Python suite passed: PYTHONPATH=src python3 -m unittest discover -s tests -v (98 tests).
  • video_recognition_local cold-display focused tests passed: go test ./internal/server ./internal/mqtt ./cmd -run 'TestColdDisplayGuard|Test.*ColdDisplayGuard' -count=1.
  • store_data_platform display-cabinet service focused tests passed: go test ./store_data/service -run 'Test.*StoreDisplayCabinet|TestResolveStoreDisplayCabinet.*|TestShouldSubmitStoreDisplayCabinetPenalty|TestBuildStoreDisplayCabinet.*' -count=1.

Deployment Review

  • Synced only these cold-display source files to xiaozheng@10.8.0.23:/home/xiaozheng/cold_display_guard/src/cold_display_guard/: models.py, config.py, engine.py, cases.py, webhooks.py.
  • Backed up the remote source files and live config/example.toml before deployment.
  • Updated the live remote thresholds to pre_warning_seconds=60, max_dwell_seconds=120, alarm_removal_seconds=30, and trash_confirmation_seconds=30.
  • Updated the live remote Webhook target from the unreachable old host to http://10.8.0.23:8080/api/webhook/cold-display-guard.
  • Rebuilt cold-display-guard:dev and restarted only cold-display-guard-api and cold-display-guard-runtime.
  • Remote verification passed:
    • GET /api/manage/health returned status=ok and runtime_status=running.
    • GET /api/manage/config returned the four expected threshold values and the new Webhook target.
    • Container-side synthetic engine run emitted batch_started, time_pre_warning, time_alarm, alarm_removal_timeout, then batch_pending_disposal plus batch_discarded.
    • Natural runtime log emitted alarm_removal_timeout for batch_000881 at 2026-06-15T11:52:20+08:00.
    • Webhook delivery for that event returned HTTP 200 from video-recognition.
    • video_recognition_local result JSONL recorded both alarm_removal_timeout batch and case events.
    • MQTT probe confirmed video-recognition published to video/cold-display-guard/result/cold-display-guard with device_identifier=cold-display-guard.
  • store_data_platform is not deployed on 10.8.0.23 under that repository name or as an identifiable container; platform handling changes were completed and verified in the local repository.
  • The cold-display retry queue has no pending entries; old 192.168.5.103 failures are already dead-letter history.

Current Task: Alarm Snapshot Labels And Zone Colors

Goal: Uploaded alarm screenshots should show each calibrated region name directly on the image, and different cold-display zones should use different overlay colors.

Design: Extend the existing standard-library overlay path. Keep drawing configured polygons before JPEG upload, but carry a display label for each region, choose a stable color from a fixed palette by zone order, and draw a small high-contrast text label inside the polygon. Keep trash ROI red and labeled separately.

  • Inspect the current calibration overlay helper and tests.
  • Add failing tests for per-zone colors and visible region labels.
  • Implement labels and stable zone color palette.
  • Run snapshot tests and full Python tests.
  • Deploy the overlay update to xiaozheng@10.8.0.23.
  • Verify remote API/runtime health and deployed overlay helper.

Review

  • apply_calibration_overlay now assigns each cold-display zone a stable color from a fixed palette and keeps the trash ROI red.
  • Each overlay region now carries a label and draws a small high-contrast label box directly on the frame before JPEG encoding/upload.
  • The built-in label renderer covers common现场 labels such as 区域 1 through digits and 垃圾区, plus basic ASCII for custom numeric/English labels.
  • Verification passed:
    • PYTHONPATH=src python3 -m unittest tests/test_alarm_snapshots.py -v
    • PYTHONPATH=src python3 -m unittest discover -s tests -v (99 tests)
  • Deployed src/cold_display_guard/alarm_snapshots.py to xiaozheng@10.8.0.23 after backing up the previous remote file.
  • Rebuilt cold-display-guard:dev and restarted cold-display-guard-api plus cold-display-guard-runtime.
  • Remote verification passed:
    • GET /api/manage/health returned status=ok and runtime_status=running.
    • Container-side overlay smoke test confirmed two zones render different RGB values and label text pixels are present.

Current Task: Alarm Snapshot Chinese Label Rendering Fix

Goal: Fix unreadable/garbled Chinese region names on uploaded alarm screenshots while keeping per-zone colors and fallback labeling robust.

Design: Use a real CJK font renderer for Chinese labels in the alarm snapshot overlay path. Install Noto CJK fonts in the runtime image, render labels through ffmpeg drawtext when the font is available, and fall back to readable ASCII labels if the font renderer is unavailable.

  • Reproduce and identify the likely root cause: remote container only matched DejaVu for zh-cn, so Chinese labels had no real CJK font path.
  • Add regression tests for Docker CJK font installation and readable ASCII fallback labels.
  • Update Dockerfile to install fonts-noto-cjk.
  • Update alarm_snapshots.py to prefer CJK font rendering and use R1/TRASH fallback text when needed.
  • Run focused and full local Python verification.
  • Deploy Dockerfile and alarm_snapshots.py to xiaozheng@10.8.0.23 without overwriting live config.
  • Rebuild/restart cold-display-guard-api and cold-display-guard-runtime.
  • Verify remote API/runtime health, CJK font availability, overlay smoke behavior, and runtime logs.

Review

  • Root cause was the screenshot overlay path not having a real Chinese font renderer in the deployed image; the container matched DejaVu before this fix.
  • The rebuilt remote container now reports NotoSansCJK-Regular.ttc: "Noto Sans CJK SC" "Regular" for fc-match :lang=zh-cn.
  • Remote overlay smoke test confirmed find_cjk_font_file() returns /usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc, Chinese labels change the frame, bright label pixels are present, and different regions retain distinct colors.
  • Local verification passed:
    • PYTHONPATH=src python3 -m unittest tests/test_alarm_snapshots.py -v
    • PYTHONPATH=src python3 -m unittest discover -s tests -v (101 tests)
  • Remote verification passed:
    • GET /api/manage/health returned status=ok, runtime_status=running, and version dev.
    • cold-display-guard-api is healthy and cold-display-guard-runtime is running after restart.
    • Runtime logs show normal startup after the restart.

Current Task: Investigate False Normal Consumption Events On 10.8.0.23

Goal: Determine why the live system records a normal consumption event about every two minutes with a dwell time near 13 seconds even when no one touched the cold display cabinet.

Debug plan: Inspect remote runtime/event/case/diagnostic logs first, correlate batch_started and batch_consumed pairs by zone and dwell time, then trace the vision metrics for those timestamps to identify whether the source is occupancy flicker, runtime restart state restoration, config thresholds, or downstream display interpretation.

  • Inspect recent remote events and confirm the exact event names, zones, dwell seconds, and cadence.
  • Inspect runtime diagnostics around those timestamps for occupancy and vision metric flicker.
  • Inspect live config and runtime logs for sampling/stabilization settings and restarts.
  • Form and test a root-cause hypothesis before changing code or live thresholds.
  • Record findings, fix if needed, and verify with logs/tests.

Findings And Fix

  • The repeated records were real batch_started -> batch_consumed events from the camera-side engine, not a downstream display issue.
  • Before the fix, recent events showed repeated zone 1 batches ending after 13-33 seconds, matching the two-frame confirmation cadence at the current sampling rate.
  • Root cause had two parts:
    • Zone 1 was genuinely occupied, but its vision signal hovered around the old relative dark threshold, so short raw-occupancy dips were interpreted as item removal.
    • Zone 2 was occupied before or during baseline learning, so its relative difference from baseline stayed near zero and it was not detected as occupied.
  • Added occupancy_absolute_dark_fraction in src/cold_display_guard/vision.py, defaulting to 0.0 so existing configs are unchanged unless they opt in.
  • Updated the live config on xiaozheng@10.8.0.23:
    • occupancy_dark_fraction = 0.12
    • occupancy_absolute_dark_fraction = 0.085
    • empty_confirm_frames = 6
  • Rebuilt and restarted cold-display-guard-api and cold-display-guard-runtime.
  • Verification:
    • Local full Python suite passed: PYTHONPATH=src python3 -m unittest discover -s tests -v (102 tests).
    • Remote health returned status=ok and runtime_status=running.
    • Remote container config shows the new thresholds.
    • After deployment, latest diagnostics stabilized at zone_counts = {"1": 1, "2": 1, "6": 1}.
  • During a two-minute observation window after 13:25, no new batch_consumed events were emitted; only expected pre-warning/alarm lifecycle events appeared for the occupied zones.

Current Task: Reduce Alarm Snapshot Label Visual Obstruction

Goal: Region labels on uploaded alarm screenshots should be smaller and more transparent so operators can inspect the food/display image underneath.

Design: Keep the existing label content, placement, CJK font rendering, and per-zone colors. Only reduce the visual weight of the label layer by lowering font size, black label-box opacity, border width, and fallback label-box opacity.

  • Inspect current alarm snapshot label rendering style.
  • Add a regression test for smaller ffmpeg drawtext label style.
  • Reduce drawtext font size and label-box opacity.
  • Keep fallback label renderer visually consistent with the ffmpeg path.
  • Run full local verification.
  • Deploy the updated snapshot overlay style to xiaozheng@10.8.0.23.
  • Verify remote runtime health and deployed label style.

Notes

  • Targeted snapshot test passed: PYTHONPATH=src python3 -m unittest tests/test_alarm_snapshots.py -v.
  • Full local verification passed: PYTHONPATH=src python3 -m unittest discover -s tests -v (103 tests).
  • Remote verification passed:
    • GET /api/manage/health returned status=ok and runtime_status=running.
    • Running container uses fontsize=13, boxcolor=black@0.34, and boxborderw=2 for region labels.
    • cold-display-guard-runtime logs show normal startup after restart.

Current Task: Limit Alert Snapshot Overlay To Event Zones

Goal: Uploaded warning/alarm screenshots should only draw the cold-display region polygons and names for the zones that actually triggered the warning/alarm event. Other configured zones and the trash ROI should not be drawn on those uploaded screenshots.

Plan: Keep the full calibration overlay helper available for tests and general use, but pass alert event zone IDs from capture_alert_snapshot into the overlay loader and disable trash ROI drawing for alert uploads.

  • Add a regression test proving alert snapshot upload only annotates the triggering event zone.
  • Filter snapshot overlay regions by event zone_id during alert upload.
  • Preserve full overlay behavior when apply_calibration_overlay is called without filters.
  • Run full local Python verification.
  • Deploy alarm_snapshots.py to xiaozheng@10.8.0.23.
  • Verify remote API/runtime health and deployed filtered-overlay behavior.

Review

  • Local verification passed:
    • PYTHONPATH=src python3 -m unittest tests/test_alarm_snapshots.py -v
    • PYTHONPATH=src python3 -m unittest discover -s tests -v (104 tests)
  • Deployed only src/cold_display_guard/alarm_snapshots.py to xiaozheng@10.8.0.23 after backing up the previous remote file; live config was not overwritten.
  • Rebuilt cold-display-guard:dev and restarted cold-display-guard-api plus cold-display-guard-runtime.
  • Remote verification passed:
    • GET /api/manage/health returned status=ok and runtime_status=running.
    • Container-side smoke test for a zone-1 alert returned zone1_changed=True, zone2_unchanged=True, and trash_unchanged=True.
    • API/runtime logs show normal startup after restart.

Current Task: Check Webhook Duplicate Delivery

Goal: Verify whether cold_display_guard is sending duplicate Webhook requests to video-recognition on xiaozheng@10.8.0.23.

Investigation: Compare the sending code path, remote webhook delivery audit, retry queue state, cold-display event/case logs, video-recognition HTTP logs, and the receiver-side JSONL payloads.

  • Inspect sender code path for direct event/case delivery and retry drain behavior.
  • Confirm remote Webhook config uses the same URL for event_url and case_url.
  • Check sender delivery audit for duplicate receiver task_id values.
  • Check retry queue for pending successful redelivery risk.
  • Check receiver-side cold-display JSONL for duplicate payloads and duplicate business keys.
  • Trace the only coarse duplicate-looking case around batch_000898.

Review

  • Current remote config sends both batch_event and case_event to http://10.8.0.23:8080/api/webhook/cold-display-guard, so one business transition can produce two HTTP POSTs to the same endpoint with different kind values.
  • Sender audit logs/webhook_delivery.jsonl contains 3056 records total; recent valid delivery has 321 direct ok records and 0 retry ok records.
  • Receiver-returned task_id values are unique: 321 unique task IDs and 0 duplicate task IDs.
  • Retry queue has 547 latest retry items, all dead_letter; there are no pending retries.
  • Receiver-side video-recognition cold-display files for 2026-06-15 contain 181 business payloads; exact payload duplicates are 0, and fine-grained business key duplicates are 0.
  • Sender events.jsonl contains 3325 events; duplicate (batch_id, event, ts, zone_id) keys are 0.
  • The only coarse duplicate-looking receiver entry was batch_000898 at 13:20:26: the same frame emitted time_pre_warning and pre_warning_handled, which produced separate case_event actions created and handled. This is not the same Webhook request repeated.