fix: stabilize cold display occupancy detection

2026-06-15 13:40:20 +08:00
parent 1059850378
commit fa2c90e250
5 changed files with 68 additions and 1 deletions
--- a/tasks/lessons.md
+++ b/tasks/lessons.md
@@ -11,3 +11,9 @@
  1. 涉及中文截图叠字时，镜像必须安装可验证的 CJK 字体包，并在容器内用 `fc-match :lang=zh-cn` 确认命中 CJK 字体。
  2. 部署后必须在目标容器内跑一次中文标签叠图烟测，确认真实渲染路径可用，而不只检查像素变化。
  3. 字体渲染不可用时，回退文本必须转换成可读 ASCII 标识，例如 `区域 1` -> `R1`、`垃圾区` -> `TRASH`，避免继续绘制乱码中文。
+
+- 2026-06-15: 现场识别抖动排查时，不能先假设某个区域为空；用户指出区域 1、2、6 实际都有物后，原先单纯调高相对暗区阈值会压掉真实占用。
+  Prevention:
+  1. 调整视觉阈值前，必须向现场实际状态对齐，明确每个被分析区域当前应该是有物还是空。
+  2. 如果物品已存在于启动基线中，不能只依赖相对基线变化；需要绝对特征或重新采空基线来识别。
+  3. 对“正常取用”误报，应优先检查有物状态是否短暂掉空，并用判空确认帧数或滞后来处理抖动，而不是只提高占用阈值。
--- a/tasks/todo.md
+++ b/tasks/todo.md
@@ -284,3 +284,35 @@
  - `GET /api/manage/health` returned `status=ok`, `runtime_status=running`, and version `dev`.
  - `cold-display-guard-api` is healthy and `cold-display-guard-runtime` is running after restart.
  - Runtime logs show normal startup after the restart.
+
+## Current Task: Investigate False Normal Consumption Events On 10.8.0.23
+
+**Goal:** Determine why the live system records a normal consumption event about every two minutes with a dwell time near 13 seconds even when no one touched the cold display cabinet.
+
+**Debug plan:** Inspect remote runtime/event/case/diagnostic logs first, correlate `batch_started` and `batch_consumed` pairs by zone and dwell time, then trace the vision metrics for those timestamps to identify whether the source is occupancy flicker, runtime restart state restoration, config thresholds, or downstream display interpretation.
+
+- [ ] Inspect recent remote events and confirm the exact event names, zones, dwell seconds, and cadence.
+- [ ] Inspect runtime diagnostics around those timestamps for occupancy and vision metric flicker.
+- [ ] Inspect live config and runtime logs for sampling/stabilization settings and restarts.
+- [x] Form and test a root-cause hypothesis before changing code or live thresholds.
+- [x] Record findings, fix if needed, and verify with logs/tests.
+
+### Findings And Fix
+
+- The repeated records were real `batch_started` -> `batch_consumed` events from the camera-side engine, not a downstream display issue.
+- Before the fix, recent events showed repeated zone 1 batches ending after 13-33 seconds, matching the two-frame confirmation cadence at the current sampling rate.
+- Root cause had two parts:
+  - Zone 1 was genuinely occupied, but its vision signal hovered around the old relative dark threshold, so short raw-occupancy dips were interpreted as item removal.
+  - Zone 2 was occupied before or during baseline learning, so its relative difference from baseline stayed near zero and it was not detected as occupied.
+- Added `occupancy_absolute_dark_fraction` in `src/cold_display_guard/vision.py`, defaulting to `0.0` so existing configs are unchanged unless they opt in.
+- Updated the live config on `xiaozheng@10.8.0.23`:
+  - `occupancy_dark_fraction = 0.12`
+  - `occupancy_absolute_dark_fraction = 0.085`
+  - `empty_confirm_frames = 6`
+- Rebuilt and restarted `cold-display-guard-api` and `cold-display-guard-runtime`.
+- Verification:
+  - Local full Python suite passed: `PYTHONPATH=src python3 -m unittest discover -s tests -v` (`102` tests).
+  - Remote health returned `status=ok` and `runtime_status=running`.
+  - Remote container config shows the new thresholds.
+  - After deployment, latest diagnostics stabilized at `zone_counts = {"1": 1, "2": 1, "6": 1}`.
+  - During a two-minute observation window after `13:25`, no new `batch_consumed` events were emitted; only expected pre-warning/alarm lifecycle events appeared for the occupied zones.