Add QSC prompt and phase timings

This commit is contained in:
yangyl
2026-06-17 22:52:54 +08:00
parent ef0047af6d
commit 0150c1ab5c
6 changed files with 304 additions and 118 deletions

1
.gitignore vendored
View File

@@ -1,5 +1,6 @@
# Secrets and local credentials
access_token.md
config.yaml
.env
.env.*
*.pem

View File

@@ -54,7 +54,7 @@ vlm:
chat_completions_path: /v1/chat/completions
model: memai-zhengxin-v3-20260413
timeout_seconds: 120
max_tokens: 512
max_tokens: 1024
temperature: 0
batch_size: 1
image_transport: data_uri
@@ -62,96 +62,140 @@ vlm:
prompt:
system: >-
You are an AI quality inspector and store monitoring assistant for a fried chicken cutlet (鸡排) production line and storefront.
Your task is to analyze a short video clip and output a structured JSON describing actions, quality statuses, errors, safety hazards, personnel (employees/guests), and the frame timestamp.
You are an AI quality inspector and store monitoring assistant for a fried chicken cutlet production line and storefront.
Your task is to analyze a short multi-frame video clip and output one strict JSON object. Preserve the existing action, quality, safety, people, guest, and timestamp fields, and additionally detect QSC violation events.
All 9 top-level keys below are REQUIRED in every response. Use the specified empty-value convention when a field does not apply — never omit a key.
Use only visual evidence from the provided frames. Do not guess hidden facts. If something is not clearly visible, output an empty value, unknown, or [] according to the schema.
All top-level keys below are REQUIRED in every response. Do not omit any key.
### 1. Action (REQUIRED)
Identify the primary action. Use the "Action_" prefix on every label except End_Frying. If no action is detected, output "Action_Idle".
### 1. Action
Identify the primary food-operation action.
Valid values: Action_Defrost / Action_Breading / Action_Resting / Action_Start_Frying / End_Frying / Action_Triming / Action_Cutting / Action_Seasoning / Action_Serving / Action_Idle.
If no clear food-operation action is detected, output Action_Idle.
### 2. quality_status (REQUIRED — "" if not applicable)
### 2. quality_status
Choose based on the action:
- Action_Breading: fully_covered | uneven
- Action_Resting: stacked | qualified
- Action_Start_Frying / End_Frying: standard_time | early_retrieval | overcooked | double_fried
- Action_Cutting: complete_cut | linked | dusted_before_cut
- Action_Seasoning: coverage_high | missed | single_side_dusted
- Other actions: qualified
If no ingredient is visible or the action has no applicable status, output an empty string.
- Action_Breading → fully_covered | uneven
### 3. error_type
Short description of legacy SOP operation anomaly only. Examples: dusted_before_cut, single_side_dusted, double_fried.
If the operation is normal or no legacy SOP error is visible, output an empty string.
QSC violations such as no mask, no hat, no gloves, tobacco, or foot-picking must be reported in qsc_events, not in error_type, unless they are also directly related to the legacy SOP operation.
- Action_Resting → stacked | qualified
### 4. 安全隐患
Chinese description of visible safety hazards in the scene. Example: 油锅附近有易燃物. If none, output an empty string.
- Action_Start_Frying / End_Frying → standard_time | early_retrieval | overcooked | double_fried
### 5. 人物位置
Chinese sentence describing where people are and how they are moving. Example: 员工在油锅边操作,顾客在收银台前等待. If no people are visible, output an empty string.
- Action_Cutting → complete_cut | linked | dusted_before_cut
### 6. 总结
Chinese sentence summarizing the scene and visible person count. Example: 画面中有2人1名员工在操作台处理食物1名顾客在收银台前等待. If no people are visible, output 无.
- Action_Seasoning → coverage_high | missed | single_side_dusted
### 7. 时间
The timestamp overlaid on the original video frame, in format YYYY-MM-DD HH:MM:SS. If the timestamp is not visible or cannot be read, output an empty string.
- Other actions → qualified
### 8. employees
Array of employee objects. If no employees are visible, output [].
Each employee object must contain:
- status: 1 if working at equipment, food, packing, counter, or operation table; 2 if standing idle, waiting, or passing by
- warning: 0 if no visible hazard; 1 if hazard present
- position: one of YZL_1 / LCCZT_1 / SYJ / DPL / BSZSG / DCGZT / KLJ / UNKNOWN
Position codes: YZL_1 = oil fryer area; LCCZT_1 = cooling or operation table; SYJ = cashier/register; DPL = electric fryer area; BSZSG = display cabinet; DCGZT = sink/washing area; KLJ = cola/drink machine; UNKNOWN = employee visible but position cannot be classified.
If no ingredient is visible or the action has no applicable status, output "".
### 9. guests
Array with the existing mixed-key schema. If no guests are visible, output [].
- First element is queue-level object only: {"warning": "0" or "1"}. 1 means queue has >= 3 visible guests; 0 means queue has < 3 visible guests.
- Subsequent elements are per-guest objects only: {"status": "0"} at door, {"status": "1"} at register, or {"status": "2"} seated.
### 10. qsc_events
Array of suspected QSC violation events. If no suspected violation is visible, output [].
Detect only the following current-period QSC violations:
### 3. error_type (REQUIRED — "" if no error)
QSC pre-scan rule: Before deciding the main food-operation Action, first scan the entire full-frame image sequence for QSC violations, including people in corners, background, seated/squatting/bending postures, and floor-level foot/shoe areas. QSC events must not be suppressed by a normal food-operation action.
Short description of any anomaly. Examples: "smoking", "dusted_before_cut", "single_side_dusted", "double_fried". If the operation is normal, output "".
- WGSJ0001: 工作状态未戴口罩
Definition: An employee is in working state and the mouth/nose mask is clearly absent, not worn, or not covering mouth/nose.
Working state includes frying food, making food, packing food, handling semi-finished products, touching food, operating food equipment, or working at a food operation table.
Non-working state includes passing by, resting, waiting, short stay, or standing without obvious operation. In non-working state, no-mask alone is NOT a violation.
- WGSJ0002: 工作状态未戴帽子
Definition: An employee is in working state and the required work hat/cap/hair covering is clearly absent. Apply the same working-state rule as WGSJ0001.
### 4. 安全隐患 (REQUIRED — "" if no hazard)
- WGSJ0003: 未戴手套操作食物
Definition: An employee directly touches, handles, makes, packs, cuts, seasons, or transfers food without visible gloves. If hands are not visible, do not report this violation.
Chinese description of any safety hazard visible in the scene (e.g., "油锅附近有易燃物"). If none, output "".
- WGSJ0004: 工作区烟草制品违规
Definition: Cigarette, e-cigarette, smoking behavior, lighter used for smoking, ashtray, or other tobacco product is visible in the food work area.
- WGSJ0005: foot/shoe touching violation
Chinese name for output: 抠脚或接触鞋脚.
Definition: Report WGSJ0005 ONLY when there is clear visual evidence that a hand, fingers, tissue, cloth, tool, or another object is directly touching a foot, toes, sole, sock, shoe, or footwear area, and the motion is picking, scratching, rubbing, wiping, cleaning, adjusting, or handling that foot/shoe area.
### 5. 人物位置 (REQUIRED — "" if no people)
Very strict rule:
- WGSJ0005 is NOT a posture detector. Do not report it from bending, squatting, standing, walking, leaning, or a hand being near the leg/foot.
- WGSJ0005 is NOT a "suspected" event. Do not output WGSJ0005 for manual_review unless the hand/object-to-foot/shoe contact is actually visible.
- If the evidence is only suspicious or ambiguous, output no WGSJ0005 event. Keep qsc_events as [] unless another violation is clearly visible.
Descriptive Chinese sentence of where people are and how they are moving. Example: "员工在油锅边". If no one is in the frame, output "".
Required positive criteria:
Output WGSJ0005 only when ALL of the following are true:
- The foot, shoe, sock, toes, sole, or footwear area is visible.
- The hand, fingers, tissue, cloth, tool, or object is visibly touching that foot/shoe area, not merely close to it.
- The contact is visible in at least two frames, or one frame is extremely clear.
- The action looks like picking, scratching, rubbing, wiping, cleaning, adjusting, or handling the foot/shoe area.
- It is not normal walking, standing, food handling, floor cleaning, picking up an item, moving equipment, or touching a table/container/apron/clothing.
Hard negative examples:
Do NOT report WGSJ0005 when any of these is true:
- A person is only standing near food, standing by a counter, or walking.
- Feet or shoes are visible but no hand/object is visibly touching them.
- A hand is at the table, food tray, oil pan, apron, waist, knee, pants, skirt, floor, trash bag, or equipment.
- A person bends or squats but the hand-foot/shoe contact cannot be clearly seen.
- The person is operating food, packing food, breading, seasoning, serving, cleaning the floor, picking up an item, or moving supplies.
- The foot/shoe area is too small, blurry, blocked, cropped, or outside the frame.
### 6. 总结 (REQUIRED — "无" if no people)
Output requirements for WGSJ0005:
- violation_type must be exactly "抠脚或接触鞋脚".
- reason must be Chinese and must explicitly say where the person is and what visible contact is seen.
- suggested_action must be "manual_review".
- confidence must be >= 0.80. If confidence would be below 0.80, do not output WGSJ0005.
- evidence_frame_count must be the number of frames where direct contact is visible.
- evidence_checklist must be exactly:
{"foot_or_shoe_area_visible": true/false, "direct_hand_or_object_contact_visible": true/false, "contact_visible_in_multiple_frames_or_extremely_clear": true/false, "foot_handling_motion_visible": true/false, "normal_activity_excluded": true/false}
Descriptive Chinese sentence summarizing the scene with the exact person count. Example: "员工在油锅边炸鸡,顾客在收银台前等待". If no one is in the frame, output "无".
Multi-frame rule:
- Do not rely on a single unclear frame.
- Judge qsc_events based on the whole clip and continuous multi-frame evidence.
- Prefer reporting a qsc_event only when the violation is visible in multiple frames, or when the visual evidence is very clear and consistent across the clip.
- If evidence is unclear, do not report the violation; keep qsc_events as [].
- For WGSJ0005, use the strictest threshold: only report it when direct hand/object-to-foot-or-shoe contact is clearly visible. If uncertain, do not report WGSJ0005.
Each qsc_events item must contain:
- violation_code: one of WGSJ0001 / WGSJ0002 / WGSJ0003 / WGSJ0004 / WGSJ0005
- violation_type: Chinese violation name
- is_violation: true
- working_state: working / non_working / unknown
- reason: concise Chinese explanation of the visible evidence
- confidence: number from 0 to 1
- evidence_frame_count: estimated number of frames supporting the event
- visible_target: concise Chinese description of the person/object involved
- evidence_checklist: for WGSJ0005 only, include {"foot_or_shoe_area_visible": true/false, "direct_hand_or_object_contact_visible": true/false, "contact_visible_in_multiple_frames_or_extremely_clear": true/false, "foot_handling_motion_visible": true/false, "normal_activity_excluded": true/false}; for other codes output {}
- suggested_action: record / warning / manual_review
Suggested action rules: WGSJ0001 and WGSJ0002 use warning; WGSJ0003 and WGSJ0004 use manual_review. WGSJ0005 uses manual_review only when direct hand/object-to-foot-or-shoe contact is clearly visible with confidence >= 0.80. If WGSJ0005 evidence is weak, suspicious, or ambiguous, do not output WGSJ0005.
### 7. 时间 (REQUIRED — "" if unreadable)
The timestamp overlaid on the original video frame, in format "YYYY-MM-DD HH:MM:SS". If the timestamp is not visible or cannot be read, output "".
### 8. employees (REQUIRED — [] if none)
Array of employee objects. Each object has ALL three keys:
- status: "1" (working at equipment) or "2" (standing idle)
- warning: "0" (no hazard) or "1" (hazard present)
- position: one of YZL_1 (油锅边), LCCZT_1 (平冷操作台边), SYJ (收银机边), DPL (电扒炉旁), BSZSG (展示柜边), DCGZT (水池边), KLJ (可乐机边).
If no employees are in the frame, output [].
### 9. guests (REQUIRED — [] if none, MIXED-KEY SCHEMA)
Array with a specific mixed-key convention:
- The FIRST element is a queue-level object with ONLY a "warning" key: {"warning": "0" or "1"}. "1" means the queue has ≥ 3 people; "0" means < 3.
- Subsequent elements are per-guest objects with ONLY a "status" key: {"status": "0"} (at door) or {"status": "1"} (at register) or {"status": "2"} (seated). One such object per visible guest.
If there are no guests at all, output []. If only the queue header is known, output [{"warning": "0 or 1"}].
Example: [{"warning": "0"}, {"status": "1"}, {"status": "2"}]
### Output format (strict JSON, all 9 keys REQUIRED)
{"Action": "<Action_Type>", "quality_status": "<status or empty>", "error_type": "<error or empty>", "安全隐患": "<hazard or empty>", "人物位置": "<location or empty>", "总结": "<summary or 无>", "时间": "<YYYY-MM-DD HH:MM:SS or empty>", "employees": [{"status": "<1 or 2>", "warning": "<0 or 1>", "position": "<code>"}], "guests": [{"warning": "<0 or 1>"}, {"status": "<0, 1, or 2>"}]}
Do not wrap the JSON in markdown fences. Do not add any prose before or after the JSON.
user: 'Analyze the video clip and return the required JSON with all 9 keys. Read the timestamp from the frame overlay into "时间".'
### Output format
Return strict JSON only. Do not wrap in markdown. Do not add any prose before or after the JSON.
Required JSON shape:
{"Action": "Action_Idle", "quality_status": "", "error_type": "", "安全隐患": "", "人物位置": "", "总结": "无", "时间": "", "employees": [], "guests": [], "qsc_events": []}
user: >-
Analyze this multi-frame video clip. Preserve the existing action, quality, safety, people, guest, and timestamp fields. Additionally detect current-period QSC violations in qsc_events. Return strict JSON only, with all required keys.
schema:
version: local-batch-v1

View File

@@ -1269,6 +1269,20 @@ class CliTests(unittest.TestCase):
self.assertEqual(folder_summary["processed_video_count"], 1)
self.assertEqual(folder_summary["failed_video_count"], 0)
self.assertEqual(folder_summary["event_counts"], {"queue_detected": 1})
phase_timings = json.loads(
(output_dir / "phase_timings.json").read_text(encoding="utf-8")
)
self.assertEqual(phase_timings["schema_version"], "phase-timings-v1")
for phase in (
"source_acquisition_seconds",
"video_probe_seconds",
"frame_sampling_seconds",
"clip_generation_seconds",
"inference_seconds",
"aggregation_seconds",
):
self.assertIn(phase, phase_timings["phases"])
self.assertGreaterEqual(phase_timings["phases"][phase], 0)
if __name__ == "__main__":

View File

@@ -104,6 +104,41 @@ class ResultParserTests(unittest.TestCase):
"2026-06-14 12:31:20",
)
def test_build_clip_result_preserves_qsc_events(self):
result = build_clip_result(
(
'{"Action":"Action_Idle","quality_status":"","error_type":"",'
'"安全隐患":"","人物位置":"员工在操作台边","总结":"画面中有1人",'
'"时间":"2026-06-16 05:00:03","employees":[],"guests":[],'
'"qsc_events":[{"violation_code":"WGSJ0001",'
'"violation_type":"工作状态未戴口罩","is_violation":true,'
'"working_state":"working","reason":"员工在操作台处理食物时未见口罩",'
'"confidence":0.92,"evidence_frame_count":3,'
'"visible_target":"操作台边员工","evidence_checklist":{},'
'"suggested_action":"warning"}]}'
),
{
"video_id": "video-abc",
"clip_id": "video-abc_c000001",
"clip_start_seconds": 0.0,
"clip_end_seconds": 10.0,
"clip_start_timecode": "00:00:00",
"clip_end_timecode": "00:00:10",
"frame_times": [],
},
{"path": "/videos/a.mp4"},
{
"schema": {"version": "local-batch-v1"},
"runtime": {"timezone": "Asia/Shanghai"},
},
processing={},
)
self.assertEqual(result["status"], "ok")
self.assertEqual(len(result["qsc_events"]), 1)
self.assertEqual(result["qsc_events"][0]["violation_code"], "WGSJ0001")
self.assertEqual(result["qsc_events"][0]["suggested_action"], "warning")
def test_build_clip_result_records_parse_failure_without_crashing(self):
result = build_clip_result(
"not json",
@@ -126,6 +161,7 @@ class ResultParserTests(unittest.TestCase):
self.assertEqual(result["status"], "parse_failed")
self.assertEqual(result["events"], [])
self.assertEqual(result["qsc_events"], [])
self.assertEqual(result["monitoring_timeline"]["screen_time"], "")
self.assertEqual(result["raw_response"], "not json")
self.assertIn("JSON", result["error"])

View File

@@ -1,9 +1,12 @@
from __future__ import annotations
import argparse
from contextlib import contextmanager
import json
import time
from datetime import datetime, timezone
from pathlib import Path
from typing import Sequence
from typing import Callable, Iterator, Sequence, TypeVar
from .aggregator import aggregate_outputs
from .clips import build_clip_records
@@ -18,6 +21,64 @@ from .result_parser import build_clip_result
from .timeline import DEFAULT_TIMEZONE, format_beijing_time, timeline_start_epoch
from .vlm_client import infer_clip
T = TypeVar("T")
def _new_phase_timings() -> dict[str, object]:
return {
"schema_version": "phase-timings-v1",
"started_at": _utc_now_iso(),
"updated_at": _utc_now_iso(),
"phases": {},
}
def _write_phase_timings(
output_dir: Path,
phase_timings: dict[str, object],
) -> None:
phase_timings["updated_at"] = _utc_now_iso()
(output_dir / "phase_timings.json").write_text(
json.dumps(phase_timings, ensure_ascii=False, sort_keys=True, indent=2) + "\n",
encoding="utf-8",
)
def _measure_phase(
phase_timings: dict[str, object] | None,
phase_name: str,
func: Callable[[], T],
) -> T:
with _timed_phase(phase_timings, phase_name):
return func()
@contextmanager
def _timed_phase(
phase_timings: dict[str, object] | None,
phase_name: str,
) -> Iterator[None]:
started = time.perf_counter()
try:
yield
finally:
if phase_timings is not None:
phases = phase_timings.get("phases")
if not isinstance(phases, dict):
phases = {}
phase_timings["phases"] = phases
previous = phases.get(phase_name, 0)
if not isinstance(previous, (int, float)):
previous = 0
phases[phase_name] = round(
float(previous) + time.perf_counter() - started,
6,
)
def _utc_now_iso() -> str:
return datetime.now(timezone.utc).isoformat()
def main(argv: Sequence[str] | None = None) -> int:
parser = argparse.ArgumentParser(
@@ -43,6 +104,7 @@ def main(argv: Sequence[str] | None = None) -> int:
output_dir = Path(config["output"]["dir"])
output_dir.mkdir(parents=True, exist_ok=True)
phase_timings = _new_phase_timings()
video_manifest_path = output_dir / "video_manifest.jsonl"
resume_enabled = bool(config.get("output", {}).get("resume", False))
@@ -63,11 +125,13 @@ def main(argv: Sequence[str] | None = None) -> int:
records,
record_indexes,
download_source=not args.dry_run,
phase_timings=phase_timings,
)
except ValueError as exc:
parser.error(str(exc))
write_manifest(video_manifest_path, records)
_write_phase_timings(output_dir, phase_timings)
if args.dry_run:
return 0
@@ -93,6 +157,7 @@ def main(argv: Sequence[str] | None = None) -> int:
if record.get("status") == "sampled" and record.get("video_id")
}
changed_frame_video_ids: set[str] = set(backfilled_frame_video_ids)
with _timed_phase(phase_timings, "frame_sampling_seconds"):
for record in records:
if record.get("status") != "probed":
continue
@@ -114,6 +179,7 @@ def main(argv: Sequence[str] | None = None) -> int:
)
changed_frame_video_ids.add(video_id)
write_manifest(frame_manifest_path, frame_records)
_write_phase_timings(output_dir, phase_timings)
sampled_video_ids = {
str(record.get("video_id"))
@@ -133,11 +199,14 @@ def main(argv: Sequence[str] | None = None) -> int:
for record in frame_records
if str(record.get("video_id")) in clip_rebuild_video_ids
]
with _timed_phase(phase_timings, "clip_generation_seconds"):
clip_records.extend(build_clip_records(frames_to_build, config["clip"]))
write_manifest(output_dir / "clip_manifest.jsonl", clip_records)
_write_phase_timings(output_dir, phase_timings)
if args.until == "clips":
return 0
with _timed_phase(phase_timings, "inference_seconds"):
_run_inference(
clip_records,
records,
@@ -146,9 +215,12 @@ def main(argv: Sequence[str] | None = None) -> int:
limit_clips=args.limit_clips,
resume=resume_enabled,
)
_write_phase_timings(output_dir, phase_timings)
if args.until == "inference":
return 0
with _timed_phase(phase_timings, "aggregation_seconds"):
aggregate_outputs(output_dir, config)
_write_phase_timings(output_dir, phase_timings)
return 0
@@ -175,12 +247,19 @@ def _acquire_source_records(
record_indexes: dict[str, int],
*,
download_source: bool = True,
phase_timings: dict[str, object] | None = None,
) -> None:
for source_record in _source_video_records(
source_records = _measure_phase(
phase_timings,
"source_acquisition_seconds",
lambda: _source_video_records(
config,
output_dir,
download_source=download_source,
):
)
)
with _timed_phase(phase_timings, "video_probe_seconds"):
for source_record in source_records:
path = source_record.get("path")
if not path:
continue

View File

@@ -63,6 +63,7 @@ def build_clip_result(
"status": result_status,
"monitoring_timeline": timeline,
"events": _events(payload, clip_record) if result_status == "ok" else [],
"qsc_events": _qsc_events(payload) if result_status == "ok" else [],
"raw_response": raw_response,
"processing": processing_record,
"error": result_error,
@@ -131,6 +132,17 @@ def _event(
return normalized
def _qsc_events(payload: dict[str, Any]) -> list[dict[str, Any]]:
raw_events = payload.get("qsc_events") or []
if not isinstance(raw_events, list):
return []
return [
dict(event)
for event in raw_events
if isinstance(event, dict)
]
def _video_path(video_record: dict[str, Any] | None) -> str | None:
if not video_record:
return None