Files

yangyl ef0047af6d Initial video AI analysis project

2026-06-17 11:33:54 +08:00

22 KiB

Raw Permalink Blame History

Project Documentation

Goal

本项目是在 /Users/yoilun/AI-train/video-ai-analysis-poc 中实现视频离线批处理分析 PoC。v1.0 已支持本地视频文件夹；v1.1 新增海康云存储录像下载作为视频来源，下载完成后复用现有抽帧、clip、VLM 推理和聚合流程。

必须支持：

选择一个本地视频文件夹。
直接调用海康云存储录像下载 API 获取录像下载地址并下载视频。
AccessToken 通过 config 或环境变量配置，不写入测试夹具和文档样例。
设备序列号和通道可配置，并支持多设备。
分析时间段包含年月日，支持 YYYY-MM-DD HH:MM:SS 配置。
海康 API 单次最多下载 1 小时，超过 1 小时的时间段必须拆成多个不超过 3600 秒的请求；默认示例使用 600 秒分片，真实 smoke 中比 3600 秒更稳定。
自动发现文件夹内所有常见视频文件。
对每个视频按 1 FPS 抽帧，按 10-20 秒 clip 组织输入。
使用已有 4B VLM 模型能力，兼容 memai-zhengxin-v3-20260413 的 OpenAI-compatible vLLM 接口。
prompt 通过 config 调整。
输出结构化 JSON/JSONL。
输出中必须包含监控画面的时间轴，包括视频、clip、frame 和事件的时间定位。

v1.1 Hik Cloud Storage Source

海康文档 录像下载流程_1.pdf 的“2、获取录像下载地址”定义：

POST https://api2.hik-cloud.com/v1/carrier/cstorage/open/play/download
Authorization: bearer <AccessToken>
Content-Type: application/json

请求 body：

{
  "deviceSerial": "EXAMPLE_DEVICE_SERIAL",
  "channelNo": 1,
  "timeBegin": 1764856787,
  "timeEnd": 1764856978
}

成功返回 data.url、actualBeginTime、actualEndTime。错误码 80430002 包含起止时间大于 3600 秒的参数错误，错误码 80438027 表示起始时间内没有录像。

配置示例：

source:
  mode: hik_cloud  # local | hik_cloud

hik_cloud:
  api_base_url: https://api2.hik-cloud.com
  download_path: /v1/carrier/cstorage/open/play/download
  access_token: null
  access_token_env: HIK_CLOUD_ACCESS_TOKEN
  chunk_seconds: 600
  timeout_seconds: 60
  download_timeout_seconds: 600
  devices:
    - device_serial: EXAMPLE_DEVICE_SERIAL
      channel_no: 1
      name: store-front
  time_ranges:
    - begin: "2026-02-03 09:00:00"
      end: "2026-02-03 11:30:00"

云下载输出：

hik_cloud_download_manifest.jsonl：每个设备/通道/时间分片的请求、实际时间、状态和错误。--dry-run 云模式只请求下载地址并写入 address_ok / failure 状态，不下载 mp4，不 probe。
downloads/hik_cloud/<device_serial>/ch<channel_no>/*.mp4：下载后供现有分析链路消费的视频文件。
video_manifest.jsonl：保留现有契约，并附加云来源元数据。

运行本地文件夹模式：

python3 -B -m video_ai_analysis_poc.cli \
  --config config/local_batch.yaml \
  --input-dir /path/to/local/videos \
  --output-dir ./outputs/local-batch

运行海康云存储模式时，复制配置文件并设置 source.mode: hik_cloud，AccessToken 优先通过环境变量提供：

export HIK_CLOUD_ACCESS_TOKEN='<redacted>'
python3 -B -m video_ai_analysis_poc.cli \
  --config /path/to/hik-cloud.yaml \
  --output-dir ./outputs/hik-cloud

--dry-run 会请求海康下载地址并写 hik_cloud_download_manifest.jsonl，但不会下载视频文件、probe、抽帧、推理或聚合。--until clips 会在下载、探测、抽帧和 clip manifest 后停止；--until inference 会继续运行模型推理并写入 clip_results.jsonl。

真实远端 smoke 观察到同一 1 小时时间段直接按 3600 秒下载时，云端返回的 MP4 缺少 moov atom，ffprobe 无法解析；改用 600 秒分片后 6 个分片均可探测并进入抽帧。抽帧阶段会根据云下载记录的 actual_begin/actual_end 或 requested_begin/requested_end 给 FFmpeg 加输出帧数上限，避免海康 MP4 异常时间戳导致 fps=1 复制出过量帧。

海康云存储安全规则：

不提交真实 AccessToken。
优先使用 hik_cloud.access_token_env: HIK_CLOUD_ACCESS_TOKEN。
不记录 Authorization header。
不持久化签名下载 URL query，例如 sign、sig、token、access_token。
access_token.md 是敏感验证文件，只能用于远端真实 smoke，不复制进文档、测试或输出样例。

Directory Boundaries

/Users/yoilun/AI-train/video-ai-analysis-poc
  本次 PoC 项目目录，后续代码、配置、计划、文档都放这里。

/Users/yoilun/AI-train/zhengxin-vlm-0413
  外部模型和参考实现目录，不是本次项目目录。

硬性边界：

不在 zhengxin-vlm-0413 中创建本项目文件。
不修改 zhengxin-vlm-0413/models/**。
不修改 zhengxin-vlm-0413/service/config.yaml、service/config.yaml-bk、docker/.env。
不把参考项目真实 RTSP、Webhook、token、Cookie、密码写入本项目示例配置、测试夹具、文档或输出样例。
输出目录只能是用户显式传入目录，或本项目内 outputs/。
不覆盖用户原始视频文件。

Inference Architecture Decision

本 PoC 明确选择：

OpenAI-compatible vLLM API

不在 PoC 第一版中直接加载 PyTorch + Transformers + PEFT。原因：

用户说明测试环境已有模型。
参考项目已经使用 vLLM OpenAI-compatible API。
本地视频批处理的主要目标是打通工程链路，而不是重新实现模型服务。

配置字段固定为：

vlm:
  api_base_url: http://localhost:8679
  chat_completions_path: /v1/chat/completions

代码拼接规则：

chat_url = api_base_url.rstrip("/") + chat_completions_path

不要在配置中同时传完整 endpoint 和 base URL，避免出现 /v1/chat/completions/v1/chat/completions 之类的双拼路径。

Target File Structure

video-ai-analysis-poc/
  agent.md
  task_plan.md
  findings.md
  progress.md
  memories.md
  video_ai_analysis_system_plan.md
  config/
    local_batch.yaml
  video_ai_analysis_poc/
    __init__.py
    cli.py
    config.py
    paths.py
    discovery.py
    probe.py
    ffmpeg_sampler.py
    frames.py
    clips.py
    vlm_client.py
    result_parser.py
    aggregator.py
    manifest.py
    logging_utils.py
  schemas/
    clip_result.schema.json
    video_result.schema.json
    folder_summary.schema.json
  tests/
    test_config.py
    test_discovery.py
    test_probe.py
    test_clips.py
    test_result_parser.py
    test_aggregator.py
  outputs/
    .gitkeep

Module Boundaries

`config.py`

加载 config/local_batch.yaml。
合并 CLI 参数覆盖项。
校验必填字段、数值范围、路径安全。
不访问视频、不调用 FFmpeg、不调用模型。

`paths.py`

生成稳定 video_id、clip_id。
生成输出目录结构。
防止输出目录指向参考模型目录或覆盖输入视频目录。

`discovery.py`

只负责按 input.dir、recursive、extensions 发现视频。
输出 video_manifest.jsonl。
不做 ffprobe，不做抽帧，不调用模型。

`probe.py`

包装 ffprobe。
输出 duration_seconds、codec_name、width、height、fps、format_name、start_time。
损坏或不支持视频标记 probe_failed，记录 last_error，不阻塞其他视频。

`ffmpeg_sampler.py`

使用 FFmpeg + NVDEC 做 1 FPS 抽帧。
根据 codec 选择 h264_cuvid / hevc_cuvid。
默认 allow_cpu_fallback: false。
输出 JPEG 和 frame_manifest.jsonl。
保存 FFmpeg stderr 摘要，作为实际使用 GPU 解码的证据。

`frames.py`

计算 frame 的相对秒数和 timecode。
维护 frame 文件路径、offset、timecode。
优先使用可获得的 pts_time，否则使用抽帧序号按 FPS 推导相对时间。

`clips.py`

读取 frame_manifest.jsonl。
按 clip.length_seconds 和 clip.stride_seconds 构建 clip。
从 1 FPS 帧中均匀采样 frames_per_clip。
输出 clip_manifest.jsonl，必须包含参与推理的实际帧时间。

`vlm_client.py`

调用 OpenAI-compatible /v1/chat/completions。
多帧使用 image_url，默认 data:image/jpeg;base64。
prompt 来自 config，不硬编码。
不解析业务事件，只返回 raw response、latency 和 HTTP 状态。
阶段 4 实现使用 Python 标准库 urllib，并暴露可注入 HTTP 函数以便测试 mock；默认 URL 拼接为 vlm.api_base_url.rstrip("/") + vlm.chat_completions_path。

`result_parser.py`

从 raw response 中提取严格 JSON。
校验 schema_version、events、screen_time、事件枚举等字段。
解析失败触发一次严格 prompt 重试。
仍失败写 parse_failed，保留 raw_response。
阶段 4 实现支持 raw JSON、markdown/prose 中嵌入 JSON，输出 clip 级 monitoring_timeline、events、raw_response、processing 和 error 字段。

`aggregator.py`

消费 video_manifest.jsonl、clip_manifest.jsonl 和 clip_results.jsonl。
聚合为 videos/<video_id>/video_result.json 和输出根目录下的 folder_summary.json。
按 merge_gap_seconds 合并同视频、同类型、相邻时间范围接近的事件。
保留事件相对时间轴、screen_time、clip evidence 和 frame evidence。
统计 parse_failed / inference_failed clip 数量。

`manifest.py`

负责 JSONL 读写和状态字段。
支持断点续跑。
每条记录包含 status、retry_count、last_error。

Config Schema

config/local_batch.yaml 建议字段：

input:
  dir: /path/to/videos
  recursive: true
  extensions: [".mp4", ".mov", ".mkv", ".avi", ".flv", ".ts", ".m4v"]

source:
  mode: local

output:
  dir: ./outputs/local-batch
  overwrite: false
  resume: true
  keep_frames: true

hik_cloud:
  api_base_url: https://api2.hik-cloud.com
  download_path: /v1/carrier/cstorage/open/play/download
  access_token: null
  access_token_env: HIK_CLOUD_ACCESS_TOKEN
  chunk_seconds: 600
  timeout_seconds: 60
  download_timeout_seconds: 600
  devices:
    - device_serial: EXAMPLE_DEVICE_SERIAL
      channel_no: 1
      name: example-device
  time_ranges:
    - begin: "2026-02-03 09:00:00"
      end: "2026-02-03 10:00:00"

ffprobe:
  timeout_seconds: 30

ffmpeg:
  prefer_nvdec: true
  allow_cpu_fallback: false
  hwaccel: cuda
  codec_decoders:
    h264: h264_cuvid
    hevc: hevc_cuvid
  frame_fps: 1
  frame_width: 640
  jpeg_quality: 4
  timeout_seconds_per_video: 3600

clip:
  length_seconds: 10
  stride_seconds: 10
  frames_per_clip: 8
  min_frames_per_clip: 4

vlm:
  api_base_url: http://localhost:8679
  chat_completions_path: /v1/chat/completions
  model: memai-zhengxin-v3-20260413
  timeout_seconds: 120
  max_tokens: 512
  temperature: 0
  batch_size: 1
  image_transport: data_uri
  retries: 1

prompt:
  system: "You are a store video analysis assistant. Return strict JSON only."
  user: "Analyze this clip. Return events and screen_time. If no event, return events: []."

schema:
  version: local-batch-v1
  event_types:
    - customer_enter
    - customer_leave
    - queue_detected
    - staff_absent
    - staff_present
    - area_crowded
    - abnormal_behavior
    - unknown
  require_strict_json: true
  parse_retry: 1
  merge_gap_seconds: 30

runtime:
  timezone: Asia/Shanghai
  log_level: INFO

File Contracts

`video_manifest.jsonl`

One line per discovered video:

{
  "video_id": "stable_hash_or_slug",
  "source_path": "/path/to/video.mp4",
  "status": "pending",
  "probe": null,
  "retry_count": 0,
  "last_error": null
}

`frame_manifest.jsonl`

One line per sampled frame:

{
  "video_id": "stable_hash_or_slug",
  "frame_id": "stable_hash_or_slug_f000120",
  "frame_path": "frames/stable_hash_or_slug/000120.jpg",
  "offset_seconds": 120.0,
  "timecode": "00:02:00",
  "pts_time": 120.0,
  "status": "sampled"
}

`clip_manifest.jsonl`

One line per clip:

{
  "video_id": "stable_hash_or_slug",
  "clip_id": "stable_hash_or_slug_c000012",
  "clip_start_seconds": 120.0,
  "clip_end_seconds": 130.0,
  "clip_start_timecode": "00:02:00",
  "clip_end_timecode": "00:02:10",
  "frame_times": [
    {
      "frame_path": "frames/stable_hash_or_slug/000120.jpg",
      "offset_seconds": 120.0,
      "timecode": "00:02:00"
    }
  ],
  "status": "pending",
  "retry_count": 0,
  "last_error": null
}

`clip_results.jsonl`

One line per inferred clip:

{
  "schema_version": "local-batch-v1",
  "video_id": "stable_hash_or_slug",
  "video_path": "/path/to/video.mp4",
  "clip_id": "stable_hash_or_slug_c000012",
  "status": "ok",
  "monitoring_timeline": {
    "timezone": "Asia/Shanghai",
    "video_start_time": null,
    "clip_start_seconds": 120.0,
    "clip_end_seconds": 130.0,
    "clip_start_timecode": "00:02:00",
    "clip_end_timecode": "00:02:10",
    "frame_times": [
      {
        "frame_path": "frames/stable_hash_or_slug/000120.jpg",
        "offset_seconds": 120.0,
        "timecode": "00:02:00"
      }
    ],
    "screen_time": "2026-06-14 12:31:20"
  },
  "events": [
    {
      "event_type": "queue_detected",
      "start_time": null,
      "end_time": null,
      "start_offset_seconds": 120.0,
      "end_offset_seconds": 130.0,
      "confidence": 0.86,
      "severity": "medium",
      "attributes": {},
      "evidence": {
        "clip_id": "stable_hash_or_slug_c000012",
        "frame_paths": ["frames/stable_hash_or_slug/000120.jpg"]
      }
    }
  ],
  "raw_response": null,
  "processing": {
    "started_at": "2026-06-15T10:00:00+08:00",
    "finished_at": "2026-06-15T10:00:02+08:00",
    "latency_ms": 1800
  },
  "error": null
}

`video_result.json`

Written to:

videos/<video_id>/video_result.json

Required top-level fields:

schema_version
video_id
video_path
probe
monitoring_timeline.video_start_time
monitoring_timeline.video_duration_seconds
clip_count
failed_clip_count
event_counts
events
outputs.clip_results_jsonl
processing

`folder_summary.json`

Required top-level fields:

schema_version
input_dir
video_count
processed_video_count
failed_video_count
event_counts
videos
processing

Timeline Rules

时间轴必须区分三类时间：

视频相对时间：offset_seconds、timecode。
画面 OCR 时间：screen_time 或模型输出里的 画面时间。
处理时间：processing.started_at、processing.finished_at。

本地视频没有可靠业务开始时间时：

video_start_time 必须为 null。
不允许伪造绝对时间。
事件必须保留 start_offset_seconds 和 end_offset_seconds。

参与推理的实际帧时间必须写入 frame_times。不能只写 clip 起止时间。

Reference Code Usage

可以参考：

zhengxin-vlm-0413/shared/vlm_client.py 的 OpenAI-compatible payload 结构。
zhengxin-vlm-0413/shared/frame_utils.py 的 base64 data URI 处理方式。
zhengxin-vlm-0413/service/config.yaml 的 prompt 配置风格。

不能直接复用为核心实现：

frame_utils.extract_frames_from_video，因为它是整段均匀抽 8 帧，不满足 1 FPS、clip manifest、时间轴要求。
vlm_client.extract_action，因为它只解析 Action，不能覆盖本项目完整事件和时间轴 schema。
rtsp_service.py 主循环，因为它服务实时 RTSP，不适合离线文件夹批处理。

Validation Matrix

Phase 1 Architecture Validation

阶段 1 complete 条件：

docs/project.md 固化模块边界、文件输出契约、config schema、时间轴 schema、安全边界和验证矩阵。
推理接口选择已明确为 OpenAI-compatible vLLM。
API URL 字段语义已固定为 api_base_url + chat_completions_path。
已声明参考 frame_utils.py / vlm_client.py 哪些可借鉴、哪些不能直接复用。
已列出阶段 2-6 的 smoke test 输入、命令、期望输出字段和失败判定标准。
子 agent 审查结论记录到 progress.md。

Phase 2 Validation

目标：本地视频发现、ffprobe、manifest、CLI 骨架。

命令：

python3 -m py_compile video_ai_analysis_poc/*.py
python3 -m video_ai_analysis_poc.cli --config config/local_batch.yaml --input-dir /path/to/videos --output-dir ./outputs/local-batch --dry-run

期望：

生成 video_manifest.jsonl。
损坏/不支持视频被标记失败，不阻塞其他视频。
不读取或写入参考模型目录。

Phase 3 Validation

目标：FFmpeg/NVDEC 1 FPS 抽帧和 clip 构建。

命令：

ffmpeg -hwaccels
ffmpeg -decoders | grep cuvid
python3 -m video_ai_analysis_poc.cli --config config/local_batch.yaml --input-dir /path/to/short-videos --output-dir ./outputs/local-batch --until clips

期望：

对一个样例视频实际运行带 -hwaccel cuda 和 h264_cuvid 或 hevc_cuvid 的抽帧命令。
保存 FFmpeg stderr 或日志中的解码器证据。
生成 frame_manifest.jsonl 和 clip_manifest.jsonl。
clip_manifest.jsonl 包含 frame_times。

Phase 4 Validation

目标：vLLM OpenAI-compatible API、prompt 配置、JSON 解析重试。

命令：

curl http://localhost:8679/v1/models
python3 -m video_ai_analysis_poc.cli --config config/local_batch.yaml --input-dir /path/to/short-videos --output-dir ./outputs/local-batch --until inference --limit-clips 3

期望：

prompt 从 config 读取。
请求 URL 使用 api_base_url + chat_completions_path。
生成 clip_results.jsonl。
每条结果包含 monitoring_timeline.frame_times 和 screen_time 字段。

Phase 5 Validation

目标：clip/video/folder 聚合和 schema 校验。

命令：

python3 -m video_ai_analysis_poc.cli --config config/local_batch.yaml --input-dir /path/to/short-videos --output-dir ./outputs/local-batch
python3 -m json.tool ./outputs/local-batch/folder_summary.json >/dev/null

期望：

默认 CLI 运行不传 --dry-run 或 --until 时，会执行到 inference 并继续 aggregation。
--until clips 和 --until inference 仍停在各自阶段，不写聚合输出。
生成 videos/<video_id>/video_result.json。
生成 folder_summary.json。
事件聚合保留相对时间轴。
JSON 可被标准工具解析。

Phase 6 Validation

目标：测试环境 smoke test 与文档更新。

远端环境：

ssh xiaozheng@192.168.5.100
/home/xiaozheng/video-ai-analysis-poc

模型服务：

ssh xiaozheng@192.168.5.100 'curl http://localhost:8679/v1/models'

当前服务状态：

容器：zhengxin-vllm
镜像：vllm/vllm-openai:v0.14.1
端口：8679
模型：memai-zhengxin-v3-20260413
模型目录挂载：/home/xiaozheng/zhengxin-vlm-0413/models:/models:ro

远端能力验证命令：

ssh xiaozheng@192.168.5.100 'nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader'
ssh xiaozheng@192.168.5.100 'ffmpeg -hwaccels'
ssh xiaozheng@192.168.5.100 'ffmpeg -decoders'

已验证：

GPU: NVIDIA GeForce RTX 3080, 20480 MiB, driver 595.71.05。
FFmpeg 6.1.1 支持 cuda hwaccel。
FFmpeg decoders 包含 h264_cuvid 和 hevc_cuvid。
/v1/models 返回模型 id memai-zhengxin-v3-20260413。
/v1/chat/completions 安全 quoted health check 返回 OK。

远端 smoke 输入：

/tmp/video-ai-analysis-poc-smoke.h1cZUR/input/sample_h264.mp4

远端 smoke 输出：

/tmp/video-ai-analysis-poc-smoke.h1cZUR/output

远端批处理命令：

ssh xiaozheng@192.168.5.100 'PYTHONPATH=/home/xiaozheng/video-ai-analysis-poc python3 -B -m unittest discover -s /home/xiaozheng/video-ai-analysis-poc/tests -v'
ssh xiaozheng@192.168.5.100 'python3 -B -m compileall -q /home/xiaozheng/video-ai-analysis-poc/video_ai_analysis_poc'
ssh xiaozheng@192.168.5.100 'PYTHONPATH=/home/xiaozheng/video-ai-analysis-poc python3 -B -m video_ai_analysis_poc.cli --config /home/xiaozheng/video-ai-analysis-poc/config/local_batch.yaml --input-dir /tmp/video-ai-analysis-poc-smoke.h1cZUR/input --output-dir /tmp/video-ai-analysis-poc-smoke.h1cZUR/output --until clips'
ssh xiaozheng@192.168.5.100 'PYTHONPATH=/home/xiaozheng/video-ai-analysis-poc python3 -B -m video_ai_analysis_poc.cli --config /home/xiaozheng/video-ai-analysis-poc/config/local_batch.yaml --input-dir /tmp/video-ai-analysis-poc-smoke.h1cZUR/input --output-dir /tmp/video-ai-analysis-poc-smoke.h1cZUR/output --until inference --limit-clips 1'
ssh xiaozheng@192.168.5.100 'PYTHONPATH=/home/xiaozheng/video-ai-analysis-poc python3 -B -m video_ai_analysis_poc.cli --config /home/xiaozheng/video-ai-analysis-poc/config/local_batch.yaml --input-dir /tmp/video-ai-analysis-poc-smoke.h1cZUR/input --output-dir /tmp/video-ai-analysis-poc-smoke.h1cZUR/output'

已验证输出：

video_manifest.jsonl: 1 条视频记录。
frame_manifest.jsonl: 12 条 sampled frame 记录。
clip_manifest.jsonl: 1 条 clip 记录。
frame manifest 中持久化 hwaccel: cuda、decoder: h264_cuvid、ffmpeg_command 和 FFmpeg stderr 摘要。
clip_results.jsonl: 1 条记录，status: ok，包含 monitoring_timeline.frame_times。
videos/<video_id>/video_result.json: JSON 可解析，failed_clip_count: 0。
folder_summary.json: JSON 可解析，video_count: 1、processed_video_count: 1。
本地视频没有可靠业务开始时间时，monitoring_timeline.video_start_time 输出 null；ffprobe 的 start_time: 0.0 只保留在 probe。

远端验证约束：

只写入明确输出目录。
不覆盖远端已有模型、配置和视频。
不复制真实凭据到日志或文档。

Known Risks

HEVC decoder 可用性已验证，但实际 smoke 只覆盖 H.264 样例视频。
24 小时真实门店视频吞吐量尚未压测。
海康云眸云录像/RTSP 接入仍在当前本地文件夹 PoC 范围之外。
本地视频可能没有画面内时间戳，必须同时保留相对时间。
模型事件质量尚未用真实门店素材验收；合成测试图没有业务事件，输出空事件是合理结果。
远端 vLLM 容器当前为手工启动，不是生产级 systemd/compose 托管。

22 KiB Raw Permalink Blame History Unescape Escape

Project Documentation

Goal

v1.1 Hik Cloud Storage Source

Directory Boundaries

Inference Architecture Decision

Target File Structure

Module Boundaries

config.py

paths.py

discovery.py

probe.py

ffmpeg_sampler.py

frames.py

clips.py

vlm_client.py

result_parser.py

aggregator.py

manifest.py