Unify analysis pipeline around the TSV; move tracked DBs out of cloud sync

- Tracked DBs now live at /mnt/data/projects/cupido/tracked/ (out of
  ownCloud to avoid sync conflicts and bandwidth churn). config.py
  TRACKING_OUTPUT_DIR points there; the docker-compose for ethoscope-lab
  mounts it world-readable for JupyterHub users.
- New scripts/export_video_db_index.py joins all_video_info_merged.xlsx
  with the video inventory and the on-disk DBs, producing a TSV that has
  one row per fly/ROI plus training/testing video and DB paths. Handles
  approximate xlsx times, cross-day training/testing, the 12 AM/PM
  ambiguity, and date typos.
- scripts/load_roi_data.py rewritten as a TSV-driven loader returning a
  single DataFrame with session and metadata columns. calculate_distances
  and the two flies_analysis notebooks migrated to use it; downstream
  trained/naive splits remain available via simple equality filters.
- Metadata vocabulary canonicalized: {naïve, niave, untrained, test} all
  resolve to {trained, naive}. Normalization happens at the TSV-export
  boundary (idempotent); the xlsx and the 2025-07-15 legacy CSV were
  edited in place to remove the worst variants.
- scripts/monitor_tracking.py rate calculation fixed: with N parallel
  workers, completions arrive in bursts; the old formula divided by burst
  width and reported nonsense rates. Now uses a 6 h window denominator.
- scripts/track_videos.py: BGRMovieCamera retries cv2.read on transient
  NFS hiccups and a post-tracking completeness gate (≥ 90 % of expected
  duration via MAX(t) across all 6 ROIs) deletes silent partial DBs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Giorgio Gilestro 2026-04-30 15:20:14 +01:00
parent e4da7691d5
commit f60a9d0530
13 changed files with 569 additions and 237 deletions

View file

@ -97,13 +97,32 @@ def snapshot() -> str:
)
lines.append(f" errors in log: {len(errors)}")
# Rate from the last 10 completions, when available.
if len(history) >= 2:
window = history[-min(10, len(history)) :]
span = window[-1] - window[0]
if span > 0:
rate_per_hour = (len(window) - 1) / span * 3600
lines.append(f" rate (last {len(window) - 1}): {rate_per_hour:.1f} videos/hour")
# Rate from completions in the last 6 h — robust to gaps from killed /
# restarted runs, while wide enough to span multiple parallel-worker
# completion bursts. Reason: with 8 workers all started together on
# multi-hour videos, completions arrive in tight bursts every ~video-
# length apart; a 30-min window catches one burst and overestimates by
# ~10×. 6 h spans at least one full burst cycle for typical videos.
now_ts = time.time()
window_secs = 6 * 3600
recent = [t for t in history if t >= now_ts - window_secs]
if len(recent) >= 2:
# Reason: with N parallel workers, completions arrive in clumps
# (all workers finish near-simultaneously). Dividing N by the *burst*
# span gives nonsense rates. Use the full window as the denominator
# once the batch has been running long enough to fill it; otherwise
# use elapsed-since-first-DB. Detection: if every DB on disk also
# falls inside the window, the batch is younger than the window.
if len(recent) == len(history):
elapsed = max(1.0, now_ts - history[0])
else:
elapsed = float(window_secs)
if elapsed > 0:
rate_per_hour = len(recent) / elapsed * 3600
lines.append(
f" rate (last {len(recent)} in {int(window_secs/3600)} h):"
f" {rate_per_hour:.1f} videos/hour"
)
remaining = max(0, pickable - tracked)
if rate_per_hour > 0 and remaining > 0:
eta_sec = remaining * 3600 / rate_per_hour
@ -112,6 +131,8 @@ def snapshot() -> str:
f" ETA remaining: {fmt_duration(eta_sec)} "
f"(done by {eta_at:%H:%M %a})"
)
else:
lines.append(" rate: (warming up — check again in a few min)")
if last_mtime is not None and last_name is not None:
ago = (datetime.now() - last_mtime).total_seconds()