Unify analysis pipeline around the TSV; move tracked DBs out of cloud sync

- Tracked DBs now live at /mnt/data/projects/cupido/tracked/ (out of ownCloud to avoid sync conflicts and bandwidth churn). config.py TRACKING_OUTPUT_DIR points there; the docker-compose for ethoscope-lab mounts it world-readable for JupyterHub users. - New scripts/export_video_db_index.py joins all_video_info_merged.xlsx with the video inventory and the on-disk DBs, producing a TSV that has one row per fly/ROI plus training/testing video and DB paths. Handles approximate xlsx times, cross-day training/testing, the 12 AM/PM ambiguity, and date typos. - scripts/load_roi_data.py rewritten as a TSV-driven loader returning a single DataFrame with session and metadata columns. calculate_distances and the two flies_analysis notebooks migrated to use it; downstream trained/naive splits remain available via simple equality filters. - Metadata vocabulary canonicalized: {naïve, niave, untrained, test} all resolve to {trained, naive}. Normalization happens at the TSV-export boundary (idempotent); the xlsx and the 2025-07-15 legacy CSV were edited in place to remove the worst variants. - scripts/monitor_tracking.py rate calculation fixed: with N parallel workers, completions arrive in bursts; the old formula divided by burst width and reported nonsense rates. Now uses a 6 h window denominator. - scripts/track_videos.py: BGRMovieCamera retries cv2.read on transient NFS hiccups and a post-tracking completeness gate (≥ 90 % of expected duration via MAX(t) across all 6 ROIs) deletes silent partial DBs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 15:20:14 +01:00 · 2026-04-30 15:20:14 +01:00 · f60a9d0530
commit f60a9d0530
parent e4da7691d5
13 changed files with 569 additions and 237 deletions
--- a/scripts/monitor_tracking.py
+++ b/scripts/monitor_tracking.py
@ -97,13 +97,32 @@ def snapshot() -> str:
    )
    lines.append(f"  errors in log:     {len(errors)}")

-    # Rate from the last 10 completions, when available.
-    if len(history) >= 2:
-        window = history[-min(10, len(history)) :]
-        span = window[-1] - window[0]
-        if span > 0:
-            rate_per_hour = (len(window) - 1) / span * 3600
-            lines.append(f"  rate (last {len(window) - 1}):    {rate_per_hour:.1f} videos/hour")
+    # Rate from completions in the last 6 h — robust to gaps from killed /
+    # restarted runs, while wide enough to span multiple parallel-worker
+    # completion bursts. Reason: with 8 workers all started together on
+    # multi-hour videos, completions arrive in tight bursts every ~video-
+    # length apart; a 30-min window catches one burst and overestimates by
+    # ~10×. 6 h spans at least one full burst cycle for typical videos.
+    now_ts = time.time()
+    window_secs = 6 * 3600
+    recent = [t for t in history if t >= now_ts - window_secs]
+    if len(recent) >= 2:
+        # Reason: with N parallel workers, completions arrive in clumps
+        # (all workers finish near-simultaneously). Dividing N by the *burst*
+        # span gives nonsense rates. Use the full window as the denominator
+        # once the batch has been running long enough to fill it; otherwise
+        # use elapsed-since-first-DB. Detection: if every DB on disk also
+        # falls inside the window, the batch is younger than the window.
+        if len(recent) == len(history):
+            elapsed = max(1.0, now_ts - history[0])
+        else:
+            elapsed = float(window_secs)
+        if elapsed > 0:
+            rate_per_hour = len(recent) / elapsed * 3600
+            lines.append(
+                f"  rate (last {len(recent)} in {int(window_secs/3600)} h):"
+                f"    {rate_per_hour:.1f} videos/hour"
+            )
            remaining = max(0, pickable - tracked)
            if rate_per_hour > 0 and remaining > 0:
                eta_sec = remaining * 3600 / rate_per_hour
@ -112,6 +131,8 @@ def snapshot() -> str:
                    f"  ETA remaining:     {fmt_duration(eta_sec)}  "
                    f"(done by {eta_at:%H:%M %a})"
                )
+    else:
+        lines.append("  rate:              (warming up — check again in a few min)")

    if last_mtime is not None and last_name is not None:
        ago = (datetime.now() - last_mtime).total_seconds()