From e4da7691d5d83c8161986efef720eb25b6af298d Mon Sep 17 00:00:00 2001 From: Giorgio Date: Mon, 27 Apr 2026 17:25:26 +0100 Subject: [PATCH 1/4] Add offline tracking pipeline for video backlog MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The 2024 video set in all_video_info_merged.xlsx covers 63 (date, machine) sessions — 129 video instances — that have no auto-detectable targets, so ROI placement requires manual reference-point selection. This commit adds the three-stage pipeline that lets a user click for an hour, then walk away while the tracker grinds overnight: 1. build_video_inventory.py — scan /mnt/ethoscope_data/videos/ and join against the xlsx, producing data/metadata/video_inventory.csv 2. pick_targets.py — interactive matplotlib/Tk picker. User clicks TOP/CORNER/LEFT (the L-shape ethoscope expects); after the third click the 6 ROI rectangles are drawn on top of the frame so geometry can be verified before saving. Also supports marking a video 'unusable' (FOV wrong) so it's permanently skipped, frame stepping by ±1s/±5%/midpoint, point editing in --redo mode, and a crosshair cursor that survives matplotlib's per-motion cursor reset. 3. track_videos.py — headless batch tracker. Reads the JSON sidecars, builds 6 ROIs from the HD-mating-arena geometry, runs MultiFlyTracker against the merged.mp4 via MovieVirtualCamera, writes SQLite DBs to data/tracked/. Idempotent (skips done DBs), parallel via --jobs, subclasses MovieVirtualCamera so frames stay BGR (MultiFlyTracker calls cvtColor(BGR2GRAY) without checking channel count). Plus auto_detect_targets.py (fallback that runs ethoscope's auto-detector in case any videos do have visible target dots), monitor_tracking.py (progress + ETA from data/tracked/ ground truth, --watch for live view), and tracking_geometry.py (single source of truth for the affine math shared by picker and tracker). requirements-tracking.txt pins the extra deps (opencv-python, openpyxl, gitpython, netifaces, mysql-connector-python) — these are only needed for the tracking pipeline, not the existing analysis notebooks. Verified end-to-end on one of the user-picked videos: ~4000 rows/ROI in a 120s slice, fly bounding boxes in the expected 800-2000 px² band. Co-Authored-By: Claude Opus 4.7 --- .gitignore | 9 + README.md | 26 ++ requirements-tracking.txt | 11 + scripts/auto_detect_targets.py | 119 ++++++++ scripts/build_video_inventory.py | 150 ++++++++++ scripts/config.py | 8 + scripts/monitor_tracking.py | 155 ++++++++++ scripts/pick_targets.py | 467 +++++++++++++++++++++++++++++++ scripts/track_videos.py | 218 +++++++++++++++ scripts/tracking_geometry.py | 71 +++++ tasks/todo.md | 62 ++++ 11 files changed, 1296 insertions(+) create mode 100644 requirements-tracking.txt create mode 100644 scripts/auto_detect_targets.py create mode 100644 scripts/build_video_inventory.py create mode 100644 scripts/monitor_tracking.py create mode 100644 scripts/pick_targets.py create mode 100644 scripts/track_videos.py create mode 100644 scripts/tracking_geometry.py diff --git a/.gitignore b/.gitignore index 50e96cf..02d5434 100644 --- a/.gitignore +++ b/.gitignore @@ -2,6 +2,15 @@ data/raw/*.db data/processed/*.csv +# Offline-tracking outputs (reproducible from videos + target JSONs) +data/tracked/*.db +data/tracked/*.db-wal +data/tracked/*.db-shm +data/tracked/*.db-journal +data/targets/*.json +data/metadata/video_inventory.csv +data/logs/*.log + # Generated figures (reproducible from scripts) figures/*.png diff --git a/README.md b/README.md index bf88c6f..9d9ff17 100644 --- a/README.md +++ b/README.md @@ -46,6 +46,32 @@ The key insight: not all "trained" flies may have actually learned. The trained **Read `docs/bimodal_hypothesis.md` for the detailed analysis plan and code sketches.** +## Offline Tracking Pipeline (added Apr 2026) + +For tracking new videos that have **no auto-detectable targets**, the pipeline +is split in two stages so you can sit at the screen and click for an hour, then +let the tracker grind through overnight. + +```bash +# extra deps (ethoscope src must be at /home/gg/Code/ethoscope_project/...) +pip install -r requirements-tracking.txt + +# 1) build the inventory (xlsx ↔ /mnt/ethoscope_data/videos/) +python scripts/build_video_inventory.py + +# 2) interactive: click TOP, CORNER, LEFT on each video (one frame per video) +python scripts/pick_targets.py # process all not-yet-picked +python scripts/pick_targets.py --redo # re-pick already-picked videos +# keys: r=reset n=skip f=jump frame q/ESC=quit ENTER=save + +# 3) batch tracking (idempotent, can run in background) +python scripts/track_videos.py --jobs 4 # parallel +# output → data/tracked/*_tracking.db (SQLite, same schema as data/raw/) +``` + +See `tasks/todo.md` "Offline Tracking" section for the full plan, and +`data/metadata/video_inventory.csv` for the list of videos to process. + ## Folder Structure ``` diff --git a/requirements-tracking.txt b/requirements-tracking.txt new file mode 100644 index 0000000..b52aad2 --- /dev/null +++ b/requirements-tracking.txt @@ -0,0 +1,11 @@ +# Extra dependencies needed only for the offline-tracking pipeline +# (build_video_inventory.py, pick_targets.py, auto_detect_targets.py, +# track_videos.py). Not needed for the existing analysis notebooks. +# +# install with: pip install -r requirements-tracking.txt +opencv-python>=4.8 +openpyxl>=3.1 +gitpython>=3.1 +netifaces>=0.11 +mysql-connector-python>=8.0 +pyserial>=3.5 diff --git a/scripts/auto_detect_targets.py b/scripts/auto_detect_targets.py new file mode 100644 index 0000000..077ac41 --- /dev/null +++ b/scripts/auto_detect_targets.py @@ -0,0 +1,119 @@ +"""Try auto-detection of L-shape targets on each video and save JSON sidecars. + +Useful for: +- videos that DO have visible black-circle targets (saves manual clicks); +- as a smoke test of the whole pipeline before running the picker. + +Failure is silent — videos that fail auto-detection are simply not written +to disk, leaving them for the manual `pick_targets.py` tool. + +Output JSON has the same shape as the manual picker's so `track_videos.py` +can consume either. +""" + +from __future__ import annotations + +import argparse +import datetime as dt +import json +import logging +import sys +from pathlib import Path + +import cv2 +import numpy as np +import pandas as pd + +# ethoscope source tree +sys.path.insert(0, "/home/gg/Code/ethoscope_project/ethoscope/src/ethoscope") + +from config import INVENTORY_CSV, TARGETS_DIR # noqa: E402 + +from ethoscope.roi_builders.target_roi_builder import TargetGridROIBuilder # noqa: E402 + + +def detect_one(video_path: Path, frame_idx: int) -> tuple[list[list[int]], int] | None: + """Run ethoscope target detection on one frame; return (points, frame_idx) or None.""" + cap = cv2.VideoCapture(str(video_path)) + if not cap.isOpened(): + return None + n = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) + if n > 0 and frame_idx >= n: + frame_idx = max(0, n - 1) + cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx) + ok, frame = cap.read() + cap.release() + if not ok or frame is None: + return None + + # The detector expects a single-channel image (grey) like ethoscope cameras produce. + if frame.ndim == 3: + gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) + else: + gray = frame + + # We don't actually need a fully-configured grid here — _find_target_coordinates + # alone gives us the 3 reference points. + builder = TargetGridROIBuilder(n_rows=2, n_cols=3) + try: + ref = builder._find_target_coordinates(gray) + except Exception as e: + logging.debug(f"detection failed for {video_path.name}: {e}") + return None + if ref is None: + return None + return [[int(p[0]), int(p[1])] for p in ref], frame_idx + + +def main() -> None: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--frame", type=int, default=125) + parser.add_argument("--limit", type=int, default=None) + parser.add_argument("--video", type=str, default=None, + help="run on a single video path (skips inventory)") + parser.add_argument("--overwrite", action="store_true", + help="overwrite existing JSON sidecars") + args = parser.parse_args() + + TARGETS_DIR.mkdir(parents=True, exist_ok=True) + + if args.video: + videos = [Path(args.video)] + else: + if not INVENTORY_CSV.exists(): + sys.exit("Inventory missing — run build_video_inventory.py first.") + inv = pd.read_csv(INVENTORY_CSV) + todo = inv[inv["in_xlsx"] & ~inv["already_tracked"]] + videos = [Path(p) for p in todo["mp4_path"].tolist()] + if args.limit: + videos = videos[: args.limit] + + n_ok = n_fail = n_skip = 0 + for v in videos: + out = TARGETS_DIR / f"{v.stem}.json" + if out.exists() and not args.overwrite: + n_skip += 1 + continue + result = detect_one(v, args.frame) + if result is None: + n_fail += 1 + print(f" fail: {v.name}") + continue + points, used_frame = result + out.write_text(json.dumps({ + "video_path": str(v), + "frame_index": int(used_frame), + "reference_points": points, + "order": ["top", "corner", "left"], + "picked_at": dt.datetime.now().isoformat(timespec="seconds"), + "method": "auto", + }, indent=2)) + n_ok += 1 + print(f" ok: {v.name} → {points}") + + print(f"\nDone. ok={n_ok} fail={n_fail} skipped(existing)={n_skip}") + + +if __name__ == "__main__": + logging.basicConfig(level=logging.WARNING, format="%(levelname)s %(message)s") + main() diff --git a/scripts/build_video_inventory.py b/scripts/build_video_inventory.py new file mode 100644 index 0000000..3c083e7 --- /dev/null +++ b/scripts/build_video_inventory.py @@ -0,0 +1,150 @@ +"""Build an inventory of videos available on disk and join with the metadata xlsx. + +Scans /mnt/ethoscope_data/videos////*.mp4 +and produces a CSV mapping each (date, machine_name) row in +all_video_info_merged.xlsx to the corresponding merged.mp4 path on disk. + +Output: data/metadata/video_inventory.csv with columns: + machine_uuid, machine_name, session_date, session_time, mp4_path, + in_xlsx (bool), already_tracked (bool) +""" + +from __future__ import annotations + +import re +from pathlib import Path + +import pandas as pd + +from config import DATA_RAW, INVENTORY_CSV, VIDEO_INFO_XLSX, VIDEOS_ROOT + +SESSION_RE = re.compile(r"^(\d{4}-\d{2}-\d{2})_(\d{2}-\d{2}-\d{2})$") + + +def scan_videos(videos_root: Path) -> pd.DataFrame: + """Walk videos_root and return one row per merged.mp4 found. + + Args: + videos_root: Root directory containing ///. + + Returns: + DataFrame with columns: machine_uuid, machine_name, session_date, + session_time, session_datetime, mp4_path. + """ + rows = [] + for uuid_dir in sorted(videos_root.iterdir()): + if not uuid_dir.is_dir(): + continue + for machine_dir in uuid_dir.iterdir(): + if not machine_dir.is_dir() or not machine_dir.name.startswith("ETHOSCOPE_"): + continue + for session_dir in machine_dir.iterdir(): + if not session_dir.is_dir(): + continue + m = SESSION_RE.match(session_dir.name) + if not m: + continue + date_str, time_str = m.group(1), m.group(2) + # Prefer *_merged.mp4 if present + merged = sorted(session_dir.glob("*_merged.mp4")) + if not merged: + merged = sorted(session_dir.glob("*.mp4")) + if not merged: + continue + rows.append( + { + "machine_uuid": uuid_dir.name, + "machine_name": machine_dir.name, + "session_date": date_str, + "session_time": time_str, + "session_datetime": f"{date_str}_{time_str}", + "mp4_path": str(merged[0]), + } + ) + return pd.DataFrame(rows) + + +def already_tracked_set(data_raw: Path) -> set[tuple[str, str]]: + """Return the set of (date, time) sessions for which a tracking DB exists. + + DBs are named like: + 2025-07-15_16-03-10___1920x1088@25fps-28q_merged_tracking.db + """ + out = set() + for db in data_raw.glob("*_tracking.db"): + m = re.match(r"^(\d{4}-\d{2}-\d{2})_(\d{2}-\d{2}-\d{2})_", db.name) + if m: + out.add((m.group(1), m.group(2))) + return out + + +def main() -> None: + print(f"Scanning {VIDEOS_ROOT} ...") + videos_df = scan_videos(VIDEOS_ROOT) + print(f" found {len(videos_df)} video sessions on disk") + + print(f"Loading metadata xlsx: {VIDEO_INFO_XLSX}") + meta = pd.read_excel(VIDEO_INFO_XLSX) + meta["session_date"] = meta["date"].dt.strftime("%Y-%m-%d") + + # The xlsx has one row per (date, machine, ROI) — collapse to unique sessions + meta_sessions = ( + meta[["session_date", "machine_name"]].drop_duplicates().reset_index(drop=True) + ) + print(f" xlsx contains {len(meta_sessions)} unique (date, machine) sessions") + + # Mark which video sessions are referenced by the xlsx + xlsx_keys = set(zip(meta_sessions["session_date"], meta_sessions["machine_name"])) + videos_df["in_xlsx"] = videos_df.apply( + lambda r: (r["session_date"], r["machine_name"]) in xlsx_keys, axis=1 + ) + + # Mark which already have tracking DBs in data/raw/ + tracked = already_tracked_set(DATA_RAW) + videos_df["already_tracked"] = videos_df.apply( + lambda r: (r["session_date"], r["session_time"]) in tracked, axis=1 + ) + + INVENTORY_CSV.parent.mkdir(parents=True, exist_ok=True) + videos_df.sort_values(["session_date", "machine_name", "session_time"]).to_csv( + INVENTORY_CSV, index=False + ) + + # Coverage report + in_xlsx = videos_df["in_xlsx"] + needed = videos_df[in_xlsx & ~videos_df["already_tracked"]] + n_xlsx_sessions = len(meta_sessions) + n_with_video = videos_df[in_xlsx].drop_duplicates( + ["session_date", "machine_name"] + ).shape[0] + + # xlsx sessions that have no video on disk + found_keys = set( + zip( + videos_df.loc[in_xlsx, "session_date"], + videos_df.loc[in_xlsx, "machine_name"], + ) + ) + missing = sorted(xlsx_keys - found_keys) + + print() + print("=" * 70) + print(f"Wrote inventory: {INVENTORY_CSV}") + print(f" total video sessions on disk: {len(videos_df)}") + print(f" xlsx unique sessions: {n_xlsx_sessions}") + print(f" xlsx sessions with video: {n_with_video}") + print(f" xlsx sessions missing video: {len(missing)}") + print(f" already tracked (DB exists): {videos_df['already_tracked'].sum()}") + print(f" TO TRACK (in_xlsx & ~tracked, video instances): {len(needed)}") + + if missing: + print() + print("xlsx sessions with NO matching video on disk:") + for d, m in missing[:20]: + print(f" {d} {m}") + if len(missing) > 20: + print(f" ... and {len(missing) - 20} more") + + +if __name__ == "__main__": + main() diff --git a/scripts/config.py b/scripts/config.py index 0593c7e..a3462b2 100644 --- a/scripts/config.py +++ b/scripts/config.py @@ -7,3 +7,11 @@ DATA_RAW = PROJECT_ROOT / "data" / "raw" DATA_METADATA = PROJECT_ROOT / "data" / "metadata" DATA_PROCESSED = PROJECT_ROOT / "data" / "processed" FIGURES = PROJECT_ROOT / "figures" + +# Offline-tracking pipeline paths +VIDEOS_ROOT = Path("/mnt/ethoscope_data/videos") +VIDEO_INFO_XLSX = PROJECT_ROOT.parent / "all_video_info_merged.xlsx" +INVENTORY_CSV = DATA_METADATA / "video_inventory.csv" +TARGETS_DIR = PROJECT_ROOT / "data" / "targets" +TRACKING_OUTPUT_DIR = PROJECT_ROOT / "data" / "tracked" +LOGS_DIR = PROJECT_ROOT / "data" / "logs" diff --git a/scripts/monitor_tracking.py b/scripts/monitor_tracking.py new file mode 100644 index 0000000..9ffa891 --- /dev/null +++ b/scripts/monitor_tracking.py @@ -0,0 +1,155 @@ +"""Live progress + ETA for the offline tracker batch. + +Counts ground-truth (DBs on disk) rather than parsing log lines, so it works +whether the batch is running fresh or was resumed after a crash. Errors are +parsed out of any *.log files in data/logs/. + +Usage: + python monitor_tracking.py # one snapshot, exit + python monitor_tracking.py --watch # refresh every 10 s + python monitor_tracking.py --watch 30 # refresh every 30 s +""" + +from __future__ import annotations + +import argparse +import json +import re +import time +from datetime import datetime, timedelta +from pathlib import Path + +from config import LOGS_DIR, TARGETS_DIR, TRACKING_OUTPUT_DIR + + +def count_target_jsons() -> tuple[int, int, list[str]]: + """Return (n_pickable, n_unusable, unusable_video_stems).""" + pickable = 0 + unusable_stems: list[str] = [] + for j in TARGETS_DIR.glob("*.json"): + try: + d = json.loads(j.read_text()) + except Exception: + continue + if d.get("unusable"): + unusable_stems.append(j.stem) + elif d.get("reference_points"): + pickable += 1 + return pickable, len(unusable_stems), unusable_stems + + +def count_tracked_dbs() -> tuple[int, datetime | None, str | None]: + """Return (n_dbs, mtime_of_newest, name_of_newest).""" + dbs = list(TRACKING_OUTPUT_DIR.glob("*_tracking.db")) + if not dbs: + return 0, None, None + newest = max(dbs, key=lambda p: p.stat().st_mtime) + return len(dbs), datetime.fromtimestamp(newest.stat().st_mtime), newest.stem + + +def parse_recent_errors(log_dir: Path, tail_lines: int = 5000) -> list[str]: + """Scan the most recent *.log file for lines reporting errors.""" + if not log_dir.exists(): + return [] + logs = sorted(log_dir.glob("*.log"), key=lambda p: p.stat().st_mtime) + if not logs: + return [] + latest = logs[-1] + try: + with latest.open() as f: + tail = f.readlines()[-tail_lines:] + except Exception: + return [] + out = [] + for line in tail: + if re.search(r":\s*error\b", line) or " error: " in line.lower(): + out.append(line.rstrip()) + return out + + +def db_completion_history() -> list[float]: + """Return mtimes of all tracking DBs, sorted ascending. Used for rate.""" + return sorted(p.stat().st_mtime for p in TRACKING_OUTPUT_DIR.glob("*_tracking.db")) + + +def fmt_duration(seconds: float) -> str: + if seconds < 60: + return f"{int(seconds)} s" + if seconds < 3600: + return f"{int(seconds // 60)} min" + h = int(seconds // 3600) + m = int((seconds % 3600) // 60) + return f"{h} h {m} min" + + +def snapshot() -> str: + pickable, unusable, _ = count_target_jsons() + tracked, last_mtime, last_name = count_tracked_dbs() + history = db_completion_history() + errors = parse_recent_errors(LOGS_DIR) + + lines = [f"tracking progress @ {datetime.now():%Y-%m-%d %H:%M:%S}"] + lines.append(f" pickable JSONs: {pickable}") + lines.append(f" unusable JSONs: {unusable} (skipped by tracker)") + pct = (tracked / pickable * 100) if pickable else 0 + lines.append( + f" DBs on disk: {tracked} / {pickable} ({pct:.0f}%)" + ) + lines.append(f" errors in log: {len(errors)}") + + # Rate from the last 10 completions, when available. + if len(history) >= 2: + window = history[-min(10, len(history)) :] + span = window[-1] - window[0] + if span > 0: + rate_per_hour = (len(window) - 1) / span * 3600 + lines.append(f" rate (last {len(window) - 1}): {rate_per_hour:.1f} videos/hour") + remaining = max(0, pickable - tracked) + if rate_per_hour > 0 and remaining > 0: + eta_sec = remaining * 3600 / rate_per_hour + eta_at = datetime.now() + timedelta(seconds=eta_sec) + lines.append( + f" ETA remaining: {fmt_duration(eta_sec)} " + f"(done by {eta_at:%H:%M %a})" + ) + + if last_mtime is not None and last_name is not None: + ago = (datetime.now() - last_mtime).total_seconds() + lines.append( + f" most recent DB: {last_name[:60]}... ({fmt_duration(ago)} ago)" + ) + + if errors: + lines.append("") + lines.append(f" recent errors ({min(5, len(errors))} of {len(errors)}):") + for e in errors[-5:]: + lines.append(f" {e[:120]}") + + return "\n".join(lines) + + +def main() -> None: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument( + "--watch", nargs="?", type=int, const=10, default=None, + help="refresh every N seconds (default 10 if flag given without value)", + ) + args = parser.parse_args() + + if args.watch is None: + print(snapshot()) + return + + try: + while True: + # Clear screen and reprint + print("\033[2J\033[H", end="") + print(snapshot()) + print(f"\n(refreshing every {args.watch}s — Ctrl-C to exit)") + time.sleep(args.watch) + except KeyboardInterrupt: + print() + + +if __name__ == "__main__": + main() diff --git a/scripts/pick_targets.py b/scripts/pick_targets.py new file mode 100644 index 0000000..a5eea07 --- /dev/null +++ b/scripts/pick_targets.py @@ -0,0 +1,467 @@ +"""Interactive target picker for offline tracking (matplotlib/Tk GUI). + +Loops through videos that need tracking and lets the user click 3 reference +points per video in L-shape order: + + 1) TOP target (above the corner) + 2) CORNER target (the right-angle vertex) + 3) LEFT target (to the left of the corner) + +These three points are the same reference layout used by ethoscope's +`TargetGridROIBuilder`: dst_points = [(0, -1), (0, 0), (-1, 0)] in unit +coordinates. Saving them as a JSON sidecar lets the offline tracker build the +6-ROI HD mating arena grid without needing auto-target detection. + +Output JSON sidecar: data/targets/.json + { + "video_path": "/mnt/.../*.mp4", + "frame_index": , + "reference_points": [[x0, y0], [x1, y1], [x2, y2]], + "order": ["top", "corner", "left"], + "picked_at": "" + } + +Keys (in the picker window): + LEFT-CLICK add a point (top → corner → left) + r reset clicks for current video + d skip this video for THIS run only (no JSON written) + u mark this video unusable (FOV wrong etc.); skipped forever + . / , advance / rewind by 25 frames (≈ 1 s @ 25 fps) + ] / [ advance / rewind by 5% of the video (~3 min in a 1 h video) + # jump to the middle of the video + enter save the 3 points and move on + q / ESC quit picker + +After the 3rd click, the 6 ROI rectangles are drawn over the frame so you +can sanity-check the geometry before pressing ENTER. + +With --redo, if a JSON sidecar exists, its points are pre-loaded so you can +nudge them rather than restart from scratch. + +Why matplotlib instead of cv2.imshow: + OpenCV's bundled GUI uses Qt, which needs XKeyboard + a fonts directory and + is fragile over SSH X11-forwarding. matplotlib's TkAgg backend uses pure + Tk/X11 and works out of the box on any DISPLAY (and gives free pan/zoom + via the toolbar — useful for clicking small targets precisely). +""" + +from __future__ import annotations + +import argparse +import datetime as dt +import json +import os +import sys +from pathlib import Path + +# Force TkAgg BEFORE importing matplotlib. We override even if MPLBACKEND is +# already set, because the script is unusable with a non-interactive backend. +os.environ["MPLBACKEND"] = "TkAgg" + +import cv2 # noqa: E402 +import matplotlib # noqa: E402 +import matplotlib.pyplot as plt # noqa: E402 +import numpy as np # noqa: E402 +import pandas as pd # noqa: E402 + +# matplotlib.backend_bases exposes the cursor identifiers under different +# names depending on version: `Cursors` enum on 3.5+, lowercase `cursors` +# instance on older releases. Both have the same integer attributes. +try: + from matplotlib.backend_bases import Cursors as _Cursors # 3.5+ +except ImportError: + try: + from matplotlib.backend_bases import cursors as _Cursors # older + except ImportError: + _Cursors = None + +# Verify we ended up on an interactive backend; bail loud (with a concrete +# explanation) if not. matplotlib silently falls back to 'agg' when its +# requested backend can't load, which is hard to debug without help. +_backend = matplotlib.get_backend() +if _backend.lower() in ("agg", "headless", "template", "pdf", "svg", "ps"): + diag = [] + try: + import tkinter as _tk + try: + _tk.Tk().destroy() + diag.append("tkinter import + Tk() instantiation: OK") + except Exception as e: + diag.append(f"tkinter imported but Tk() failed: {e!r}") + except Exception as e: + diag.append(f"tkinter import FAILED: {e!r}") + diag.append(" → on Manjaro/Arch, run: sudo pacman -S tk") + print( + f"ERROR: matplotlib loaded the non-interactive backend {_backend!r}.\n" + f" Expected 'TkAgg'. Diagnostic info:\n" + f" DISPLAY = {os.environ.get('DISPLAY')!r}\n" + f" MPLBACKEND = {os.environ.get('MPLBACKEND')!r}\n" + f" matplotlib ver = {matplotlib.__version__}\n" + + "\n".join(f" {d}" for d in diag), + file=sys.stderr, + ) + sys.exit(2) + +from config import INVENTORY_CSV, TARGETS_DIR # noqa: E402 +from tracking_geometry import compute_roi_polygons # noqa: E402 + +# Strip default matplotlib keybindings that would conflict with ours. +for k in ("keymap.home", "keymap.save", "keymap.quit", "keymap.fullscreen", + "keymap.pan", "keymap.zoom", "keymap.back", "keymap.forward"): + try: + plt.rcParams[k] = [] + except KeyError: + pass + +CLICK_LABELS = ("TOP", "CORNER", "LEFT") +CLICK_COLORS = ("red", "lime", "deepskyblue") + + +def grab_frame( + video_path: Path, frame_idx: int +) -> tuple[np.ndarray, int, int] | None: + """Return (RGB frame, actual_frame_idx, n_frames) from the video, or None. + + Clamps frame_idx to [0, n_frames-1] so callers can step blindly. + """ + cap = cv2.VideoCapture(str(video_path)) + if not cap.isOpened(): + return None + n = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) + if n > 0: + frame_idx = max(0, min(frame_idx, n - 1)) + cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx) + ok, frame = cap.read() + cap.release() + if not ok or frame is None: + return None + return cv2.cvtColor(frame, cv2.COLOR_BGR2RGB), frame_idx, n + + +def pick_one( + video_path: Path, + frame_idx: int, + status_prefix: str, + initial_points: list[tuple[float, float]] | None = None, +) -> dict | None: + """Show the picker UI for a single video; return the result dict or None.""" + grabbed = grab_frame(video_path, frame_idx) + if grabbed is None: + print(f" ! cannot read {video_path}") + return None + frame, frame_idx, n_frames = grabbed + # Big-step size for ] / [ : 5% of total length, ~3 min in a 1h video. + big_step = max(1, int(round(0.05 * n_frames))) if n_frames > 0 else 250 + + fig, ax = plt.subplots(figsize=(14, 8)) + try: + fig.canvas.manager.set_window_title("pick targets") + except Exception: + pass + # Use a crosshair cursor over the axes so it's obvious where the click + # will land. matplotlib's toolbar resets the cursor to POINTER (arrow) on + # every mouse-move when no tool is active, so we intercept set_cursor: + # whenever it asks for POINTER, we substitute SELECT_REGION (crosshair). + # Tool modes (zoom/pan) keep their native cursors. + if _Cursors is not None: + _orig_set_cursor = fig.canvas.set_cursor + + def _set_cursor_with_crosshair(cursor): + if cursor == _Cursors.POINTER: + cursor = _Cursors.SELECT_REGION + return _orig_set_cursor(cursor) + + fig.canvas.set_cursor = _set_cursor_with_crosshair + try: + fig.canvas.set_cursor(_Cursors.SELECT_REGION) + except Exception: + pass + else: + # Last-ditch: just set the Tk widget's cursor once and hope the + # toolbar doesn't immediately overwrite it. + try: + fig.canvas.get_tk_widget().config(cursor="tcross") + except Exception: + pass + img_artist = ax.imshow(frame) + ax.set_axis_off() + fig.tight_layout() + + state = { + "points": list(initial_points) if initial_points else [], + "action": None, # 'save' | 'skip' | 'quit' | 'unusable' + "frame": frame, + "frame_idx": frame_idx, + "drawn": [], # artists drawn on top of the image + } + + def update_title(): + nb = len(state["points"]) + nxt = ( + f"click {CLICK_LABELS[nb]}" + if nb < 3 + else "ENTER=save | r=reset d=skip u=unusable q=quit | . , [ ] # = step frame" + ) + ax.set_title( + f'{status_prefix} frame {state["frame_idx"]} | {nxt}', + fontsize=10, + ) + + def redraw_points(): + for a in state["drawn"]: + try: + a.remove() + except Exception: + pass + state["drawn"].clear() + for i, (x, y) in enumerate(state["points"]): + color = CLICK_COLORS[i] + label = CLICK_LABELS[i] + (cross,) = ax.plot(x, y, marker="+", color=color, markersize=22, mew=2) + (ring,) = ax.plot( + x, y, marker="o", color=color, markersize=22, + fillstyle="none", mew=2, + ) + txt = ax.text( + x + 14, y - 14, label, + color=color, fontsize=10, weight="bold", + ) + state["drawn"].extend([cross, ring, txt]) + if len(state["points"]) >= 2: + (line1,) = ax.plot( + [state["points"][0][0], state["points"][1][0]], + [state["points"][0][1], state["points"][1][1]], + color="white", linewidth=0.7, alpha=0.6, + ) + state["drawn"].append(line1) + if len(state["points"]) == 3: + (line2,) = ax.plot( + [state["points"][1][0], state["points"][2][0]], + [state["points"][1][1], state["points"][2][1]], + color="white", linewidth=0.7, alpha=0.6, + ) + state["drawn"].append(line2) + # ROI overlay — draw the 6 computed rectangles on top of the frame + try: + polys = compute_roi_polygons(state["points"]) + except Exception as e: + polys = [] + print(f" (ROI preview failed: {e})") + for j, poly in enumerate(polys): + # Close the polygon by repeating the first point + xs = list(poly[:, 0]) + [poly[0, 0]] + ys = list(poly[:, 1]) + [poly[0, 1]] + (line,) = ax.plot( + xs, ys, color="yellow", linewidth=1.5, alpha=0.9, + ) + state["drawn"].append(line) + cx = float(np.mean(poly[:, 0])) + cy = float(np.mean(poly[:, 1])) + lbl = ax.text( + cx, cy, str(j + 1), + color="yellow", fontsize=14, weight="bold", + ha="center", va="center", + ) + state["drawn"].append(lbl) + update_title() + fig.canvas.draw_idle() + + def reload_frame(new_idx: int): + grabbed = grab_frame(video_path, new_idx) + if grabbed is None: + return + new_frame, new_idx, _ = grabbed + state["frame"] = new_frame + state["frame_idx"] = new_idx + img_artist.set_data(new_frame) + # Keep clicked targets + ROI overlay in place across frame-stepping — + # press 'r' to clear them explicitly. + redraw_points() + + def on_click(event): + if event.inaxes is not ax: + return + if event.button != 1: # left click only + return + if event.xdata is None or event.ydata is None: + return + # Skip clicks fired while the toolbar's pan/zoom is active. + toolbar = getattr(fig.canvas, "toolbar", None) + if toolbar is not None and getattr(toolbar, "mode", ""): + return + x, y = float(event.xdata), float(event.ydata) + if len(state["points"]) < 3: + state["points"].append((x, y)) + else: + # 3 points already there — replace the nearest one. Lets the user + # nudge pre-loaded targets in --redo mode, or correct a bad click. + dists = [(x - px) ** 2 + (y - py) ** 2 for px, py in state["points"]] + i_nearest = min(range(3), key=dists.__getitem__) + state["points"][i_nearest] = (x, y) + redraw_points() + + def on_key(event): + k = event.key or "" + if k in ("escape", "q"): + state["action"] = "quit" + plt.close(fig) + elif k == "r": + state["points"].clear() + redraw_points() + elif k == "d": + state["action"] = "skip" + plt.close(fig) + elif k == "u": + state["action"] = "unusable" + plt.close(fig) + elif k == "enter": + if len(state["points"]) == 3: + state["action"] = "save" + plt.close(fig) + elif k == ".": + reload_frame(state["frame_idx"] + 25) + elif k == ",": + reload_frame(state["frame_idx"] - 25) + elif k == "]": + reload_frame(state["frame_idx"] + big_step) + elif k == "[": + reload_frame(state["frame_idx"] - big_step) + elif k == "#": + if n_frames > 0: + reload_frame(n_frames // 2) + + fig.canvas.mpl_connect("button_press_event", on_click) + fig.canvas.mpl_connect("key_press_event", on_key) + update_title() + plt.show() # blocks until the figure is closed + + if state["action"] == "save": + return { + "action": "save", + "frame_idx": state["frame_idx"], + "points": state["points"], + } + if state["action"] == "unusable": + return {"action": "unusable", "frame_idx": state["frame_idx"]} + if state["action"] in ("skip", "quit"): + return {"action": state["action"]} + # Window closed via the WM "X" button — treat as quit so the loop stops + return {"action": "quit"} + + +def main() -> None: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument( + "--redo", action="store_true", + help="re-pick videos that already have JSON sidecars", + ) + parser.add_argument( + "--frame", type=int, default=125, + help="default frame index to display (default 125 ≈ 5 s @ 25 fps)", + ) + parser.add_argument( + "--limit", type=int, default=None, + help="only process the first N videos", + ) + args = parser.parse_args() + + if not INVENTORY_CSV.exists(): + sys.exit( + f"Inventory not found at {INVENTORY_CSV}. " + "Run build_video_inventory.py first." + ) + + inv = pd.read_csv(INVENTORY_CSV) + todo = inv[inv["in_xlsx"] & ~inv["already_tracked"]].copy() + todo = todo.sort_values( + ["session_date", "machine_name", "session_time"] + ).reset_index(drop=True) + + TARGETS_DIR.mkdir(parents=True, exist_ok=True) + + def sidecar_for(mp4_path: str) -> Path: + return TARGETS_DIR / (Path(mp4_path).stem + ".json") + + if not args.redo: + todo = todo[ + ~todo["mp4_path"].apply(lambda p: sidecar_for(p).exists()) + ].reset_index(drop=True) + + if args.limit: + todo = todo.head(args.limit) + + n = len(todo) + if n == 0: + print("Nothing to pick. All eligible videos already have target JSONs.") + return + + print( + f"Picking targets for {n} videos. " + "Window keys: ENTER=save r=reset d=skip u=unusable q=quit " + ".,[]=step frame | pan/zoom via toolbar" + ) + saved = skipped = unusable = 0 + for i, row in todo.iterrows(): + mp4 = Path(row["mp4_path"]) + prefix = f"[{i + 1}/{n}] {row['machine_name']} {row['session_datetime']}" + print(f"\n{prefix}") + + # If --redo and a JSON sidecar exists, pre-load its points (only for + # regular saves — unusable sidecars are left as-is and shown empty). + initial_points = None + existing = sidecar_for(row["mp4_path"]) + if args.redo and existing.exists(): + try: + prev = json.loads(existing.read_text()) + if not prev.get("unusable") and prev.get("reference_points"): + initial_points = [tuple(p) for p in prev["reference_points"]] + print(f" pre-loaded {len(initial_points)} previous point(s)") + except Exception as e: + print(f" ! could not read previous sidecar: {e}") + + result = pick_one(mp4, args.frame, prefix, initial_points=initial_points) + if result is None or result.get("action") == "quit": + print(" quitting picker.") + break + if result["action"] == "skip": + skipped += 1 + print(" skipped (no JSON written, will be re-asked next run).") + continue + if result["action"] == "unusable": + try: + reason = input(" reason for marking unusable (Enter to skip): ").strip() + except EOFError: + reason = "" + payload = { + "video_path": str(mp4), + "unusable": True, + "reason": reason, + "marked_at": dt.datetime.now().isoformat(timespec="seconds"), + } + out_path = sidecar_for(row["mp4_path"]) + out_path.write_text(json.dumps(payload, indent=2)) + unusable += 1 + print(f" marked unusable → {out_path.name}") + continue + if result["action"] == "save": + payload = { + "video_path": str(mp4), + "frame_index": int(result["frame_idx"]), + "reference_points": [list(map(int, p)) for p in result["points"]], + "order": ["top", "corner", "left"], + "picked_at": dt.datetime.now().isoformat(timespec="seconds"), + } + out_path = sidecar_for(row["mp4_path"]) + out_path.write_text(json.dumps(payload, indent=2)) + saved += 1 + print(f" saved → {out_path.name}") + + remaining = n - saved - skipped - unusable + print( + f"\nDone. saved={saved} unusable={unusable} " + f"skipped(this run)={skipped} remaining={remaining}" + ) + + +if __name__ == "__main__": + main() diff --git a/scripts/track_videos.py b/scripts/track_videos.py new file mode 100644 index 0000000..d9bd197 --- /dev/null +++ b/scripts/track_videos.py @@ -0,0 +1,218 @@ +"""Headless offline tracker. + +Reads target JSONs produced by `pick_targets.py`, builds the 6 ROIs of the +HD mating arena from the L-shape reference points, runs ethoscope's +`MultiFlyTracker` against the merged.mp4 file via `MovieVirtualCamera`, and +writes a SQLite DB to `data/tracked/_tracking.db`. + +Idempotent: skips videos whose tracking DB already exists (unless --redo). + +Usage: + python track_videos.py # process all videos with target JSON + python track_videos.py --redo # re-track even if DB exists + python track_videos.py --jobs 4 # run up to 4 videos in parallel + python track_videos.py --max-duration 1800 # cap each video at 30 min (sec) +""" + +from __future__ import annotations + +import argparse +import json +import logging +import os +import sys +import traceback +from concurrent.futures import ProcessPoolExecutor, as_completed +from pathlib import Path + +import numpy as np + +# Import ethoscope from the local source tree (no pip install). +ETHOSCOPE_SRC = Path("/home/gg/Code/ethoscope_project/ethoscope/src/ethoscope") +sys.path.insert(0, str(ETHOSCOPE_SRC)) + +from config import TARGETS_DIR, TRACKING_OUTPUT_DIR # noqa: E402 +from tracking_geometry import HD_FG_DATA, compute_roi_polygons # noqa: E402 + + +def build_rois_from_targets(reference_points): + """Wrap the shared geometry into ethoscope `ROI` objects.""" + from ethoscope.core.roi import ROI + + polys = compute_roi_polygons(reference_points) + return [ROI(poly.reshape((1, 4, 2)), idx=i + 1) for i, poly in enumerate(polys)] + + +def track_one(json_path: Path, output_dir: Path, max_duration: float | None, + redo: bool) -> tuple[str, str]: + """Track a single video. Returns (status, message). Run in subprocess. + + Statuses: "ok", "skip", "error". + """ + # Re-import inside subprocess so each worker has its own ethoscope state. + import sys as _sys + _sys.path.insert(0, str(ETHOSCOPE_SRC)) + import cv2 + from ethoscope.core.monitor import Monitor + from ethoscope.hardware.input.cameras import MovieVirtualCamera + from ethoscope.io.sqlite import SQLiteResultWriter + from ethoscope.trackers.multi_fly_tracker import MultiFlyTracker + + class BGRMovieCamera(MovieVirtualCamera): + """MovieVirtualCamera variant that keeps BGR frames. + + MultiFlyTracker calls cv2.cvtColor(img, COLOR_BGR2GRAY) without checking + whether img is already grayscale, so we must feed it 3-channel input. + """ + def _next_image(self): + ret, frame = self.capture.read() + if not ret or frame is None: + return None + return frame # BGR, untouched + + payload = json.loads(json_path.read_text()) + if payload.get("unusable"): + reason = payload.get("reason") or "no reason given" + return "skip", f"marked unusable: {reason}" + video_path = Path(payload["video_path"]) + if not video_path.exists(): + return "error", f"video missing: {video_path}" + + out_db = output_dir / f"{video_path.stem}_tracking.db" + if out_db.exists() and not redo: + return "skip", f"DB exists: {out_db.name}" + if out_db.exists(): + out_db.unlink() + + rois = build_rois_from_targets(payload["reference_points"]) + + cam_kwargs = {"use_wall_clock": False} + if max_duration is not None: + cam_kwargs["max_duration"] = max_duration + cam = BGRMovieCamera(str(video_path), **cam_kwargs) + + metadata = { + "machine_id": payload.get("machine_uuid", "unknown"), + "machine_name": payload.get("machine_name", "unknown"), + "date_time": int(payload.get("session_epoch", 0)), + "frame_width": cam.width, + "frame_height": cam.height, + "version": "offline-tracker-1", + "experimental_info": "{}", + "selected_options": json.dumps({ + "tracker": "MultiFlyTracker", + "template": "HD_Mating_Arena_6_ROIS", + "fg_data": HD_FG_DATA, + "maxN": 2, + }), + "hardware_info": "{}", + "reference_points": str([list(map(int, p)) for p in payload["reference_points"]]), + "backup_filename": out_db.name, + "result_writer_type": "SQLite3", + "sqlite_source_path": str(out_db), + } + + tracker_data = { + "maxN": 2, + "visualise": False, + "fg_data": HD_FG_DATA, + "adaptive_threshold": True, + "min_fg_threshold": 10, + "max_fg_threshold": 50, + } + + db_credentials = {"name": str(out_db)} + rw = SQLiteResultWriter( + db_credentials, rois, metadata=metadata, + make_dam_like_table=False, take_frame_shots=False, erase_old_db=True, + ) + + monit = Monitor( + cam, MultiFlyTracker, rois, + reference_points=payload["reference_points"], + data=tracker_data, + ) + + try: + with rw as result_writer: + monit.run(result_writer=result_writer, drawer=None, verbose=False) + except Exception: + return "error", traceback.format_exc(limit=5) + finally: + try: + cam._close() + except Exception: + pass + + if not out_db.exists(): + return "error", "tracking finished but DB was not created" + return "ok", str(out_db) + + +def main() -> None: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--redo", action="store_true", help="re-track even if DB exists") + parser.add_argument("--jobs", type=int, default=1, help="parallel workers") + parser.add_argument( + "--max-duration", type=float, default=None, + help="cap each video at this many seconds (default: full video)", + ) + parser.add_argument("--limit", type=int, default=None, help="process only first N") + parser.add_argument("--video", type=str, default=None, + help="track a single video (mp4 path); requires its target JSON") + args = parser.parse_args() + + TRACKING_OUTPUT_DIR.mkdir(parents=True, exist_ok=True) + + if args.video: + stem = Path(args.video).stem + json_path = TARGETS_DIR / f"{stem}.json" + if not json_path.exists(): + sys.exit(f"No target JSON for {args.video}: expected {json_path}") + jsons = [json_path] + else: + jsons = sorted(TARGETS_DIR.glob("*.json")) + + if args.limit: + jsons = jsons[: args.limit] + + if not jsons: + print("No target JSONs found. Run pick_targets.py first.") + return + + print(f"Tracking {len(jsons)} videos (jobs={args.jobs}, redo={args.redo}).") + n_ok = n_skip = n_err = 0 + + if args.jobs <= 1: + for jp in jsons: + print(f" → {jp.name}", flush=True) + status, msg = track_one(jp, TRACKING_OUTPUT_DIR, args.max_duration, args.redo) + print(f" {status}: {msg.splitlines()[-1] if msg else ''}", flush=True) + n_ok += status == "ok" + n_skip += status == "skip" + n_err += status == "error" + else: + with ProcessPoolExecutor(max_workers=args.jobs) as ex: + futs = { + ex.submit(track_one, jp, TRACKING_OUTPUT_DIR, args.max_duration, args.redo): jp + for jp in jsons + } + for fut in as_completed(futs): + jp = futs[fut] + try: + status, msg = fut.result() + except Exception as e: + status, msg = "error", f"future raised: {e}" + print(f" {jp.name}: {status} — {msg.splitlines()[-1] if msg else ''}", + flush=True) + n_ok += status == "ok" + n_skip += status == "skip" + n_err += status == "error" + + print(f"\nDone. ok={n_ok} skipped={n_skip} errors={n_err}") + sys.exit(0 if n_err == 0 else 1) + + +if __name__ == "__main__": + logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s") + main() diff --git a/scripts/tracking_geometry.py b/scripts/tracking_geometry.py new file mode 100644 index 0000000..1f98918 --- /dev/null +++ b/scripts/tracking_geometry.py @@ -0,0 +1,71 @@ +"""Shared HD-mating-arena ROI geometry, used by both pick_targets.py +(for live overlay) and track_videos.py (for actual tracking). + +Pure numpy + cv2; no ethoscope dependency. +""" + +from __future__ import annotations + +import itertools + +import cv2 +import numpy as np + +# Layout from +# ethoscope/.../roi_builders/roi_templates/builtin/HD_Mating_Arena_6_ROIS.json +HD_MATING_ARENA = { + "n_rows": 2, + "n_cols": 3, + "top_margin": -0.21, + "bottom_margin": -0.13, + "left_margin": 0.05, + "right_margin": 0.05, + "horizontal_fill": 0.85, + "vertical_fill": 1.3, +} + +HD_FG_DATA = { + "sample_size": 400, + "normal_limits": [800, 2000], + "tolerance": 0.8, +} + + +def compute_roi_polygons(reference_points, layout=HD_MATING_ARENA): + """Map 3 L-shape reference points to 6 ROI polygons, in the order ROI 1..6. + + Reference points must be ordered: + [TOP, CORNER, LEFT] + matching ethoscope's dst_points = [(0, -1), (0, 0), (-1, 0)]. + + Returns: + list[np.ndarray] # 6 arrays, each shape (4, 2), int32, in image coords + """ + ref = np.asarray(reference_points, dtype=np.float32) + if ref.shape != (3, 2): + raise ValueError(f"reference_points must be 3x2, got shape {ref.shape}") + + dst_points = np.array([(0, -1), (0, 0), (-1, 0)], dtype=np.float32) + wrap_mat = cv2.getAffineTransform(dst_points, ref) + + n_col = layout["n_cols"] + n_row = layout["n_rows"] + tm, bm = layout["top_margin"], layout["bottom_margin"] + lm, rm = layout["left_margin"], layout["right_margin"] + hf, vf = layout["horizontal_fill"], layout["vertical_fill"] + + y_positions = (np.arange(n_row) * 2.0 + 1) * (1 - tm - bm) / (2 * n_row) + tm + x_positions = (np.arange(n_col) * 2.0 + 1) * (1 - lm - rm) / (2 * n_col) + lm + centres = [np.array([x, y]) for x, y in itertools.product(x_positions, y_positions)] + sign_mat = np.array([[-1, -1], [+1, -1], [+1, +1], [-1, +1]]) + xy_size = np.array([hf / float(n_col), vf / float(n_row)]) / 2.0 + rectangles = [sign_mat * xy_size + c for c in centres] + + shift = np.dot(wrap_mat, [1, 1, 0]) - ref[1] + + polys = [] + for r in rectangles: + r3 = np.append(r, np.zeros((4, 1)), axis=1) + mapped = np.dot(wrap_mat, r3.T).T - shift + polys.append(mapped.astype(np.int32)) + return polys diff --git a/tasks/todo.md b/tasks/todo.md index f5e8b3f..f86bd65 100644 --- a/tasks/todo.md +++ b/tasks/todo.md @@ -51,6 +51,68 @@ See `docs/bimodal_hypothesis.md` for detailed methodology. - [ ] Consider converting pixel distances to physical units (need calibration) - [ ] The second notebook (`flies_analysis.ipynb`) re-runs from DB extraction - consider deprecating +## Phase: Offline Tracking of 2024 Video Backlog (added 2026-04-27) + +### Recap + +Tracked so far (5 sessions, all from 2025-07-15, machines 076/145/268). The DBs in +`data/raw/` use tracker `ConstrainedMultiFlyTracker` and template +`HD_Mating_Arena_6_ROIS.json` (2 flies × 6 ROIs per video). + +The metadata file `../all_video_info_merged.xlsx` indexes a different set of +experiments: 7 dates from 2024-09-17 → 2024-10-21, 16 ethoscope machines, +63 unique (date, machine) sessions = 484 ROI-rows. **None of the already-tracked +sessions are in this xlsx — these are fresh recordings to track.** + +Inventory: see `data/metadata/video_inventory.csv` (built by +`scripts/build_video_inventory.py`). +- 1163 video sessions on disk under `/mnt/ethoscope_data/videos/` +- 63/63 xlsx (date, machine) sessions have video on disk +- 129 video instances need tracking (some (date, machine) have 2-4 recordings/day) + +### Plan + +The HD-mating-arena videos have no auto-detectable targets — the user must +manually click 3 reference points (L-shape: top, corner, left) per video. Once +all targets are picked, tracking can run in the background. + +- [x] **Step 1 — Inventory**: `scripts/build_video_inventory.py` → + `data/metadata/video_inventory.csv`. 63 (date,machine) sessions match + the xlsx, all videos found, 129 video instances need tracking. +- [x] **Step 2 — Manual target picker**: `scripts/pick_targets.py`. Loops over + videos with `in_xlsx & ~already_tracked & no JSON yet`; per video, shows + a representative frame, captures 3 clicks (top, corner, left), saves + `data/targets/.json`. Skips videos already done. +- [x] **Step 3 — Background tracker**: `scripts/track_videos.py`. Reads target + JSONs, builds 6 ROIs from the HD-mating-arena geometry, runs + `MovieVirtualCamera` + `MultiFlyTracker` + `SQLiteResultWriter`, writes + `data/tracked/_tracking.db`. Idempotent. Smoke-tested + end-to-end: 90s of video → ~3000 rows/ROI, areas in 800-2000 band. +- [x] **Step 4 — Tracking deps**: `requirements-tracking.txt`. + +### Still TODO +- [ ] User to run `pick_targets.py` (interactive — needs DISPLAY) on the 129 + pending videos. +- [ ] Run `track_videos.py --jobs 4` against the resulting JSONs. +- [ ] (Optional) `auto_detect_targets.py` exists as a fallback for videos that + DO have visible targets (saves clicks). Confirmed not useful on the + 2025-07-15 batch — these arenas don't have black target dots — but worth + trying on 2024 batches before falling back to manual. +- [ ] Decide what to do with the 4 (date, machine) sessions that have 3-4 + recordings/day instead of 2 (e.g. ETHOSCOPE_086 on 2024-09-17 has 4). + One of them is at lower resolution (1280x960) — likely an aborted take. + +### Open questions / risks + +- Some (date, machine) combos have 3-4 recordings (e.g. ETHOSCOPE_086 on + 2024-09-17). Need to figure out which is the real "test" video vs aborted + takes — possibly use video duration or filename pattern. +- One mismatched-resolution file: `1280x960@25fps-20q` instead of + `1920x1088@25fps-28q` — flag for inspection. +- The original `ConstrainedMultiFlyTracker` is no longer in the ethoscope repo; + `MultiFlyTracker` is its likely successor. Validate output schema matches + what the existing analysis pipeline expects (`load_roi_data.py`, etc.). + ## Discovered During Work (Add new items here as they come up during analysis) From f60a9d053015bd7408948a31e5323cf4d5383e4f Mon Sep 17 00:00:00 2001 From: Giorgio Gilestro Date: Thu, 30 Apr 2026 15:20:14 +0100 Subject: [PATCH 2/4] Unify analysis pipeline around the TSV; move tracked DBs out of cloud sync MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Tracked DBs now live at /mnt/data/projects/cupido/tracked/ (out of ownCloud to avoid sync conflicts and bandwidth churn). config.py TRACKING_OUTPUT_DIR points there; the docker-compose for ethoscope-lab mounts it world-readable for JupyterHub users. - New scripts/export_video_db_index.py joins all_video_info_merged.xlsx with the video inventory and the on-disk DBs, producing a TSV that has one row per fly/ROI plus training/testing video and DB paths. Handles approximate xlsx times, cross-day training/testing, the 12 AM/PM ambiguity, and date typos. - scripts/load_roi_data.py rewritten as a TSV-driven loader returning a single DataFrame with session and metadata columns. calculate_distances and the two flies_analysis notebooks migrated to use it; downstream trained/naive splits remain available via simple equality filters. - Metadata vocabulary canonicalized: {naïve, niave, untrained, test} all resolve to {trained, naive}. Normalization happens at the TSV-export boundary (idempotent); the xlsx and the 2025-07-15 legacy CSV were edited in place to remove the worst variants. - scripts/monitor_tracking.py rate calculation fixed: with N parallel workers, completions arrive in bursts; the old formula divided by burst width and reported nonsense rates. Now uses a 6 h window denominator. - scripts/track_videos.py: BGRMovieCamera retries cv2.read on transient NFS hiccups and a post-tracking completeness gate (≥ 90 % of expected duration via MAX(t) across all 6 ROIs) deletes silent partial DBs. Co-Authored-By: Claude Opus 4.7 --- .gitignore | 7 +- README.md | 2 +- data/metadata/2025_07_15_metadata_fixed.csv | 38 ++-- data/processed/README.md | 58 ++++--- notebooks/flies_analysis.ipynb | 19 +- notebooks/flies_analysis_simple.ipynb | 19 +- scripts/calculate_distances.py | 164 ++++++++---------- scripts/config.py | 5 +- scripts/export_video_db_index.py | 181 ++++++++++++++++++++ scripts/load_roi_data.py | 171 ++++++++++-------- scripts/monitor_tracking.py | 35 +++- scripts/track_videos.py | 83 ++++++++- tasks/todo.md | 24 ++- 13 files changed, 569 insertions(+), 237 deletions(-) create mode 100644 scripts/export_video_db_index.py diff --git a/.gitignore b/.gitignore index 02d5434..07f3445 100644 --- a/.gitignore +++ b/.gitignore @@ -2,11 +2,8 @@ data/raw/*.db data/processed/*.csv -# Offline-tracking outputs (reproducible from videos + target JSONs) -data/tracked/*.db -data/tracked/*.db-wal -data/tracked/*.db-shm -data/tracked/*.db-journal +# Offline-tracking outputs (regenerable from videos + target JSONs) +# DBs live outside the repo at /mnt/data/projects/cupido/tracked/ data/targets/*.json data/metadata/video_inventory.csv data/logs/*.log diff --git a/README.md b/README.md index 9d9ff17..5644fea 100644 --- a/README.md +++ b/README.md @@ -66,7 +66,7 @@ python scripts/pick_targets.py --redo # re-pick already-picked videos # 3) batch tracking (idempotent, can run in background) python scripts/track_videos.py --jobs 4 # parallel -# output → data/tracked/*_tracking.db (SQLite, same schema as data/raw/) +# output → /mnt/data/projects/cupido/tracked/*_tracking.db (SQLite, same schema as data/raw/) ``` See `tasks/todo.md` "Offline Tracking" section for the full plan, and diff --git a/data/metadata/2025_07_15_metadata_fixed.csv b/data/metadata/2025_07_15_metadata_fixed.csv index 36d07c5..bce7bcc 100644 --- a/data/metadata/2025_07_15_metadata_fixed.csv +++ b/data/metadata/2025_07_15_metadata_fixed.csv @@ -1,37 +1,37 @@ -date,HHMMSS,machine_name,ROI,genotype,group,path,filesize_mb +date,HHMMSS,machine_name,ROI,genotype,group,path,filesize_mb 15/07/2025,16-03-10,76,6,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4 -15/07/2025,16-03-10,76,4,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4 +15/07/2025,16-03-10,76,4,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4 15/07/2025,16-03-10,76,2,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4 -15/07/2025,16-03-10,76,5,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4 +15/07/2025,16-03-10,76,5,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4 15/07/2025,16-03-10,76,3,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4 -15/07/2025,16-03-10,76,1,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4 +15/07/2025,16-03-10,76,1,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4 15/07/2025,16-31-34,76,6,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98 15/07/2025,16-31-34,76,4,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98 15/07/2025,16-31-34,76,2,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98 -15/07/2025,16-31-34,76,5,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98 -15/07/2025,16-31-34,76,3,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98 -15/07/2025,16-31-34,76,1,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98 +15/07/2025,16-31-34,76,5,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98 +15/07/2025,16-31-34,76,3,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98 +15/07/2025,16-31-34,76,1,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98 15/07/2025,16-03-27,145,6,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72 15/07/2025,16-03-27,145,4,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72 15/07/2025,16-03-27,145,2,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72 -15/07/2025,16-03-27,145,5,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72 -15/07/2025,16-03-27,145,3,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72 -15/07/2025,16-03-27,145,1,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72 +15/07/2025,16-03-27,145,5,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72 +15/07/2025,16-03-27,145,3,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72 +15/07/2025,16-03-27,145,1,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72 15/07/2025,16-31-41,145,6,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9 15/07/2025,16-31-41,145,4,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9 15/07/2025,16-31-41,145,2,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9 -15/07/2025,16-31-41,145,5,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9 -15/07/2025,16-31-41,145,3,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9 -15/07/2025,16-31-41,145,1,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9 +15/07/2025,16-31-41,145,5,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9 +15/07/2025,16-31-41,145,3,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9 +15/07/2025,16-31-41,145,1,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9 15/07/2025,16-31-52,139,6,CS,trained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4 15/07/2025,16-31-52,139,4,CS,trained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4 15/07/2025,16-31-52,139,2,CS,trained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4 -15/07/2025,16-31-52,139,5,CS,untrained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4 -15/07/2025,16-31-52,139,3,CS,untrained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4 -15/07/2025,16-31-52,139,1,CS,untrained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4 -15/07/2025,16-32-05,268,6,CS,untrained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72 -15/07/2025,16-32-05,268,4,CS,untrained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72 -15/07/2025,16-32-05,268,2,CS,untrained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72 +15/07/2025,16-31-52,139,5,CS,naive,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4 +15/07/2025,16-31-52,139,3,CS,naive,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4 +15/07/2025,16-31-52,139,1,CS,naive,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4 +15/07/2025,16-32-05,268,6,CS,naive,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72 +15/07/2025,16-32-05,268,4,CS,naive,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72 +15/07/2025,16-32-05,268,2,CS,naive,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72 15/07/2025,16-32-05,268,5,CS,trained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72 15/07/2025,16-32-05,268,3,CS,trained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72 15/07/2025,16-32-05,268,1,CS,trained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72 diff --git a/data/processed/README.md b/data/processed/README.md index 97d2e82..d934460 100644 --- a/data/processed/README.md +++ b/data/processed/README.md @@ -1,39 +1,47 @@ # Processed Data -Large CSV files generated from the analysis pipeline. All files are gitignored (~370MB total) and can be regenerated. +CSVs derived from the tracking DBs (`/mnt/data/projects/cupido/tracked/`) +and the merged TSV (`../../all_video_info_merged.tsv`). All files are +gitignored and regenerable. ## Files and Regeneration | File | Description | Generated By | |------|-------------|--------------| -| `trained_roi_data.csv` | Raw tracking data for trained ROIs | `scripts/load_roi_data.py` or notebook step 1 | -| `untrained_roi_data.csv` | Raw tracking data for untrained ROIs | `scripts/load_roi_data.py` or notebook step 1 | -| `trained_distances.csv` | Pairwise distances (unaligned) | `scripts/calculate_distances.py` | -| `untrained_distances.csv` | Pairwise distances (unaligned) | `scripts/calculate_distances.py` | -| `trained_distances_aligned.csv` | Distances aligned to barrier opening | Notebook step 4 | -| `untrained_distances_aligned.csv` | Distances aligned to barrier opening | Notebook step 4 | -| `trained_tracked.csv` | Identity-tracked fly positions | Notebook step 7 | -| `untrained_tracked.csv` | Identity-tracked fly positions | Notebook step 7 | -| `trained_max_velocity.csv` | Max velocity over 10s windows | Notebook step 7 | -| `untrained_max_velocity.csv` | Max velocity over 10s windows | Notebook step 7 | +| `distances.csv` | Per-frame inter-fly distances for every (date, machine, ROI, session). Includes metadata columns to filter trained vs naïve, training phase, species, etc. | `scripts/calculate_distances.py` | +| `*_distances_aligned.csv` | (legacy, 2025-07-15 only) distances aligned to barrier opening | `notebooks/flies_analysis*.ipynb` | +| `*_tracked.csv` | (legacy) identity-tracked fly positions | `notebooks/flies_analysis_simple.ipynb` | +| `*_max_velocity.csv` | (legacy) max velocity over 10 s windows | `notebooks/flies_analysis_simple.ipynb` | -## To Regenerate All Data +## Loading the data -Run the full notebook `notebooks/flies_analysis_simple.ipynb` with: ```python -recalculate_distances = True -recalculate_tracking = True +import sys +sys.path.insert(0, "../scripts") +from load_roi_data import load_roi_data + +data = load_roi_data() # full batch as one DataFrame +# Or filter the metadata first: +import pandas as pd +tsv = pd.read_csv("../../all_video_info_merged.tsv", sep="\t") +data = load_roi_data(tsv[tsv.species.str.contains("Melanogaster")]) ``` -**Warning**: Identity tracking and velocity calculations take significant time (~30+ minutes). +The returned DataFrame has columns: +`id, t, x, y, w, h, phi, is_inferred, has_interacted, session, ROI, date, +machine_name, species, male, training_date_time, testing_date_time, +training_length_hr, consolidation_length_hr, memory, age`. -## Column Reference +`session` is `"training"` or `"testing"`; `male` is `"trained"` or +`"naive"` (canonical — variants like `"naïve"` and `"niave"` are normalized +at the TSV-export step). -### Distance CSVs (`*_distances_aligned.csv`) -- `machine_name`: Ethoscope machine ID (string) -- `ROI`: ROI number (1-6) -- `aligned_time`: Time in ms relative to barrier opening (0 = opening) -- `distance`: Euclidean distance between flies in pixels -- `n_flies`: Number of flies detected at this time point -- `area_fly1`, `area_fly2`: Bounding box areas (w*h) in pixels^2 -- `group`: "trained" or "untrained" +## Column Reference (`distances.csv`) + +- `date`, `machine_name`, `ROI`, `session`: identifies one fly trajectory +- `t`: time in ms within that session +- `distance`: Euclidean distance between the two flies in pixels +- `n_flies`: number of fly detections at this frame (1 or 2) +- `area_fly1`, `area_fly2`: bounding-box areas (`w * h`) in pixels² +- `male`: `trained` or `naive` (carried from the xlsx; normalized) +- `species`, `memory`, `age`: experimental metadata diff --git a/notebooks/flies_analysis.ipynb b/notebooks/flies_analysis.ipynb index d9c24e3..9bf3a30 100644 --- a/notebooks/flies_analysis.ipynb +++ b/notebooks/flies_analysis.ipynb @@ -28,7 +28,22 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": "def load_roi_data():\n \"\"\"Load ROI data from SQLite databases and group by trained/untrained\"\"\"\n metadata = pd.read_csv(DATA_METADATA / '2025_07_15_metadata_fixed.csv')\n metadata['machine_name'] = metadata['machine_name'].astype(str)\n \n trained_rois = metadata[metadata['group'] == 'trained']\n untrained_rois = metadata[metadata['group'] == 'untrained']\n \n db_files = list(DATA_RAW.glob('*_tracking.db'))\n \n trained_df = pd.DataFrame()\n untrained_df = pd.DataFrame()\n \n for db_file in db_files:\n print(f\"Processing {db_file.name}\")\n \n pattern = r'_([0-9a-f]{32})__'\n match = re.search(pattern, db_file.name)\n \n if not match:\n print(f\"Could not extract UUID from {db_file.name}\")\n continue\n \n uuid = match.group(1)\n metadata_matches = metadata[metadata['path'].str.contains(uuid, na=False)]\n \n if metadata_matches.empty:\n print(f\"No metadata matches found for UUID {uuid}\")\n continue\n \n machine_id = metadata_matches.iloc[0]['machine_name']\n print(f\"Matched to machine ID: {machine_id}\")\n \n conn = sqlite3.connect(str(db_file))\n \n machine_trained = trained_rois[trained_rois['machine_name'] == machine_id]\n machine_untrained = untrained_rois[untrained_rois['machine_name'] == machine_id]\n \n for _, row in machine_trained.iterrows():\n roi = row['ROI']\n try:\n roi_data = pd.read_sql_query(f\"SELECT * FROM ROI_{roi}\", conn)\n roi_data['machine_name'] = machine_id\n roi_data['ROI'] = roi\n roi_data['group'] = 'trained'\n trained_df = pd.concat([trained_df, roi_data], ignore_index=True)\n except Exception as e:\n print(f\"Error loading ROI_{roi}: {e}\")\n \n for _, row in machine_untrained.iterrows():\n roi = row['ROI']\n try:\n roi_data = pd.read_sql_query(f\"SELECT * FROM ROI_{roi}\", conn)\n roi_data['machine_name'] = machine_id\n roi_data['ROI'] = roi\n roi_data['group'] = 'untrained'\n untrained_df = pd.concat([untrained_df, roi_data], ignore_index=True)\n except Exception as e:\n print(f\"Error loading ROI_{roi}: {e}\")\n \n conn.close()\n \n return trained_df, untrained_df\n\ntrained_data, untrained_data = load_roi_data()\nprint(f\"Trained data shape: {trained_data.shape}\")\nprint(f\"Untrained data shape: {untrained_data.shape}\")\n\ntrained_data.to_csv(DATA_PROCESSED / 'trained_roi_data.csv', index=False)\nuntrained_data.to_csv(DATA_PROCESSED / 'untrained_roi_data.csv', index=False)\nprint(\"Data saved to CSV files\")" + "source": [ + "# Load tracking data via the unified loader (driven by all_video_info_merged.tsv).\n", + "# Reason: replaces the old data/raw + 2025_07_15_metadata_fixed.csv path with\n", + "# the TSV-based loader that covers the entire batch (2025-07-15 + 2024).\n", + "sys.path.insert(0, str(PROJECT_ROOT / 'scripts'))\n", + "from load_roi_data import load_roi_data\n", + "\n", + "data = load_roi_data()\n", + "# Backwards-compat slices for the rest of the notebook.\n", + "trained_data = data[data['male'] == 'trained'].copy()\n", + "untrained_data = data[data['male'] == 'naive'].copy()\n", + "\n", + "print(f\"all data: {data.shape}\")\n", + "print(f\"trained: {trained_data.shape}\")\n", + "print(f\"naive: {untrained_data.shape}\")\n" + ] }, { "cell_type": "markdown", @@ -219,4 +234,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} \ No newline at end of file +} diff --git a/notebooks/flies_analysis_simple.ipynb b/notebooks/flies_analysis_simple.ipynb index 1663b10..7072c73 100644 --- a/notebooks/flies_analysis_simple.ipynb +++ b/notebooks/flies_analysis_simple.ipynb @@ -28,7 +28,22 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": "# Load the pre-processed data\ntrained_data = pd.read_csv(DATA_PROCESSED / 'trained_roi_data.csv')\nuntrained_data = pd.read_csv(DATA_PROCESSED / 'untrained_roi_data.csv')\n\nprint(f\"Trained data shape: {trained_data.shape}\")\nprint(f\"Untrained data shape: {untrained_data.shape}\")\nprint(f\"Trained data columns: {list(trained_data.columns)}\")\nprint(f\"Untrained data columns: {list(untrained_data.columns)}\")" + "source": [ + "# Load tracking data via the unified loader (driven by all_video_info_merged.tsv).\n", + "# Reason: replaces reads of trained_roi_data.csv / untrained_roi_data.csv with\n", + "# the live loader so the notebook always sees the current batch.\n", + "sys.path.insert(0, str(PROJECT_ROOT / 'scripts'))\n", + "from load_roi_data import load_roi_data\n", + "\n", + "data = load_roi_data()\n", + "trained_data = data[data['male'] == 'trained'].copy()\n", + "untrained_data = data[data['male'] == 'naive'].copy()\n", + "\n", + "print(f\"all data shape: {data.shape}\")\n", + "print(f\"Trained data: {trained_data.shape}\")\n", + "print(f\"Naive data: {untrained_data.shape}\")\n", + "print(f\"Columns: {list(trained_data.columns)}\")\n" + ] }, { "cell_type": "markdown", @@ -418,4 +433,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} \ No newline at end of file +} diff --git a/scripts/calculate_distances.py b/scripts/calculate_distances.py index 09eff9a..75e7a1a 100644 --- a/scripts/calculate_distances.py +++ b/scripts/calculate_distances.py @@ -1,117 +1,99 @@ -import pandas as pd +"""Compute per-frame inter-fly distances for every (date, machine, ROI, session). + +Reads tracking data via :func:`load_roi_data.load_roi_data` (which is driven +by ``all_video_info_merged.tsv``) and produces one distances DataFrame +spanning every fly/session in the batch. Group membership (``trained`` / +``untrained``) is preserved from the ``male`` column. +""" + import numpy as np +import pandas as pd from scipy.spatial.distance import euclidean from config import DATA_PROCESSED +from load_roi_data import load_roi_data -def calculate_fly_distances(trained_file=None, untrained_file=None): - """Calculate distances between flies at each time point. +def calculate_fly_distances(data: pd.DataFrame | None = None) -> pd.DataFrame: + """Compute inter-fly distances over time for every fly/session. - For each time point: - - If two flies are detected: calculate Cartesian distance between them - - If one fly is detected: set distance to 0 if area > average area, otherwise NaN + For each time point inside one (date, machine, ROI, session) trajectory: + - 2+ flies detected: Euclidean distance between the first two by id + - 1 fly detected: distance = 0 if its bbox area exceeds the global + mean (likely a single blob containing both flies), else NaN Args: - trained_file (Path): Path to trained ROI data CSV. - untrained_file (Path): Path to untrained ROI data CSV. + data: optional pre-loaded DataFrame from :func:`load_roi_data`. If + None, the full batch is loaded. Returns: - tuple: (trained_distances, untrained_distances) DataFrames. + DataFrame with one row per (track, time) pair, including ``distance``, + ``n_flies``, ``area_fly1``, ``area_fly2``, plus the metadata columns + propagated from the source row (``date``, ``machine_name``, ``ROI``, + ``session``, ``male``, ``species``, ``memory``, ``age``). """ - if trained_file is None: - trained_file = DATA_PROCESSED / 'trained_roi_data.csv' - if untrained_file is None: - untrained_file = DATA_PROCESSED / 'untrained_roi_data.csv' + if data is None: + data = load_roi_data() + if data.empty: + return pd.DataFrame() - trained_df = pd.read_csv(trained_file) - untrained_df = pd.read_csv(untrained_file) - - trained_df['area'] = trained_df['w'] * trained_df['h'] - untrained_df['area'] = untrained_df['w'] * untrained_df['h'] - - avg_area = np.mean([trained_df['area'].mean(), untrained_df['area'].mean()]) + data = data.copy() + data["area"] = data["w"] * data["h"] + avg_area = data["area"].mean() print(f"Average area across all data: {avg_area:.2f}") - trained_distances = process_distance_data(trained_df, avg_area) - untrained_distances = process_distance_data(untrained_df, avg_area) + # Carry these onto every output row (constant within a track). + keep_meta = ["date", "machine_name", "ROI", "session", "male", + "species", "memory", "age"] - return trained_distances, untrained_distances - - -def process_distance_data(df, avg_area): - """Process a DataFrame to calculate distances between flies at each time point. - - Args: - df (pd.DataFrame): Input tracking data. - avg_area (float): Average area threshold for single-fly detection. - - Returns: - pd.DataFrame: Distance data with columns for machine, ROI, time, distance. - """ - results = [] - - for (machine_name, roi), group in df.groupby(['machine_name', 'ROI']): - for t, time_group in group.groupby('t'): - time_group = time_group.sort_values('id').reset_index(drop=True) + rows: list[dict] = [] + track_keys = ["date", "machine_name", "ROI", "session"] + for track, track_df in data.groupby(track_keys, sort=False): + meta_row = {k: v for k, v in zip(track_keys, track)} + # Carry the rest of the metadata from any sample (constant per track). + sample = track_df.iloc[0] + for col in keep_meta: + if col not in meta_row: + meta_row[col] = sample[col] + for t, time_group in track_df.groupby("t", sort=False): + time_group = time_group.sort_values("id").reset_index(drop=True) + row = dict(meta_row) + row["t"] = t if len(time_group) >= 2: - fly1 = time_group.iloc[0] - fly2 = time_group.iloc[1] - distance = euclidean([fly1['x'], fly1['y']], [fly2['x'], fly2['y']]) + f1, f2 = time_group.iloc[0], time_group.iloc[1] + row["distance"] = euclidean([f1["x"], f1["y"]], [f2["x"], f2["y"]]) + row["n_flies"] = len(time_group) + row["area_fly1"] = f1["area"] + row["area_fly2"] = f2["area"] + else: + f = time_group.iloc[0] + row["distance"] = 0.0 if f["area"] > avg_area else np.nan + row["n_flies"] = 1 + row["area_fly1"] = f["area"] + row["area_fly2"] = np.nan + rows.append(row) - results.append({ - 'machine_name': machine_name, - 'ROI': roi, - 't': t, - 'distance': distance, - 'n_flies': len(time_group), - 'area_fly1': fly1['area'], - 'area_fly2': fly2['area'] - }) - elif len(time_group) == 1: - fly = time_group.iloc[0] - area = fly['area'] - - if area > avg_area: - distance = 0.0 - else: - distance = np.nan - - results.append({ - 'machine_name': machine_name, - 'ROI': roi, - 't': t, - 'distance': distance, - 'n_flies': 1, - 'area_fly1': area, - 'area_fly2': np.nan - }) - - return pd.DataFrame(results) + return pd.DataFrame(rows) -def main(): - """Run distance calculations and save results.""" - trained_distances, untrained_distances = calculate_fly_distances() +def main() -> None: + distances = calculate_fly_distances() - print(f"Trained data distance summary:") - print(f" Shape: {trained_distances.shape}") - print(f" Distance stats:") - print(f" Count: {trained_distances['distance'].count()}") - print(f" Mean: {trained_distances['distance'].mean():.2f}") - print(f" Std: {trained_distances['distance'].std():.2f}") + print("\nDistance summary:") + print(f" Shape: {distances.shape}") + if not distances.empty: + print(f" Distance count: {distances['distance'].count()}") + print(f" Distance mean: {distances['distance'].mean():.2f}") + print(f" Distance std: {distances['distance'].std():.2f}") + male = distances["male"] + print(f" Trained tracks: {(male == 'trained').sum()}") + print(f" Naive tracks: {(male == 'naive').sum()}") - print(f"\nUntrained data distance summary:") - print(f" Shape: {untrained_distances.shape}") - print(f" Distance stats:") - print(f" Count: {untrained_distances['distance'].count()}") - print(f" Mean: {untrained_distances['distance'].mean():.2f}") - print(f" Std: {untrained_distances['distance'].std():.2f}") - - trained_distances.to_csv(DATA_PROCESSED / 'trained_distances.csv', index=False) - untrained_distances.to_csv(DATA_PROCESSED / 'untrained_distances.csv', index=False) - print("\nDistance data saved") + DATA_PROCESSED.mkdir(parents=True, exist_ok=True) + out = DATA_PROCESSED / "distances.csv" + distances.to_csv(out, index=False) + print(f"\nSaved {out}") if __name__ == "__main__": diff --git a/scripts/config.py b/scripts/config.py index a3462b2..447cee3 100644 --- a/scripts/config.py +++ b/scripts/config.py @@ -13,5 +13,8 @@ VIDEOS_ROOT = Path("/mnt/ethoscope_data/videos") VIDEO_INFO_XLSX = PROJECT_ROOT.parent / "all_video_info_merged.xlsx" INVENTORY_CSV = DATA_METADATA / "video_inventory.csv" TARGETS_DIR = PROJECT_ROOT / "data" / "targets" -TRACKING_OUTPUT_DIR = PROJECT_ROOT / "data" / "tracked" +# Reason: tracking DBs are large binary files that don't belong in +# ownCloud-synced storage (sync conflicts + bandwidth). They live on the +# local data volume instead. Regenerable from videos + target JSONs. +TRACKING_OUTPUT_DIR = Path("/mnt/data/projects/cupido/tracked") LOGS_DIR = PROJECT_ROOT / "data" / "logs" diff --git a/scripts/export_video_db_index.py b/scripts/export_video_db_index.py new file mode 100644 index 0000000..723108c --- /dev/null +++ b/scripts/export_video_db_index.py @@ -0,0 +1,181 @@ +"""Augment all_video_info_merged.xlsx with the input video + tracking DB paths. + +Each xlsx row represents one fly (date, machine_name, ROI), observed across a +training session and a testing session. We resolve those two sessions to the +on-disk video files (via the inventory CSV) and to their tracking DBs (under +TRACKING_OUTPUT_DIR), then write the result as TSV. + +Output columns added: + training_video_path, training_db_path, + testing_video_path, testing_db_path + +Empty values mean either no video matched (rare — implies missing inventory +entry) or no DB exists yet (e.g. the one video the completeness gate +rejected). + +Usage: + python export_video_db_index.py + python export_video_db_index.py --out path/to/output.tsv +""" + +from __future__ import annotations + +import argparse +import re +from pathlib import Path + +import pandas as pd + +from config import INVENTORY_CSV, TRACKING_OUTPUT_DIR, VIDEO_INFO_XLSX + + +_TIME_RE = re.compile(r"^(\d{8})_(\d{1,2})(\d{2})?(AM|PM)$", re.IGNORECASE) + + +def parse_xlsx_time(value: str) -> tuple[str, int] | None: + """Convert '20241021_11AM' / '20240918_1030AM' to (YYYY-MM-DD, minutes24). + + Resolution is hour-only when no minutes are given (e.g. '11AM' → 11:00). + Returns minutes-from-midnight so we can do nearest-neighbor matching. + """ + if not isinstance(value, str): + return None + m = _TIME_RE.match(value.strip()) + if not m: + return None + ymd, hh, mm, ampm = m.groups() + date = f"{ymd[:4]}-{ymd[4:6]}-{ymd[6:8]}" + hour = int(hh) + minute = int(mm) if mm else 0 + if ampm.upper() == "PM" and hour != 12: + hour += 12 + if ampm.upper() == "AM" and hour == 12: + hour = 0 + return date, hour * 60 + minute + + +def build_session_index(inventory: pd.DataFrame) -> dict[tuple[str, str], list[dict]]: + """Index inventory rows by (date, machine_name) → list of session dicts.""" + idx: dict[tuple[str, str], list[dict]] = {} + for row in inventory.itertuples(index=False): + h, m, _s = (int(p) for p in str(row.session_time).split("-")) + key = (row.session_date, row.machine_name) + idx.setdefault(key, []).append({ + "mp4_path": row.mp4_path, + "session_datetime": row.session_datetime, + "minutes": h * 60 + m, + }) + return idx + + +def db_path_for_video(mp4_path: str) -> Path | None: + """Tracker writes _tracking.db under TRACKING_OUTPUT_DIR.""" + stem = Path(mp4_path).stem + db = TRACKING_OUTPUT_DIR / f"{stem}_tracking.db" + return db if db.exists() else None + + +_TIME_TOLERANCE_MIN = 90 # xlsx labels are approximate ("11AM" → 10:51 is fine) + + +def resolve_session( + machine_name: str, + when: str, + fallback_date: str | None, + index: dict[tuple[str, str], list[dict]], +) -> tuple[str, str]: + """Look up the video + db whose start time is closest to `when`. + + Match strategy: + 1. Use the date embedded in `when` (training/testing can fall on a + different calendar day from the row's ``date`` column). + 2. If no candidates exist for that date, fall back to ``fallback_date`` + (the xlsx row's ``date`` column). Reason: the xlsx contains + date typos like '20240110_11AM' for an Oct 1 experiment. + + Among candidates, pick the video whose start minute is closest to the + xlsx-claimed time, within ±_TIME_TOLERANCE_MIN. + """ + parsed = parse_xlsx_time(when) + if parsed is None: + return "", "" + date, target_min = parsed + candidates = index.get((date, machine_name), []) + if not candidates and fallback_date: + candidates = index.get((fallback_date, machine_name), []) + if not candidates: + return "", "" + + def _gap(target: int, c: dict) -> int: + # Reason: xlsx times like '1230AM' are ambiguous (12 AM vs 12 PM). + # We try both the literal time AND a +12-hour shift, picking the + # interpretation that brings us closest to a real session. + return min(abs(c["minutes"] - target), abs(c["minutes"] - (target + 720) % 1440)) + + best = min(candidates, key=lambda c: _gap(target_min, c)) + if _gap(target_min, best) > _TIME_TOLERANCE_MIN: + return "", "" + db = db_path_for_video(best["mp4_path"]) + return best["mp4_path"], (str(db) if db else "") + + +# Variants of "naive" the xlsx has accumulated: 'naïve', 'niave', plus +# trailing whitespace. All collapse to a single canonical 'naive'. +_MALE_NAIVE_VARIANTS = {"naïve", "niave", "naive"} + + +def _normalize_metadata(df: pd.DataFrame) -> None: + """Strip whitespace and canonicalize the ``male`` column in place.""" + for col in df.select_dtypes(include=("object", "string")).columns: + df[col] = df[col].astype(str).str.strip() + df["male"] = df["male"].apply( + lambda v: "naive" if v.lower() in _MALE_NAIVE_VARIANTS else v + ) + + +def main() -> None: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument( + "--out", + type=Path, + default=VIDEO_INFO_XLSX.with_suffix(".tsv"), + help="output TSV path (default: alongside the xlsx)", + ) + args = parser.parse_args() + + inv = pd.read_csv(INVENTORY_CSV) + inv = inv[inv["in_xlsx"]].copy() + index = build_session_index(inv) + + df = pd.read_excel(VIDEO_INFO_XLSX) + _normalize_metadata(df) + date_iso = pd.to_datetime(df["date"]).dt.strftime("%Y-%m-%d") + + train_videos, train_dbs, test_videos, test_dbs = [], [], [], [] + for fallback, row in zip(date_iso, df.itertuples(index=False)): + tv, td = resolve_session(row.machine_name, row.training_date_time, fallback, index) + sv, sd = resolve_session(row.machine_name, row.testing_date_time, fallback, index) + train_videos.append(tv) + train_dbs.append(td) + test_videos.append(sv) + test_dbs.append(sd) + + df["training_video_path"] = train_videos + df["training_db_path"] = train_dbs + df["testing_video_path"] = test_videos + df["testing_db_path"] = test_dbs + + df.to_csv(args.out, sep="\t", index=False) + + n_rows = len(df) + n_train_video = sum(bool(v) for v in train_videos) + n_train_db = sum(bool(v) for v in train_dbs) + n_test_video = sum(bool(v) for v in test_videos) + n_test_db = sum(bool(v) for v in test_dbs) + print(f"wrote {args.out} ({n_rows} rows)") + print(f" training: {n_train_video} with video, {n_train_db} with DB") + print(f" testing: {n_test_video} with video, {n_test_db} with DB") + + +if __name__ == "__main__": + main() diff --git a/scripts/load_roi_data.py b/scripts/load_roi_data.py index 5cf3cc6..84b00eb 100644 --- a/scripts/load_roi_data.py +++ b/scripts/load_roi_data.py @@ -1,90 +1,113 @@ -import pandas as pd +"""Load ROI tracking data from all sessions into one DataFrame. + +Drives off the merged TSV (one row per ROI/fly across training + testing +phases). For each TSV row, opens the corresponding tracking DB and pulls +the matching ROI table, then attaches the experimental metadata. + +The TSV is the single source of truth for what data exists and how it +maps to flies and conditions. +""" + import sqlite3 -import re +from pathlib import Path -from config import DATA_RAW, DATA_METADATA, DATA_PROCESSED +import pandas as pd + +from config import VIDEO_INFO_XLSX -def load_roi_data(): - """Load ROI data from SQLite databases and group by trained/untrained. +# Metadata columns to copy onto every tracking sample. These are the xlsx +# fields that describe the experimental condition behind each fly/ROI. +# Reason: the ROI column is uppercase ("ROI") for backwards compatibility +# with the existing analysis pipeline (calculate_distances.py, notebooks). +_META_COLS = ( + "date", + "machine_name", + "species", + "male", + "training_date_time", + "testing_date_time", + "training_length_hr", + "consolidation_length_hr", + "memory", + "age", +) + + +def _open_ro(db_path: str, cache: dict) -> sqlite3.Connection | None: + """Cached read-only sqlite connection. Returns None on failure.""" + if not isinstance(db_path, str) or not db_path: + return None + if db_path not in cache: + try: + cache[db_path] = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True) + except sqlite3.Error as e: + print(f"failed to open {Path(db_path).name}: {e}") + cache[db_path] = None + return cache[db_path] + + +def load_roi_data(meta: pd.DataFrame | None = None) -> pd.DataFrame: + """Load ROI tracking data joined with experimental metadata. + + For each row in ``meta``, reads the matching ROI table from both the + training DB and the testing DB (whichever exist), and stamps every + sample with the row's metadata plus a ``session`` column + (``"training"`` or ``"testing"``). Rows with empty DB paths (unusable + videos, or videos that didn't pass the completeness gate) are skipped. + + Args: + meta: optional DataFrame with the same schema as + ``all_video_info_merged.tsv``. Pass a filtered slice to load a + subset (e.g. ``meta[meta.species == 'Melanogaster/CS']``). + Defaults to the full TSV. Returns: - tuple: (trained_df, untrained_df) DataFrames with tracking data. + DataFrame with columns ``id, t, x, y, w, h, phi, is_inferred, + has_interacted, session, `` — one row per tracking + sample. Empty if nothing could be loaded. """ - metadata = pd.read_csv(DATA_METADATA / '2025_07_15_metadata_fixed.csv') - metadata['machine_name'] = metadata['machine_name'].astype(str) + if meta is None: + meta = pd.read_csv(VIDEO_INFO_XLSX.with_suffix(".tsv"), sep="\t") - trained_rois = metadata[metadata['group'] == 'trained'] - untrained_rois = metadata[metadata['group'] == 'untrained'] + db_cache: dict = {} + chunks: list[pd.DataFrame] = [] - db_files = list(DATA_RAW.glob('*_tracking.db')) - - trained_df = pd.DataFrame() - untrained_df = pd.DataFrame() - - for db_file in db_files: - print(f"Processing {db_file.name}") - - pattern = r'_([0-9a-f]{32})__' - match = re.search(pattern, db_file.name) - - if not match: - print(f"Could not extract UUID from {db_file.name}") - continue - - uuid = match.group(1) - metadata_matches = metadata[metadata['path'].str.contains(uuid, na=False)] - - if metadata_matches.empty: - print(f"No metadata matches found for UUID {uuid} from {db_file.name}") - continue - - machine_id = metadata_matches.iloc[0]['machine_name'] - print(f"Matched to machine ID: {machine_id}") - - conn = sqlite3.connect(str(db_file)) - - machine_trained = trained_rois[trained_rois['machine_name'] == machine_id] - machine_untrained = untrained_rois[untrained_rois['machine_name'] == machine_id] - - for _, row in machine_trained.iterrows(): - roi = row['ROI'] + for row in meta.itertuples(index=False): + for session in ("training", "testing"): + conn = _open_ro(getattr(row, f"{session}_db_path"), db_cache) + if conn is None: + continue try: - query = f"SELECT * FROM ROI_{roi}" - roi_data = pd.read_sql_query(query, conn) - roi_data['machine_name'] = machine_id - roi_data['ROI'] = roi - roi_data['group'] = 'trained' - trained_df = pd.concat([trained_df, roi_data], ignore_index=True) + df = pd.read_sql_query( + f"SELECT * FROM ROI_{int(row.roi)}", conn + ) except Exception as e: - print(f"Error loading ROI_{roi} from {db_file.name}: {e}") + # Reason: a DB may be missing a ROI table if tracking was + # partial — skip rather than abort the whole batch. + print(f" ROI_{row.roi} from {session} DB: {e}") + continue + df["session"] = session + df["ROI"] = int(row.roi) + for col in _META_COLS: + df[col] = getattr(row, col) + chunks.append(df) - for _, row in machine_untrained.iterrows(): - roi = row['ROI'] - try: - query = f"SELECT * FROM ROI_{roi}" - roi_data = pd.read_sql_query(query, conn) - roi_data['machine_name'] = machine_id - roi_data['ROI'] = roi - roi_data['group'] = 'untrained' - untrained_df = pd.concat([untrained_df, roi_data], ignore_index=True) - except Exception as e: - print(f"Error loading ROI_{roi} from {db_file.name}: {e}") + for conn in db_cache.values(): + if conn is not None: + conn.close() - conn.close() - - return trained_df, untrained_df + return pd.concat(chunks, ignore_index=True) if chunks else pd.DataFrame() if __name__ == "__main__": - trained_data, untrained_data = load_roi_data() - print(f"Trained data shape: {trained_data.shape}") - print(f"Untrained data shape: {untrained_data.shape}") - if not trained_data.empty: - print("Trained data columns:", trained_data.columns.tolist()) - if not untrained_data.empty: - print("Untrained data columns:", untrained_data.columns.tolist()) - - trained_data.to_csv(DATA_PROCESSED / 'trained_roi_data.csv', index=False) - untrained_data.to_csv(DATA_PROCESSED / 'untrained_roi_data.csv', index=False) - print("Data saved to trained_roi_data.csv and untrained_roi_data.csv") + data = load_roi_data() + print(f"shape: {data.shape}") + if not data.empty: + print(f"columns: {list(data.columns)}") + print(f"sessions: {data['session'].value_counts().to_dict()}") + print(f"unique machines: {data['machine_name'].nunique()}") + print( + f"unique flies (date,machine,roi): " + f"{data.groupby(['date','machine_name','roi']).ngroups}" + ) diff --git a/scripts/monitor_tracking.py b/scripts/monitor_tracking.py index 9ffa891..991798f 100644 --- a/scripts/monitor_tracking.py +++ b/scripts/monitor_tracking.py @@ -97,13 +97,32 @@ def snapshot() -> str: ) lines.append(f" errors in log: {len(errors)}") - # Rate from the last 10 completions, when available. - if len(history) >= 2: - window = history[-min(10, len(history)) :] - span = window[-1] - window[0] - if span > 0: - rate_per_hour = (len(window) - 1) / span * 3600 - lines.append(f" rate (last {len(window) - 1}): {rate_per_hour:.1f} videos/hour") + # Rate from completions in the last 6 h — robust to gaps from killed / + # restarted runs, while wide enough to span multiple parallel-worker + # completion bursts. Reason: with 8 workers all started together on + # multi-hour videos, completions arrive in tight bursts every ~video- + # length apart; a 30-min window catches one burst and overestimates by + # ~10×. 6 h spans at least one full burst cycle for typical videos. + now_ts = time.time() + window_secs = 6 * 3600 + recent = [t for t in history if t >= now_ts - window_secs] + if len(recent) >= 2: + # Reason: with N parallel workers, completions arrive in clumps + # (all workers finish near-simultaneously). Dividing N by the *burst* + # span gives nonsense rates. Use the full window as the denominator + # once the batch has been running long enough to fill it; otherwise + # use elapsed-since-first-DB. Detection: if every DB on disk also + # falls inside the window, the batch is younger than the window. + if len(recent) == len(history): + elapsed = max(1.0, now_ts - history[0]) + else: + elapsed = float(window_secs) + if elapsed > 0: + rate_per_hour = len(recent) / elapsed * 3600 + lines.append( + f" rate (last {len(recent)} in {int(window_secs/3600)} h):" + f" {rate_per_hour:.1f} videos/hour" + ) remaining = max(0, pickable - tracked) if rate_per_hour > 0 and remaining > 0: eta_sec = remaining * 3600 / rate_per_hour @@ -112,6 +131,8 @@ def snapshot() -> str: f" ETA remaining: {fmt_duration(eta_sec)} " f"(done by {eta_at:%H:%M %a})" ) + else: + lines.append(" rate: (warming up — check again in a few min)") if last_mtime is not None and last_name is not None: ago = (datetime.now() - last_mtime).total_seconds() diff --git a/scripts/track_videos.py b/scripts/track_videos.py index d9bd197..cb65292 100644 --- a/scripts/track_videos.py +++ b/scripts/track_videos.py @@ -3,7 +3,7 @@ Reads target JSONs produced by `pick_targets.py`, builds the 6 ROIs of the HD mating arena from the L-shape reference points, runs ethoscope's `MultiFlyTracker` against the merged.mp4 file via `MovieVirtualCamera`, and -writes a SQLite DB to `data/tracked/_tracking.db`. +writes a SQLite DB to `TRACKING_OUTPUT_DIR/_tracking.db`. Idempotent: skips videos whose tracking DB already exists (unless --redo). @@ -58,17 +58,46 @@ def track_one(json_path: Path, output_dir: Path, max_duration: float | None, from ethoscope.io.sqlite import SQLiteResultWriter from ethoscope.trackers.multi_fly_tracker import MultiFlyTracker - class BGRMovieCamera(MovieVirtualCamera): - """MovieVirtualCamera variant that keeps BGR frames. + import time as _time - MultiFlyTracker calls cv2.cvtColor(img, COLOR_BGR2GRAY) without checking - whether img is already grayscale, so we must feed it 3-channel input. + class BGRMovieCamera(MovieVirtualCamera): + """MovieVirtualCamera that keeps BGR frames AND retries on transient + read failures. + + Two reasons for the override: + + 1. MultiFlyTracker calls cv2.cvtColor(img, COLOR_BGR2GRAY) without + checking whether img is already grayscale, so we must feed it + 3-channel input. + + 2. cv2.VideoCapture.read() can return False on transient I/O hiccups + (NFS contention when 8 workers pull big mp4s in parallel) without + the file actually being at EOF. A naive "False -> StopIteration" + handling makes the tracker silently exit mid-video and write a + short, lying DB. We retry a few times and only treat persistent + failures within the *interior* of the video as real EOF. """ + + _retry_count = 5 + _retry_backoff_s = 0.25 + _eof_safety_frames = 50 # near end-of-file, treat False as legitimate + def _next_image(self): - ret, frame = self.capture.read() - if not ret or frame is None: - return None - return frame # BGR, untouched + for attempt in range(self._retry_count): + ret, frame = self.capture.read() + if ret and frame is not None: + return frame # BGR, untouched + # If we're near the genuine end of the file, accept it. + if ( + self._has_end_of_file + and self._frame_idx >= self._total_n_frames - self._eof_safety_frames + ): + return None + # Otherwise, this is a suspected transient hiccup — back off + # and try again. The capture is still open; cv2 will pick up + # the next decoded frame. + _time.sleep(self._retry_backoff_s) + return None # truly persistent failure payload = json.loads(json_path.read_text()) if payload.get("unusable"): @@ -146,6 +175,42 @@ def track_one(json_path: Path, output_dir: Path, max_duration: float | None, if not out_db.exists(): return "error", "tracking finished but DB was not created" + + # Post-tracking sanity check: did we cover most of the source video? + # If not (cv2 retry exhausted, codec corruption, etc.), reject the DB so + # it doesn't get cached as "done" — better an explicit failure than a + # silent partial write. + expected_ms = (cam._total_n_frames / 25.0) * 1000.0 + if max_duration is not None: + expected_ms = min(expected_ms, max_duration * 1000.0) + completeness_threshold = 0.90 # require ≥ 90 % of expected duration + + # Use MAX(t) across all ROIs — a single ROI can run dry early if its fly + # stops moving, so the latest detection anywhere in the arena is the + # better signal of how far the iterator actually got. + import sqlite3 as _sqlite3 + try: + _con = _sqlite3.connect(f"file:{out_db}?mode=ro", uri=True) + t_max = 0 + for _i in range(1, 7): + _v = _con.execute(f"SELECT MAX(t) FROM ROI_{_i}").fetchone()[0] + if _v and _v > t_max: + t_max = _v + _con.close() + except Exception: + t_max = 0 + + if expected_ms > 0 and t_max < expected_ms * completeness_threshold: + out_db.unlink() + for sidecar in (str(out_db) + "-wal", str(out_db) + "-shm"): + Path(sidecar).unlink(missing_ok=True) + ratio = t_max / expected_ms if expected_ms else 0 + return ( + "error", + f"short output: t_max={t_max} ms vs expected {int(expected_ms)} ms " + f"({ratio*100:.0f}%); DB removed", + ) + return "ok", str(out_db) diff --git a/tasks/todo.md b/tasks/todo.md index f86bd65..30b473c 100644 --- a/tasks/todo.md +++ b/tasks/todo.md @@ -115,4 +115,26 @@ all targets are picked, tracking can run in the background. ## Discovered During Work -(Add new items here as they come up during analysis) +### Barrier-opening annotation for the 2024 batch (added 2026-04-30) +The current `flies_analysis*.ipynb` aligns trajectories to a barrier-opening +event sourced from `data/metadata/2025_07_15_barrier_opening.csv`. That file +covers only the 5 machines in the 2025-07-15 experiment. The 2024 batch +(`/mnt/data/projects/cupido/tracked/`, 113 DBs) has no equivalent annotation +yet, so all post-alignment cells silently exclude that data. + +- [ ] Build a small picker that lets the user scrub through each tracking + DB / video and mark the barrier-opening frame, writing a row to a new + `data/metadata/barrier_opening_2024.csv` (or extend the existing + file with a date column). +- [ ] Once the 2024 entries exist, update `align_to_opening_time` so it + pulls from a unified `barrier_opening` table keyed by + `(date, machine_name)` rather than `machine_name` alone. + +### Metadata vocabulary normalization (done 2026-04-30) +The xlsx had inconsistent labels for control flies (`'naïve'`, `'niave'`, +`'untrained'` plus trailing whitespace). All sources now use a single +canonical `'naive'`. Normalization happens in +`scripts/export_video_db_index.py` so re-running it from the xlsx always +produces a clean TSV. The 2025-07-15 legacy CSV +(`data/metadata/2025_07_15_metadata_fixed.csv`) was edited in place from +`'untrained'` → `'naive'`. From 7d095238405d13cf365e85b7f871b907cf9ce7e7 Mon Sep 17 00:00:00 2001 From: Giorgio Gilestro Date: Thu, 30 Apr 2026 17:13:55 +0100 Subject: [PATCH 3/4] Move TARGETS_DIR to /mnt/data/projects/cupido/targets Targets relocated alongside the tracking DBs (out of ownCloud sync) so the docker mount already covers them and ownCloud no longer churns on JSON sidecars. Updated config, fixed a stale docstring in pick_targets, and dropped the now-moot data/targets/*.json gitignore rule. Co-Authored-By: Claude Opus 4.7 --- .gitignore | 3 +-- scripts/config.py | 4 +++- scripts/pick_targets.py | 2 +- 3 files changed, 5 insertions(+), 4 deletions(-) diff --git a/.gitignore b/.gitignore index 07f3445..54331af 100644 --- a/.gitignore +++ b/.gitignore @@ -3,8 +3,7 @@ data/raw/*.db data/processed/*.csv # Offline-tracking outputs (regenerable from videos + target JSONs) -# DBs live outside the repo at /mnt/data/projects/cupido/tracked/ -data/targets/*.json +# DBs and target JSONs live outside the repo at /mnt/data/projects/cupido/ data/metadata/video_inventory.csv data/logs/*.log diff --git a/scripts/config.py b/scripts/config.py index 447cee3..e2951f2 100644 --- a/scripts/config.py +++ b/scripts/config.py @@ -12,7 +12,9 @@ FIGURES = PROJECT_ROOT / "figures" VIDEOS_ROOT = Path("/mnt/ethoscope_data/videos") VIDEO_INFO_XLSX = PROJECT_ROOT.parent / "all_video_info_merged.xlsx" INVENTORY_CSV = DATA_METADATA / "video_inventory.csv" -TARGETS_DIR = PROJECT_ROOT / "data" / "targets" +# Reason: kept on the local data volume alongside the tracking DBs (out of +# ownCloud sync). See TRACKING_OUTPUT_DIR comment below. +TARGETS_DIR = Path("/mnt/data/projects/cupido/targets") # Reason: tracking DBs are large binary files that don't belong in # ownCloud-synced storage (sync conflicts + bandwidth). They live on the # local data volume instead. Regenerable from videos + target JSONs. diff --git a/scripts/pick_targets.py b/scripts/pick_targets.py index a5eea07..73be53e 100644 --- a/scripts/pick_targets.py +++ b/scripts/pick_targets.py @@ -12,7 +12,7 @@ These three points are the same reference layout used by ethoscope's coordinates. Saving them as a JSON sidecar lets the offline tracker build the 6-ROI HD mating arena grid without needing auto-target detection. -Output JSON sidecar: data/targets/.json +Output JSON sidecar: TARGETS_DIR/.json { "video_path": "/mnt/.../*.mp4", "frame_index": , From ec56e51bf9eab5e19f74297479f4747412316de4 Mon Sep 17 00:00:00 2001 From: Giorgio Gilestro Date: Thu, 30 Apr 2026 18:14:17 +0100 Subject: [PATCH 4/4] Add beginner tutorial notebooks for incoming students MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Four guided notebooks under notebooks/getting_started/ aimed at someone new to Python and data science. The series progresses: project orientation → Python/pandas crash course → exploring one tracking DB → first trained-vs-naive comparison using load_roi_data + Mann-Whitney U. Each notebook leans heavily on markdown explanations, includes exercises with empty cells, and links out to canonical references (JupyterLab, official Python tutorial, pandas 10-min guide, Wikipedia for stats concepts). Co-Authored-By: Claude Opus 4.7 --- notebooks/getting_started/00_welcome.ipynb | 255 +++++++++ .../01_python_pandas_basics.ipynb | 500 ++++++++++++++++++ .../02_explore_one_database.ipynb | 439 +++++++++++++++ .../03_compare_trained_vs_naive.ipynb | 398 ++++++++++++++ notebooks/getting_started/README.md | 15 + 5 files changed, 1607 insertions(+) create mode 100644 notebooks/getting_started/00_welcome.ipynb create mode 100644 notebooks/getting_started/01_python_pandas_basics.ipynb create mode 100644 notebooks/getting_started/02_explore_one_database.ipynb create mode 100644 notebooks/getting_started/03_compare_trained_vs_naive.ipynb create mode 100644 notebooks/getting_started/README.md diff --git a/notebooks/getting_started/00_welcome.ipynb b/notebooks/getting_started/00_welcome.ipynb new file mode 100644 index 0000000..e961555 --- /dev/null +++ b/notebooks/getting_started/00_welcome.ipynb @@ -0,0 +1,255 @@ +{ + "nbformat": 4, + "nbformat_minor": 5, + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 00 \u00b7 Welcome to the Cupido fly-tracking project\n", + "\n", + "Hi! You're about to start working on a project that studies how *Drosophila*\n", + "(fruit flies) form **memories of mating experiences** \u2014 and whether trained\n", + "flies behave differently from na\u00efve ones in their later courtship.\n", + "\n", + "**You don't need any prior experience with Python or data science to follow\n", + "along.** This series of notebooks will walk you through everything, one\n", + "small step at a time.\n", + "\n", + "> **How to read these notebooks**: each notebook is split into \"cells\".\n", + "> Some cells are explanations (like this one), others are code that you\n", + "> can **run** by clicking on the cell and pressing `Shift + Enter`. Try it\n", + "> on the next cell.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "# This is a code cell. Click on it and press Shift+Enter to run it.\n", + "print(\"Hello, fly world!\")\n", + "1 + 1\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You should have seen `Hello, fly world!` printed and the number `2`\n", + "appear underneath. If something else happened, ask Giorgio \u2014 that's a\n", + "sign the environment isn't set up right.\n", + "\n", + "If this is the very first time you're using JupyterLab, take 10 minutes\n", + "to read the [official \"Getting started with JupyterLab\"\n", + "guide](https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html).\n", + "The most important things to know are:\n", + "\n", + "- A notebook (`.ipynb` file) is a sequence of **cells**.\n", + "- Each cell is either **Markdown** (formatted text, like this) or **Code**\n", + " (Python that the computer runs).\n", + "- The **kernel** is the running Python process behind the notebook. It\n", + " remembers everything you've defined. If something gets weird, restart\n", + " the kernel: top menu \u2192 *Kernel* \u2192 *Restart Kernel\u2026*.\n", + "- `Shift + Enter` runs a cell and moves to the next one.\n", + "- `Ctrl + Enter` runs a cell and stays put.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What is the project about?\n", + "\n", + "Drosophila males court females with a stereotyped sequence (chasing,\n", + "wing-extension, tapping). When a male is rejected by a female (e.g.\n", + "because she's already mated), he **learns** to suppress his courtship \u2014\n", + "even toward new, receptive females, for a while. This is a textbook\n", + "example of *non-associative learning* in invertebrates ([review on\n", + "PubMed](https://pubmed.ncbi.nlm.nih.gov/?term=courtship+conditioning+drosophila)).\n", + "\n", + "The lab is interested in:\n", + "\n", + "- Does this learning **transfer across species**? (We have ~7 *Drosophila*\n", + " species recorded.)\n", + "- How long does the memory last? (training_length_hr,\n", + " consolidation_length_hr columns in the metadata.)\n", + "- Are there **individual differences** \u2014 do some males learn while others\n", + " don't? (The \"bimodal hypothesis\" in `docs/bimodal_hypothesis.md`.)\n", + "\n", + "Your job, broadly, will be to **turn videos of flies into numbers and\n", + "plots that answer these questions.**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## How an experiment works (the bird's-eye view)\n", + "\n", + "1. **Training**: a male fly is placed with a non-receptive (mated) female.\n", + " He courts, gets rejected, eventually gives up.\n", + "2. *Wait* for some hours (the \"consolidation\" period \u2014 gives memory time\n", + " to form).\n", + "3. **Testing**: same male is placed with a fresh receptive female.\n", + " Does he court her vigorously, or has he learned to give up easily?\n", + "\n", + "Each experiment runs in an **HD mating arena** \u2014 a small chamber with\n", + "6 sub-arenas (we call them **ROIs**, for \"regions of interest\"). Each ROI\n", + "contains one couple (a male and a female). A camera films the whole arena\n", + "from above. So one **video** gives us 6 simultaneous experiments.\n", + "\n", + "The setup uses [Ethoscopes](https://www.ethoscope.com/) \u2014 open-source\n", + "behavioural recording boxes built in this lab. Each ethoscope is a\n", + "machine; we have 16 in total, named `ETHOSCOPE_067`, `ETHOSCOPE_076`, etc.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What does the data look like?\n", + "\n", + "For each video, the **tracker** (a piece of software that runs after the\n", + "recording) finds the flies frame-by-frame and writes their positions to a\n", + "**SQLite database** (a single file, ending in `.db`). One DB per video.\n", + "Inside each DB there are 6 tables called `ROI_1`, `ROI_2`, \u2026, `ROI_6` \u2014\n", + "one per sub-arena. Each row of an ROI table is **one fly detection at one\n", + "moment in time** with these columns:\n", + "\n", + "| column | meaning |\n", + "|---|---|\n", + "| `id` | row number (auto-incremented) |\n", + "| `t` | time in **milliseconds** since the video started |\n", + "| `x`, `y` | fly position in **pixels** (top-left corner of the image is 0,0) |\n", + "| `w`, `h` | width and height of the bounding box around the fly, in pixels |\n", + "| `phi` | orientation angle of the fly |\n", + "| `is_inferred` | 1 if the position was guessed (not directly seen), 0 otherwise |\n", + "| `has_interacted` | (legacy column, mostly unused) |\n", + "\n", + "If a single ROI has two flies that the tracker can see, you'll get **two\n", + "rows with the same `t`** \u2014 one for each fly. If only one fly is detected\n", + "(maybe they're on top of each other), you'll get one row.\n", + "\n", + "That's the heart of the data. Everything else (distances, velocities,\n", + "group comparisons) is computed from these (t, x, y) traces.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Where everything lives\n", + "\n", + "Take a moment to memorize these locations \u2014 you'll come back to them often.\n", + "\n", + "| what | where |\n", + "|---|---|\n", + "| Tracking DBs (SQLite, one per video) | `/mnt/data/projects/cupido/tracked/` |\n", + "| Target JSONs (the user-clicked reference points) | `/mnt/data/projects/cupido/targets/` |\n", + "| Source video files | `/mnt/ethoscope_data/videos/` |\n", + "| Project code (this repo) | `/home/gg/ownCloud/Work/Projects/coding/cupido/tracking/` |\n", + "| The metadata table (xlsx + TSV) | `/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv` |\n", + "| Your notebooks | `notebooks/getting_started/` (this folder) |\n", + "\n", + "Let's verify a couple of these from inside Python:\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "\n", + "tracked = Path(\"/mnt/data/projects/cupido/tracked\")\n", + "targets = Path(\"/mnt/data/projects/cupido/targets\")\n", + "\n", + "n_dbs = len(list(tracked.glob(\"*_tracking.db\")))\n", + "n_jsons = len(list(targets.glob(\"*.json\")))\n", + "\n", + "print(f\"Tracking DBs available: {n_dbs}\")\n", + "print(f\"Target JSONs available: {n_jsons}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You should see roughly 113 tracking DBs and 130 target JSONs. If those\n", + "numbers are zero, the storage volume isn't mounted \u2014 ask Giorgio.\n", + "\n", + "> **Note**: the tracking DBs are read-only inside the JupyterLab\n", + "> container. You can read them but not modify or delete them. That's a\n", + "> deliberate safety measure \u2014 we don't want analysis code accidentally\n", + "> corrupting the source data.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Glossary (refer back as needed)\n", + "\n", + "- **ROI** \u2014 *region of interest*. One sub-arena inside the HD mating\n", + " arena. There are 6 ROIs per video, numbered 1\u20136.\n", + "- **fly** \u2014 one detection in a single (t, ROI) cell. Two flies in the\n", + " same ROI at the same time = two rows with the same `t`.\n", + "- **trained** \u2014 the male had a training session before testing.\n", + "- **naive** \u2014 the male is a control (no training).\n", + "- **training session** \u2014 the recording where the male meets the\n", + " non-receptive female (he gets rejected).\n", + "- **testing session** \u2014 the recording where the male meets a fresh\n", + " receptive female (we measure his courtship).\n", + "- **t (milliseconds)** \u2014 time within one session, starting at 0.\n", + "- **(x, y) pixels** \u2014 fly position in the image. Top-left is (0, 0); x\n", + " grows to the right, y grows **downward** (this is the image-coordinate\n", + " convention, opposite of math class).\n", + "- **machine_name** \u2014 which ethoscope recorded the video, e.g.\n", + " `ETHOSCOPE_076`.\n", + "- **species** \u2014 `Melanogaster/CS`, `Sechellia`, `Simulans`, `Yakuba`,\n", + " `Erecta`, `Willistoni`, or `CS`.\n", + "\n", + "If you bump into other terms in the code, ask. Don't guess \u2014 biology\n", + "codebases pick up jargon over the years.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What's next\n", + "\n", + "When you're ready, open these notebooks **in order**:\n", + "\n", + "1. `01_python_pandas_basics.ipynb` \u2014 just enough Python and pandas to\n", + " read and manipulate tabular data.\n", + "2. `02_explore_one_database.ipynb` \u2014 open one tracking DB, plot a fly's\n", + " trajectory, see what the numbers actually look like.\n", + "3. `03_compare_trained_vs_naive.ipynb` \u2014 your first real analysis,\n", + " comparing groups of flies.\n", + "\n", + "After those, the notebooks one level up (`flies_analysis.ipynb`,\n", + "`flies_analysis_simple.ipynb`) contain the analysis pipeline that the\n", + "previous student built \u2014 those will make sense once you've worked\n", + "through the tutorials.\n", + "\n", + "Don't try to power through all of them in one sitting. Run a few cells,\n", + "read the explanation, **change a number** to see what happens, **break\n", + "something on purpose** to see the error message. That's how you learn.\n" + ] + } + ] +} diff --git a/notebooks/getting_started/01_python_pandas_basics.ipynb b/notebooks/getting_started/01_python_pandas_basics.ipynb new file mode 100644 index 0000000..310429a --- /dev/null +++ b/notebooks/getting_started/01_python_pandas_basics.ipynb @@ -0,0 +1,500 @@ +{ + "nbformat": 4, + "nbformat_minor": 5, + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 01 \u00b7 Python and pandas \u2014 just enough to be dangerous\n", + "\n", + "This notebook teaches the **minimum** Python and `pandas` you need to read\n", + "the rest of the project's code and write your own analyses.\n", + "\n", + "If you've never programmed before, don't try to memorize the syntax.\n", + "Just run each cell, read what it does, and come back when you're stuck on\n", + "something specific. The cheat sheet at the end is the only thing worth\n", + "keeping handy.\n", + "\n", + "External resources, in order of how much time they take:\n", + "\n", + "- \ud83e\udd98 [Python in 10 minutes (very condensed)](https://www.stavros.io/tutorials/python/)\n", + "- \ud83d\udc0d [Official Python tutorial \u2014 chapters 3\u20135](https://docs.python.org/3/tutorial/introduction.html)\n", + "- \ud83d\udc3c [pandas in 10 minutes (official)](https://pandas.pydata.org/docs/user_guide/10min.html)\n", + "- \ud83d\udcda [Python for Data Analysis (the book)](https://wesmckinney.com/book/) \u2014 free online\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Variables\n", + "\n", + "A variable is a named box you put a value into. The `=` is **assignment**,\n", + "not equality. Read it as \"make `name` refer to `value`\".\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "x = 5\n", + "y = 3\n", + "total = x + y\n", + "print(total)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Re-running the cell after changing `x = 5` to `x = 50` gives a different\n", + "answer. Try it.\n", + "\n", + "Variable names: lowercase letters, digits, and underscores. They can't\n", + "start with a digit. Convention is `snake_case`: `mean_distance`, not\n", + "`meanDistance` or `MeanDistance`.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Strings and numbers\n", + "\n", + "A **string** is text in quotes. You can join strings with `+`. You can\n", + "turn a number into a string with `str()`, and vice-versa with `int()` /\n", + "`float()`.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "species = \"Drosophila melanogaster\"\n", + "n_flies = 12\n", + "message = \"We tracked \" + str(n_flies) + \" \" + species + \" males.\"\n", + "print(message)\n", + "\n", + "# A nicer way to build strings \u2014 f-strings (note the leading 'f'):\n", + "print(f\"We tracked {n_flies} {species} males.\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Lists\n", + "\n", + "A list is an ordered collection of things. Square brackets, items\n", + "separated by commas. You can mix types (but usually shouldn't).\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "machines = [\"ETHOSCOPE_076\", \"ETHOSCOPE_082\", \"ETHOSCOPE_086\"]\n", + "print(machines[0]) # first item \u2014 Python counts from 0!\n", + "print(machines[-1]) # last item\n", + "print(len(machines)) # how many items\n", + "print(machines + [\"ETHOSCOPE_140\"]) # concatenate (returns a new list)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Dictionaries\n", + "\n", + "A dictionary maps **keys** to **values**. Curly braces, `key: value`\n", + "pairs.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "fly = {\"species\": \"Sechellia\", \"trained\": True, \"age_days\": 5}\n", + "print(fly[\"species\"])\n", + "print(fly[\"age_days\"])\n", + "fly[\"alive\"] = False # add a new key\n", + "print(fly)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Conditions: if / elif / else\n", + "\n", + "Compare with `==` (equal), `!=` (not equal), `<`, `>`, `<=`, `>=`.\n", + "Combine with `and`, `or`, `not`.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "distance_px = 42\n", + "\n", + "if distance_px < 50:\n", + " label = \"close\"\n", + "elif distance_px < 200:\n", + " label = \"medium\"\n", + "else:\n", + " label = \"far\"\n", + "\n", + "print(label)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Loops\n", + "\n", + "`for x in collection:` runs the indented block once per item.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "for m in machines:\n", + " print(f\"Looking at machine {m}\")\n", + "\n", + "# Looping with an index, when you need it:\n", + "for i, m in enumerate(machines):\n", + " print(f\"{i}: {m}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Functions\n", + "\n", + "A function is a named, reusable chunk of code. `def` declares it. `return`\n", + "sends a value back to whoever called it.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "def fly_age_in_weeks(days):\n", + " \"\"\"Return age in weeks given age in days.\"\"\"\n", + " return days / 7\n", + "\n", + "print(fly_age_in_weeks(14)) # 2.0\n", + "print(fly_age_in_weeks(5)) # 0.714\u2026\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8. Importing libraries\n", + "\n", + "A library is somebody else's code. We use `import` to pull it into our\n", + "notebook.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "import math\n", + "print(math.sqrt(16)) # 4.0\n", + "print(math.pi)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9. Meet pandas\n", + "\n", + "Real data is rarely a single number \u2014 it's a **table** with rows and\n", + "columns (think Excel). `pandas` is the library that handles tables in\n", + "Python. The two main objects are:\n", + "\n", + "- **`Series`** \u2014 a single column with a name.\n", + "- **`DataFrame`** \u2014 a whole table.\n", + "\n", + "By convention we import pandas as `pd`. Always.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "# Read the project's metadata TSV (Tab-Separated Values).\n", + "tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n", + "df = pd.read_csv(tsv_path, sep=\"\\t\")\n", + "\n", + "# How big is it?\n", + "print(f\"Rows: {len(df)}\")\n", + "print(f\"Columns: {df.shape[1]}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10. Looking at the table\n", + "\n", + "`.head()` shows the first 5 rows. `.tail()` the last 5. `.columns` lists\n", + "column names. `.dtypes` shows the type of each column.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "df.head(3)\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "print(\"Column names:\")\n", + "for c in df.columns:\n", + " print(f\" {c}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 11. Selecting columns\n", + "\n", + "Two main ways to get one column: bracket-indexing (`df[\"name\"]`) or\n", + "attribute access (`df.name`). The first works for any column name; the\n", + "second only works if the name has no spaces or weird characters.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "df[\"species\"].head()\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "df.species.value_counts() # how many rows per species\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 12. Selecting multiple columns\n", + "\n", + "Pass a **list** of names inside the brackets:\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "df[[\"machine_name\", \"roi\", \"species\", \"male\"]].head()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 13. Filtering rows\n", + "\n", + "The pattern is `df[condition]`. The condition is a Series of `True`/`False`.\n", + "Pandas keeps the rows where it's `True`.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "trained = df[df[\"male\"] == \"trained\"]\n", + "print(f\"trained rows: {len(trained)}\")\n", + "\n", + "mel_only = df[df[\"species\"] == \"Melanogaster/CS\"]\n", + "print(f\"Melanogaster/CS rows: {len(mel_only)}\")\n", + "\n", + "# Combine conditions with & (and) | (or) \u2014 and wrap each part in parentheses.\n", + "trained_mel = df[(df[\"male\"] == \"trained\") & (df[\"species\"] == \"Melanogaster/CS\")]\n", + "print(f\"trained Mel rows: {len(trained_mel)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 14. Grouping and counting\n", + "\n", + "`.groupby(\"col\")` followed by an aggregator like `.size()` or `.mean()`\n", + "splits the table by the values in that column and computes something per\n", + "group.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "# How many ROIs per (species, training condition)?\n", + "df.groupby([\"species\", \"male\"]).size()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 15. Quick plots\n", + "\n", + "DataFrames know how to draw themselves. Under the hood it's `matplotlib`.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "# How many rows per machine?\n", + "df[\"machine_name\"].value_counts().plot(kind=\"bar\", figsize=(10, 4))\n", + "plt.title(\"Number of fly-rows per ethoscope machine\")\n", + "plt.ylabel(\"rows\")\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 16. Exercises\n", + "\n", + "Don't skip these. They're how you find out what you actually understood.\n", + "\n", + "1. How many rows does `df` have where `age` equals `'5-7'`?\n", + "2. Print the **unique values** of the `memory` column. (Hint: `df[\"memory\"].unique()`)\n", + "3. How many distinct `(date, machine_name)` pairs are in the dataset?\n", + " (Hint: `df.groupby([\"date\", \"machine_name\"]).size().shape`.)\n", + "4. Make a bar plot of `species` counts. Which species has the most rows?\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "# Try exercise 1 here\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "# Try exercise 2 here\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "# Try exercise 3 here\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "# Try exercise 4 here\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cheat sheet\n", + "\n", + "```python\n", + "import pandas as pd\n", + "df = pd.read_csv(\"file.tsv\", sep=\"\\t\") # read\n", + "df.head(); df.tail(); df.shape; df.columns # peek\n", + "df[\"col\"]; df[[\"a\", \"b\"]] # select\n", + "df[df[\"col\"] == \"value\"] # filter\n", + "df.groupby(\"col\").size() # count per group\n", + "df.groupby(\"col\")[\"x\"].mean() # mean of x per group\n", + "df[\"col\"].value_counts() # quick counts\n", + "df[\"col\"].unique() # unique values\n", + "df[\"new_col\"] = df[\"w\"] * df[\"h\"] # derived column\n", + "df.sort_values(\"col\", ascending=False) # sort\n", + "df.plot(...) # quick plot\n", + "```\n", + "\n", + "Keep this list open when reading other people's code. Most of pandas is\n", + "just combinations of these primitives. When you need more, the official\n", + "[pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)\n", + "is excellent.\n" + ] + } + ] +} diff --git a/notebooks/getting_started/02_explore_one_database.ipynb b/notebooks/getting_started/02_explore_one_database.ipynb new file mode 100644 index 0000000..db3a242 --- /dev/null +++ b/notebooks/getting_started/02_explore_one_database.ipynb @@ -0,0 +1,439 @@ +{ + "nbformat": 4, + "nbformat_minor": 5, + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 02 \u00b7 A first look at one tracking database\n", + "\n", + "In this notebook we open **one** of the SQLite databases that the tracker\n", + "produced and look at what's actually inside. By the end you'll be able to:\n", + "\n", + "- list the tables in a `.db` file\n", + "- read one ROI's tracking trace into a DataFrame\n", + "- plot a fly's path through the arena\n", + "- count how many flies are visible at each moment\n", + "- compute a simple distance between the two flies in a ROI\n", + "\n", + "If you're curious how SQLite works, the\n", + "[SQLite Quickstart](https://www.sqlite.org/quickstart.html) is short and\n", + "worth reading. For our purposes, **SQLite is just a file that contains\n", + "several tables you can query like a DataFrame**.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "We import the libraries we need. `sqlite3` is part of Python's standard\n", + "library \u2014 no install needed.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "import sqlite3\n", + "from pathlib import Path\n", + "\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Find the databases\n", + "\n", + "The DBs live at `/mnt/data/projects/cupido/tracked/`. Let's list a few.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "tracked_dir = Path(\"/mnt/data/projects/cupido/tracked\")\n", + "db_files = sorted(tracked_dir.glob(\"*_tracking.db\"))\n", + "\n", + "print(f\"Found {len(db_files)} tracking DBs.\")\n", + "print(\"\\nFirst 5 by name:\")\n", + "for db in db_files[:5]:\n", + " print(f\" {db.name}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The filename encodes the date, time, machine UUID, video resolution, and\n", + "the suffix `_tracking.db`. For example:\n", + "\n", + "```\n", + "2024-09-17_10-32-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged_tracking.db\n", + "\u2514\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2518\u2514\u2500\u2500\u252c\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n", + " date time machine UUID video format\n", + "```\n", + "\n", + "Pick one to explore. Feel free to change the index.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "db_path = db_files[0]\n", + "print(\"Working with:\", db_path.name)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Open the database\n", + "\n", + "We open it **read-only** as a safety measure. The `?mode=ro` flag is\n", + "SQLite's read-only switch.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "conn = sqlite3.connect(f\"file:{db_path}?mode=ro\", uri=True)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What tables are inside?\n", + "\n", + "Every SQLite database has a system table called `sqlite_master` that\n", + "lists everything. We can query it like any other table.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "tables = pd.read_sql_query(\n", + " \"SELECT name FROM sqlite_master WHERE type='table' ORDER BY name\", conn\n", + ")\n", + "tables\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You should see tables like `ROI_1`, `ROI_2`, \u2026, `ROI_6` (one per\n", + "sub-arena), plus housekeeping tables like `METADATA`, `ROI_MAP`,\n", + "`VAR_MAP`, `START_EVENTS`. We mostly care about the `ROI_*` ones.\n", + "\n", + "## Read one ROI\n", + "\n", + "`pd.read_sql_query()` runs an SQL query against the connection and\n", + "returns a DataFrame. The query `SELECT * FROM ROI_1` means *\"give me all\n", + "columns and all rows from the table called ROI_1\"*.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "roi1 = pd.read_sql_query(\"SELECT * FROM ROI_1\", conn)\n", + "print(f\"shape: {roi1.shape}\") # (rows, columns)\n", + "roi1.head()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Understanding the columns\n", + "\n", + "Refer back to notebook `00_welcome` for the full column reference. Quick\n", + "recap of the important ones:\n", + "\n", + "- `t`: time in **milliseconds** since the video started.\n", + "- `x`, `y`: fly position in **pixels**. The image origin (0, 0) is the\n", + " **top-left** corner. y grows downward.\n", + "- `w`, `h`: bounding-box width/height. Their product (`area = w*h`) is a\n", + " rough proxy for \"how big does this blob look\" \u2014 useful for spotting\n", + " frames where the tracker merged two flies into one big detection.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "# Quick descriptive stats\n", + "roi1[[\"t\", \"x\", \"y\", \"w\", \"h\"]].describe()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The minimum `t` should be 0 (start of the video). The maximum tells you\n", + "how long the recording was. Convert ms to minutes by dividing by 60000:\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "duration_min = roi1[\"t\"].max() / 60_000\n", + "print(f\"Session length: {duration_min:.1f} minutes\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## How many flies per frame?\n", + "\n", + "If two flies are visible in this ROI, we get **two rows per `t`**. Let's\n", + "check.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "flies_per_frame = roi1.groupby(\"t\").size()\n", + "print(flies_per_frame.value_counts().sort_index())\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The output tells you, e.g., \"100,000 frames had 2 flies visible, 30,000\n", + "had 1 fly visible\". Frames with 1 fly usually mean the two flies are\n", + "overlapping or one is occluded \u2014 that's something we'll handle properly\n", + "in the next notebook.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Plot one fly's trajectory\n", + "\n", + "We'll plot the position over the first 5 minutes (300 000 ms). For\n", + "clarity we'll only look at frames where there were 2 flies and pick the\n", + "**first** of the two (sorted by `id`) as \"fly 1\" \u2014 this is a rough\n", + "heuristic; identity tracking is harder than it sounds.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "# Filter to the first 5 minutes\n", + "sub = roi1[roi1[\"t\"] <= 5 * 60_000]\n", + "\n", + "# Pick \"fly 1\" by taking the first row at each time point\n", + "fly1 = sub.sort_values([\"t\", \"id\"]).drop_duplicates(\"t\", keep=\"first\")\n", + "\n", + "plt.figure(figsize=(6, 5))\n", + "plt.plot(fly1[\"x\"], fly1[\"y\"], color=\"steelblue\", linewidth=0.5, alpha=0.7)\n", + "plt.scatter(fly1[\"x\"].iloc[0], fly1[\"y\"].iloc[0], color=\"green\", label=\"start\", zorder=5)\n", + "plt.scatter(fly1[\"x\"].iloc[-1], fly1[\"y\"].iloc[-1], color=\"red\", label=\"end\", zorder=5)\n", + "plt.gca().invert_yaxis() # because pixel y grows downward\n", + "plt.xlabel(\"x (pixels)\")\n", + "plt.ylabel(\"y (pixels)\")\n", + "plt.title(f\"Fly 1 trajectory \u2014 first 5 min \u2014 {db_path.name[:30]}\u2026\")\n", + "plt.legend()\n", + "plt.axis(\"equal\")\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You should see a tangle of lines confined to a roughly rectangular ROI.\n", + "That tangle is the fly walking around its sub-arena.\n", + "\n", + "Notice we did `plt.gca().invert_yaxis()` \u2014 that's because in image\n", + "coordinates y grows downward, but humans expect plots where y grows\n", + "upward. Without it the plot would be vertically flipped.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Plot position over time\n", + "\n", + "A trajectory plot collapses time into \"shape on a page\". To see *when*\n", + "things happen we need time on the x-axis.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "fig, axes = plt.subplots(2, 1, figsize=(12, 5), sharex=True)\n", + "\n", + "axes[0].plot(fly1[\"t\"] / 1000, fly1[\"x\"], linewidth=0.5)\n", + "axes[0].set_ylabel(\"x (px)\")\n", + "axes[0].set_title(f\"Fly 1, ROI 1, {db_path.name[:30]}\u2026\")\n", + "\n", + "axes[1].plot(fly1[\"t\"] / 1000, fly1[\"y\"], linewidth=0.5, color=\"darkorange\")\n", + "axes[1].set_ylabel(\"y (px)\")\n", + "axes[1].set_xlabel(\"time (s)\")\n", + "axes[1].invert_yaxis()\n", + "\n", + "plt.tight_layout()\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Bursts of variation = active fly. Long flat stretches = the fly is sitting\n", + "still. You'll come to recognize courtship vs idling by eye after a while.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Distance between the two flies\n", + "\n", + "Whenever the ROI has 2 detections at the same `t`, we can compute the\n", + "Euclidean distance between them: `sqrt((x1-x2)\u00b2 + (y1-y2)\u00b2)`.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "two_fly_frames = roi1.groupby(\"t\").filter(lambda g: len(g) == 2)\n", + "two_fly_frames = two_fly_frames.sort_values([\"t\", \"id\"])\n", + "\n", + "# Pivot so each row is one timepoint with x1, y1, x2, y2\n", + "def pair_up(g):\n", + " g = g.reset_index(drop=True)\n", + " return pd.Series({\n", + " \"x1\": g.loc[0, \"x\"], \"y1\": g.loc[0, \"y\"],\n", + " \"x2\": g.loc[1, \"x\"], \"y2\": g.loc[1, \"y\"],\n", + " })\n", + "\n", + "paired = two_fly_frames.groupby(\"t\").apply(pair_up).reset_index()\n", + "paired[\"distance_px\"] = np.hypot(paired[\"x1\"] - paired[\"x2\"], paired[\"y1\"] - paired[\"y2\"])\n", + "paired.head()\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "plt.figure(figsize=(12, 4))\n", + "plt.plot(paired[\"t\"] / 1000, paired[\"distance_px\"], linewidth=0.4)\n", + "plt.xlabel(\"time (s)\")\n", + "plt.ylabel(\"inter-fly distance (px)\")\n", + "plt.title(\"Distance between the two flies in ROI 1\")\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is the kind of trace that drives the rest of the analysis: a male\n", + "courting a female stays close (small distance); a male giving up wanders\n", + "off (large distance). The shape of this curve is the behavioural readout.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Don't forget to close the connection\n", + "\n", + "If you opened a connection, close it when you're done. (Not strictly\n", + "necessary in a notebook \u2014 Python tidies up \u2014 but a good habit.)\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "conn.close()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises\n", + "\n", + "1. Pick a different DB (change `db_files[0]` to `db_files[10]` for example)\n", + " and re-run the trajectory plot. Is the arena bigger / smaller? Why\n", + " might that be? (Hint: look at the resolution part of the filename.)\n", + "2. Plot the distance trace for **ROI 4** instead of ROI 1.\n", + "3. Compute the **percentage of frames** in ROI 1 that had only 1 fly visible.\n", + "4. The `area = w * h` column is a useful diagnostic. Plot `area` vs `t`\n", + " for fly 1 \u2014 when does the bounding box get unusually large?\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "# Exercise space\n" + ] + } + ] +} diff --git a/notebooks/getting_started/03_compare_trained_vs_naive.ipynb b/notebooks/getting_started/03_compare_trained_vs_naive.ipynb new file mode 100644 index 0000000..91041ae --- /dev/null +++ b/notebooks/getting_started/03_compare_trained_vs_naive.ipynb @@ -0,0 +1,398 @@ +{ + "nbformat": 4, + "nbformat_minor": 5, + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 03 \u00b7 Your first real analysis: trained vs naive\n", + "\n", + "In notebook 02 we explored a single database. Now we'll work with **all\n", + "of them at once**, compute a simple per-fly metric, and ask the central\n", + "question of the project:\n", + "\n", + "> **Do trained males behave differently from na\u00efve males in the testing\n", + "> session?**\n", + "\n", + "By the end you'll have:\n", + "\n", + "- loaded every (fly, session) trace into one big DataFrame using the\n", + " project's helper function;\n", + "- reduced each trace to one number per fly (the *median inter-fly\n", + " distance*);\n", + "- compared the trained group against the na\u00efve group with a histogram\n", + " and a non-parametric statistical test;\n", + "- learnt enough to start asking your own questions.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "import sys\n", + "from pathlib import Path\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from scipy import stats\n", + "\n", + "# Tell Python where to find the project's helper modules.\n", + "PROJECT_ROOT = Path(\"..\").resolve().parent # this notebook is in notebooks/getting_started/\n", + "sys.path.insert(0, str(PROJECT_ROOT / \"scripts\"))\n", + "\n", + "from load_roi_data import load_roi_data\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Loading everything at once \u2014 but carefully\n", + "\n", + "`load_roi_data()` opens every tracking DB referenced by the metadata TSV\n", + "and returns one big DataFrame. **It can be slow and memory-hungry**\n", + "(the full batch is ~200 million rows). Always start small.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "# Load the metadata TSV first \u2014 it's small and fast.\n", + "tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n", + "meta = pd.read_csv(tsv_path, sep=\"\\t\")\n", + "print(f\"metadata rows: {len(meta)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Pre-filter the metadata before passing it to `load_roi_data`. We'll start\n", + "with **just one species and just the testing sessions**, because:\n", + "\n", + "1. mixing species is a confound (different species behave differently);\n", + "2. the question is about behaviour after training, so the testing session\n", + " is the relevant one;\n", + "3. starting small means we can iterate quickly.\n", + "\n", + "You can come back later and broaden this filter.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "# Pick one species. 'Melanogaster/CS' has the most rows (127), so a good default.\n", + "sub = meta[meta[\"species\"] == \"Melanogaster/CS\"].copy()\n", + "\n", + "# We're loading every session for these flies, but the loader stamps each\n", + "# row with a 'session' column so we can filter to testing afterwards.\n", + "print(f\"selected metadata rows: {len(sub)}\")\n", + "print(sub[\"male\"].value_counts())\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "# This will take a minute or two and use a chunk of RAM. Be patient.\n", + "data = load_roi_data(sub)\n", + "print(f\"loaded shape: {data.shape}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What did we get?\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "data.head(3)\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "# How big is each session, in tracking samples?\n", + "data.groupby([\"session\", \"male\"]).size().unstack(fill_value=0)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Restrict to the testing session\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "testing = data[data[\"session\"] == \"testing\"].copy()\n", + "print(f\"testing samples: {len(testing):,}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Reduce each trace to one number\n", + "\n", + "Right now each fly contributes **tens of thousands** of (t, x, y) rows.\n", + "We can't compare distributions of millions of points across two groups\n", + "in any meaningful way. So we **collapse each (date, machine_name, ROI)\n", + "trace into a single summary number** \u2014 here, the median distance between\n", + "the two flies during testing.\n", + "\n", + "Why median rather than mean? Because tracker glitches (one fly\n", + "temporarily lost) can produce huge spikes that the median ignores.\n", + "[Why medians beat means in noisy data\n", + "(2-min read)](https://en.wikipedia.org/wiki/Median#Robustness).\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "# Step 1 \u2014 per-frame distance.\n", + "# Take only frames with exactly 2 flies (so we have a real distance).\n", + "two_fly = testing.groupby([\"date\", \"machine_name\", \"ROI\", \"t\"]).filter(lambda g: len(g) == 2)\n", + "\n", + "# For each (track, t), compute the distance between the two rows.\n", + "def distance_for_frame(g):\n", + " g = g.sort_values(\"id\").reset_index(drop=True)\n", + " return np.hypot(g.loc[0, \"x\"] - g.loc[1, \"x\"], g.loc[0, \"y\"] - g.loc[1, \"y\"])\n", + "\n", + "# This is the slow step. With ~3 M frames it takes a while.\n", + "per_frame = (\n", + " two_fly\n", + " .groupby([\"date\", \"machine_name\", \"ROI\", \"t\", \"male\"])\n", + " .apply(distance_for_frame)\n", + " .reset_index(name=\"distance_px\")\n", + ")\n", + "print(f\"per-frame distance rows: {len(per_frame):,}\")\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "# Step 2 \u2014 one number per (date, machine_name, ROI).\n", + "per_fly = (\n", + " per_frame\n", + " .groupby([\"date\", \"machine_name\", \"ROI\", \"male\"])[\"distance_px\"]\n", + " .median()\n", + " .reset_index(name=\"median_distance_px\")\n", + ")\n", + "\n", + "# Each row now is \"one fly during testing\", with its median distance.\n", + "print(per_fly.shape)\n", + "per_fly.head()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Sanity check: how many flies per group?\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "per_fly[\"male\"].value_counts()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the numbers are very different, your statistical comparison will be\n", + "underpowered for one side. Note them down.\n", + "\n", + "## Plot the distributions\n", + "\n", + "The first thing to do with two groups is to **look at them**. Don't trust\n", + "a p-value before you've seen the histogram.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(10, 5))\n", + "\n", + "bins = np.linspace(0, per_fly[\"median_distance_px\"].max(), 40)\n", + "\n", + "for label, color in [(\"trained\", \"steelblue\"), (\"naive\", \"darkorange\")]:\n", + " sub = per_fly[per_fly[\"male\"] == label][\"median_distance_px\"]\n", + " ax.hist(sub, bins=bins, alpha=0.6, label=f\"{label} (n={len(sub)})\", color=color)\n", + "\n", + "ax.set_xlabel(\"median inter-fly distance during testing (px)\")\n", + "ax.set_ylabel(\"number of flies\")\n", + "ax.set_title(\"Trained vs na\u00efve \u2014 Melanogaster/CS \u2014 testing session\")\n", + "ax.legend()\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**What you might see:**\n", + "\n", + "- If the trained group's distribution is shifted to **higher** distances,\n", + " trained males are spending less time near the female (i.e. they\n", + " learned to give up).\n", + "- If the two distributions look identical, no learning effect was\n", + " measurable with this metric \u2014 but that doesn't mean there's no effect,\n", + " just that this particular summary didn't capture it.\n", + "- A **bimodal** trained distribution (two humps) would mean some males\n", + " learned and others didn't \u2014 the \"individual differences\" story in\n", + " `docs/bimodal_hypothesis.md`.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Add a stat test\n", + "\n", + "A formal comparison. Because group sizes are small and we don't know if\n", + "the data are normally distributed, the\n", + "[Mann-Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test)\n", + "is a safer default than the classic t-test.\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "trained_vals = per_fly[per_fly[\"male\"] == \"trained\"][\"median_distance_px\"]\n", + "naive_vals = per_fly[per_fly[\"male\"] == \"naive\"][\"median_distance_px\"]\n", + "\n", + "stat, pvalue = stats.mannwhitneyu(trained_vals, naive_vals, alternative=\"two-sided\")\n", + "\n", + "print(f\"trained median: {trained_vals.median():.1f} px (n={len(trained_vals)})\")\n", + "print(f\"naive median: {naive_vals.median():.1f} px (n={len(naive_vals)})\")\n", + "print(f\"Mann-Whitney U: {stat:.0f} p-value: {pvalue:.4f}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**How to read this**: the p-value is the probability of seeing a\n", + "difference at least this big *if there were really no difference*. By\n", + "convention p < 0.05 is \"interesting\", p < 0.01 is \"fairly convincing\".\n", + "But never trust a p-value without:\n", + "\n", + "1. eyeballing the histogram first (you did);\n", + "2. reporting the **effect size**, not just the p-value (e.g. the\n", + " difference of medians);\n", + "3. understanding that p-values\n", + " [say nothing about practical importance](https://www.nature.com/articles/d41586-019-00857-9).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What's next?\n", + "\n", + "- **Pick a different metric**: instead of median distance, try fraction\n", + " of time the flies were within 50 px (a \"close-proximity\" metric), or\n", + " the maximum velocity per fly. (Velocity needs identity tracking, which\n", + " is harder \u2014 see `flies_analysis_simple.ipynb` cell 16 for an example.)\n", + "- **Look at it per species**: re-run with `species == \"Sechellia\"` and\n", + " compare. Does the effect generalize? Where is it strongest?\n", + "- **Look at the bimodality**: a kernel density plot\n", + " ([seaborn.kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html))\n", + " will show humps better than a histogram.\n", + "- **Time inside the session**: maybe the difference only shows up in the\n", + " first few minutes (right after the female is introduced). Slice\n", + " `per_frame` by `t` before aggregating.\n", + "- **Consult `docs/bimodal_hypothesis.md`**: it lays out a formal plan for\n", + " testing the \"some flies learn, others don't\" hypothesis.\n", + "\n", + "When you write your own analysis, **save it as a new notebook** (don't\n", + "edit this one). Copy the setup cells, change the question, change the\n", + "plot. That's how analysis projects grow.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A note on iteration speed\n", + "\n", + "The pipeline above is correct but **slow** because we apply a Python\n", + "function to every (track, t) group. If you find yourself re-running the\n", + "same expensive computation a lot, save the intermediate result to disk:\n", + "\n", + "```python\n", + "per_frame.to_parquet(\"per_frame_distance.parquet\")\n", + "# next time:\n", + "per_frame = pd.read_parquet(\"per_frame_distance.parquet\")\n", + "```\n", + "\n", + "`parquet` is a fast columnar format. `pip install pyarrow` if your\n", + "environment doesn't have it.\n", + "\n", + "There are also vectorized ways to compute these distances ~100\u00d7 faster\n", + "that avoid `groupby().apply()`. Don't worry about that yet \u2014 get a\n", + "correct answer first, optimize only if you find yourself waiting.\n" + ] + } + ] +} diff --git a/notebooks/getting_started/README.md b/notebooks/getting_started/README.md new file mode 100644 index 0000000..a74649a --- /dev/null +++ b/notebooks/getting_started/README.md @@ -0,0 +1,15 @@ +# Tutorial notebooks + +Read these in order: + +1. **`00_welcome.ipynb`** — what's the project, where the data lives, + how to use a Jupyter notebook. +2. **`01_python_pandas_basics.ipynb`** — minimum Python and pandas you + need to read project code. +3. **`02_explore_one_database.ipynb`** — open one tracking DB, plot a + trajectory, compute a single distance. +4. **`03_compare_trained_vs_naive.ipynb`** — first real analysis, + comparing groups. + +After these, the notebooks one level up (`flies_analysis*.ipynb`) walk +through the full analysis pipeline that the previous student built.