cupido/tasks/todo.md
Giorgio Gilestro 23050360ea Remove data/raw/ entirely — all bulky data now under /mnt/data/projects/cupido/
Deleted the 5 stale pre-pipeline tracking DBs and the data/raw/ directory.
Dropped DATA_RAW from config.py; build_video_inventory now scans
TRACKING_OUTPUT_DIR for already-tracked sessions. Notebooks no longer
import DATA_RAW. README, PLANNING and todo updated to reflect that the
repo holds only code + small curated metadata, never bulky DBs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-01 09:20:25 +01:00

6.9 KiB

Task List

Completed Work

  • Extract ROI data from SQLite databases grouped by trained/untrained
  • Calculate inter-fly distances at each time point
  • Align data to barrier opening time (t=0)
  • Plot average distance over time (entire experiment + 300s window)
  • Track fly identities across frames (Hungarian algorithm)
  • Calculate max velocity over 10-second moving windows
  • Statistical tests (t-tests, Cohen's d) comparing groups
  • ML classification attempt (Logistic Regression, Random Forest)
  • Clustering analysis (K-means)
  • Organize project structure for student handoff

Priority: Bimodal Hypothesis Analysis

See docs/bimodal_hypothesis.md for detailed methodology.

Phase 1: Per-ROI Feature Extraction

  • Compute per-ROI summary statistics from aligned distance data
    • Mean distance post-opening (0-300s)
    • Median distance post-opening
    • Fraction of time at distance < 50px ("close proximity")
    • Mean max velocity post-opening
  • Create a summary DataFrame with N=18 trained + N=18 untrained rows
  • Note: Only 30 ROIs have data (Machine 139 missing = 6 ROIs lost)

Phase 2: Distribution Visualization

  • Plot histograms/KDE of per-ROI metrics for each group
  • Look for bimodality in trained group vs unimodality in untrained

Phase 3: Formal Bimodality Testing

  • Hartigan's dip test on trained per-ROI distributions
  • Fit Gaussian Mixture Models (1 vs 2 components) to trained data
  • Compare BIC scores to determine optimal number of components

Phase 4: Subgroup Identification

  • If bimodal: classify trained ROIs as "learner" vs "non-learner" using GMM posteriors
  • Compare learner subgroup vs untrained group (expect larger effect size)

Phase 5: Effect Size Re-estimation

  • Mann-Whitney U test (appropriate for small N)
  • Bootstrap confidence intervals for effect sizes
  • Account for session as random effect

Maintenance Items

  • Investigate missing Machine 139 data (has metadata but no tracking DB)
  • Add diptest to requirements.txt when starting bimodal analysis
  • Consider converting pixel distances to physical units (need calibration)
  • The second notebook (flies_analysis.ipynb) re-runs from DB extraction - consider deprecating

Phase: Offline Tracking of 2024 Video Backlog (added 2026-04-27)

Recap

Tracked so far (5 sessions, all from 2025-07-15, machines 076/145/268). Those were re-tracked through the unified pipeline and now live at /mnt/data/projects/cupido/tracked/ (no separate data/raw/ anymore — the old pre-pipeline copies were deleted on 2026-05-01).

The metadata file /mnt/data/projects/cupido/all_video_info_merged.xlsx indexes a different set of experiments: 7 dates from 2024-09-17 → 2024-10-21, 16 ethoscope machines, 63 unique (date, machine) sessions = 484 ROI-rows.

Inventory: see data/metadata/video_inventory.csv (built by scripts/build_video_inventory.py).

  • 1163 video sessions on disk under /mnt/ethoscope_data/videos/
  • 63/63 xlsx (date, machine) sessions have video on disk
  • 129 video instances need tracking (some (date, machine) have 2-4 recordings/day)

Plan

The HD-mating-arena videos have no auto-detectable targets — the user must manually click 3 reference points (L-shape: top, corner, left) per video. Once all targets are picked, tracking can run in the background.

  • Step 1 — Inventory: scripts/build_video_inventory.pydata/metadata/video_inventory.csv. 63 (date,machine) sessions match the xlsx, all videos found, 129 video instances need tracking.
  • Step 2 — Manual target picker: scripts/pick_targets.py. Loops over videos with in_xlsx & ~already_tracked & no JSON yet; per video, shows a representative frame, captures 3 clicks (top, corner, left), saves data/targets/<video_basename>.json. Skips videos already done.
  • Step 3 — Background tracker: scripts/track_videos.py. Reads target JSONs, builds 6 ROIs from the HD-mating-arena geometry, runs MovieVirtualCamera + MultiFlyTracker + SQLiteResultWriter, writes data/tracked/<basename>_tracking.db. Idempotent. Smoke-tested end-to-end: 90s of video → ~3000 rows/ROI, areas in 800-2000 band.
  • Step 4 — Tracking deps: requirements-tracking.txt.

Still TODO

  • User to run pick_targets.py (interactive — needs DISPLAY) on the 129 pending videos.
  • Run track_videos.py --jobs 4 against the resulting JSONs.
  • (Optional) auto_detect_targets.py exists as a fallback for videos that DO have visible targets (saves clicks). Confirmed not useful on the 2025-07-15 batch — these arenas don't have black target dots — but worth trying on 2024 batches before falling back to manual.
  • Decide what to do with the 4 (date, machine) sessions that have 3-4 recordings/day instead of 2 (e.g. ETHOSCOPE_086 on 2024-09-17 has 4). One of them is at lower resolution (1280x960) — likely an aborted take.

Open questions / risks

  • Some (date, machine) combos have 3-4 recordings (e.g. ETHOSCOPE_086 on 2024-09-17). Need to figure out which is the real "test" video vs aborted takes — possibly use video duration or filename pattern.
  • One mismatched-resolution file: 1280x960@25fps-20q instead of 1920x1088@25fps-28q — flag for inspection.
  • The original ConstrainedMultiFlyTracker is no longer in the ethoscope repo; MultiFlyTracker is its likely successor. Validate output schema matches what the existing analysis pipeline expects (load_roi_data.py, etc.).

Discovered During Work

Barrier-opening annotation for the 2024 batch (added 2026-04-30)

The current flies_analysis*.ipynb aligns trajectories to a barrier-opening event sourced from data/metadata/2025_07_15_barrier_opening.csv. That file covers only the 5 machines in the 2025-07-15 experiment. The 2024 batch (/mnt/data/projects/cupido/tracked/, 113 DBs) has no equivalent annotation yet, so all post-alignment cells silently exclude that data.

  • Build a small picker that lets the user scrub through each tracking DB / video and mark the barrier-opening frame, writing a row to a new data/metadata/barrier_opening_2024.csv (or extend the existing file with a date column).
  • Once the 2024 entries exist, update align_to_opening_time so it pulls from a unified barrier_opening table keyed by (date, machine_name) rather than machine_name alone.

Metadata vocabulary normalization (done 2026-04-30)

The xlsx had inconsistent labels for control flies ('naïve', 'niave', 'untrained' plus trailing whitespace). All sources now use a single canonical 'naive'. Normalization happens in scripts/export_video_db_index.py so re-running it from the xlsx always produces a clean TSV. The 2025-07-15 legacy CSV (data/metadata/2025_07_15_metadata_fixed.csv) was edited in place from 'untrained''naive'.