# Task List ## Completed Work - [x] Extract ROI data from SQLite databases grouped by trained/untrained - [x] Calculate inter-fly distances at each time point - [x] Align data to barrier opening time (t=0) - [x] Plot average distance over time (entire experiment + 300s window) - [x] Track fly identities across frames (Hungarian algorithm) - [x] Calculate max velocity over 10-second moving windows - [x] Statistical tests (t-tests, Cohen's d) comparing groups - [x] ML classification attempt (Logistic Regression, Random Forest) - [x] Clustering analysis (K-means) - [x] Organize project structure for student handoff ## Priority: Bimodal Hypothesis Analysis See `docs/bimodal_hypothesis.md` for detailed methodology. ### Phase 1: Per-ROI Feature Extraction - [ ] Compute per-ROI summary statistics from aligned distance data - Mean distance post-opening (0-300s) - Median distance post-opening - Fraction of time at distance < 50px ("close proximity") - Mean max velocity post-opening - [ ] Create a summary DataFrame with N=18 trained + N=18 untrained rows - [ ] **Note**: Only 30 ROIs have data (Machine 139 missing = 6 ROIs lost) ### Phase 2: Distribution Visualization - [ ] Plot histograms/KDE of per-ROI metrics for each group - [ ] Look for bimodality in trained group vs unimodality in untrained ### Phase 3: Formal Bimodality Testing - [ ] Hartigan's dip test on trained per-ROI distributions - [ ] Fit Gaussian Mixture Models (1 vs 2 components) to trained data - [ ] Compare BIC scores to determine optimal number of components ### Phase 4: Subgroup Identification - [ ] If bimodal: classify trained ROIs as "learner" vs "non-learner" using GMM posteriors - [ ] Compare learner subgroup vs untrained group (expect larger effect size) ### Phase 5: Effect Size Re-estimation - [ ] Mann-Whitney U test (appropriate for small N) - [ ] Bootstrap confidence intervals for effect sizes - [ ] Account for session as random effect ## Maintenance Items - [ ] Investigate missing Machine 139 data (has metadata but no tracking DB) - [ ] Add `diptest` to requirements.txt when starting bimodal analysis - [ ] Consider converting pixel distances to physical units (need calibration) - [ ] The second notebook (`flies_analysis.ipynb`) re-runs from DB extraction - consider deprecating ## Phase: Offline Tracking of 2024 Video Backlog (added 2026-04-27) ### Recap Tracked so far (5 sessions, all from 2025-07-15, machines 076/145/268). The DBs in `data/raw/` use tracker `ConstrainedMultiFlyTracker` and template `HD_Mating_Arena_6_ROIS.json` (2 flies × 6 ROIs per video). The metadata file `../all_video_info_merged.xlsx` indexes a different set of experiments: 7 dates from 2024-09-17 → 2024-10-21, 16 ethoscope machines, 63 unique (date, machine) sessions = 484 ROI-rows. **None of the already-tracked sessions are in this xlsx — these are fresh recordings to track.** Inventory: see `data/metadata/video_inventory.csv` (built by `scripts/build_video_inventory.py`). - 1163 video sessions on disk under `/mnt/ethoscope_data/videos/` - 63/63 xlsx (date, machine) sessions have video on disk - 129 video instances need tracking (some (date, machine) have 2-4 recordings/day) ### Plan The HD-mating-arena videos have no auto-detectable targets — the user must manually click 3 reference points (L-shape: top, corner, left) per video. Once all targets are picked, tracking can run in the background. - [x] **Step 1 — Inventory**: `scripts/build_video_inventory.py` → `data/metadata/video_inventory.csv`. 63 (date,machine) sessions match the xlsx, all videos found, 129 video instances need tracking. - [x] **Step 2 — Manual target picker**: `scripts/pick_targets.py`. Loops over videos with `in_xlsx & ~already_tracked & no JSON yet`; per video, shows a representative frame, captures 3 clicks (top, corner, left), saves `data/targets/.json`. Skips videos already done. - [x] **Step 3 — Background tracker**: `scripts/track_videos.py`. Reads target JSONs, builds 6 ROIs from the HD-mating-arena geometry, runs `MovieVirtualCamera` + `MultiFlyTracker` + `SQLiteResultWriter`, writes `data/tracked/_tracking.db`. Idempotent. Smoke-tested end-to-end: 90s of video → ~3000 rows/ROI, areas in 800-2000 band. - [x] **Step 4 — Tracking deps**: `requirements-tracking.txt`. ### Still TODO - [ ] User to run `pick_targets.py` (interactive — needs DISPLAY) on the 129 pending videos. - [ ] Run `track_videos.py --jobs 4` against the resulting JSONs. - [ ] (Optional) `auto_detect_targets.py` exists as a fallback for videos that DO have visible targets (saves clicks). Confirmed not useful on the 2025-07-15 batch — these arenas don't have black target dots — but worth trying on 2024 batches before falling back to manual. - [ ] Decide what to do with the 4 (date, machine) sessions that have 3-4 recordings/day instead of 2 (e.g. ETHOSCOPE_086 on 2024-09-17 has 4). One of them is at lower resolution (1280x960) — likely an aborted take. ### Open questions / risks - Some (date, machine) combos have 3-4 recordings (e.g. ETHOSCOPE_086 on 2024-09-17). Need to figure out which is the real "test" video vs aborted takes — possibly use video duration or filename pattern. - One mismatched-resolution file: `1280x960@25fps-20q` instead of `1920x1088@25fps-28q` — flag for inspection. - The original `ConstrainedMultiFlyTracker` is no longer in the ethoscope repo; `MultiFlyTracker` is its likely successor. Validate output schema matches what the existing analysis pipeline expects (`load_roi_data.py`, etc.). ## Discovered During Work ### Barrier-opening annotation for the 2024 batch (added 2026-04-30) The current `flies_analysis*.ipynb` aligns trajectories to a barrier-opening event sourced from `data/metadata/2025_07_15_barrier_opening.csv`. That file covers only the 5 machines in the 2025-07-15 experiment. The 2024 batch (`/mnt/data/projects/cupido/tracked/`, 113 DBs) has no equivalent annotation yet, so all post-alignment cells silently exclude that data. - [ ] Build a small picker that lets the user scrub through each tracking DB / video and mark the barrier-opening frame, writing a row to a new `data/metadata/barrier_opening_2024.csv` (or extend the existing file with a date column). - [ ] Once the 2024 entries exist, update `align_to_opening_time` so it pulls from a unified `barrier_opening` table keyed by `(date, machine_name)` rather than `machine_name` alone. ### Metadata vocabulary normalization (done 2026-04-30) The xlsx had inconsistent labels for control flies (`'naïve'`, `'niave'`, `'untrained'` plus trailing whitespace). All sources now use a single canonical `'naive'`. Normalization happens in `scripts/export_video_db_index.py` so re-running it from the xlsx always produces a clean TSV. The 2025-07-15 legacy CSV (`data/metadata/2025_07_15_metadata_fixed.csv`) was edited in place from `'untrained'` → `'naive'`.