cupido/tasks/todo.md
Giorgio Gilestro f60a9d0530 Unify analysis pipeline around the TSV; move tracked DBs out of cloud sync
- Tracked DBs now live at /mnt/data/projects/cupido/tracked/ (out of
  ownCloud to avoid sync conflicts and bandwidth churn). config.py
  TRACKING_OUTPUT_DIR points there; the docker-compose for ethoscope-lab
  mounts it world-readable for JupyterHub users.
- New scripts/export_video_db_index.py joins all_video_info_merged.xlsx
  with the video inventory and the on-disk DBs, producing a TSV that has
  one row per fly/ROI plus training/testing video and DB paths. Handles
  approximate xlsx times, cross-day training/testing, the 12 AM/PM
  ambiguity, and date typos.
- scripts/load_roi_data.py rewritten as a TSV-driven loader returning a
  single DataFrame with session and metadata columns. calculate_distances
  and the two flies_analysis notebooks migrated to use it; downstream
  trained/naive splits remain available via simple equality filters.
- Metadata vocabulary canonicalized: {naïve, niave, untrained, test} all
  resolve to {trained, naive}. Normalization happens at the TSV-export
  boundary (idempotent); the xlsx and the 2025-07-15 legacy CSV were
  edited in place to remove the worst variants.
- scripts/monitor_tracking.py rate calculation fixed: with N parallel
  workers, completions arrive in bursts; the old formula divided by burst
  width and reported nonsense rates. Now uses a 6 h window denominator.
- scripts/track_videos.py: BGRMovieCamera retries cv2.read on transient
  NFS hiccups and a post-tracking completeness gate (≥ 90 % of expected
  duration via MAX(t) across all 6 ROIs) deletes silent partial DBs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 15:20:14 +01:00

140 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Task List
## Completed Work
- [x] Extract ROI data from SQLite databases grouped by trained/untrained
- [x] Calculate inter-fly distances at each time point
- [x] Align data to barrier opening time (t=0)
- [x] Plot average distance over time (entire experiment + 300s window)
- [x] Track fly identities across frames (Hungarian algorithm)
- [x] Calculate max velocity over 10-second moving windows
- [x] Statistical tests (t-tests, Cohen's d) comparing groups
- [x] ML classification attempt (Logistic Regression, Random Forest)
- [x] Clustering analysis (K-means)
- [x] Organize project structure for student handoff
## Priority: Bimodal Hypothesis Analysis
See `docs/bimodal_hypothesis.md` for detailed methodology.
### Phase 1: Per-ROI Feature Extraction
- [ ] Compute per-ROI summary statistics from aligned distance data
- Mean distance post-opening (0-300s)
- Median distance post-opening
- Fraction of time at distance < 50px ("close proximity")
- Mean max velocity post-opening
- [ ] Create a summary DataFrame with N=18 trained + N=18 untrained rows
- [ ] **Note**: Only 30 ROIs have data (Machine 139 missing = 6 ROIs lost)
### Phase 2: Distribution Visualization
- [ ] Plot histograms/KDE of per-ROI metrics for each group
- [ ] Look for bimodality in trained group vs unimodality in untrained
### Phase 3: Formal Bimodality Testing
- [ ] Hartigan's dip test on trained per-ROI distributions
- [ ] Fit Gaussian Mixture Models (1 vs 2 components) to trained data
- [ ] Compare BIC scores to determine optimal number of components
### Phase 4: Subgroup Identification
- [ ] If bimodal: classify trained ROIs as "learner" vs "non-learner" using GMM posteriors
- [ ] Compare learner subgroup vs untrained group (expect larger effect size)
### Phase 5: Effect Size Re-estimation
- [ ] Mann-Whitney U test (appropriate for small N)
- [ ] Bootstrap confidence intervals for effect sizes
- [ ] Account for session as random effect
## Maintenance Items
- [ ] Investigate missing Machine 139 data (has metadata but no tracking DB)
- [ ] Add `diptest` to requirements.txt when starting bimodal analysis
- [ ] Consider converting pixel distances to physical units (need calibration)
- [ ] The second notebook (`flies_analysis.ipynb`) re-runs from DB extraction - consider deprecating
## Phase: Offline Tracking of 2024 Video Backlog (added 2026-04-27)
### Recap
Tracked so far (5 sessions, all from 2025-07-15, machines 076/145/268). The DBs in
`data/raw/` use tracker `ConstrainedMultiFlyTracker` and template
`HD_Mating_Arena_6_ROIS.json` (2 flies × 6 ROIs per video).
The metadata file `../all_video_info_merged.xlsx` indexes a different set of
experiments: 7 dates from 2024-09-17 2024-10-21, 16 ethoscope machines,
63 unique (date, machine) sessions = 484 ROI-rows. **None of the already-tracked
sessions are in this xlsx these are fresh recordings to track.**
Inventory: see `data/metadata/video_inventory.csv` (built by
`scripts/build_video_inventory.py`).
- 1163 video sessions on disk under `/mnt/ethoscope_data/videos/`
- 63/63 xlsx (date, machine) sessions have video on disk
- 129 video instances need tracking (some (date, machine) have 2-4 recordings/day)
### Plan
The HD-mating-arena videos have no auto-detectable targets the user must
manually click 3 reference points (L-shape: top, corner, left) per video. Once
all targets are picked, tracking can run in the background.
- [x] **Step 1 — Inventory**: `scripts/build_video_inventory.py`
`data/metadata/video_inventory.csv`. 63 (date,machine) sessions match
the xlsx, all videos found, 129 video instances need tracking.
- [x] **Step 2 — Manual target picker**: `scripts/pick_targets.py`. Loops over
videos with `in_xlsx & ~already_tracked & no JSON yet`; per video, shows
a representative frame, captures 3 clicks (top, corner, left), saves
`data/targets/<video_basename>.json`. Skips videos already done.
- [x] **Step 3 — Background tracker**: `scripts/track_videos.py`. Reads target
JSONs, builds 6 ROIs from the HD-mating-arena geometry, runs
`MovieVirtualCamera` + `MultiFlyTracker` + `SQLiteResultWriter`, writes
`data/tracked/<basename>_tracking.db`. Idempotent. Smoke-tested
end-to-end: 90s of video ~3000 rows/ROI, areas in 800-2000 band.
- [x] **Step 4 — Tracking deps**: `requirements-tracking.txt`.
### Still TODO
- [ ] User to run `pick_targets.py` (interactive needs DISPLAY) on the 129
pending videos.
- [ ] Run `track_videos.py --jobs 4` against the resulting JSONs.
- [ ] (Optional) `auto_detect_targets.py` exists as a fallback for videos that
DO have visible targets (saves clicks). Confirmed not useful on the
2025-07-15 batch these arenas don't have black target dots but worth
trying on 2024 batches before falling back to manual.
- [ ] Decide what to do with the 4 (date, machine) sessions that have 3-4
recordings/day instead of 2 (e.g. ETHOSCOPE_086 on 2024-09-17 has 4).
One of them is at lower resolution (1280x960) likely an aborted take.
### Open questions / risks
- Some (date, machine) combos have 3-4 recordings (e.g. ETHOSCOPE_086 on
2024-09-17). Need to figure out which is the real "test" video vs aborted
takes possibly use video duration or filename pattern.
- One mismatched-resolution file: `1280x960@25fps-20q` instead of
`1920x1088@25fps-28q` flag for inspection.
- The original `ConstrainedMultiFlyTracker` is no longer in the ethoscope repo;
`MultiFlyTracker` is its likely successor. Validate output schema matches
what the existing analysis pipeline expects (`load_roi_data.py`, etc.).
## Discovered During Work
### Barrier-opening annotation for the 2024 batch (added 2026-04-30)
The current `flies_analysis*.ipynb` aligns trajectories to a barrier-opening
event sourced from `data/metadata/2025_07_15_barrier_opening.csv`. That file
covers only the 5 machines in the 2025-07-15 experiment. The 2024 batch
(`/mnt/data/projects/cupido/tracked/`, 113 DBs) has no equivalent annotation
yet, so all post-alignment cells silently exclude that data.
- [ ] Build a small picker that lets the user scrub through each tracking
DB / video and mark the barrier-opening frame, writing a row to a new
`data/metadata/barrier_opening_2024.csv` (or extend the existing
file with a date column).
- [ ] Once the 2024 entries exist, update `align_to_opening_time` so it
pulls from a unified `barrier_opening` table keyed by
`(date, machine_name)` rather than `machine_name` alone.
### Metadata vocabulary normalization (done 2026-04-30)
The xlsx had inconsistent labels for control flies (`'naïve'`, `'niave'`,
`'untrained'` plus trailing whitespace). All sources now use a single
canonical `'naive'`. Normalization happens in
`scripts/export_video_db_index.py` so re-running it from the xlsx always
produces a clean TSV. The 2025-07-15 legacy CSV
(`data/metadata/2025_07_15_metadata_fixed.csv`) was edited in place from
`'untrained'` `'naive'`.