- Tracked DBs now live at /mnt/data/projects/cupido/tracked/ (out of
ownCloud to avoid sync conflicts and bandwidth churn). config.py
TRACKING_OUTPUT_DIR points there; the docker-compose for ethoscope-lab
mounts it world-readable for JupyterHub users.
- New scripts/export_video_db_index.py joins all_video_info_merged.xlsx
with the video inventory and the on-disk DBs, producing a TSV that has
one row per fly/ROI plus training/testing video and DB paths. Handles
approximate xlsx times, cross-day training/testing, the 12 AM/PM
ambiguity, and date typos.
- scripts/load_roi_data.py rewritten as a TSV-driven loader returning a
single DataFrame with session and metadata columns. calculate_distances
and the two flies_analysis notebooks migrated to use it; downstream
trained/naive splits remain available via simple equality filters.
- Metadata vocabulary canonicalized: {naïve, niave, untrained, test} all
resolve to {trained, naive}. Normalization happens at the TSV-export
boundary (idempotent); the xlsx and the 2025-07-15 legacy CSV were
edited in place to remove the worst variants.
- scripts/monitor_tracking.py rate calculation fixed: with N parallel
workers, completions arrive in bursts; the old formula divided by burst
width and reported nonsense rates. Now uses a 6 h window denominator.
- scripts/track_videos.py: BGRMovieCamera retries cv2.read on transient
NFS hiccups and a post-tracking completeness gate (≥ 90 % of expected
duration via MAX(t) across all 6 ROIs) deletes silent partial DBs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
140 lines
6.9 KiB
Markdown
140 lines
6.9 KiB
Markdown
# Task List
|
||
|
||
## Completed Work
|
||
|
||
- [x] Extract ROI data from SQLite databases grouped by trained/untrained
|
||
- [x] Calculate inter-fly distances at each time point
|
||
- [x] Align data to barrier opening time (t=0)
|
||
- [x] Plot average distance over time (entire experiment + 300s window)
|
||
- [x] Track fly identities across frames (Hungarian algorithm)
|
||
- [x] Calculate max velocity over 10-second moving windows
|
||
- [x] Statistical tests (t-tests, Cohen's d) comparing groups
|
||
- [x] ML classification attempt (Logistic Regression, Random Forest)
|
||
- [x] Clustering analysis (K-means)
|
||
- [x] Organize project structure for student handoff
|
||
|
||
## Priority: Bimodal Hypothesis Analysis
|
||
|
||
See `docs/bimodal_hypothesis.md` for detailed methodology.
|
||
|
||
### Phase 1: Per-ROI Feature Extraction
|
||
- [ ] Compute per-ROI summary statistics from aligned distance data
|
||
- Mean distance post-opening (0-300s)
|
||
- Median distance post-opening
|
||
- Fraction of time at distance < 50px ("close proximity")
|
||
- Mean max velocity post-opening
|
||
- [ ] Create a summary DataFrame with N=18 trained + N=18 untrained rows
|
||
- [ ] **Note**: Only 30 ROIs have data (Machine 139 missing = 6 ROIs lost)
|
||
|
||
### Phase 2: Distribution Visualization
|
||
- [ ] Plot histograms/KDE of per-ROI metrics for each group
|
||
- [ ] Look for bimodality in trained group vs unimodality in untrained
|
||
|
||
### Phase 3: Formal Bimodality Testing
|
||
- [ ] Hartigan's dip test on trained per-ROI distributions
|
||
- [ ] Fit Gaussian Mixture Models (1 vs 2 components) to trained data
|
||
- [ ] Compare BIC scores to determine optimal number of components
|
||
|
||
### Phase 4: Subgroup Identification
|
||
- [ ] If bimodal: classify trained ROIs as "learner" vs "non-learner" using GMM posteriors
|
||
- [ ] Compare learner subgroup vs untrained group (expect larger effect size)
|
||
|
||
### Phase 5: Effect Size Re-estimation
|
||
- [ ] Mann-Whitney U test (appropriate for small N)
|
||
- [ ] Bootstrap confidence intervals for effect sizes
|
||
- [ ] Account for session as random effect
|
||
|
||
## Maintenance Items
|
||
|
||
- [ ] Investigate missing Machine 139 data (has metadata but no tracking DB)
|
||
- [ ] Add `diptest` to requirements.txt when starting bimodal analysis
|
||
- [ ] Consider converting pixel distances to physical units (need calibration)
|
||
- [ ] The second notebook (`flies_analysis.ipynb`) re-runs from DB extraction - consider deprecating
|
||
|
||
## Phase: Offline Tracking of 2024 Video Backlog (added 2026-04-27)
|
||
|
||
### Recap
|
||
|
||
Tracked so far (5 sessions, all from 2025-07-15, machines 076/145/268). The DBs in
|
||
`data/raw/` use tracker `ConstrainedMultiFlyTracker` and template
|
||
`HD_Mating_Arena_6_ROIS.json` (2 flies × 6 ROIs per video).
|
||
|
||
The metadata file `../all_video_info_merged.xlsx` indexes a different set of
|
||
experiments: 7 dates from 2024-09-17 → 2024-10-21, 16 ethoscope machines,
|
||
63 unique (date, machine) sessions = 484 ROI-rows. **None of the already-tracked
|
||
sessions are in this xlsx — these are fresh recordings to track.**
|
||
|
||
Inventory: see `data/metadata/video_inventory.csv` (built by
|
||
`scripts/build_video_inventory.py`).
|
||
- 1163 video sessions on disk under `/mnt/ethoscope_data/videos/`
|
||
- 63/63 xlsx (date, machine) sessions have video on disk
|
||
- 129 video instances need tracking (some (date, machine) have 2-4 recordings/day)
|
||
|
||
### Plan
|
||
|
||
The HD-mating-arena videos have no auto-detectable targets — the user must
|
||
manually click 3 reference points (L-shape: top, corner, left) per video. Once
|
||
all targets are picked, tracking can run in the background.
|
||
|
||
- [x] **Step 1 — Inventory**: `scripts/build_video_inventory.py` →
|
||
`data/metadata/video_inventory.csv`. 63 (date,machine) sessions match
|
||
the xlsx, all videos found, 129 video instances need tracking.
|
||
- [x] **Step 2 — Manual target picker**: `scripts/pick_targets.py`. Loops over
|
||
videos with `in_xlsx & ~already_tracked & no JSON yet`; per video, shows
|
||
a representative frame, captures 3 clicks (top, corner, left), saves
|
||
`data/targets/<video_basename>.json`. Skips videos already done.
|
||
- [x] **Step 3 — Background tracker**: `scripts/track_videos.py`. Reads target
|
||
JSONs, builds 6 ROIs from the HD-mating-arena geometry, runs
|
||
`MovieVirtualCamera` + `MultiFlyTracker` + `SQLiteResultWriter`, writes
|
||
`data/tracked/<basename>_tracking.db`. Idempotent. Smoke-tested
|
||
end-to-end: 90s of video → ~3000 rows/ROI, areas in 800-2000 band.
|
||
- [x] **Step 4 — Tracking deps**: `requirements-tracking.txt`.
|
||
|
||
### Still TODO
|
||
- [ ] User to run `pick_targets.py` (interactive — needs DISPLAY) on the 129
|
||
pending videos.
|
||
- [ ] Run `track_videos.py --jobs 4` against the resulting JSONs.
|
||
- [ ] (Optional) `auto_detect_targets.py` exists as a fallback for videos that
|
||
DO have visible targets (saves clicks). Confirmed not useful on the
|
||
2025-07-15 batch — these arenas don't have black target dots — but worth
|
||
trying on 2024 batches before falling back to manual.
|
||
- [ ] Decide what to do with the 4 (date, machine) sessions that have 3-4
|
||
recordings/day instead of 2 (e.g. ETHOSCOPE_086 on 2024-09-17 has 4).
|
||
One of them is at lower resolution (1280x960) — likely an aborted take.
|
||
|
||
### Open questions / risks
|
||
|
||
- Some (date, machine) combos have 3-4 recordings (e.g. ETHOSCOPE_086 on
|
||
2024-09-17). Need to figure out which is the real "test" video vs aborted
|
||
takes — possibly use video duration or filename pattern.
|
||
- One mismatched-resolution file: `1280x960@25fps-20q` instead of
|
||
`1920x1088@25fps-28q` — flag for inspection.
|
||
- The original `ConstrainedMultiFlyTracker` is no longer in the ethoscope repo;
|
||
`MultiFlyTracker` is its likely successor. Validate output schema matches
|
||
what the existing analysis pipeline expects (`load_roi_data.py`, etc.).
|
||
|
||
## Discovered During Work
|
||
|
||
### Barrier-opening annotation for the 2024 batch (added 2026-04-30)
|
||
The current `flies_analysis*.ipynb` aligns trajectories to a barrier-opening
|
||
event sourced from `data/metadata/2025_07_15_barrier_opening.csv`. That file
|
||
covers only the 5 machines in the 2025-07-15 experiment. The 2024 batch
|
||
(`/mnt/data/projects/cupido/tracked/`, 113 DBs) has no equivalent annotation
|
||
yet, so all post-alignment cells silently exclude that data.
|
||
|
||
- [ ] Build a small picker that lets the user scrub through each tracking
|
||
DB / video and mark the barrier-opening frame, writing a row to a new
|
||
`data/metadata/barrier_opening_2024.csv` (or extend the existing
|
||
file with a date column).
|
||
- [ ] Once the 2024 entries exist, update `align_to_opening_time` so it
|
||
pulls from a unified `barrier_opening` table keyed by
|
||
`(date, machine_name)` rather than `machine_name` alone.
|
||
|
||
### Metadata vocabulary normalization (done 2026-04-30)
|
||
The xlsx had inconsistent labels for control flies (`'naïve'`, `'niave'`,
|
||
`'untrained'` plus trailing whitespace). All sources now use a single
|
||
canonical `'naive'`. Normalization happens in
|
||
`scripts/export_video_db_index.py` so re-running it from the xlsx always
|
||
produces a clean TSV. The 2025-07-15 legacy CSV
|
||
(`data/metadata/2025_07_15_metadata_fixed.csv`) was edited in place from
|
||
`'untrained'` → `'naive'`.
|