cupido/tasks/todo.md

# Task List

## Completed Work

- [x] Extract ROI data from SQLite databases grouped by trained/untrained
- [x] Calculate inter-fly distances at each time point
- [x] Align data to barrier opening time (t=0)
- [x] Plot average distance over time (entire experiment + 300s window)
- [x] Track fly identities across frames (Hungarian algorithm)
- [x] Calculate max velocity over 10-second moving windows
- [x] Statistical tests (t-tests, Cohen's d) comparing groups
- [x] ML classification attempt (Logistic Regression, Random Forest)
- [x] Clustering analysis (K-means)
- [x] Organize project structure for student handoff

## Priority: Bimodal Hypothesis Analysis

See `docs/bimodal_hypothesis.md` for detailed methodology.

### Phase 1: Per-ROI Feature Extraction
- [ ] Compute per-ROI summary statistics from aligned distance data
  - Mean distance post-opening (0-300s)
  - Median distance post-opening
  - Fraction of time at distance < 50px ("close proximity")
  - Mean max velocity post-opening
- [ ] Create a summary DataFrame with N=18 trained + N=18 untrained rows
- [ ] **Note**: Only 30 ROIs have data (Machine 139 missing = 6 ROIs lost)

### Phase 2: Distribution Visualization
- [ ] Plot histograms/KDE of per-ROI metrics for each group
- [ ] Look for bimodality in trained group vs unimodality in untrained

### Phase 3: Formal Bimodality Testing
- [ ] Hartigan's dip test on trained per-ROI distributions
- [ ] Fit Gaussian Mixture Models (1 vs 2 components) to trained data
- [ ] Compare BIC scores to determine optimal number of components

### Phase 4: Subgroup Identification
- [ ] If bimodal: classify trained ROIs as "learner" vs "non-learner" using GMM posteriors
- [ ] Compare learner subgroup vs untrained group (expect larger effect size)

### Phase 5: Effect Size Re-estimation
- [ ] Mann-Whitney U test (appropriate for small N)
- [ ] Bootstrap confidence intervals for effect sizes
- [ ] Account for session as random effect

## Maintenance Items

- [ ] Investigate missing Machine 139 data (has metadata but no tracking DB)
- [ ] Add `diptest` to requirements.txt when starting bimodal analysis
- [ ] Consider converting pixel distances to physical units (need calibration)
- [ ] The second notebook (`flies_analysis.ipynb`) re-runs from DB extraction - consider deprecating

## Discovered During Work

(Add new items here as they come up during analysis)