Initial commit: organized project structure for student handoff

Reorganized flat 41-file directory into structured layout with: - scripts/ for Python analysis code with shared config.py - notebooks/ for Jupyter analysis notebooks - data/ split into raw/, metadata/, processed/ - docs/ with analysis summary, experimental design, and bimodal hypothesis tutorial - tasks/ with todo checklist and lessons learned - Comprehensive README, PLANNING.md, and .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 16:08:36 +00:00 · 2026-03-05 16:08:36 +00:00 · e7e4db264d
commit e7e4db264d
27 changed files with 3105 additions and 0 deletions
--- a/tasks/lessons.md
+++ b/tasks/lessons.md
@ -0,0 +1,38 @@
+# Lessons Learned
+
+## Pseudoreplication Pitfall
+
+**The most important lesson in this project.**
+
+The raw data has ~230K data points per group, but the true independent samples are ROIs (N=18 per group). Each ROI contributes thousands of correlated time points. Running t-tests on all data points inflates significance massively (p < 1e-200) while the actual effect size is negligible (Cohen's d = 0.09).
+
+**Rule**: Always compute per-ROI summary statistics first, then compare groups at the ROI level.
+
+## Significance vs Effect Size
+
+A tiny p-value does NOT mean a meaningful difference. With N=230K, even a Cohen's d of 0.09 (96% overlap between distributions) gives p < 1e-200. Always report and interpret effect sizes alongside p-values.
+
+## Data Type Mismatches
+
+Machine names are stored as integers in metadata (76, 145, 268) but as strings in some contexts. The barrier_opening.csv uses "076" format. Always convert to string with `.astype(str)` before matching.
+
+## Time Unit Mismatches
+
+- SQLite databases: time `t` is in **milliseconds**
+- `2025_07_15_barrier_opening.csv`: `opening_time` is in **seconds**
+- Must multiply barrier opening times by 1000 before aligning
+
+## Missing Data
+
+Machine 139 has 6 ROIs in the metadata (3 trained, 3 untrained) but:
+- No tracking database file exists
+- No entry in barrier_opening.csv
+- This reduces the effective N from 18 to 15 per group
+
+## Single-Fly Detection Handling
+
+When only one fly is detected (instead of two), the tracker reports a single bounding box. If the area of that box is large (>1.5x median two-fly area), it likely means the flies are overlapping (distance ~0). If the area is small, one fly is probably out of frame (distance = NaN, excluded from analysis).
+
+## Path Management
+
+All scripts use `from config import DATA_PROCESSED, FIGURES, ...` for consistent paths. Notebooks use `Path("..")` relative to the `notebooks/` directory. Never use hardcoded absolute paths.
--- a/tasks/todo.md
+++ b/tasks/todo.md
@ -0,0 +1,56 @@
+# Task List
+
+## Completed Work
+
+- [x] Extract ROI data from SQLite databases grouped by trained/untrained
+- [x] Calculate inter-fly distances at each time point
+- [x] Align data to barrier opening time (t=0)
+- [x] Plot average distance over time (entire experiment + 300s window)
+- [x] Track fly identities across frames (Hungarian algorithm)
+- [x] Calculate max velocity over 10-second moving windows
+- [x] Statistical tests (t-tests, Cohen's d) comparing groups
+- [x] ML classification attempt (Logistic Regression, Random Forest)
+- [x] Clustering analysis (K-means)
+- [x] Organize project structure for student handoff
+
+## Priority: Bimodal Hypothesis Analysis
+
+See `docs/bimodal_hypothesis.md` for detailed methodology.
+
+### Phase 1: Per-ROI Feature Extraction
+- [ ] Compute per-ROI summary statistics from aligned distance data
+  - Mean distance post-opening (0-300s)
+  - Median distance post-opening
+  - Fraction of time at distance < 50px ("close proximity")
+  - Mean max velocity post-opening
+- [ ] Create a summary DataFrame with N=18 trained + N=18 untrained rows
+- [ ] **Note**: Only 30 ROIs have data (Machine 139 missing = 6 ROIs lost)
+
+### Phase 2: Distribution Visualization
+- [ ] Plot histograms/KDE of per-ROI metrics for each group
+- [ ] Look for bimodality in trained group vs unimodality in untrained
+
+### Phase 3: Formal Bimodality Testing
+- [ ] Hartigan's dip test on trained per-ROI distributions
+- [ ] Fit Gaussian Mixture Models (1 vs 2 components) to trained data
+- [ ] Compare BIC scores to determine optimal number of components
+
+### Phase 4: Subgroup Identification
+- [ ] If bimodal: classify trained ROIs as "learner" vs "non-learner" using GMM posteriors
+- [ ] Compare learner subgroup vs untrained group (expect larger effect size)
+
+### Phase 5: Effect Size Re-estimation
+- [ ] Mann-Whitney U test (appropriate for small N)
+- [ ] Bootstrap confidence intervals for effect sizes
+- [ ] Account for session as random effect
+
+## Maintenance Items
+
+- [ ] Investigate missing Machine 139 data (has metadata but no tracking DB)
+- [ ] Add `diptest` to requirements.txt when starting bimodal analysis
+- [ ] Consider converting pixel distances to physical units (need calibration)
+- [ ] The second notebook (`flies_analysis.ipynb`) re-runs from DB extraction - consider deprecating
+
+## Discovered During Work
+
+(Add new items here as they come up during analysis)