Initial commit: organized project structure for student handoff

Reorganized flat 41-file directory into structured layout with: - scripts/ for Python analysis code with shared config.py - notebooks/ for Jupyter analysis notebooks - data/ split into raw/, metadata/, processed/ - docs/ with analysis summary, experimental design, and bimodal hypothesis tutorial - tasks/ with todo checklist and lessons learned - Comprehensive README, PLANNING.md, and .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 16:08:36 +00:00 · 2026-03-05 16:08:36 +00:00 · e7e4db264d
commit e7e4db264d
27 changed files with 3105 additions and 0 deletions
--- a/PLANNING.md
+++ b/PLANNING.md
@ -0,0 +1,45 @@
+# Planning & Architecture
+
+## Project Overview
+
+Drosophila behavioral tracking analysis for the Cupido project. Compares social interaction patterns (inter-fly distance, velocity) between trained and untrained flies using a barrier-opening assay recorded on ethoscope platforms.
+
+## Architecture
+
+**Pipeline-based**: Raw SQLite DBs -> ROI extraction -> distance calculation -> time alignment -> statistical analysis / visualization.
+
+**Stack**: Python 3.10+, pandas, scipy, scikit-learn, matplotlib/seaborn, Jupyter.
+
+## Code Conventions
+
+- **PEP8** formatting, Google-style docstrings
+- **Type hints** on function signatures
+- **Time units**: milliseconds in all data (DB stores ms, barrier CSV stores seconds but is converted to ms on load)
+- **Distance units**: pixels (no conversion to physical units)
+- **Path management**: All scripts import from `scripts/config.py` for consistent paths
+- **Notebooks**: Use `Path("..")` relative paths from `notebooks/` directory
+
+## Key Caveats
+
+- **Pseudoreplication**: True N = 18 ROIs per group (not 230K data points). Statistical tests on individual data points are inflated.
+- **Tiny effect sizes**: Cohen's d ~ 0.09 for distance, ~0.14 for velocity. Statistically significant only due to massive sample size.
+- **Missing data**: Machine 139 (6 ROIs) has metadata but no tracking DB or barrier opening time.
+- **Machine name type mismatch**: Metadata stores as int (76), barrier CSV stores as int (076). Must convert to string for matching.
+
+## Directory Structure
+
+```
+tracking/
+├── data/raw/          # SQLite DBs (gitignored)
+├── data/metadata/     # Small CSVs (tracked)
+├── data/processed/    # Large generated CSVs (gitignored)
+├── scripts/           # Python scripts with config.py imports
+├── notebooks/         # Jupyter analysis notebooks
+├── figures/           # Generated plots (gitignored)
+├── docs/              # Scientific documentation
+└── tasks/             # Task tracking
+```
+
+## Next Direction
+
+The primary next step is testing the **bimodal hypothesis** - see `docs/bimodal_hypothesis.md` for the full plan. The core idea: aggregate analysis fails because the trained group likely contains both true learners and non-learners, diluting the signal.