# Planning & Architecture ## Project Overview Drosophila behavioral tracking analysis for the Cupido project. Compares social interaction patterns (inter-fly distance, velocity) between trained and untrained flies using a barrier-opening assay recorded on ethoscope platforms. ## Architecture **Pipeline-based**: Raw SQLite DBs -> ROI extraction -> distance calculation -> time alignment -> statistical analysis / visualization. **Stack**: Python 3.10+, pandas, scipy, scikit-learn, matplotlib/seaborn, Jupyter. ## Code Conventions - **PEP8** formatting, Google-style docstrings - **Type hints** on function signatures - **Time units**: milliseconds in all data (DB stores ms, barrier CSV stores seconds but is converted to ms on load) - **Distance units**: pixels (no conversion to physical units) - **Path management**: All scripts import from `scripts/config.py` for consistent paths - **Notebooks**: Use `Path("..")` relative paths from `notebooks/` directory ## Key Caveats - **Pseudoreplication**: True N = 18 ROIs per group (not 230K data points). Statistical tests on individual data points are inflated. - **Tiny effect sizes**: Cohen's d ~ 0.09 for distance, ~0.14 for velocity. Statistically significant only due to massive sample size. - **Missing data**: Machine 139 (6 ROIs) has metadata but no tracking DB or barrier opening time. - **Machine name type mismatch**: Metadata stores as int (76), barrier CSV stores as int (076). Must convert to string for matching. ## Directory Structure ``` tracking/ ├── data/raw/ # SQLite DBs (gitignored) ├── data/metadata/ # Small CSVs (tracked) ├── data/processed/ # Large generated CSVs (gitignored) ├── scripts/ # Python scripts with config.py imports ├── notebooks/ # Jupyter analysis notebooks ├── figures/ # Generated plots (gitignored) ├── docs/ # Scientific documentation └── tasks/ # Task tracking ``` ## Next Direction The primary next step is testing the **bimodal hypothesis** - see `docs/bimodal_hypothesis.md` for the full plan. The core idea: aggregate analysis fails because the trained group likely contains both true learners and non-learners, diluting the signal.