cupido/PLANNING.md

# Planning & Architecture

## Project Overview

Drosophila behavioral tracking analysis for the Cupido project. Compares social interaction patterns (inter-fly distance, velocity) between trained and untrained flies using a barrier-opening assay recorded on ethoscope platforms.

## Architecture

**Pipeline-based**: Raw SQLite DBs -> ROI extraction -> distance calculation -> time alignment -> statistical analysis / visualization.

**Stack**: Python 3.10+, pandas, scipy, scikit-learn, matplotlib/seaborn, Jupyter.

## Code Conventions

- **PEP8** formatting, Google-style docstrings
- **Type hints** on function signatures
- **Time units**: milliseconds in all data (DB stores ms, barrier CSV stores seconds but is converted to ms on load)
- **Distance units**: pixels (no conversion to physical units)
- **Path management**: All scripts import from `scripts/config.py` for consistent paths
- **Notebooks**: Use `Path("..")` relative paths from `notebooks/` directory

## Key Caveats

- **Pseudoreplication**: True N = 18 ROIs per group (not 230K data points). Statistical tests on individual data points are inflated.
- **Tiny effect sizes**: Cohen's d ~ 0.09 for distance, ~0.14 for velocity. Statistically significant only due to massive sample size.
- **Missing data**: Machine 139 (6 ROIs) has metadata but no tracking DB or barrier opening time.
- **Machine name type mismatch**: Metadata stores as int (76), barrier CSV stores as int (076). Must convert to string for matching.

## Directory Structure

```
tracking/
├── data/metadata/     # Small hand-curated CSVs (tracked in git)
├── data/processed/    # Large generated CSVs (gitignored)
├── data/logs/         # Tracker logs (gitignored)
├── scripts/           # Python scripts with config.py imports
├── notebooks/         # Jupyter analysis notebooks
├── figures/           # Generated plots (gitignored)
├── docs/              # Scientific documentation
└── tasks/             # Task tracking

# All bulky data lives outside the repo at /mnt/data/projects/cupido/:
#   tracked/                       # SQLite tracking DBs
#   targets/                       # Target-point JSON sidecars
#   all_video_info_merged.{xlsx,tsv}  # Metadata spreadsheet
```

## Next Direction

The primary next step is testing the **bimodal hypothesis** - see `docs/bimodal_hypothesis.md` for the full plan. The core idea: aggregate analysis fails because the trained group likely contains both true learners and non-learners, diluting the signal.