cupido/PLANNING.md
Giorgio e7e4db264d Initial commit: organized project structure for student handoff
Reorganized flat 41-file directory into structured layout with:
- scripts/ for Python analysis code with shared config.py
- notebooks/ for Jupyter analysis notebooks
- data/ split into raw/, metadata/, processed/
- docs/ with analysis summary, experimental design, and bimodal hypothesis tutorial
- tasks/ with todo checklist and lessons learned
- Comprehensive README, PLANNING.md, and .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 16:08:36 +00:00

2.2 KiB

Planning & Architecture

Project Overview

Drosophila behavioral tracking analysis for the Cupido project. Compares social interaction patterns (inter-fly distance, velocity) between trained and untrained flies using a barrier-opening assay recorded on ethoscope platforms.

Architecture

Pipeline-based: Raw SQLite DBs -> ROI extraction -> distance calculation -> time alignment -> statistical analysis / visualization.

Stack: Python 3.10+, pandas, scipy, scikit-learn, matplotlib/seaborn, Jupyter.

Code Conventions

  • PEP8 formatting, Google-style docstrings
  • Type hints on function signatures
  • Time units: milliseconds in all data (DB stores ms, barrier CSV stores seconds but is converted to ms on load)
  • Distance units: pixels (no conversion to physical units)
  • Path management: All scripts import from scripts/config.py for consistent paths
  • Notebooks: Use Path("..") relative paths from notebooks/ directory

Key Caveats

  • Pseudoreplication: True N = 18 ROIs per group (not 230K data points). Statistical tests on individual data points are inflated.
  • Tiny effect sizes: Cohen's d ~ 0.09 for distance, ~0.14 for velocity. Statistically significant only due to massive sample size.
  • Missing data: Machine 139 (6 ROIs) has metadata but no tracking DB or barrier opening time.
  • Machine name type mismatch: Metadata stores as int (76), barrier CSV stores as int (076). Must convert to string for matching.

Directory Structure

tracking/
├── data/raw/          # SQLite DBs (gitignored)
├── data/metadata/     # Small CSVs (tracked)
├── data/processed/    # Large generated CSVs (gitignored)
├── scripts/           # Python scripts with config.py imports
├── notebooks/         # Jupyter analysis notebooks
├── figures/           # Generated plots (gitignored)
├── docs/              # Scientific documentation
└── tasks/             # Task tracking

Next Direction

The primary next step is testing the bimodal hypothesis - see docs/bimodal_hypothesis.md for the full plan. The core idea: aggregate analysis fails because the trained group likely contains both true learners and non-learners, diluting the signal.