Deleted the 5 stale pre-pipeline tracking DBs and the data/raw/ directory. Dropped DATA_RAW from config.py; build_video_inventory now scans TRACKING_OUTPUT_DIR for already-tracked sessions. Notebooks no longer import DATA_RAW. README, PLANNING and todo updated to reflect that the repo holds only code + small curated metadata, never bulky DBs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2.5 KiB
2.5 KiB
Planning & Architecture
Project Overview
Drosophila behavioral tracking analysis for the Cupido project. Compares social interaction patterns (inter-fly distance, velocity) between trained and untrained flies using a barrier-opening assay recorded on ethoscope platforms.
Architecture
Pipeline-based: Raw SQLite DBs -> ROI extraction -> distance calculation -> time alignment -> statistical analysis / visualization.
Stack: Python 3.10+, pandas, scipy, scikit-learn, matplotlib/seaborn, Jupyter.
Code Conventions
- PEP8 formatting, Google-style docstrings
- Type hints on function signatures
- Time units: milliseconds in all data (DB stores ms, barrier CSV stores seconds but is converted to ms on load)
- Distance units: pixels (no conversion to physical units)
- Path management: All scripts import from
scripts/config.pyfor consistent paths - Notebooks: Use
Path("..")relative paths fromnotebooks/directory
Key Caveats
- Pseudoreplication: True N = 18 ROIs per group (not 230K data points). Statistical tests on individual data points are inflated.
- Tiny effect sizes: Cohen's d ~ 0.09 for distance, ~0.14 for velocity. Statistically significant only due to massive sample size.
- Missing data: Machine 139 (6 ROIs) has metadata but no tracking DB or barrier opening time.
- Machine name type mismatch: Metadata stores as int (76), barrier CSV stores as int (076). Must convert to string for matching.
Directory Structure
tracking/
├── data/metadata/ # Small hand-curated CSVs (tracked in git)
├── data/processed/ # Large generated CSVs (gitignored)
├── data/logs/ # Tracker logs (gitignored)
├── scripts/ # Python scripts with config.py imports
├── notebooks/ # Jupyter analysis notebooks
├── figures/ # Generated plots (gitignored)
├── docs/ # Scientific documentation
└── tasks/ # Task tracking
# All bulky data lives outside the repo at /mnt/data/projects/cupido/:
# tracked/ # SQLite tracking DBs
# targets/ # Target-point JSON sidecars
# all_video_info_merged.{xlsx,tsv} # Metadata spreadsheet
Next Direction
The primary next step is testing the bimodal hypothesis - see docs/bimodal_hypothesis.md for the full plan. The core idea: aggregate analysis fails because the trained group likely contains both true learners and non-learners, diluting the signal.