Initial commit: organized project structure for student handoff

Reorganized flat 41-file directory into structured layout with: - scripts/ for Python analysis code with shared config.py - notebooks/ for Jupyter analysis notebooks - data/ split into raw/, metadata/, processed/ - docs/ with analysis summary, experimental design, and bimodal hypothesis tutorial - tasks/ with todo checklist and lessons learned - Comprehensive README, PLANNING.md, and .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 16:08:36 +00:00 · 2026-03-05 16:08:36 +00:00 · e7e4db264d
commit e7e4db264d
27 changed files with 3105 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,32 @@
 # Large data files (reproducible from raw DBs)
 data/raw/*.db
 data/processed/*.csv
 # Generated figures (reproducible from scripts)
 figures/*.png
 # Python
 __pycache__/
 *.py[cod]
 *$py.class
 *.so
 .venv/
 venv/
 env/
 *.egg-info/
 dist/
 build/
 # Jupyter
 .ipynb_checkpoints/
 # IDE
 .vscode/
 .idea/
 # OS
 .DS_Store
 Thumbs.db
 # Claude Code
 .claude/
--- a/PLANNING.md
+++ b/PLANNING.md
@ -0,0 +1,45 @@
 # Planning & Architecture
 ## Project Overview
 Drosophila behavioral tracking analysis for the Cupido project. Compares social interaction patterns (inter-fly distance, velocity) between trained and untrained flies using a barrier-opening assay recorded on ethoscope platforms.
 ## Architecture
 **Pipeline-based**: Raw SQLite DBs -> ROI extraction -> distance calculation -> time alignment -> statistical analysis / visualization.
 **Stack**: Python 3.10+, pandas, scipy, scikit-learn, matplotlib/seaborn, Jupyter.
 ## Code Conventions
 - **PEP8** formatting, Google-style docstrings
 - **Type hints** on function signatures
 - **Time units**: milliseconds in all data (DB stores ms, barrier CSV stores seconds but is converted to ms on load)
 - **Distance units**: pixels (no conversion to physical units)
 - **Path management**: All scripts import from `scripts/config.py` for consistent paths
 - **Notebooks**: Use `Path("..")` relative paths from `notebooks/` directory
 ## Key Caveats
 - **Pseudoreplication**: True N = 18 ROIs per group (not 230K data points). Statistical tests on individual data points are inflated.
 - **Tiny effect sizes**: Cohen's d ~ 0.09 for distance, ~0.14 for velocity. Statistically significant only due to massive sample size.
 - **Missing data**: Machine 139 (6 ROIs) has metadata but no tracking DB or barrier opening time.
 - **Machine name type mismatch**: Metadata stores as int (76), barrier CSV stores as int (076). Must convert to string for matching.
 ## Directory Structure
 ```
 tracking/
 ├── data/raw/          # SQLite DBs (gitignored)
 ├── data/metadata/     # Small CSVs (tracked)
 ├── data/processed/    # Large generated CSVs (gitignored)
 ├── scripts/           # Python scripts with config.py imports
 ├── notebooks/         # Jupyter analysis notebooks
 ├── figures/           # Generated plots (gitignored)
 ├── docs/              # Scientific documentation
 └── tasks/             # Task tracking
 ```
 ## Next Direction
 The primary next step is testing the **bimodal hypothesis** - see `docs/bimodal_hypothesis.md` for the full plan. The core idea: aggregate analysis fails because the trained group likely contains both true learners and non-learners, diluting the signal.
--- a/README.md
+++ b/README.md
@ -0,0 +1,111 @@
 # Cupido: Drosophila Social Interaction Tracking
 Behavioral analysis of trained vs untrained *Drosophila melanogaster* in a barrier-opening social interaction assay. Part of the Cupido project studying learned social behaviors.
 ## Quick Start
 ```bash
 # Clone the repository
 git clone ssh://git@git.lab.gilest.ro:222/lab/cupido.git
 cd cupido
 # Create virtual environment
 python -m venv .venv
 source .venv/bin/activate
 pip install -r requirements.txt
 # Get the data files (not in git - ask lab for copies)
 # Place .db files in data/raw/
 # Place large .csv files in data/processed/
 # Run the main analysis notebook
 jupyter notebook notebooks/flies_analysis_simple.ipynb
 ```
 ## Project Overview
 ### The Experiment
 Pairs of flies are placed in chambers (ROIs) separated by a physical barrier. After a configurable delay, the barrier is removed, allowing flies to interact. We track the distance between flies over time to compare social approach behavior between trained (socially experienced) and untrained (naive) groups.
 - **3 ethoscope machines**, 5 recording sessions, 6 ROIs each = 30 ROIs with data
 - **18 trained ROIs, 18 untrained ROIs** (6 from Machine 139 have no tracking data)
 - See `docs/experimental_design.md` for full details
 ### Current Findings
 Aggregate analysis shows statistically significant but **tiny** differences:
 - Post-opening distance: Cohen's d = 0.09 (96% distribution overlap)
 - Max velocity (50-200s): Cohen's d = 0.14
 These effect sizes are inflated by pseudoreplication (230K data points from 18 independent ROIs per group).
 ### Next Direction: Bimodal Hypothesis
 The key insight: not all "trained" flies may have actually learned. The trained group likely contains **true learners** (showing distinct behavior) and **non-learners** (indistinguishable from untrained). Testing this requires per-ROI analysis and bimodality testing.
 **Read `docs/bimodal_hypothesis.md` for the detailed analysis plan and code sketches.**
 ## Folder Structure
 ```
 tracking/
 ├── README.md              # This file
 ├── PLANNING.md            # Architecture & conventions
 ├── requirements.txt       # Python dependencies
 ├── data/
 │   ├── raw/               # SQLite tracking databases (gitignored)
 │   ├── metadata/          # Experiment metadata CSVs
 │   └── processed/         # Generated analysis CSVs (gitignored)
 ├── scripts/               # Python analysis scripts
 │   ├── config.py          # Shared path constants
 │   ├── load_roi_data.py   # Extract data from DBs
 │   ├── calculate_distances.py
 │   ├── analyze_distances.py
 │   ├── statistical_tests.py
 │   ├── ml_classification.py
 │   └── plot_*.py          # Plotting scripts
 ├── notebooks/             # Jupyter notebooks
 │   ├── flies_analysis_simple.ipynb  # Main analysis (use this one)
 │   └── flies_analysis.ipynb         # Full pipeline from DB extraction
 ├── figures/               # Generated plots (gitignored)
 ├── docs/                  # Scientific documentation
 │   ├── analysis_summary.md
 │   ├── bimodal_hypothesis.md
 │   └── experimental_design.md
 └── tasks/
    ├── todo.md            # Task checklist
    └── lessons.md         # Pitfalls & patterns
 ```
 ## Data Pipeline
 ```
 SQLite DBs (data/raw/)
    │
    ▼  load_roi_data.py / notebook step 1
 ROI CSVs (data/processed/*_roi_data.csv)
    │
    ▼  notebook steps 2-4
 Aligned Distance CSVs (data/processed/*_distances_aligned.csv)
    │
    ├──▶ Plots (figures/)
    ├──▶ Statistical tests
    └──▶ Identity tracking → Velocity analysis
 ```
 ## Key Files
 | File | Purpose |
 |------|---------|
 | `notebooks/flies_analysis_simple.ipynb` | **Start here** - main analysis notebook |
 | `docs/bimodal_hypothesis.md` | **Read next** - the new analysis direction |
 | `data/metadata/2025_07_15_metadata_fixed.csv` | ROI-to-group mapping |
 | `data/metadata/2025_07_15_barrier_opening.csv` | Barrier opening times per machine |
 | `scripts/config.py` | Shared path constants for all scripts |
 ## Requirements
 - Python 3.10+
 - See `requirements.txt` for packages (numpy, pandas, matplotlib, seaborn, scipy, scikit-learn, jupyter)
 - Large data files (~370MB CSVs + ~33MB DBs) must be obtained separately
--- a/data/metadata/2025_07_15_barrier_opening.csv
+++ b/data/metadata/2025_07_15_barrier_opening.csv
@ -0,0 +1,7 @@
 machine,date,opening_time
 076,16-03-10,52
 076,16-31-34,25
 145,16-03-27,42
 145,16-31-41,20
 268,16-32-05,75
--- a/data/metadata/2025_07_15_metadata.csv
+++ b/data/metadata/2025_07_15_metadata.csv
@ -0,0 +1,37 @@
 date,HHMMSS,machine_name,ROI,genotype,group,path,filesize_mb
 15/07/2025,16-03-10,76,1,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
 15/07/2025,16-03-10,76,2,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
 15/07/2025,16-03-10,76,3,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
 15/07/2025,16-03-10,76,4,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
 15/07/2025,16-03-10,76,5,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
 15/07/2025,16-03-10,76,6,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
 15/07/2025,16-31-34,76,1,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
 15/07/2025,16-31-34,76,2,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
 15/07/2025,16-31-34,76,3,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
 15/07/2025,16-31-34,76,4,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
 15/07/2025,16-31-34,76,5,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
 15/07/2025,16-31-34,76,6,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
 15/07/2025,16-03-27,145,1,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
 15/07/2025,16-03-27,145,2,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
 15/07/2025,16-03-27,145,3,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
 15/07/2025,16-03-27,145,4,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
 15/07/2025,16-03-27,145,5,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
 15/07/2025,16-03-27,145,6,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
 15/07/2025,16-31-41,145,1,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
 15/07/2025,16-31-41,145,2,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
 15/07/2025,16-31-41,145,3,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
 15/07/2025,16-31-41,145,4,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
 15/07/2025,16-31-41,145,5,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
 15/07/2025,16-31-41,145,6,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
 15/07/2025,16-31-52,139,1,CS,trained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
 15/07/2025,16-31-52,139,2,CS,trained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
 15/07/2025,16-31-52,139,3,CS,trained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
 15/07/2025,16-31-52,139,4,CS,untrained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
 15/07/2025,16-31-52,139,5,CS,untrained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
 15/07/2025,16-31-52,139,6,CS,untrained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
 15/07/2025,16-32-05,268,1,CS,untrained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
 15/07/2025,16-32-05,268,2,CS,untrained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
 15/07/2025,16-32-05,268,3,CS,untrained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
 15/07/2025,16-32-05,268,4,CS,trained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
 15/07/2025,16-32-05,268,5,CS,trained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
 15/07/2025,16-32-05,268,6,CS,trained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
--- a/data/metadata/2025_07_15_metadata_fixed.csv
+++ b/data/metadata/2025_07_15_metadata_fixed.csv
@ -0,0 +1,37 @@
 date,HHMMSS,machine_name,ROI,genotype,group,path,filesize_mb
 15/07/2025,16-03-10,76,6,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
 15/07/2025,16-03-10,76,4,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
 15/07/2025,16-03-10,76,2,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
 15/07/2025,16-03-10,76,5,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
 15/07/2025,16-03-10,76,3,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
 15/07/2025,16-03-10,76,1,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
 15/07/2025,16-31-34,76,6,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
 15/07/2025,16-31-34,76,4,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
 15/07/2025,16-31-34,76,2,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
 15/07/2025,16-31-34,76,5,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
 15/07/2025,16-31-34,76,3,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
 15/07/2025,16-31-34,76,1,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
 15/07/2025,16-03-27,145,6,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
 15/07/2025,16-03-27,145,4,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
 15/07/2025,16-03-27,145,2,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
 15/07/2025,16-03-27,145,5,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
 15/07/2025,16-03-27,145,3,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
 15/07/2025,16-03-27,145,1,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
 15/07/2025,16-31-41,145,6,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
 15/07/2025,16-31-41,145,4,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
 15/07/2025,16-31-41,145,2,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
 15/07/2025,16-31-41,145,5,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
 15/07/2025,16-31-41,145,3,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
 15/07/2025,16-31-41,145,1,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
 15/07/2025,16-31-52,139,6,CS,trained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
 15/07/2025,16-31-52,139,4,CS,trained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
 15/07/2025,16-31-52,139,2,CS,trained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
 15/07/2025,16-31-52,139,5,CS,untrained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
 15/07/2025,16-31-52,139,3,CS,untrained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
 15/07/2025,16-31-52,139,1,CS,untrained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
 15/07/2025,16-32-05,268,6,CS,untrained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
 15/07/2025,16-32-05,268,4,CS,untrained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
 15/07/2025,16-32-05,268,2,CS,untrained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
 15/07/2025,16-32-05,268,5,CS,trained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
 15/07/2025,16-32-05,268,3,CS,trained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
 15/07/2025,16-32-05,268,1,CS,trained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
--- a/data/processed/README.md
+++ b/data/processed/README.md
@ -0,0 +1,39 @@
 # Processed Data
 Large CSV files generated from the analysis pipeline. All files are gitignored (~370MB total) and can be regenerated.
 ## Files and Regeneration
 | File | Description | Generated By |
 |------|-------------|--------------|
 | `trained_roi_data.csv` | Raw tracking data for trained ROIs | `scripts/load_roi_data.py` or notebook step 1 |
 | `untrained_roi_data.csv` | Raw tracking data for untrained ROIs | `scripts/load_roi_data.py` or notebook step 1 |
 | `trained_distances.csv` | Pairwise distances (unaligned) | `scripts/calculate_distances.py` |
 | `untrained_distances.csv` | Pairwise distances (unaligned) | `scripts/calculate_distances.py` |
 | `trained_distances_aligned.csv` | Distances aligned to barrier opening | Notebook step 4 |
 | `untrained_distances_aligned.csv` | Distances aligned to barrier opening | Notebook step 4 |
 | `trained_tracked.csv` | Identity-tracked fly positions | Notebook step 7 |
 | `untrained_tracked.csv` | Identity-tracked fly positions | Notebook step 7 |
 | `trained_max_velocity.csv` | Max velocity over 10s windows | Notebook step 7 |
 | `untrained_max_velocity.csv` | Max velocity over 10s windows | Notebook step 7 |
 ## To Regenerate All Data
 Run the full notebook `notebooks/flies_analysis_simple.ipynb` with:
 ```python
 recalculate_distances = True
 recalculate_tracking = True
 ```
 **Warning**: Identity tracking and velocity calculations take significant time (~30+ minutes).
 ## Column Reference
 ### Distance CSVs (`*_distances_aligned.csv`)
 - `machine_name`: Ethoscope machine ID (string)
 - `ROI`: ROI number (1-6)
 - `aligned_time`: Time in ms relative to barrier opening (0 = opening)
 - `distance`: Euclidean distance between flies in pixels
 - `n_flies`: Number of flies detected at this time point
 - `area_fly1`, `area_fly2`: Bounding box areas (w*h) in pixels^2
 - `group`: "trained" or "untrained"
--- a/data/raw/README.md
+++ b/data/raw/README.md
@ -0,0 +1,37 @@
 # Raw Data
 SQLite databases containing fly tracking data from ethoscope recordings.
 ## Files
 | File | Machine | Session | Size |
 |------|---------|---------|------|
 | `2025-07-15_16-03-10_076e...tracking.db` | ETHOSCOPE_076 | 16:03:10 | ~6.5MB |
 | `2025-07-15_16-03-27_145b...tracking.db` | ETHOSCOPE_145 | 16:03:27 | ~6.1MB |
 | `2025-07-15_16-31-34_076e...tracking.db` | ETHOSCOPE_076 | 16:31:34 | ~6.6MB |
 | `2025-07-15_16-31-41_145b...tracking.db` | ETHOSCOPE_145 | 16:31:41 | ~6.6MB |
 | `2025-07-15_16-32-05_268...tracking.db` | ETHOSCOPE_268 | 16:32:05 | ~7.0MB |
 **Note**: Machine 139 has metadata but no tracking database. See `docs/experimental_design.md`.
 ## Schema
 Each database contains tables `ROI_1` through `ROI_6`:
 | Column | Type | Description |
 |--------|------|-------------|
 | `id` | int | Detection ID within frame |
 | `t` | int | Time in **milliseconds** from recording start |
 | `x` | float | X position in pixels |
 | `y` | float | Y position in pixels |
 | `w` | float | Bounding box width in pixels |
 | `h` | float | Bounding box height in pixels |
 | `phi` | float | Orientation angle |
 | `is_inferred` | int | Whether position was inferred (0/1) |
 | `has_interacted` | int | Whether interaction detected (0/1) |
 ## Provenance
 Data recorded on 2025-07-15 using ethoscope platform.
 Resolution: 1920x1088 @ 25fps, H.264 28q quality.
 These files are gitignored (binary, ~33MB total).
--- a/docs/analysis_summary.md
+++ b/docs/analysis_summary.md
@ -0,0 +1,68 @@
 # Analysis Summary: Distance Between Flies
 ## Key Findings
 ### Overall Statistics
 - Trained flies (aligned to barrier opening):
  - Data points: 46,613
  - Mean distance: 92.72
  - Std distance: 111.86
 - Untrained flies (aligned to barrier opening):
  - Data points: 46,866
  - Mean distance: 78.67
  - Std distance: 102.86
 ### Pre-Opening Period (t < 0)
 During this period, flies are physically separated by a barrier:
 - Trained mean distance: 155.27
 - Untrained mean distance: 147.19
 ### Post-Opening Period (t > 0)
 After the barrier is removed and flies can interact:
 - Trained mean distance: 77.44
 - Untrained mean distance: 61.87
 ## Statistical Tests
 ### Pre-opening period comparison:
 - Trained mean: 155.27, Untrained mean: 147.19
 - T-statistic: 4.4024, P-value: 1.08e-05
 - Cohen's d: 0.0654
 ### Post-opening period comparison:
 - Trained mean: 72.52, Untrained mean: 63.86
 - T-statistic: 30.4693, P-value: 1.08e-203
 - Cohen's d: 0.0914
 ### Within-group changes:
 - Trained flies - Pre vs Post:
  - Mean change: -82.75
  - T-statistic: 77.2805, P-value: 0.00e+00
  - Cohen's d: -0.8313
 - Untrained flies - Pre vs Post:
  - Mean change: -83.33
  - T-statistic: 84.4719, P-value: 0.00e+00
  - Cohen's d: -0.9022
 ## Interpretation
 1. **Pre-opening**: Both groups show high distances as expected since flies are separated by a barrier. There is a statistically significant difference between groups (p < 0.001), but the effect size is very small (Cohen's d = 0.0654).
 2. **Post-opening**: Both groups show reduced distances after the barrier is removed, indicating interaction between flies. There is a statistically significant difference between groups (p < 0.001) with a small effect size (Cohen's d = 0.0914).
 3. **Within-group changes**: Both trained and untrained flies show significant decreases in distance after the barrier opens (p < 0.001). The effect sizes are large for both groups (Cohen's d = -0.8313 for trained, -0.9022 for untrained).
 4. **Group Differences**: 
   - Trained flies maintain slightly higher distances than untrained flies both before and after barrier opening.
   - This could indicate different behavioral patterns between trained and untrained flies, though the effect sizes are small.
 ## Visualization
 The plot "avg_distance_aligned_to_opening.png" shows:
 - Blue line: Average distance for trained flies (smoothed)
 - Red line: Average distance for untrained flies (smoothed)
 - Black dashed line: Time of barrier opening (t=0)
 The visualization clearly shows the transition from high distances (pre-opening) to lower distances (post-opening) for both groups, with trained flies consistently maintaining slightly higher distances than untrained flies.
--- a/docs/bimodal_hypothesis.md
+++ b/docs/bimodal_hypothesis.md
@ -0,0 +1,961 @@
 # The Bimodal Hypothesis: A Guide to the Next Phase of Analysis
 **Project:** Cupido -- Drosophila Social Interaction Tracking
 **Date:** 2026-03-05
 **Purpose:** Tutorial-level guide for a student taking over the analysis pipeline
 ---
 ## Table of Contents
 1. [Background: The Assay](#1-background-the-assay)
 2. [Why the Current Aggregate Analysis Fails](#2-why-the-current-aggregate-analysis-fails)
 3. [The Bimodal Hypothesis](#3-the-bimodal-hypothesis)
 4. [The Methodology Shift: Per-ROI Summary Statistics](#4-the-methodology-shift-per-roi-summary-statistics)
 5. [Step-by-Step Analysis Code](#5-step-by-step-analysis-code)
 6. [Expected Results and Interpretation](#6-expected-results-and-interpretation)
 7. [Statistical Considerations](#7-statistical-considerations)
 8. [References](#8-references)
 ---
 ## 1. Background: The Assay
 This project uses ethoscope-based tracking to study social behavior in *Drosophila melanogaster*. The experimental setup is a barrier-opening assay:
 - Each recording chamber (ROI) contains two flies separated by a physical barrier.
 - At a defined time point, the barrier is removed, allowing the flies to interact freely.
 - We measure the **distance between the two flies** over time, along with **velocity** and other behavioral features.
 The core experimental question: **Do trained flies (those that have undergone a conditioning protocol) behave differently from untrained flies after the barrier is removed?**
 ### Data Overview
 - **Machines:** 3 ethoscopes (076, 145, 268), each running multiple recording sessions
 - **Recording sessions:** 5 total (076 ran twice, 145 ran twice, 268 ran once)
 - **ROIs per session:** 6 (containing pairs of flies)
 - **Total ROIs:** 30 across all sessions, but **Machine 139 has metadata entries for 6 ROIs with no corresponding tracking data** (no tracking DB, no barrier_opening entry), leaving **24 usable ROIs** -- approximately 12 trained and 12 untrained, split across machines and sessions. The exact split per session varies (check the metadata).
 - **Note on true N:** The metadata file (`data/metadata/2025_07_15_metadata_fixed.csv`) lists all ROIs and their group assignments. The barrier opening file (`data/metadata/2025_07_15_barrier_opening.csv`) lists 5 sessions across 3 machines. When you load the aligned data, count unique (machine_name, ROI) combinations per group to determine your actual N. Based on the existing analysis, the project documentation refers to N=18 per group, but you should verify this directly from your data.
 The primary data files are:
 | File | Description |
 |------|-------------|
 | `data/processed/trained_distances_aligned.csv` | Distance data for trained ROIs, aligned to barrier opening |
 | `data/processed/untrained_distances_aligned.csv` | Distance data for untrained ROIs, aligned to barrier opening |
 | `data/processed/trained_max_velocity.csv` | Max velocity data for trained ROIs |
 | `data/processed/untrained_max_velocity.csv` | Max velocity data for untrained ROIs |
 Each CSV has columns: `machine_name`, `ROI`, `aligned_time`, `distance` (or `max_velocity`), `n_flies`, `area_fly1`, `area_fly2`, `group`.
 The `aligned_time` column is in **milliseconds**, with the barrier opening at `t = 0`. Negative values are pre-opening; positive values are post-opening.
 ---
 ## 2. Why the Current Aggregate Analysis Fails
 ### 2.1 The Numbers
 The current analysis (see `docs/analysis_summary.md` and `scripts/statistical_tests.py`) pools all time-point observations from all ROIs into one giant distribution per group and runs a t-test:
 | Comparison | Cohen's d | p-value | What it means |
 |---|---|---|---|
 | Post-opening: trained vs untrained | 0.09 | ~1e-203 | Tiny effect, astronomically "significant" |
 | Pre-opening: trained vs untrained | 0.07 | ~1e-5 | Tiny effect before barrier even opens |
 | Velocity differences | ~0.14 | very small | Still small |
 This is a textbook case of **statistical significance without practical significance**.
 ### 2.2 What Cohen's d = 0.09 Actually Means
 Cohen's d measures the standardized difference between two group means. To put d = 0.09 in perspective:
 - **d = 0.2** is conventionally considered a "small" effect.
 - **d = 0.09** is less than half of "small".
 - At d = 0.09, the two distributions overlap by approximately **96%**. If you randomly picked one fly from the trained group and one from the untrained group, you would correctly guess which was which only about **52.5%** of the time (barely better than a coin flip).
 - The reason the p-value is so extreme (p < 1e-200) is purely a function of sample size: with ~230,000 data points per group, even a trivially small mean difference becomes "statistically significant."
 Imagine two bell curves that are nearly perfectly superimposed, with one shifted a hair to the right. That is what d = 0.09 looks like. No biologist would claim these groups are meaningfully different based on that overlap.
 ### 2.3 The Pseudoreplication Problem
 This is the fatal flaw in the current analysis. Here is the logic:
 1. We have **N = 18 trained ROIs** and **N = 18 untrained ROIs** (the true biological replicates).
 2. Each ROI contributes thousands of time-point measurements (one distance value every ~40ms).
 3. The current code concatenates all time points from all ROIs into a single array per group, yielding ~230,000 "observations" per group.
 4. A t-test is run on these 230,000 vs 230,000 values.
 The problem: **time points within a single ROI are not independent observations.** The distance at time t is highly correlated with the distance at time t+40ms (the flies do not teleport between frames). Furthermore, all time points from one ROI reflect the behavior of the *same pair of flies*.
 This is called **pseudoreplication** -- treating non-independent measurements as independent samples. It inflates your degrees of freedom from ~34 (the true df for 18 vs 18) to ~460,000, making your standard error artificially tiny and your p-value absurdly small.
 **The correct N for any between-group test is the number of ROIs (biological replicates), not the number of time points.**
 ### 2.4 The Signal Dilution Problem
 Even if we fix the pseudoreplication (by computing per-ROI summary statistics), the aggregate approach still has a deeper issue. It assumes that the training effect is **uniform** -- that every trained fly learns equally. But what if only a fraction of trained flies actually acquire the conditioned response?
 If 6 out of 18 trained ROIs contain flies that genuinely learned (showing reduced distance or faster approach), while the other 12 trained ROIs contain flies that failed to learn (behaving identically to untrained flies), then:
 - The mean of the trained group is pulled toward the untrained mean by the 12 non-learners.
 - The effect size is diluted by a factor proportional to the non-learner fraction.
 - No amount of data accumulation will fix this -- you are averaging signal with noise.
 **This is why we need the bimodal hypothesis.**
 ---
 ## 3. The Bimodal Hypothesis
 ### 3.1 The Core Prediction
 The hypothesis is simple:
 > **The trained group is not one population -- it is a mixture of two subpopulations: true learners and non-learners.**
 Concretely, when you compute a single summary statistic for each ROI (e.g., the mean distance in the first 300 seconds after barrier opening), you should see:
 - **Untrained group:** A single cluster (unimodal distribution). All untrained flies behave similarly because none of them received conditioning.
 - **Trained group:** Two clusters (bimodal distribution).
  - **Cluster 1 (non-learners):** These flies did not acquire the conditioned response. Their summary statistics overlap with the untrained distribution.
  - **Cluster 2 (true learners):** These flies *did* learn. Their summary statistics are shifted -- e.g., lower mean distance (they approach the other fly), higher fraction of time at close proximity, or different velocity profiles.
 ### 3.2 Why This Matters
 If the bimodal hypothesis is correct, the appropriate analysis is not "trained vs untrained" as a whole, but rather:
 1. **Identify** which trained ROIs are learners vs non-learners.
 2. **Compare** learners vs untrained (this should show a large effect size).
 3. **Report** the learning rate (what fraction of trained flies actually learned).
 This transforms a "disappointing d = 0.09" result into a potentially strong finding: "X% of trained flies show a robust behavioral shift (d = ?) while the remainder do not learn."
 ### 3.3 Visual Intuition
 Imagine plotting a histogram of "mean post-opening distance" for each ROI:
 ```
 Untrained (N=18):          Trained (N=18):
  Count                      Count
  |                          |
  |   ****                   |   ***         **
  |  ******                  |  *****       ****
  | ********                 | *******     ******
  |**********                |*********   ********
  +-----------> distance     +-----------> distance
     (unimodal)                 (bimodal)
 ```
 The untrained histogram has one peak. The trained histogram has two peaks: one overlapping with the untrained distribution (non-learners) and one shifted to the left (learners who maintain closer proximity).
 ---
 ## 4. The Methodology Shift: Per-ROI Summary Statistics
 ### 4.1 The Key Idea
 Instead of feeding raw time-series data into group comparisons, **collapse each ROI into a single summary value** (or a small set of features). This:
 1. Eliminates pseudoreplication (each ROI contributes exactly one data point).
 2. Gives you the true sample size (N per group = number of ROIs).
 3. Enables distribution-level analyses (bimodality testing, mixture modeling).
 ### 4.2 Which Summary Statistics to Compute
 For each unique (machine_name, ROI) combination, compute the following over the **post-opening window** (aligned_time > 0, typically 0 to 300,000 ms, i.e., 0 to 300 seconds):
 | Feature | Description | Rationale |
 |---------|-------------|-----------|
 | `mean_distance` | Mean distance across all post-opening time points | Overall proximity measure |
 | `median_distance` | Median distance | Robust to outliers (e.g., tracking artifacts) |
 | `frac_close` | Fraction of time points where distance < 50 px | Captures how often flies are in close contact |
 | `frac_very_close` | Fraction of time points where distance < 25 px | Stricter proximity threshold |
 | `distance_slope` | Slope of a linear fit to distance vs time (post-opening) | Captures whether flies approach over time |
 | `mean_velocity` | Mean of max_velocity in the 50-200s window | Activity / approach speed (from velocity data) |
 | `distance_cv` | Coefficient of variation of distance | Behavioral variability within a trial |
 You may also want a **pre-opening baseline** for each ROI (e.g., mean distance for aligned_time < 0) to use as a covariate or to compute a **change score** (post - pre) that controls for baseline chamber differences.
 ### 4.3 Why 0-300 Seconds?
 The first 300 seconds (5 minutes) after barrier opening is the critical window where social approach behavior is most apparent. Beyond this, flies may habituate or begin other behaviors. The existing plots (`avg_distance_300_seconds.png`) focus on this window for the same reason.
 You can and should also try other windows (e.g., 0-120s for early approach, 60-300s to skip the initial startle) to check robustness.
 ---
 ## 5. Step-by-Step Analysis Code
 All code below uses Python with pandas, numpy, scipy, matplotlib, and scikit-learn. Run everything within the project virtual environment.
 ### Phase 1: Per-ROI Feature Extraction
 ```python
 """
 Phase 1: Compute per-ROI summary statistics from the aligned distance data.
 This is the foundation of the bimodal analysis. Each ROI becomes a single
 data point instead of thousands of time-series observations.
 """
 import pandas as pd
 import numpy as np
 from scipy import stats as sp_stats
 from pathlib import Path
 # ------------------------------------------------------------------
 # Load data
 # ------------------------------------------------------------------
 DATA_PROCESSED = Path("data/processed")
 trained = pd.read_csv(DATA_PROCESSED / "trained_distances_aligned.csv")
 untrained = pd.read_csv(DATA_PROCESSED / "untrained_distances_aligned.csv")
 # Combine into one DataFrame for uniform processing
 df = pd.concat([trained, untrained], ignore_index=True)
 # Verify the true sample sizes
 roi_counts = (
    df.groupby("group")[["machine_name", "ROI"]]
    .apply(lambda g: g.drop_duplicates().shape[0])
 )
 print("ROIs per group:")
 print(roi_counts)
 # Expected: trained ~18, untrained ~18
 # ------------------------------------------------------------------
 # Define the post-opening analysis window
 # aligned_time is in milliseconds; 300 seconds = 300,000 ms
 # ------------------------------------------------------------------
 POST_START_MS = 0
 POST_END_MS = 300_000
 post = df[(df["aligned_time"] > POST_START_MS) & (df["aligned_time"] <= POST_END_MS)].copy()
 # Drop rows where distance is NaN (frames with only one small fly detected)
 post = post.dropna(subset=["distance"])
 # ------------------------------------------------------------------
 # Compute per-ROI features using groupby
 # ------------------------------------------------------------------
 def compute_roi_features(group_df):
    """Compute summary features for a single ROI's post-opening data.
    Args:
        group_df (pd.DataFrame): All post-opening rows for one ROI.
    Returns:
        pd.Series: Summary features.
    """
    d = group_df["distance"]
    return pd.Series({
        "mean_distance": d.mean(),
        "median_distance": d.median(),
        "std_distance": d.std(),
        "frac_close_50": (d < 50).mean(),     # fraction of time at < 50 px
        "frac_close_25": (d < 25).mean(),     # fraction of time at < 25 px
        "distance_cv": d.std() / d.mean() if d.mean() > 0 else np.nan,
        "n_timepoints": len(d),
        "group": group_df["group"].iloc[0],
    })
 roi_features = (
    post.groupby(["machine_name", "ROI"])
    .apply(compute_roi_features)
    .reset_index()
 )
 print(f"\nPer-ROI features computed: {len(roi_features)} ROIs")
 print(roi_features.head(10))
 # ------------------------------------------------------------------
 # Split by group for downstream analysis
 # ------------------------------------------------------------------
 trained_features = roi_features[roi_features["group"] == "trained"].copy()
 untrained_features = roi_features[roi_features["group"] == "untrained"].copy()
 print(f"\nTrained ROIs: {len(trained_features)}")
 print(f"Untrained ROIs: {len(untrained_features)}")
 # Save for later use
 roi_features.to_csv(DATA_PROCESSED / "roi_summary_features.csv", index=False)
 print("Saved roi_summary_features.csv")
 ```
 **What this gives you:** A DataFrame with one row per ROI and columns for each summary statistic. This is your true dataset for all subsequent analyses.
 ### Phase 2: Visualization -- Histograms and KDE Plots
 ```python
 """
 Phase 2: Visualize the per-ROI feature distributions.
 The key question: does the trained group look bimodal while the
 untrained group looks unimodal?
 """
 import matplotlib.pyplot as plt
 import seaborn as sns
 # ------------------------------------------------------------------
 # Histogram + KDE for each feature, side by side
 # ------------------------------------------------------------------
 features_to_plot = ["mean_distance", "median_distance", "frac_close_50"]
 fig, axes = plt.subplots(len(features_to_plot), 2, figsize=(12, 4 * len(features_to_plot)))
 for i, feature in enumerate(features_to_plot):
    # Left panel: Untrained
    ax_u = axes[i, 0]
    vals_u = untrained_features[feature].dropna().values
    ax_u.hist(vals_u, bins=8, density=True, alpha=0.6, color="steelblue", edgecolor="black")
    if len(vals_u) > 2:
        sns.kdeplot(vals_u, ax=ax_u, color="navy", linewidth=2)
    ax_u.set_title(f"Untrained -- {feature}")
    ax_u.set_xlabel(feature)
    ax_u.set_ylabel("Density")
    # Right panel: Trained
    ax_t = axes[i, 1]
    vals_t = trained_features[feature].dropna().values
    ax_t.hist(vals_t, bins=8, density=True, alpha=0.6, color="salmon", edgecolor="black")
    if len(vals_t) > 2:
        sns.kdeplot(vals_t, ax=ax_t, color="darkred", linewidth=2)
    ax_t.set_title(f"Trained -- {feature}")
    ax_t.set_xlabel(feature)
    ax_t.set_ylabel("Density")
 plt.tight_layout()
 plt.savefig("figures/bimodal_roi_distributions.png", dpi=200, bbox_inches="tight")
 plt.show()
 # ------------------------------------------------------------------
 # Strip/swarm plot overlay: both groups on one axis
 # ------------------------------------------------------------------
 fig, axes = plt.subplots(1, len(features_to_plot), figsize=(5 * len(features_to_plot), 5))
 for i, feature in enumerate(features_to_plot):
    ax = axes[i]
    sns.stripplot(
        data=roi_features, x="group", y=feature, ax=ax,
        jitter=True, alpha=0.7, size=8, palette={"trained": "salmon", "untrained": "steelblue"}
    )
    ax.set_title(feature)
 plt.tight_layout()
 plt.savefig("figures/bimodal_stripplots.png", dpi=200, bbox_inches="tight")
 plt.show()
 ```
 **What to look for:**
 - The untrained histograms should show a single cluster of points.
 - The trained histograms should show two clusters -- some ROIs overlapping with the untrained values (non-learners) and some ROIs clearly separated (learners).
 - In the strip plots, look for a gap or split among the trained dots that does not appear in the untrained dots.
 ### Phase 3: Hartigan's Dip Test for Bimodality
 The dip test is a formal statistical test of unimodality. A significant p-value means the distribution is unlikely to be unimodal.
 ```python
 """
 Phase 3: Hartigan's dip test for bimodality.
 Install the package first:
    pip install diptest
 """
 import diptest
 # ------------------------------------------------------------------
 # Run dip test on each group for each feature
 # ------------------------------------------------------------------
 features_to_test = ["mean_distance", "median_distance", "frac_close_50"]
 print("=== HARTIGAN'S DIP TEST ===\n")
 for feature in features_to_test:
    print(f"Feature: {feature}")
    for group_name, group_df in [("trained", trained_features), ("untrained", untrained_features)]:
        values = group_df[feature].dropna().values
        if len(values) < 4:
            print(f"  {group_name}: Not enough data points (N={len(values)})")
            continue
        dip_stat, p_value = diptest.diptest(values)
        print(f"  {group_name}: dip statistic = {dip_stat:.4f}, p-value = {p_value:.4f}")
        if p_value < 0.05:
            print(f"    --> SIGNIFICANT: Evidence against unimodality (p < 0.05)")
        else:
            print(f"    --> Not significant: Consistent with unimodality")
    print()
 ```
 **Interpreting the dip test:**
 - **Null hypothesis:** The distribution is unimodal.
 - **Alternative:** The distribution has more than one mode (bimodal, multimodal).
 - If the trained group yields p < 0.05 and the untrained group does not, this is direct evidence for the bimodal hypothesis.
 - **Caution:** With N=18, the dip test has limited power. A non-significant result does not prove unimodality -- it may simply mean the sample is too small to detect bimodality. This is why we also use Gaussian Mixture Models (next phase).
 ### Phase 4: Gaussian Mixture Model (GMM) Fitting
 A more informative approach than the dip test: fit 1-component and 2-component Gaussian mixture models and compare them using the Bayesian Information Criterion (BIC).
 ```python
 """
 Phase 4: Fit Gaussian Mixture Models to the per-ROI feature distributions.
 Compare 1-component (unimodal) vs 2-component (bimodal) models using BIC.
 Lower BIC = better model fit, penalizing for complexity.
 """
 from sklearn.mixture import GaussianMixture
 import numpy as np
 # ------------------------------------------------------------------
 # Fit GMMs and compare BIC
 # ------------------------------------------------------------------
 features_to_fit = ["mean_distance", "median_distance", "frac_close_50"]
 print("=== GAUSSIAN MIXTURE MODEL COMPARISON ===\n")
 gmm_results = {}
 for feature in features_to_fit:
    print(f"Feature: {feature}")
    for group_name, group_df in [("trained", trained_features), ("untrained", untrained_features)]:
        values = group_df[feature].dropna().values.reshape(-1, 1)
        if len(values) < 4:
            print(f"  {group_name}: Not enough data points")
            continue
        # Fit 1-component GMM (unimodal)
        gmm1 = GaussianMixture(n_components=1, random_state=42)
        gmm1.fit(values)
        bic1 = gmm1.bic(values)
        # Fit 2-component GMM (bimodal)
        gmm2 = GaussianMixture(n_components=2, random_state=42)
        gmm2.fit(values)
        bic2 = gmm2.bic(values)
        delta_bic = bic1 - bic2  # positive = 2-component model is better
        print(f"  {group_name}:")
        print(f"    BIC (1 component): {bic1:.2f}")
        print(f"    BIC (2 components): {bic2:.2f}")
        print(f"    Delta BIC (1 - 2): {delta_bic:.2f}")
        if delta_bic > 10:
            print(f"    --> STRONG evidence for 2 components (Delta BIC > 10)")
        elif delta_bic > 2:
            print(f"    --> Moderate evidence for 2 components")
        elif delta_bic > 0:
            print(f"    --> Weak evidence for 2 components")
        else:
            print(f"    --> 1 component (unimodal) preferred")
        if group_name == "trained":
            gmm_results[feature] = {
                "gmm1": gmm1,
                "gmm2": gmm2,
                "bic1": bic1,
                "bic2": bic2,
                "delta_bic": delta_bic,
            }
    print()
 ```
 **Interpreting Delta BIC:**
 | Delta BIC (BIC_1comp - BIC_2comp) | Interpretation |
 |---|---|
 | > 10 | Strong evidence that 2 components fits better |
 | 2 -- 10 | Moderate evidence for 2 components |
 | 0 -- 2 | Weak / negligible evidence |
 | < 0 | 1-component model preferred (unimodal) |
 **Expected result if hypothesis is correct:**
 - Trained group: Delta BIC >> 0 (2 components preferred)
 - Untrained group: Delta BIC <= 0 (1 component preferred)
 ### Phase 5: Subgroup Classification
 If the 2-component GMM is preferred for the trained group, use it to classify each trained ROI as a "learner" or "non-learner."
 ```python
 """
 Phase 5: Classify trained ROIs into learners vs non-learners
 using GMM posterior probabilities.
 """
 import matplotlib.pyplot as plt
 import numpy as np
 # ------------------------------------------------------------------
 # Choose the feature with the best bimodal evidence
 # ------------------------------------------------------------------
 # (Adjust this based on Phase 4 results -- use the feature with
 #  the highest Delta BIC for the trained group)
 BEST_FEATURE = "mean_distance"
 values_trained = trained_features[BEST_FEATURE].dropna().values.reshape(-1, 1)
 # Fit 2-component GMM
 gmm2 = GaussianMixture(n_components=2, random_state=42)
 gmm2.fit(values_trained)
 # Get posterior probabilities: P(component | data point)
 probs = gmm2.predict_proba(values_trained)  # shape: (N, 2)
 labels = gmm2.predict(values_trained)         # hard assignment
 # Identify which component is the "learner" component
 # Learners should have LOWER mean distance (approach behavior)
 component_means = gmm2.means_.flatten()
 learner_component = np.argmin(component_means)  # lower distance = learner
 non_learner_component = 1 - learner_component
 print(f"Component means: {component_means}")
 print(f"Learner component: {learner_component} (mean = {component_means[learner_component]:.2f})")
 print(f"Non-learner component: {non_learner_component} (mean = {component_means[non_learner_component]:.2f})")
 # ------------------------------------------------------------------
 # Assign labels to trained ROIs
 # ------------------------------------------------------------------
 trained_features_classified = trained_features.dropna(subset=[BEST_FEATURE]).copy()
 trained_features_classified["learner_prob"] = probs[:, learner_component]
 trained_features_classified["is_learner"] = labels == learner_component
 n_learners = trained_features_classified["is_learner"].sum()
 n_total = len(trained_features_classified)
 print(f"\nClassification results:")
 print(f"  Learners: {n_learners} / {n_total} ({100 * n_learners / n_total:.1f}%)")
 print(f"  Non-learners: {n_total - n_learners} / {n_total} ({100 * (n_total - n_learners) / n_total:.1f}%)")
 # ------------------------------------------------------------------
 # Visualization: GMM fit with classification
 # ------------------------------------------------------------------
 fig, ax = plt.subplots(figsize=(8, 5))
 x_range = np.linspace(values_trained.min() - 10, values_trained.max() + 10, 300).reshape(-1, 1)
 # Plot overall GMM density
 log_dens = gmm2.score_samples(x_range)
 ax.plot(x_range, np.exp(log_dens), color="black", linewidth=2, label="GMM fit (2 components)")
 # Plot individual component densities
 weights = gmm2.weights_
 means = gmm2.means_.flatten()
 covariances = gmm2.covariances_.flatten()
 for k in range(2):
    component_dens = (
        weights[k]
        * (1 / np.sqrt(2 * np.pi * covariances[k]))
        * np.exp(-0.5 * (x_range.flatten() - means[k]) ** 2 / covariances[k])
    )
    lbl = "Learners" if k == learner_component else "Non-learners"
    clr = "green" if k == learner_component else "gray"
    ax.plot(x_range, component_dens, color=clr, linewidth=1.5, linestyle="--", label=lbl)
 # Scatter the actual ROI values
 for _, row in trained_features_classified.iterrows():
    color = "green" if row["is_learner"] else "gray"
    ax.axvline(row[BEST_FEATURE], color=color, alpha=0.4, linewidth=1)
 ax.set_xlabel(BEST_FEATURE)
 ax.set_ylabel("Density")
 ax.set_title("GMM Classification of Trained ROIs")
 ax.legend()
 plt.tight_layout()
 plt.savefig("figures/gmm_classification.png", dpi=200, bbox_inches="tight")
 plt.show()
 # ------------------------------------------------------------------
 # Print per-ROI classification table
 # ------------------------------------------------------------------
 print("\nPer-ROI Classification:")
 print(trained_features_classified[
    ["machine_name", "ROI", BEST_FEATURE, "learner_prob", "is_learner"]
 ].to_string(index=False))
 ```
 **Important note about the classification threshold:** The GMM `predict()` method assigns each point to the component with the highest posterior probability. For borderline cases (learner_prob near 0.5), you may want to be conservative and only label ROIs as learners if `learner_prob > 0.8`. Report your threshold and sensitivity analysis.
 ### Phase 6: Effect Size Re-estimation
 Now compare the *learner* subgroup against untrained flies, using appropriate small-sample statistics.
 ```python
 """
 Phase 6: Re-estimate effect sizes after subgroup identification.
 With small N, use Mann-Whitney U (non-parametric) and bootstrap confidence intervals.
 """
 from scipy.stats import mannwhitneyu
 import numpy as np
 # ------------------------------------------------------------------
 # Setup: get the feature values for each group
 # ------------------------------------------------------------------
 FEATURE = "mean_distance"
 learner_vals = trained_features_classified.loc[
    trained_features_classified["is_learner"], FEATURE
 ].values
 non_learner_vals = trained_features_classified.loc[
    ~trained_features_classified["is_learner"], FEATURE
 ].values
 untrained_vals = untrained_features[FEATURE].dropna().values
 print(f"Group sizes:")
 print(f"  Learners: N = {len(learner_vals)}")
 print(f"  Non-learners: N = {len(non_learner_vals)}")
 print(f"  Untrained: N = {len(untrained_vals)}")
 # ------------------------------------------------------------------
 # Mann-Whitney U test: Learners vs Untrained
 # ------------------------------------------------------------------
 if len(learner_vals) >= 3 and len(untrained_vals) >= 3:
    u_stat, p_value = mannwhitneyu(learner_vals, untrained_vals, alternative="two-sided")
    print(f"\nMann-Whitney U test (Learners vs Untrained):")
    print(f"  U = {u_stat:.1f}, p = {p_value:.4f}")
    # Rank-biserial correlation as non-parametric effect size
    n1, n2 = len(learner_vals), len(untrained_vals)
    rank_biserial = 1 - (2 * u_stat) / (n1 * n2)
    print(f"  Rank-biserial correlation (effect size): r = {rank_biserial:.3f}")
 else:
    print("\nNot enough data points for Mann-Whitney U test.")
 # ------------------------------------------------------------------
 # Mann-Whitney U test: Non-learners vs Untrained
 # (These should NOT differ if classification is correct)
 # ------------------------------------------------------------------
 if len(non_learner_vals) >= 3 and len(untrained_vals) >= 3:
    u_stat_nl, p_value_nl = mannwhitneyu(non_learner_vals, untrained_vals, alternative="two-sided")
    print(f"\nMann-Whitney U test (Non-learners vs Untrained):")
    print(f"  U = {u_stat_nl:.1f}, p = {p_value_nl:.4f}")
    print(f"  (Expected: NOT significant -- non-learners should resemble untrained)")
 # ------------------------------------------------------------------
 # Bootstrap confidence interval for the mean difference
 # ------------------------------------------------------------------
 def bootstrap_mean_diff(group_a, group_b, n_bootstrap=10000, ci=95, seed=42):
    """Compute bootstrap CI for difference in means (A - B).
    Args:
        group_a (np.ndarray): Values from group A.
        group_b (np.ndarray): Values from group B.
        n_bootstrap (int): Number of bootstrap iterations.
        ci (float): Confidence interval percentage.
        seed (int): Random seed for reproducibility.
    Returns:
        tuple: (observed_diff, ci_lower, ci_upper).
    """
    rng = np.random.default_rng(seed)
    observed_diff = np.mean(group_a) - np.mean(group_b)
    boot_diffs = np.empty(n_bootstrap)
    for i in range(n_bootstrap):
        boot_a = rng.choice(group_a, size=len(group_a), replace=True)
        boot_b = rng.choice(group_b, size=len(group_b), replace=True)
        boot_diffs[i] = np.mean(boot_a) - np.mean(boot_b)
    alpha = (100 - ci) / 2
    ci_lower = np.percentile(boot_diffs, alpha)
    ci_upper = np.percentile(boot_diffs, 100 - alpha)
    return observed_diff, ci_lower, ci_upper
 if len(learner_vals) >= 3 and len(untrained_vals) >= 3:
    diff, ci_lo, ci_hi = bootstrap_mean_diff(learner_vals, untrained_vals)
    print(f"\nBootstrap 95% CI for mean difference (Learners - Untrained):")
    print(f"  Observed difference: {diff:.2f}")
    print(f"  95% CI: [{ci_lo:.2f}, {ci_hi:.2f}]")
    if ci_lo > 0 or ci_hi < 0:
        print(f"  --> CI does not include zero: significant difference")
    else:
        print(f"  --> CI includes zero: difference not significant at 95% level")
 # ------------------------------------------------------------------
 # Cohen's d with Hedges' g correction (for small N)
 # ------------------------------------------------------------------
 def hedges_g(group_a, group_b):
    """Compute Hedges' g (bias-corrected Cohen's d for small samples).
    Args:
        group_a (np.ndarray): Values from group A.
        group_b (np.ndarray): Values from group B.
    Returns:
        float: Hedges' g effect size.
    """
    n1, n2 = len(group_a), len(group_b)
    mean_diff = np.mean(group_a) - np.mean(group_b)
    pooled_std = np.sqrt(
        ((n1 - 1) * np.var(group_a, ddof=1) + (n2 - 1) * np.var(group_b, ddof=1))
        / (n1 + n2 - 2)
    )
    d = mean_diff / pooled_std if pooled_std > 0 else np.nan
    # Hedges' correction factor for small-sample bias
    correction = 1 - (3 / (4 * (n1 + n2) - 9))
    return d * correction
 if len(learner_vals) >= 3 and len(untrained_vals) >= 3:
    g = hedges_g(learner_vals, untrained_vals)
    print(f"\nHedges' g (Learners vs Untrained): {g:.3f}")
    print(f"  (Compare to the original aggregate Cohen's d = 0.09)")
 ```
 **What to expect:** If the bimodal hypothesis is correct, the Hedges' g for learners vs untrained should be substantially larger than the original d = 0.09, potentially in the "medium" (0.5) or "large" (0.8) range. The non-learners vs untrained comparison should be non-significant with a near-zero effect size.
 ---
 ## 6. Expected Results and Interpretation
 ### 6.1 If the Bimodal Hypothesis Is Correct
 You should see:
 1. **Histograms/KDE:** The trained group shows two visible clusters in at least one feature (e.g., mean_distance). The untrained group has a single peak.
 2. **Dip test:** The trained group yields a significant (or trending) dip test result; the untrained group does not. (But remember: with N=18, power is limited.)
 3. **GMM comparison:** The 2-component model has a substantially lower BIC than the 1-component model for the trained group. For the untrained group, the 1-component model is preferred (or the difference is negligible).
 4. **Classification:** The GMM identifies a clear split -- some trained ROIs cluster with untrained values, others are distinctly separated.
 5. **Effect sizes:** Learners vs untrained shows a large effect size (Hedges' g > 0.5). Non-learners vs untrained shows g near zero.
 6. **Bootstrap CIs:** The confidence interval for the learner-vs-untrained mean difference does not include zero.
 **This would be the ideal outcome.** It means the training works, but only on a subset of animals -- a biologically plausible result, since learning rates in Drosophila conditioning experiments are typically well below 100%.
 ### 6.2 If the Bimodal Hypothesis Is NOT Supported
 If the distributions look unimodal for both groups (dip test non-significant, 1-component GMM preferred), there are several possible conclusions:
 1. **The training genuinely has no effect** (or an effect too small to detect with this sample size). The aggregate d = 0.09 is the real signal -- negligible.
 2. **The training has a uniform but tiny effect on all flies**, rather than a large effect on a subset. Both distributions shift slightly. This is hard to distinguish from "no effect" with N=18.
 3. **The bimodal split exists but on a different feature** than the ones you tested. Try additional features: time-to-first-contact, latency to approach within 25px, peak velocity in the first 60 seconds, etc.
 4. **The sample size is too small to detect bimodality.** With N=18, the dip test and GMM have limited power, especially if the learner fraction is small (e.g., 3 out of 18 learners). Consider whether additional data can be collected.
 5. **The tracking data quality masks the signal.** Review individual ROI time series plots to check for artifacts (lost tracking, stuck detections, distance = 0 for extended periods). If some ROIs have unreliable data, they should be flagged or excluded.
 **Be honest about null results.** A clear negative finding ("training does not produce bimodal behavior in this assay") is still a valuable scientific result. Do not p-hack or cherry-pick features until something is significant.
 ### 6.3 Sanity Check: Individual ROI Time Series
 Before trusting the aggregate statistics, always look at the raw data. Plot the distance time series for each ROI individually:
 ```python
 """
 Sanity check: Plot individual ROI distance traces.
 """
 import matplotlib.pyplot as plt
 fig, axes = plt.subplots(6, 3, figsize=(18, 24), sharex=True, sharey=True)
 axes = axes.flatten()
 # Plot trained ROIs
 for idx, ((machine, roi), roi_data) in enumerate(
    post.groupby(["machine_name", "ROI"])
 ):
    if idx >= len(axes):
        break
    ax = axes[idx]
    group = roi_data["group"].iloc[0]
    color = "salmon" if group == "trained" else "steelblue"
    ax.plot(
        roi_data["aligned_time"] / 1000,  # convert to seconds
        roi_data["distance"],
        color=color, alpha=0.5, linewidth=0.5,
    )
    ax.set_title(f"M{machine} ROI{roi} ({group})", fontsize=9)
    ax.set_ylim(0, 300)
 for ax in axes:
    ax.set_xlabel("Time post-opening (s)")
    ax.set_ylabel("Distance (px)")
 plt.tight_layout()
 plt.savefig("figures/individual_roi_traces.png", dpi=150, bbox_inches="tight")
 plt.show()
 ```
 This is invaluable for spotting tracking artifacts and getting intuition for whether there really is a subgroup of trained ROIs that behave differently.
 ---
 ## 7. Statistical Considerations
 ### 7.1 Small Sample Size (N=18 per Group)
 With only 18 ROIs per group, you are working with a very small sample. This has several implications:
 - **Parametric tests (t-tests) are unreliable** unless the data are approximately normally distributed. Always check with a Shapiro-Wilk test (`scipy.stats.shapiro(values)`) and use non-parametric alternatives (Mann-Whitney U) when normality is violated.
 - **The dip test has low power** at N=18. A non-significant result is ambiguous -- it could mean the data are truly unimodal, or it could mean N is too small to detect bimodality. The GMM + BIC approach is somewhat more sensitive but still limited.
 - **Confidence intervals are wide.** This is expected and honest. Report CIs alongside point estimates so readers understand the uncertainty.
 - **Do not use the t-test on per-ROI summary statistics and call it a "t-test on 230K observations."** The N is 18, not 230K. This is the whole point of the methodology shift.
 ### 7.2 Why Bootstrap Confidence Intervals
 With N=18, the Central Limit Theorem provides only a rough approximation. Bootstrap CIs:
 - Make no distributional assumptions.
 - Work well even for non-normal, skewed, or heavy-tailed distributions.
 - Give a direct estimate of the sampling variability of your effect size.
 Use at least 10,000 bootstrap iterations. For publication, consider 100,000.
 ### 7.3 Recording Session as a Random Effect
 ROIs within the same recording session share the same machine, lighting, temperature, and barrier-opening event. This creates a nested structure:
 ```
 Group (trained/untrained)
  -> Session (machine x date: 076-session1, 076-session2, 145-session1, ...)
    -> ROI (1-6 within each session)
 ```
 Ideally, you would account for this using a **mixed-effects model** with session as a random intercept:
 ```python
 """
 Mixed-effects model (optional, for advanced analysis).
 Install: pip install statsmodels
 """
 import statsmodels.formula.api as smf
 # Add a session identifier
 roi_features["session"] = (
    roi_features["machine_name"].astype(str) + "_" +
    roi_features.groupby("machine_name").cumcount().astype(str)
 )
 # Note: you may need a better session identifier from the metadata.
 # Each machine ran multiple sessions -- use the date/HHMMSS from metadata
 # to construct a unique session ID.
 # Fit mixed model: feature ~ group, random intercept for session
 model = smf.mixedlm(
    "mean_distance ~ group",
    data=roi_features,
    groups=roi_features["session"],
 )
 result = model.fit()
 print(result.summary())
 ```
 However, with only 5 sessions total (and some sessions contributing to only one group), a mixed model may be overparameterized. In practice, for this dataset:
 - **At minimum**, verify that results hold when you average across sessions first (i.e., compute session-level means, then compare groups with N=5 rather than N=18).
 - If session-level analysis gives the same qualitative result as ROI-level analysis, the session clustering is not driving your findings.
 ### 7.4 Multiple Comparisons
 If you test bimodality across multiple features (mean_distance, median_distance, frac_close_50, velocity, etc.), you are performing multiple comparisons. Apply a correction:
 - **Bonferroni:** Divide your alpha (0.05) by the number of tests. Conservative but simple.
 - **Benjamini-Hochberg (FDR):** Controls the false discovery rate. Less conservative, appropriate for exploratory analysis.
 ```python
 from scipy.stats import false_discovery_control
 # Collect all p-values from dip tests
 p_values = [0.03, 0.12, 0.45, 0.01]  # example values
 adjusted = false_discovery_control(p_values, method="bh")
 print("BH-adjusted p-values:", adjusted)
 ```
 Be transparent about how many features you tested and which correction you applied.
 ### 7.5 Circular Analysis Warning
 There is a subtle danger if you use the same data to both (a) identify learners and (b) estimate the learner-vs-untrained effect size. This is called **circular analysis** or **double dipping** -- the effect size will be inflated because the GMM "optimizes" the split to maximize the difference.
 To mitigate this:
 - **Use one feature for classification and a different feature for effect size estimation.** For example, classify based on mean_distance but estimate the effect size on frac_close_50 or velocity.
 - **Or use cross-validation:** Split the trained ROIs into two halves, fit the GMM on one half, classify the other half, and estimate the effect size on the classified half only.
 - **Report the classification feature and the evaluation features separately** so reviewers can assess potential circularity.
 ---
 ## 8. References
 ### Software and Packages
 - **diptest** -- Hartigan's dip test for unimodality
  - Install: `pip install diptest`
  - Usage: `diptest.diptest(data)` returns `(dip_statistic, p_value)`
  - GitHub: https://github.com/RUrlus/diptest
 - **scikit-learn GaussianMixture** -- Gaussian Mixture Model fitting
  - Docs: https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html
  - Key methods: `.fit()`, `.bic()`, `.predict()`, `.predict_proba()`, `.score_samples()`
  - BIC comparison guide: https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html
 - **scipy.stats.mannwhitneyu** -- Non-parametric two-sample test
  - Docs: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html
 - **statsmodels mixedlm** -- Linear mixed-effects models
  - Docs: https://www.statsmodels.org/stable/mixed_linear.html
 ### Statistical Concepts
 - **Pseudoreplication:** Hurlbert, S.H. (1984). "Pseudoreplication and the design of ecological field experiments." *Ecological Monographs*, 54(2), 187-211. The original paper defining the concept. Every experimental biologist should read this.
 - **Hartigan's Dip Test:** Hartigan, J.A. & Hartigan, P.M. (1985). "The Dip Test of Unimodality." *Annals of Statistics*, 13(1), 70-84.
 - **BIC for Model Selection:** Schwarz, G. (1978). "Estimating the Dimension of a Model." *Annals of Statistics*, 6(2), 461-464. Lower BIC = better model, with a penalty for additional parameters. A Delta BIC > 10 is conventionally considered "very strong" evidence.
 - **Cohen's d interpretation:** Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences* (2nd ed.). The standard reference for effect size conventions (small = 0.2, medium = 0.5, large = 0.8).
 - **Hedges' g:** Hedges, L.V. (1981). "Distribution Theory for Glass's Estimator of Effect Size and Related Estimators." *Journal of Educational Statistics*, 6(2), 107-128. Bias-corrected version of Cohen's d for small samples.
 - **Bootstrap methods:** Efron, B. & Tibshirani, R.J. (1993). *An Introduction to the Bootstrap.* The definitive reference for bootstrap confidence intervals.
 ---
 ## Summary: The Analysis Roadmap
 ```
 Current state                          Goal
 -------------------------------------------------------------------
 230K observations per group    --->    18 per-ROI summary values per group
 Cohen's d = 0.09 (96% overlap) --->   Separate learners (strong effect)
                                        from non-learners (no effect)
 One giant t-test               --->    Bimodality test + GMM + subgroup comparison
 Pseudoreplication              --->    Correct N, honest uncertainty
 ```
 **Steps in order:**
 1. Compute per-ROI features (Phase 1) -- this is the prerequisite for everything else.
 2. Visualize (Phase 2) -- look before you test.
 3. Test bimodality formally (Phases 3 and 4).
 4. If bimodal: classify and re-estimate (Phases 5 and 6).
 5. If not bimodal: report honestly, consider alternative features or additional data.
 6. Regardless of outcome: check individual ROI traces, account for session effects, and apply multiple comparison corrections.
 Good luck with the analysis. The bimodal hypothesis is worth testing rigorously -- either it reveals a meaningful subgroup structure in the trained flies, or it confirms that the training effect in this assay is genuinely negligible. Both are informative results.
--- a/docs/experimental_design.md
+++ b/docs/experimental_design.md
@ -0,0 +1,98 @@
 # Experimental Design: Barrier-Opening Social Interaction Assay
 ## Overview
 This document describes the experimental design of a Drosophila behavioral tracking experiment conducted as part of the Cupido project. The experiment uses a barrier-opening assay to measure social interaction patterns between trained and untrained flies.
 **Date**: July 15, 2025
 **Species**: *Drosophila melanogaster*, Canton-S (CS) wild-type strain
 ## Assay Description
 The barrier-opening assay places two flies in each Region of Interest (ROI), separated by a physical barrier. After a configurable delay from the start of recording, the barrier is manually opened, allowing the flies to interact socially. The primary behavioral metric is the **distance between the two flies over time** following barrier opening, which serves as a proxy for social engagement.
 ## Experimental Groups
 - **Trained**: Flies that received prior social experience before the assay.
 - **Untrained**: Socially naive flies with no prior social experience.
 > **Note**: The exact training protocol (duration, conditions, group sizes) should be documented separately. "Trained" refers to flies with prior social experience; "untrained" refers to socially naive individuals.
 ## Equipment and Recording Parameters
 | Parameter | Value |
 |-----------|-------|
 | Resolution | 1920 x 1088 pixels |
 | Frame rate | 25 fps |
 | Video codec | H.264 |
 | Quality | 28q |
 | ROIs per session | 6 (each containing a pair of flies) |
 | Tracking output | SQLite databases (one per session) |
 ## Machines and Recording Sessions
 Three ethoscope machines were used for tracking, with a fourth (Machine 139) having metadata but no tracking data.
 | Machine | Session Start Time | Barrier Opening (s) | Status |
 |---------|--------------------|----------------------|--------|
 | ETHOSCOPE_076 | 16:03:10 | 52 | OK |
 | ETHOSCOPE_076 | 16:31:34 | 25 | OK |
 | ETHOSCOPE_145 | 16:03:27 | 42 | OK |
 | ETHOSCOPE_145 | 16:31:41 | 20 | OK |
 | ETHOSCOPE_268 | 16:32:05 | 75 | OK |
 | ETHOSCOPE_139 | 16:31:52 | Not recorded | **DATA MISSING** |
 **Total sessions**: 6 (5 with tracking data, 1 missing)
 ## ROI-to-Group Mapping
 Each session contains 6 ROIs. The assignment of trained/untrained groups to ROIs varies across sessions.
 | Machine | Session | ROI 1 | ROI 2 | ROI 3 | ROI 4 | ROI 5 | ROI 6 |
 |---------|---------|-------|-------|-------|-------|-------|-------|
 | 076 | 16:03:10 | Trained | Untrained | Trained | Untrained | Trained | Untrained |
 | 076 | 16:31:34 | Trained | Trained | Trained | Untrained | Untrained | Untrained |
 | 145 | 16:03:27 | Trained | Trained | Trained | Untrained | Untrained | Untrained |
 | 145 | 16:31:41 | Trained | Trained | Trained | Untrained | Untrained | Untrained |
 | 268 | 16:32:05 | Untrained | Untrained | Untrained | Trained | Trained | Trained |
 | 139 | 16:31:52 | Trained | Trained | Trained | Untrained | Untrained | Untrained |
 ## Sample Sizes
 | Group | ROIs (total) | ROIs (with data) | ROIs (missing) |
 |-------|-------------|-------------------|----------------|
 | Trained | 18 | 15 | 3 (Machine 139) |
 | Untrained | 18 | 15 | 3 (Machine 139) |
 | **Total** | **36** | **30** | **6** |
 ## Tracking Database Schema
 Each recording session produces a SQLite database file containing tables `ROI_1` through `ROI_6`. Each table has the following columns:
 | Column | Type | Description |
 |--------|------|-------------|
 | `id` | INTEGER | Row identifier |
 | `t` | INTEGER | Timestamp in **milliseconds** from start of recording |
 | `x` | REAL | Horizontal position in pixels |
 | `y` | REAL | Vertical position in pixels |
 | `w` | REAL | Width of detected object (pixels) |
 | `h` | REAL | Height of detected object (pixels) |
 | `phi` | REAL | Angle/orientation of detected object |
 | `is_inferred` | INTEGER | Whether the position was inferred (not directly detected) |
 | `has_interacted` | INTEGER | Whether an interaction was detected |
 ## Known Issues and Data Caveats
 1. **Machine 139 missing data**: Metadata entries exist for ETHOSCOPE_139 (session 16:31:52) in the metadata CSV, but no corresponding tracking database file is present and no barrier opening time was recorded. This accounts for 6 missing ROIs (3 trained, 3 untrained). The cause needs investigation.
 2. **Time unit mismatch between files**: The tracking databases store time (`t`) in **milliseconds**, while `2025_07_15_barrier_opening.csv` stores barrier opening times in **seconds**. The analysis pipeline converts barrier opening times to milliseconds for alignment.
 3. **Machine name type inconsistency**: The metadata CSV stores machine identifiers as integers (e.g., `76`, `145`, `268`), while `2025_07_15_barrier_opening.csv` also stores them as integers (e.g., `076`, `145`, `268`). String conversion with zero-padding is required when matching between files and when constructing tracking database filenames (e.g., `ETHOSCOPE_076`).
 ## Source Files
 | File | Description |
 |------|-------------|
 | `2025_07_15_metadata_fixed.csv` | ROI-to-group mapping (trained/untrained) |
 | `2025_07_15_barrier_opening.csv` | Barrier opening times per machine/session |
 | `*_tracking.db` | SQLite tracking databases (one per session) |
--- a/notebooks/flies_analysis.ipynb
+++ b/notebooks/flies_analysis.ipynb
@ -0,0 +1,222 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Flies Behavior Analysis Pipeline\n",
    "\n",
    "This notebook implements the complete analysis pipeline for discriminating between trained and untrained flies based on their distance behavior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "import pandas as pd\nimport numpy as np\nimport sqlite3\nimport glob\nimport re\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.spatial.distance import euclidean\nfrom scipy import stats\nfrom pathlib import Path\nimport sys\n\n# Set up paths relative to notebook location\nPROJECT_ROOT = Path(\"..\").resolve()\nDATA_RAW = PROJECT_ROOT / \"data\" / \"raw\"\nDATA_METADATA = PROJECT_ROOT / \"data\" / \"metadata\"\nDATA_PROCESSED = PROJECT_ROOT / \"data\" / \"processed\"\nFIGURES = PROJECT_ROOT / \"figures\"\n\nsys.path.insert(0, str(PROJECT_ROOT / \"scripts\"))\n\n# Set plotting style\nplt.style.use('seaborn-v0_8')\nsns.set_palette(\"husl\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Load data from DB and save as CSV grouped by trained/untrained"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "def load_roi_data():\n    \"\"\"Load ROI data from SQLite databases and group by trained/untrained\"\"\"\n    metadata = pd.read_csv(DATA_METADATA / '2025_07_15_metadata_fixed.csv')\n    metadata['machine_name'] = metadata['machine_name'].astype(str)\n    \n    trained_rois = metadata[metadata['group'] == 'trained']\n    untrained_rois = metadata[metadata['group'] == 'untrained']\n    \n    db_files = list(DATA_RAW.glob('*_tracking.db'))\n    \n    trained_df = pd.DataFrame()\n    untrained_df = pd.DataFrame()\n    \n    for db_file in db_files:\n        print(f\"Processing {db_file.name}\")\n        \n        pattern = r'_([0-9a-f]{32})__'\n        match = re.search(pattern, db_file.name)\n        \n        if not match:\n            print(f\"Could not extract UUID from {db_file.name}\")\n            continue\n            \n        uuid = match.group(1)\n        metadata_matches = metadata[metadata['path'].str.contains(uuid, na=False)]\n        \n        if metadata_matches.empty:\n            print(f\"No metadata matches found for UUID {uuid}\")\n            continue\n            \n        machine_id = metadata_matches.iloc[0]['machine_name']\n        print(f\"Matched to machine ID: {machine_id}\")\n        \n        conn = sqlite3.connect(str(db_file))\n        \n        machine_trained = trained_rois[trained_rois['machine_name'] == machine_id]\n        machine_untrained = untrained_rois[untrained_rois['machine_name'] == machine_id]\n        \n        for _, row in machine_trained.iterrows():\n            roi = row['ROI']\n            try:\n                roi_data = pd.read_sql_query(f\"SELECT * FROM ROI_{roi}\", conn)\n                roi_data['machine_name'] = machine_id\n                roi_data['ROI'] = roi\n                roi_data['group'] = 'trained'\n                trained_df = pd.concat([trained_df, roi_data], ignore_index=True)\n            except Exception as e:\n                print(f\"Error loading ROI_{roi}: {e}\")\n        \n        for _, row in machine_untrained.iterrows():\n            roi = row['ROI']\n            try:\n                roi_data = pd.read_sql_query(f\"SELECT * FROM ROI_{roi}\", conn)\n                roi_data['machine_name'] = machine_id\n                roi_data['ROI'] = roi\n                roi_data['group'] = 'untrained'\n                untrained_df = pd.concat([untrained_df, roi_data], ignore_index=True)\n            except Exception as e:\n                print(f\"Error loading ROI_{roi}: {e}\")\n        \n        conn.close()\n    \n    return trained_df, untrained_df\n\ntrained_data, untrained_data = load_roi_data()\nprint(f\"Trained data shape: {trained_data.shape}\")\nprint(f\"Untrained data shape: {untrained_data.shape}\")\n\ntrained_data.to_csv(DATA_PROCESSED / 'trained_roi_data.csv', index=False)\nuntrained_data.to_csv(DATA_PROCESSED / 'untrained_roi_data.csv', index=False)\nprint(\"Data saved to CSV files\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Align data using barrier opening time as time 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "barrier_data = pd.read_csv(DATA_METADATA / '2025_07_15_barrier_opening.csv')\nbarrier_data['opening_time_ms'] = barrier_data['opening_time'] * 1000\nopening_times = dict(zip(barrier_data['machine'], barrier_data['opening_time_ms']))\nprint(\"Barrier opening times:\")\nprint(barrier_data)"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def align_to_opening_time(df, opening_times):\n",
    "    \"\"\"Align data to barrier opening time\"\"\"\n",
    "    # Add aligned time column\n",
    "    df_aligned = df.copy()\n",
    "    df_aligned['aligned_time'] = np.nan\n",
    "    \n",
    "    # Align each machine's data\n",
    "    for machine in df['machine_name'].unique():\n",
    "        if machine in opening_times:\n",
    "            opening_time = opening_times[machine]\n",
    "            mask = df['machine_name'] == machine\n",
    "            df_aligned.loc[mask, 'aligned_time'] = df.loc[mask, 't'] - opening_time\n",
    "    \n",
    "    # Remove rows where aligned_time is NaN\n",
    "    df_aligned = df_aligned.dropna(subset=['aligned_time'])\n",
    "    \n",
    "    return df_aligned\n",
    "\n",
    "# Align the data\n",
    "trained_aligned = align_to_opening_time(trained_data, opening_times)\n",
    "untrained_aligned = align_to_opening_time(untrained_data, opening_times)\n",
    "\n",
    "print(f\"Trained aligned data shape: {trained_aligned.shape}\")\n",
    "print(f\"Untrained aligned data shape: {untrained_aligned.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Calculate median area size in rows where two flies are being tracked"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calculate_areas_with_two_flies(df):\n",
    "    \"\"\"Calculate median area size for time points with two flies\"\"\"\n",
    "    # Calculate area for each row\n",
    "    df['area'] = df['w'] * df['h']\n",
    "    \n",
    "    # Group by machine_name, ROI, and time to count flies per time point\n",
    "    fly_counts = df.groupby(['machine_name', 'ROI', 't']).size().reset_index(name='fly_count')\n",
    "    \n",
    "    # Filter for time points with exactly 2 flies\n",
    "    two_fly_times = fly_counts[fly_counts['fly_count'] == 2]\n",
    "    \n",
    "    # Merge back with original data to get areas for these time points\n",
    "    two_fly_data = pd.merge(df, two_fly_times[['machine_name', 'ROI', 't']], \n",
    "                           on=['machine_name', 'ROI', 't'])\n",
    "    \n",
    "    # Calculate median area\n",
    "    median_area = two_fly_data['area'].median()\n",
    "    \n",
    "    return median_area, two_fly_data\n",
    "\n",
    "# Combine trained and untrained data for area calculation\n",
    "combined_data = pd.concat([trained_aligned, untrained_aligned], ignore_index=True)\n",
    "\n",
    "# Calculate median area for time points with two flies\n",
    "median_area, two_fly_data = calculate_areas_with_two_flies(combined_data)\n",
    "print(f\"Median area size for time points with two flies: {median_area:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Calculate distances taking into account area size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "def calculate_distances_with_area(df, median_area_threshold):\n    \"\"\"Calculate distances between flies, setting to 0 for large single-fly detections\"\"\"\n    df['area'] = df['w'] * df['h']\n    results = []\n    \n    for (machine_name, roi, t), group in df.groupby(['machine_name', 'ROI', 'aligned_time']):\n        group = group.sort_values('id').reset_index(drop=True)\n        \n        if len(group) >= 2:\n            fly1 = group.iloc[0]\n            fly2 = group.iloc[1]\n            distance = euclidean([fly1['x'], fly1['y']], [fly2['x'], fly2['y']])\n            results.append({\n                'machine_name': machine_name, 'ROI': roi, 'aligned_time': t,\n                'distance': distance, 'n_flies': len(group),\n                'area_fly1': fly1['area'], 'area_fly2': fly2['area'],\n                'group': fly1['group']\n            })\n        elif len(group) == 1:\n            fly = group.iloc[0]\n            area = fly['area']\n            distance = 0.0 if area > 1.5 * median_area_threshold else np.nan\n            results.append({\n                'machine_name': machine_name, 'ROI': roi, 'aligned_time': t,\n                'distance': distance, 'n_flies': 1,\n                'area_fly1': area, 'area_fly2': np.nan,\n                'group': fly['group']\n            })\n    \n    return pd.DataFrame(results)\n\ntrained_distances = calculate_distances_with_area(trained_aligned, median_area)\nuntrained_distances = calculate_distances_with_area(untrained_aligned, median_area)\n\nprint(f\"Trained distances shape: {trained_distances.shape}\")\nprint(f\"Untrained distances shape: {untrained_distances.shape}\")\n\ntrained_distances.to_csv(DATA_PROCESSED / 'trained_distances_aligned.csv', index=False)\nuntrained_distances.to_csv(DATA_PROCESSED / 'untrained_distances_aligned.csv', index=False)\nprint(\"Distance data saved to CSV files\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Plot averaged lines of trained vs untrained for the entire experiment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "trained_clean = trained_distances.dropna(subset=['distance'])\nuntrained_clean = untrained_distances.dropna(subset=['distance'])\n\ntrained_avg = trained_clean.groupby('aligned_time')['distance'].mean()\nuntrained_avg = untrained_clean.groupby('aligned_time')['distance'].mean()\n\nwindow_size = 50\ntrained_smooth = trained_avg.rolling(window=window_size, center=True).mean()\nuntrained_smooth = untrained_avg.rolling(window=window_size, center=True).mean()\n\nplt.figure(figsize=(15, 8))\nplt.plot(trained_smooth.index/1000, trained_smooth.values, label='Trained (smoothed)', color='blue', linewidth=2)\nplt.plot(untrained_smooth.index/1000, untrained_smooth.values, label='Untrained (smoothed)', color='red', linewidth=2)\nplt.axvline(x=0, color='black', linestyle='--', alpha=0.7, label='Barrier Opening')\nplt.xlabel('Time (seconds relative to barrier opening)')\nplt.ylabel('Average Distance')\nplt.title('Average Distance Between Flies Over Entire Experiment')\nplt.legend()\nplt.grid(True, alpha=0.3)\nplt.tight_layout()\nplt.savefig(FIGURES / 'avg_distance_entire_experiment.png', dpi=300, bbox_inches='tight')\nplt.show()"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Same plot but ending at time +300 seconds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "trained_filtered = trained_clean[trained_clean['aligned_time'] <= 300000]\nuntrained_filtered = untrained_clean[untrained_clean['aligned_time'] <= 300000]\n\ntrained_avg_300 = trained_filtered.groupby('aligned_time')['distance'].mean()\nuntrained_avg_300 = untrained_filtered.groupby('aligned_time')['distance'].mean()\n\ntrained_smooth_300 = trained_avg_300.rolling(window=window_size, center=True).mean()\nuntrained_smooth_300 = untrained_avg_300.rolling(window=window_size, center=True).mean()\n\nplt.figure(figsize=(15, 8))\nplt.plot(trained_smooth_300.index/1000, trained_smooth_300.values, label='Trained (smoothed)', color='blue', linewidth=2)\nplt.plot(untrained_smooth_300.index/1000, untrained_smooth_300.values, label='Untrained (smoothed)', color='red', linewidth=2)\nplt.axvline(x=0, color='black', linestyle='--', alpha=0.7, label='Barrier Opening')\nplt.xlabel('Time (seconds relative to barrier opening)')\nplt.ylabel('Average Distance')\nplt.title('Average Distance Between Flies (First 300 Seconds Post-Opening)')\nplt.legend()\nplt.grid(True, alpha=0.3)\nplt.xlim(-150, 300)\nplt.tight_layout()\nplt.savefig(FIGURES / 'avg_distance_300_seconds.png', dpi=300, bbox_inches='tight')\nplt.show()"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary Statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"=== SUMMARY STATISTICS ===\")\n",
    "print(f\"Median area size for two-fly detections: {median_area:.2f}\")\n",
    "\n",
    "print(\"\\nPre-opening period (t < 0):\")\n",
    "trained_pre = trained_clean[trained_clean['aligned_time'] < 0]['distance']\n",
    "untrained_pre = untrained_clean[untrained_clean['aligned_time'] < 0]['distance']\n",
    "print(f\"  Trained mean distance: {trained_pre.mean():.2f}\")\n",
    "print(f\"  Untrained mean distance: {untrained_pre.mean():.2f}\")\n",
    "\n",
    "print(\"\\nPost-opening period (t > 0):\")\n",
    "trained_post = trained_clean[trained_clean['aligned_time'] > 0]['distance']\n",
    "untrained_post = untrained_clean[untrained_clean['aligned_time'] > 0]['distance']\n",
    "print(f\"  Trained mean distance: {trained_post.mean():.2f}\")\n",
    "print(f\"  Untrained mean distance: {untrained_post.mean():.2f}\")\n",
    "\n",
    "# Statistical test\n",
    "t_stat, p_val = stats.ttest_ind(trained_post, untrained_post)\n",
    "cohens_d = (trained_post.mean() - untrained_post.mean()) / np.sqrt(((len(trained_post)-1)*trained_post.var() + (len(untrained_post)-1)*untrained_post.var()) / (len(trained_post) + len(untrained_post) - 2))\n",
    "\n",
    "print(f\"\\nPost-opening comparison (trained vs untrained):\")\n",
    "print(f\"  T-statistic: {t_stat:.4f}\")\n",
    "print(f\"  P-value: {p_val:.2e}\")\n",
    "print(f\"  Cohen's d: {cohens_d:.4f}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/notebooks/flies_analysis_simple.ipynb
+++ b/notebooks/flies_analysis_simple.ipynb
@ -0,0 +1,421 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Flies Behavior Analysis Pipeline\n",
    "\n",
    "This notebook analyzes the behavior of trained vs untrained flies based on their distance patterns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "import pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.spatial.distance import euclidean\nfrom scipy import stats\nfrom pathlib import Path\nimport sys\nimport os\n\n# Set up paths relative to notebook location\nPROJECT_ROOT = Path(\"..\").resolve()\nDATA_RAW = PROJECT_ROOT / \"data\" / \"raw\"\nDATA_METADATA = PROJECT_ROOT / \"data\" / \"metadata\"\nDATA_PROCESSED = PROJECT_ROOT / \"data\" / \"processed\"\nFIGURES = PROJECT_ROOT / \"figures\"\n\n# Add scripts to path for imports\nsys.path.insert(0, str(PROJECT_ROOT / \"scripts\"))\n\n# Set plotting style\nplt.style.use('seaborn-v0_8')\nsns.set_palette(\"husl\")\n\nprint(f\"Project root: {PROJECT_ROOT}\")\nprint(f\"Pandas version: {pd.__version__}\")\nprint(f\"NumPy version: {np.__version__}\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Load existing CSV data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Load the pre-processed data\ntrained_data = pd.read_csv(DATA_PROCESSED / 'trained_roi_data.csv')\nuntrained_data = pd.read_csv(DATA_PROCESSED / 'untrained_roi_data.csv')\n\nprint(f\"Trained data shape: {trained_data.shape}\")\nprint(f\"Untrained data shape: {untrained_data.shape}\")\nprint(f\"Trained data columns: {list(trained_data.columns)}\")\nprint(f\"Untrained data columns: {list(untrained_data.columns)}\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Align data using barrier opening time as time 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Load barrier opening data\nbarrier_data = pd.read_csv(DATA_METADATA / '2025_07_15_barrier_opening.csv')\nbarrier_data['opening_time_ms'] = barrier_data['opening_time'] * 1000\n\n# Create a dictionary mapping machine_name to opening time\nopening_times = dict(zip(barrier_data['machine'], barrier_data['opening_time_ms']))\nprint(\"Barrier opening times:\")\nprint(barrier_data)"
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Trained aligned data shape: (1166318, 13)\n",
      "Untrained aligned data shape: (1130333, 13)\n"
     ]
    }
   ],
   "source": [
    "def align_to_opening_time(df, opening_times):\n",
    "    \"\"\"Align data to barrier opening time\"\"\"\n",
    "    # Add aligned time column\n",
    "    df_aligned = df.copy()\n",
    "    df_aligned['aligned_time'] = np.nan\n",
    "    \n",
    "    # Align each machine's data\n",
    "    for machine in df['machine_name'].unique():\n",
    "        if machine in opening_times:\n",
    "            opening_time = opening_times[machine]\n",
    "            mask = df['machine_name'] == machine\n",
    "            df_aligned.loc[mask, 'aligned_time'] = df.loc[mask, 't'] - opening_time\n",
    "    \n",
    "    # Remove rows where aligned_time is NaN\n",
    "    df_aligned = df_aligned.dropna(subset=['aligned_time'])\n",
    "    \n",
    "    return df_aligned\n",
    "\n",
    "# Align the data\n",
    "trained_aligned = align_to_opening_time(trained_data, opening_times)\n",
    "untrained_aligned = align_to_opening_time(untrained_data, opening_times)\n",
    "\n",
    "print(f\"Trained aligned data shape: {trained_aligned.shape}\")\n",
    "print(f\"Untrained aligned data shape: {untrained_aligned.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Calculate median area size in rows where two flies are being tracked"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Median area size for time points with two flies: 1749.00\n"
     ]
    }
   ],
   "source": [
    "def calculate_areas_with_two_flies(df):\n",
    "    \"\"\"Calculate median area size for time points with two flies\"\"\"\n",
    "    # Calculate area for each row\n",
    "    df['area'] = df['w'] * df['h']\n",
    "    \n",
    "    # Group by machine_name, ROI, and time to count flies per time point\n",
    "    fly_counts = df.groupby(['machine_name', 'ROI', 't']).size().reset_index(name='fly_count')\n",
    "    \n",
    "    # Filter for time points with exactly 2 flies\n",
    "    two_fly_times = fly_counts[fly_counts['fly_count'] == 2]\n",
    "    \n",
    "    # Merge back with original data to get areas for these time points\n",
    "    two_fly_data = pd.merge(df, two_fly_times[['machine_name', 'ROI', 't']], \n",
    "                           on=['machine_name', 'ROI', 't'])\n",
    "    \n",
    "    # Calculate median area\n",
    "    median_area = two_fly_data['area'].median()\n",
    "    \n",
    "    return median_area, two_fly_data\n",
    "\n",
    "# Combine trained and untrained data for area calculation\n",
    "combined_data = pd.concat([trained_aligned, untrained_aligned], ignore_index=True)\n",
    "\n",
    "# Calculate median area for time points with two flies\n",
    "median_area, two_fly_data = calculate_areas_with_two_flies(combined_data)\n",
    "print(f\"Median area size for time points with two flies: {median_area:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Calculate distances taking into account area size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "recalculate_distances = False  # Set to True if you want to recalculate\n\ntrained_dist_file = DATA_PROCESSED / 'trained_distances_aligned.csv'\nuntrained_dist_file = DATA_PROCESSED / 'untrained_distances_aligned.csv'\n\nif not recalculate_distances and trained_dist_file.exists() and untrained_dist_file.exists():\n    print(\"Loading pre-calculated distance data from CSV files...\")\n    trained_distances = pd.read_csv(trained_dist_file)\n    untrained_distances = pd.read_csv(untrained_dist_file)\n    print(f\"Trained distances shape: {trained_distances.shape}\")\n    print(f\"Untrained distances shape: {untrained_distances.shape}\")\nelse:\n    print(\"Calculating distances from scratch...\")\n    def calculate_distances_with_area(df, median_area_threshold):\n        \"\"\"Calculate distances between flies, setting to 0 for large single-fly detections\"\"\"\n        df['area'] = df['w'] * df['h']\n        results = []\n        \n        for (machine_name, roi, t), group in df.groupby(['machine_name', 'ROI', 'aligned_time']):\n            group = group.sort_values('id').reset_index(drop=True)\n            \n            if len(group) >= 2:\n                fly1 = group.iloc[0]\n                fly2 = group.iloc[1]\n                distance = euclidean([fly1['x'], fly1['y']], [fly2['x'], fly2['y']])\n                \n                results.append({\n                    'machine_name': machine_name, 'ROI': roi, 'aligned_time': t,\n                    'distance': distance, 'n_flies': len(group),\n                    'area_fly1': fly1['area'], 'area_fly2': fly2['area'],\n                    'group': fly1['group']\n                })\n            elif len(group) == 1:\n                fly = group.iloc[0]\n                area = fly['area']\n                distance = 0.0 if area > 1.5 * median_area_threshold else np.nan\n                \n                results.append({\n                    'machine_name': machine_name, 'ROI': roi, 'aligned_time': t,\n                    'distance': distance, 'n_flies': 1,\n                    'area_fly1': area, 'area_fly2': np.nan,\n                    'group': fly['group']\n                })\n        \n        return pd.DataFrame(results)\n    \n    trained_distances = calculate_distances_with_area(trained_aligned, median_area)\n    untrained_distances = calculate_distances_with_area(untrained_aligned, median_area)\n    \n    print(f\"Trained distances shape: {trained_distances.shape}\")\n    print(f\"Untrained distances shape: {untrained_distances.shape}\")\n    \n    trained_distances.to_csv(trained_dist_file, index=False)\n    untrained_distances.to_csv(untrained_dist_file, index=False)\n    print(\"Distance data saved to CSV files\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Plot averaged lines of trained vs untrained for the entire experiment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Remove NaN distances for plotting\ntrained_clean = trained_distances.dropna(subset=['distance'])\nuntrained_clean = untrained_distances.dropna(subset=['distance'])\n\n# Calculate average distance over time for each group\ntrained_avg = trained_clean.groupby('aligned_time')['distance'].mean()\nuntrained_avg = untrained_clean.groupby('aligned_time')['distance'].mean()\n\n# Apply smoothing using a rolling average\nwindow_size = 50\ntrained_smooth = trained_avg.rolling(window=window_size, center=True).mean()\nuntrained_smooth = untrained_avg.rolling(window=window_size, center=True).mean()\n\n# Create the plot\nplt.figure(figsize=(15, 8))\n\nplt.plot(trained_smooth.index/1000, trained_smooth.values, \n         label='Trained (smoothed)', color='blue', linewidth=2)\nplt.plot(untrained_smooth.index/1000, untrained_smooth.values, \n         label='Untrained (smoothed)', color='red', linewidth=2)\n\nplt.axvline(x=0, color='black', linestyle='--', alpha=0.7, label='Barrier Opening')\n\nplt.xlabel('Time (seconds relative to barrier opening)')\nplt.ylabel('Average Distance')\nplt.title('Average Distance Between Flies Over Entire Experiment')\nplt.legend()\nplt.grid(True, alpha=0.3)\n\nplt.tight_layout()\nplt.savefig(FIGURES / 'avg_distance_entire_experiment.png', dpi=300, bbox_inches='tight')\nplt.show()"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Same plot but ending at time +300 seconds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Filter data to +300 seconds\ntrained_filtered = trained_clean[trained_clean['aligned_time'] <= 300000]\nuntrained_filtered = untrained_clean[untrained_clean['aligned_time'] <= 300000]\n\n# Calculate average distance over time for each group\ntrained_avg_300 = trained_filtered.groupby('aligned_time')['distance'].mean()\nuntrained_avg_300 = untrained_filtered.groupby('aligned_time')['distance'].mean()\n\n# Apply smoothing using a rolling average\ntrained_smooth_300 = trained_avg_300.rolling(window=window_size, center=True).mean()\nuntrained_smooth_300 = untrained_avg_300.rolling(window=window_size, center=True).mean()\n\n# Create the plot\nplt.figure(figsize=(15, 8))\n\nplt.plot(trained_smooth_300.index/1000, trained_smooth_300.values, \n         label='Trained (smoothed)', color='blue', linewidth=2)\nplt.plot(untrained_smooth_300.index/1000, untrained_smooth_300.values, \n         label='Untrained (smoothed)', color='red', linewidth=2)\n\nplt.axvline(x=0, color='black', linestyle='--', alpha=0.7, label='Barrier Opening')\n\nplt.xlabel('Time (seconds relative to barrier opening)')\nplt.ylabel('Average Distance')\nplt.title('Average Distance Between Flies (First 300 Seconds Post-Opening)')\nplt.legend()\nplt.grid(True, alpha=0.3)\nplt.xlim(-150, 300)\n\nplt.tight_layout()\nplt.savefig(FIGURES / 'avg_distance_300_seconds.png', dpi=300, bbox_inches='tight')\nplt.show()"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Track fly identities and calculate meaningful velocity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "def track_fly_identities_vectorized(df):\n    \"\"\"Efficiently track fly identities across frames using vectorized operations\"\"\"\n    from scipy.optimize import linear_sum_assignment\n    \n    df = df.sort_values(['machine_name', 'ROI', 'aligned_time']).reset_index(drop=True)\n    tracked_flies = []\n    \n    for (machine_name, roi), group in df.groupby(['machine_name', 'ROI']):\n        group = group.sort_values('aligned_time').reset_index(drop=True)\n        time_points = sorted(group['aligned_time'].unique())\n        \n        if len(time_points) < 2:\n            continue\n            \n        fly_id_counter = 0\n        fly_identities = {}\n        \n        first_time_flies = group[group['aligned_time'] == time_points[0]]\n        for i in range(len(first_time_flies)):\n            fly_identities[(time_points[0], i)] = fly_id_counter\n            fly_id_counter += 1\n        \n        for t_idx in range(len(time_points) - 1):\n            t1, t2 = time_points[t_idx], time_points[t_idx + 1]\n            flies_t1 = group[group['aligned_time'] == t1].reset_index(drop=True)\n            flies_t2 = group[group['aligned_time'] == t2].reset_index(drop=True)\n            \n            if len(flies_t1) == 0 or len(flies_t2) == 0:\n                continue\n                \n            if len(flies_t1) == len(flies_t2):\n                pos_t1 = flies_t1[['x', 'y']].values\n                pos_t2 = flies_t2[['x', 'y']].values\n                distances = np.sqrt(np.sum((pos_t1[:, np.newaxis] - pos_t2[np.newaxis, :])**2, axis=2))\n                \n                row_ind, col_ind = linear_sum_assignment(distances)\n                \n                for i, j in zip(row_ind, col_ind):\n                    prev_id = fly_identities[(t1, i)]\n                    fly_identities[(t2, j)] = prev_id\n            else:\n                min_flies = min(len(flies_t1), len(flies_t2))\n                max_flies = max(len(flies_t1), len(flies_t2))\n                \n                for i in range(min_flies):\n                    prev_id = fly_identities[(t1, i)]\n                    fly_identities[(t2, i)] = prev_id\n                \n                for i in range(min_flies, max_flies):\n                    fly_identities[(t2, i)] = fly_id_counter\n                    fly_id_counter += 1\n        \n        for t in time_points:\n            flies_at_t = group[group['aligned_time'] == t].reset_index(drop=True)\n            for i, (idx, fly) in enumerate(flies_at_t.iterrows()):\n                if (t, i) in fly_identities:\n                    fly_copy = fly.copy()\n                    fly_copy['fly_id'] = fly_identities[(t, i)]\n                    tracked_flies.append(fly_copy)\n    \n    return pd.DataFrame(tracked_flies)\n\n# Check if pre-calculated tracked identity files exist\nrecalculate_tracking = False  # Set to True if you want to recalculate\n\ntrained_tracked_file = DATA_PROCESSED / 'trained_tracked.csv'\nuntrained_tracked_file = DATA_PROCESSED / 'untrained_tracked.csv'\n\nif not recalculate_tracking and trained_tracked_file.exists() and untrained_tracked_file.exists():\n    print(\"Loading pre-calculated tracked identity data from CSV files...\")\n    trained_tracked = pd.read_csv(trained_tracked_file)\n    untrained_tracked = pd.read_csv(untrained_tracked_file)\n    print(f\"Trained tracked data shape: {trained_tracked.shape}\")\n    print(f\"Untrained tracked data shape: {untrained_tracked.shape}\")\nelse:\n    print(\"Tracking fly identities (this may take a while...)\")\n    trained_tracked = track_fly_identities_vectorized(trained_aligned)\n    untrained_tracked = track_fly_identities_vectorized(untrained_aligned)\n    \n    trained_tracked.to_csv(trained_tracked_file, index=False)\n    untrained_tracked.to_csv(untrained_tracked_file, index=False)\n    print(\"Tracked identity data saved to CSV files\")\n    \n    print(f\"Trained tracked data shape: {trained_tracked.shape}\")\n    print(f\"Untrained tracked data shape: {untrained_tracked.shape}\")"
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculating velocities based on tracked identities...\n",
      "Trained velocity data shape: (1160573, 15)\n",
      "Untrained velocity data shape: (1125288, 15)\n"
     ]
    }
   ],
   "source": [
    "def calculate_velocity_with_identities_vectorized(df):\n",
    "    \"\"\"Efficiently calculate velocity for each fly based on tracked identities\"\"\"\n",
    "    if df.empty:\n",
    "        return pd.DataFrame()\n",
    "    \n",
    "    # Sort by fly_id and time\n",
    "    df = df.sort_values(['machine_name', 'ROI', 'fly_id', 'aligned_time']).reset_index(drop=True)\n",
    "    \n",
    "    # Calculate velocities using groupby and vectorized operations\n",
    "    def calculate_group_velocity(group):\n",
    "        if len(group) < 2:\n",
    "            return pd.DataFrame()\n",
    "            \n",
    "        # Sort by time\n",
    "        group = group.sort_values('aligned_time').reset_index(drop=True)\n",
    "        \n",
    "        # Calculate differences using vectorized operations\n",
    "        time_diff = group['aligned_time'].diff() / 1000.0  # Convert ms to seconds\n",
    "        x_diff = group['x'].diff()\n",
    "        y_diff = group['y'].diff()\n",
    "        \n",
    "        # Calculate Euclidean distance (in pixels)\n",
    "        distance = np.sqrt(x_diff**2 + y_diff**2)\n",
    "        \n",
    "        # Calculate velocity (pixels per second)\n",
    "        velocity = distance / time_diff\n",
    "        \n",
    "        # Add to results (skip first row which has NaN velocity)\n",
    "        result = group.iloc[1:].copy()\n",
    "        result['velocity'] = velocity.iloc[1:].values\n",
    "        \n",
    "        return result\n",
    "    \n",
    "    # Apply the function to each group\n",
    "    velocity_groups = []\n",
    "    for (machine_name, roi, fly_id), group in df.groupby(['machine_name', 'ROI', 'fly_id']):\n",
    "        velocity_group = calculate_group_velocity(group)\n",
    "        if not velocity_group.empty:\n",
    "            velocity_groups.append(velocity_group)\n",
    "    \n",
    "    if velocity_groups:\n",
    "        return pd.concat(velocity_groups, ignore_index=True)\n",
    "    else:\n",
    "        return pd.DataFrame()\n",
    "\n",
    "# Calculate velocities for both groups\n",
    "print(\"Calculating velocities based on tracked identities...\")\n",
    "trained_velocity = calculate_velocity_with_identities_vectorized(trained_tracked)\n",
    "untrained_velocity = calculate_velocity_with_identities_vectorized(untrained_tracked)\n",
    "\n",
    "print(f\"Trained velocity data shape: {trained_velocity.shape}\")\n",
    "print(f\"Untrained velocity data shape: {untrained_velocity.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Analyze maximal velocity over a moving window of 10 seconds\ndef calculate_max_velocity_window_vectorized(df, window_size_seconds=10):\n    \"\"\"Efficiently calculate maximal velocity over a moving window\"\"\"\n    df_clean = df.dropna(subset=['velocity'])\n    \n    if df_clean.empty:\n        return pd.DataFrame()\n        \n    window_size_ms = window_size_seconds * 1000\n    results = []\n    \n    for (machine_name, roi), group in df_clean.groupby(['machine_name', 'ROI']):\n        group = group.sort_values('aligned_time').reset_index(drop=True)\n        \n        if len(group) == 0:\n            continue\n            \n        times = group['aligned_time'].values\n        velocities = group['velocity'].values\n        \n        for i, current_time in enumerate(times):\n            window_end = current_time + window_size_ms\n            window_mask = (times >= current_time) & (times < window_end)\n            window_velocities = velocities[window_mask]\n            \n            if len(window_velocities) > 0:\n                max_velocity = np.max(window_velocities)\n                results.append({\n                    'machine_name': machine_name,\n                    'ROI': roi,\n                    'aligned_time': current_time,\n                    'max_velocity': max_velocity,\n                    'group': group.iloc[i]['group']\n                })\n    \n    return pd.DataFrame(results)\n\n# Calculate max velocity over 10-second windows\ntrained_max_velocity = calculate_max_velocity_window_vectorized(trained_velocity, 10)\nuntrained_max_velocity = calculate_max_velocity_window_vectorized(untrained_velocity, 10)\n\nprint(f\"Trained max velocity data shape: {trained_max_velocity.shape}\")\nprint(f\"Untrained max velocity data shape: {untrained_max_velocity.shape}\")\n\n# Save velocity data to CSV\ntrained_max_velocity.to_csv(DATA_PROCESSED / 'trained_max_velocity.csv', index=False)\nuntrained_max_velocity.to_csv(DATA_PROCESSED / 'untrained_max_velocity.csv', index=False)\nprint(\"Max velocity data saved to CSV files\")"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Plot averaged max velocity over time for each group\n# Focus on the time period between 50 and 200 seconds where differences are largest\ntrained_velocity_filtered = trained_max_velocity[\n    (trained_max_velocity['aligned_time'] >= 50000) & \n    (trained_max_velocity['aligned_time'] <= 200000)\n]\nuntrained_velocity_filtered = untrained_max_velocity[\n    (untrained_max_velocity['aligned_time'] >= 50000) & \n    (untrained_max_velocity['aligned_time'] <= 200000)\n]\n\nif not trained_velocity_filtered.empty and not untrained_velocity_filtered.empty:\n    trained_velocity_avg = trained_velocity_filtered.groupby('aligned_time')['max_velocity'].mean()\n    untrained_velocity_avg = untrained_velocity_filtered.groupby('aligned_time')['max_velocity'].mean()\n\n    velocity_window_size = 30\n    trained_velocity_smooth = trained_velocity_avg.rolling(window=velocity_window_size, center=True).mean()\n    untrained_velocity_smooth = untrained_velocity_avg.rolling(window=velocity_window_size, center=True).mean()\n\n    plt.figure(figsize=(15, 8))\n\n    plt.plot(trained_velocity_smooth.index/1000, trained_velocity_smooth.values, \n             label='Trained (smoothed)', color='blue', linewidth=2)\n    plt.plot(untrained_velocity_smooth.index/1000, untrained_velocity_smooth.values, \n             label='Untrained (smoothed)', color='red', linewidth=2)\n\n    plt.axvline(x=0, color='black', linestyle='--', alpha=0.7, label='Barrier Opening')\n\n    plt.xlabel('Time (seconds relative to barrier opening)')\n    plt.ylabel('Average Max Velocity (pixels/second)')\n    plt.title('Average Max Velocity Over 10-Second Windows (50-200 Seconds)')\n    plt.legend()\n    plt.grid(True, alpha=0.3)\n    plt.xlim(50, 200)\n\n    plt.tight_layout()\n    plt.savefig(FIGURES / 'avg_max_velocity_50_200_seconds.png', dpi=300, bbox_inches='tight')\n    plt.show()\nelse:\n    print(\"Not enough data to plot velocity differences\")"
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=== MAX VELOCITY STATISTICS (50-200 seconds) ===\n",
      "Trained flies:\n",
      "  Mean max velocity: 6120.19 pixels/second\n",
      "  Std max velocity: 2913.85 pixels/second\n",
      "  Median max velocity: 7426.51 pixels/second\n",
      "\n",
      "Untrained flies:\n",
      "  Mean max velocity: 5710.33 pixels/second\n",
      "  Std max velocity: 2784.93 pixels/second\n",
      "  Median max velocity: 6876.77 pixels/second\n",
      "\n",
      "Statistical comparison (trained vs untrained):\n",
      "  T-statistic: 43.8194\n",
      "  P-value: 0.00e+00\n",
      "  Cohen's d: 0.1438\n"
     ]
    }
   ],
   "source": [
    "# Summary statistics for max velocity in the 50-200 second window\n",
    "if not trained_velocity_filtered.empty and not untrained_velocity_filtered.empty:\n",
    "    print(\"=== MAX VELOCITY STATISTICS (50-200 seconds) ===\")\n",
    "    trained_velocity_50_200 = trained_velocity_filtered['max_velocity']\n",
    "    untrained_velocity_50_200 = untrained_velocity_filtered['max_velocity']\n",
    "\n",
    "    print(f\"Trained flies:\")\n",
    "    print(f\"  Mean max velocity: {trained_velocity_50_200.mean():.2f} pixels/second\")\n",
    "    print(f\"  Std max velocity: {trained_velocity_50_200.std():.2f} pixels/second\")\n",
    "    print(f\"  Median max velocity: {trained_velocity_50_200.median():.2f} pixels/second\")\n",
    "\n",
    "    print(f\"\\nUntrained flies:\")\n",
    "    print(f\"  Mean max velocity: {untrained_velocity_50_200.mean():.2f} pixels/second\")\n",
    "    print(f\"  Std max velocity: {untrained_velocity_50_200.std():.2f} pixels/second\")\n",
    "    print(f\"  Median max velocity: {untrained_velocity_50_200.median():.2f} pixels/second\")\n",
    "\n",
    "    # Statistical test\n",
    "    if len(trained_velocity_50_200) > 1 and len(untrained_velocity_50_200) > 1:\n",
    "        t_stat_vel, p_val_vel = stats.ttest_ind(trained_velocity_50_200, untrained_velocity_50_200)\n",
    "        cohens_d_vel = (trained_velocity_50_200.mean() - untrained_velocity_50_200.mean()) / \\\n",
    "            np.sqrt(((len(trained_velocity_50_200)-1)*trained_velocity_50_200.var() + \\\n",
    "                     (len(untrained_velocity_50_200)-1)*untrained_velocity_50_200.var()) / \\\n",
    "                    (len(trained_velocity_50_200) + len(untrained_velocity_50_200) - 2))\n",
    "\n",
    "        print(f\"\\nStatistical comparison (trained vs untrained):\")\n",
    "        print(f\"  T-statistic: {t_stat_vel:.4f}\")\n",
    "        print(f\"  P-value: {p_val_vel:.2e}\")\n",
    "        print(f\"  Cohen's d: {cohens_d_vel:.4f}\")\n",
    "    else:\n",
    "        print(\"\\nNot enough data for statistical test\")\n",
    "else:\n",
    "    print(\"Not enough data for velocity statistics\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary Statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=== SUMMARY STATISTICS ===\n",
      "Median area size for two-fly detections: 1749.00\n",
      "\n",
      "Pre-opening period (t < 0):\n",
      "  Trained mean distance: 156.05\n",
      "  Untrained mean distance: 147.69\n",
      "\n",
      "Post-opening period (t > 0):\n",
      "  Trained mean distance: 72.60\n",
      "  Untrained mean distance: 64.00\n",
      "\n",
      "Post-opening comparison (trained vs untrained):\n",
      "  T-statistic: 30.2455\n",
      "  P-value: 9.57e-201\n",
      "  Cohen's d: 0.0908\n"
     ]
    }
   ],
   "source": [
    "print(\"=== SUMMARY STATISTICS ===\")\n",
    "print(f\"Median area size for two-fly detections: {median_area:.2f}\")\n",
    "\n",
    "print(\"\\nPre-opening period (t < 0):\")\n",
    "trained_pre = trained_clean[trained_clean['aligned_time'] < 0]['distance']\n",
    "untrained_pre = untrained_clean[untrained_clean['aligned_time'] < 0]['distance']\n",
    "print(f\"  Trained mean distance: {trained_pre.mean():.2f}\")\n",
    "print(f\"  Untrained mean distance: {untrained_pre.mean():.2f}\")\n",
    "\n",
    "print(\"\\nPost-opening period (t > 0):\")\n",
    "trained_post = trained_clean[trained_clean['aligned_time'] > 0]['distance']\n",
    "untrained_post = untrained_clean[untrained_clean['aligned_time'] > 0]['distance']\n",
    "print(f\"  Trained mean distance: {trained_post.mean():.2f}\")\n",
    "print(f\"  Untrained mean distance: {untrained_post.mean():.2f}\")\n",
    "\n",
    "# Statistical test\n",
    "t_stat, p_val = stats.ttest_ind(trained_post, untrained_post)\n",
    "cohens_d = (trained_post.mean() - untrained_post.mean()) / np.sqrt(((len(trained_post)-1)*trained_post.var() + (len(untrained_post)-1)*untrained_post.var()) / (len(trained_post) + len(untrained_post) - 2))\n",
    "\n",
    "print(f\"\\nPost-opening comparison (trained vs untrained):\")\n",
    "print(f\"  T-statistic: {t_stat:.4f}\")\n",
    "print(f\"  P-value: {p_val:.2e}\")\n",
    "print(f\"  Cohen's d: {cohens_d:.4f}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,7 @@
 numpy>=1.24
 pandas>=2.0
 matplotlib>=3.7
 seaborn>=0.12
 scipy>=1.10
 scikit-learn>=1.3
 jupyter>=1.0
--- a/scripts/init.py
+++ b/scripts/init.py
--- a/scripts/analyze_distances.py
+++ b/scripts/analyze_distances.py
@ -0,0 +1,240 @@
 import pandas as pd
 import numpy as np
 import matplotlib.pyplot as plt
 import seaborn as sns
 from scipy import stats
 from sklearn.cluster import KMeans
 from sklearn.preprocessing import StandardScaler
 from sklearn.metrics import silhouette_score
 import warnings
 warnings.filterwarnings('ignore')
 from config import DATA_PROCESSED, FIGURES
 def load_and_combine_data():
    """Load and combine trained and untrained distance data.
    Returns:
        pd.DataFrame: Combined distance data with group labels.
    """
    trained_distances = pd.read_csv(DATA_PROCESSED / 'trained_distances.csv')
    untrained_distances = pd.read_csv(DATA_PROCESSED / 'untrained_distances.csv')
    trained_distances['group'] = 'trained'
    untrained_distances['group'] = 'untrained'
    combined_data = pd.concat([trained_distances, untrained_distances], ignore_index=True)
    combined_data = combined_data.dropna(subset=['distance'])
    print(f"Combined data shape: {combined_data.shape}")
    print(f"Trained samples: {len(combined_data[combined_data['group'] == 'trained'])}")
    print(f"Untrained samples: {len(combined_data[combined_data['group'] == 'untrained'])}")
    return combined_data
 def basic_statistics(combined_data):
    """Perform basic statistical analysis.
    Args:
        combined_data (pd.DataFrame): Combined distance data.
    """
    print("\n=== BASIC STATISTICS ===")
    for group in ['trained', 'untrained']:
        group_data = combined_data[combined_data['group'] == group]['distance']
        print(f"\n{group.capitalize()} flies:")
        print(f"  Count: {len(group_data)}")
        print(f"  Mean distance: {group_data.mean():.2f}")
        print(f"  Std distance: {group_data.std():.2f}")
        print(f"  Median distance: {group_data.median():.2f}")
        print(f"  Min distance: {group_data.min():.2f}")
        print(f"  Max distance: {group_data.max():.2f}")
    trained_dist = combined_data[combined_data['group'] == 'trained']['distance']
    untrained_dist = combined_data[combined_data['group'] == 'untrained']['distance']
    t_stat, p_value = stats.ttest_ind(trained_dist, untrained_dist)
    print(f"\nT-test between groups:")
    print(f"  T-statistic: {t_stat:.4f}")
    print(f"  P-value: {p_value:.2e}")
    pooled_std = np.sqrt(((len(trained_dist)-1)*trained_dist.std()**2 +
                         (len(untrained_dist)-1)*untrained_dist.std()**2) /
                        (len(trained_dist) + len(untrained_dist) - 2))
    cohens_d = (trained_dist.mean() - untrained_dist.mean()) / pooled_std
    print(f"  Cohen's d (effect size): {cohens_d:.4f}")
 def distance_distribution_analysis(combined_data):
    """Analyze distance distributions and create plots.
    Args:
        combined_data (pd.DataFrame): Combined distance data.
    """
    print("\n=== DISTANCE DISTRIBUTION ANALYSIS ===")
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('Distance Distribution Analysis', fontsize=16)
    axes[0, 0].hist(combined_data[combined_data['group'] == 'trained']['distance'],
                     alpha=0.7, label='Trained', bins=50, density=True)
    axes[0, 0].hist(combined_data[combined_data['group'] == 'untrained']['distance'],
                     alpha=0.7, label='Untrained', bins=50, density=True)
    axes[0, 0].set_xlabel('Distance')
    axes[0, 0].set_ylabel('Density')
    axes[0, 0].set_title('Distance Distribution by Group')
    axes[0, 0].legend()
    combined_data.boxplot(column='distance', by='group', ax=axes[0, 1])
    axes[0, 1].set_title('Distance Box Plot by Group')
    axes[0, 1].set_xlabel('Group')
    axes[0, 1].set_ylabel('Distance')
    trained_dist = combined_data[combined_data['group'] == 'trained']['distance']
    untrained_dist = combined_data[combined_data['group'] == 'untrained']['distance']
    trained_sorted = np.sort(trained_dist)
    untrained_sorted = np.sort(untrained_dist)
    trained_cumulative = np.arange(1, len(trained_sorted) + 1) / len(trained_sorted)
    untrained_cumulative = np.arange(1, len(untrained_sorted) + 1) / len(untrained_sorted)
    axes[1, 0].plot(trained_sorted, trained_cumulative, label='Trained', alpha=0.7)
    axes[1, 0].plot(untrained_sorted, untrained_cumulative, label='Untrained', alpha=0.7)
    axes[1, 0].set_xlabel('Distance')
    axes[1, 0].set_ylabel('Cumulative Probability')
    axes[1, 0].set_title('Cumulative Distribution of Distances')
    axes[1, 0].legend()
    sns.violinplot(data=combined_data, x='group', y='distance', ax=axes[1, 1])
    axes[1, 1].set_title('Distance Violin Plot by Group')
    axes[1, 1].set_xlabel('Group')
    axes[1, 1].set_ylabel('Distance')
    plt.tight_layout()
    plt.savefig(FIGURES / 'distance_analysis.png', dpi=300, bbox_inches='tight')
    plt.show()
    print("Distance distribution plots saved")
 def clustering_analysis(combined_data):
    """Perform clustering analysis on distance data.
    Args:
        combined_data (pd.DataFrame): Combined distance data.
    Returns:
        tuple: (clustered_data, kmeans_model, scaler).
    """
    print("\n=== CLUSTERING ANALYSIS ===")
    features = ['distance', 'n_flies', 'area_fly1', 'area_fly2']
    X = combined_data[features].dropna()
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    k_range = range(2, 6)
    inertias = []
    sil_scores = []
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        kmeans.fit(X_scaled)
        inertias.append(kmeans.inertia_)
        sil_scores.append(silhouette_score(X_scaled, kmeans.labels_))
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    ax1.plot(k_range, inertias, 'bo-')
    ax1.set_xlabel('Number of Clusters (k)')
    ax1.set_ylabel('Inertia')
    ax1.set_title('Elbow Method for Optimal k')
    ax2.plot(k_range, sil_scores, 'ro-')
    ax2.set_xlabel('Number of Clusters (k)')
    ax2.set_ylabel('Silhouette Score')
    ax2.set_title('Silhouette Score for Different k')
    plt.tight_layout()
    plt.savefig(FIGURES / 'clustering_analysis.png', dpi=300, bbox_inches='tight')
    plt.show()
    optimal_k = 2
    kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(X_scaled)
    X_clustered = X.copy()
    X_clustered['cluster'] = cluster_labels
    X_clustered['actual_group'] = combined_data.loc[X_clustered.index, 'group'].values
    confusion = pd.crosstab(X_clustered['cluster'], X_clustered['actual_group'])
    print(f"Clustering results (k={optimal_k}):")
    print(confusion)
    c0t = len(X_clustered[(X_clustered['cluster'] == 0) & (X_clustered['actual_group'] == 'trained')])
    c0u = len(X_clustered[(X_clustered['cluster'] == 0) & (X_clustered['actual_group'] == 'untrained')])
    c1t = len(X_clustered[(X_clustered['cluster'] == 1) & (X_clustered['actual_group'] == 'trained')])
    c1u = len(X_clustered[(X_clustered['cluster'] == 1) & (X_clustered['actual_group'] == 'untrained')])
    accuracy = max((c0t + c1u) / len(X_clustered), (c0u + c1t) / len(X_clustered))
    print(f"\nClustering accuracy: {accuracy:.4f}")
    print("\nCluster characteristics:")
    for i in range(optimal_k):
        cluster_data = X_clustered[X_clustered['cluster'] == i]
        print(f"\nCluster {i}:")
        print(f"  Size: {len(cluster_data)}")
        print(f"  Distance - Mean: {cluster_data['distance'].mean():.2f}, Std: {cluster_data['distance'].std():.2f}")
        print(f"  N_flies - Mean: {cluster_data['n_flies'].mean():.2f}")
        print(f"  Area_fly1 - Mean: {cluster_data['area_fly1'].mean():.2f}")
    return X_clustered, kmeans, scaler
 def simple_classification_rule(combined_data):
    """Create a simple rule-based classifier.
    Args:
        combined_data (pd.DataFrame): Combined distance data.
    """
    print("\n=== SIMPLE RULE-BASED CLASSIFICATION ===")
    clean_data = combined_data.dropna(subset=['distance'])
    thresholds = np.percentile(clean_data['distance'], [25, 50, 75])
    print(f"Distance percentiles: 25%={thresholds[0]:.2f}, 50%={thresholds[1]:.2f}, 75%={thresholds[2]:.2f}")
    for threshold in thresholds:
        predictions = ['trained' if d > threshold else 'untrained'
                      for d in clean_data['distance']]
        actual = clean_data['group']
        accuracy = np.mean([p == a for p, a in zip(predictions, actual)])
        tp = sum([p == 'trained' and a == 'trained' for p, a in zip(predictions, actual)])
        tn = sum([p == 'untrained' and a == 'untrained' for p, a in zip(predictions, actual)])
        fp = sum([p == 'trained' and a == 'untrained' for p, a in zip(predictions, actual)])
        fn = sum([p == 'untrained' and a == 'trained' for p, a in zip(predictions, actual)])
        sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0
        specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
        print(f"\nThreshold = {threshold:.2f}:")
        print(f"  Accuracy: {accuracy:.4f}")
        print(f"  Sensitivity: {sensitivity:.4f}, Specificity: {specificity:.4f}")
 def main():
    """Run the full distance analysis pipeline."""
    combined_data = load_and_combine_data()
    basic_statistics(combined_data)
    distance_distribution_analysis(combined_data)
    clustered_data, kmeans_model, scaler = clustering_analysis(combined_data)
    simple_classification_rule(combined_data)
    clustered_data.to_csv(DATA_PROCESSED / 'clustered_distance_data.csv', index=False)
    print("\n=== ANALYSIS COMPLETE ===")
 if __name__ == "__main__":
    main()
--- a/scripts/calculate_distances.py
+++ b/scripts/calculate_distances.py
@ -0,0 +1,118 @@
 import pandas as pd
 import numpy as np
 from scipy.spatial.distance import euclidean
 from config import DATA_PROCESSED
 def calculate_fly_distances(trained_file=None, untrained_file=None):
    """Calculate distances between flies at each time point.
    For each time point:
    - If two flies are detected: calculate Cartesian distance between them
    - If one fly is detected: set distance to 0 if area > average area, otherwise NaN
    Args:
        trained_file (Path): Path to trained ROI data CSV.
        untrained_file (Path): Path to untrained ROI data CSV.
    Returns:
        tuple: (trained_distances, untrained_distances) DataFrames.
    """
    if trained_file is None:
        trained_file = DATA_PROCESSED / 'trained_roi_data.csv'
    if untrained_file is None:
        untrained_file = DATA_PROCESSED / 'untrained_roi_data.csv'
    trained_df = pd.read_csv(trained_file)
    untrained_df = pd.read_csv(untrained_file)
    trained_df['area'] = trained_df['w'] * trained_df['h']
    untrained_df['area'] = untrained_df['w'] * untrained_df['h']
    avg_area = np.mean([trained_df['area'].mean(), untrained_df['area'].mean()])
    print(f"Average area across all data: {avg_area:.2f}")
    trained_distances = process_distance_data(trained_df, avg_area)
    untrained_distances = process_distance_data(untrained_df, avg_area)
    return trained_distances, untrained_distances
 def process_distance_data(df, avg_area):
    """Process a DataFrame to calculate distances between flies at each time point.
    Args:
        df (pd.DataFrame): Input tracking data.
        avg_area (float): Average area threshold for single-fly detection.
    Returns:
        pd.DataFrame: Distance data with columns for machine, ROI, time, distance.
    """
    results = []
    for (machine_name, roi), group in df.groupby(['machine_name', 'ROI']):
        for t, time_group in group.groupby('t'):
            time_group = time_group.sort_values('id').reset_index(drop=True)
            if len(time_group) >= 2:
                fly1 = time_group.iloc[0]
                fly2 = time_group.iloc[1]
                distance = euclidean([fly1['x'], fly1['y']], [fly2['x'], fly2['y']])
                results.append({
                    'machine_name': machine_name,
                    'ROI': roi,
                    't': t,
                    'distance': distance,
                    'n_flies': len(time_group),
                    'area_fly1': fly1['area'],
                    'area_fly2': fly2['area']
                })
            elif len(time_group) == 1:
                fly = time_group.iloc[0]
                area = fly['area']
                if area > avg_area:
                    distance = 0.0
                else:
                    distance = np.nan
                results.append({
                    'machine_name': machine_name,
                    'ROI': roi,
                    't': t,
                    'distance': distance,
                    'n_flies': 1,
                    'area_fly1': area,
                    'area_fly2': np.nan
                })
    return pd.DataFrame(results)
 def main():
    """Run distance calculations and save results."""
    trained_distances, untrained_distances = calculate_fly_distances()
    print(f"Trained data distance summary:")
    print(f"  Shape: {trained_distances.shape}")
    print(f"  Distance stats:")
    print(f"    Count: {trained_distances['distance'].count()}")
    print(f"    Mean: {trained_distances['distance'].mean():.2f}")
    print(f"    Std: {trained_distances['distance'].std():.2f}")
    print(f"\nUntrained data distance summary:")
    print(f"  Shape: {untrained_distances.shape}")
    print(f"  Distance stats:")
    print(f"    Count: {untrained_distances['distance'].count()}")
    print(f"    Mean: {untrained_distances['distance'].mean():.2f}")
    print(f"    Std: {untrained_distances['distance'].std():.2f}")
    trained_distances.to_csv(DATA_PROCESSED / 'trained_distances.csv', index=False)
    untrained_distances.to_csv(DATA_PROCESSED / 'untrained_distances.csv', index=False)
    print("\nDistance data saved")
 if __name__ == "__main__":
    main()
--- a/scripts/config.py
+++ b/scripts/config.py
@ -0,0 +1,9 @@
 """Shared path constants for the Cupido tracking project."""
 from pathlib import Path
 PROJECT_ROOT = Path(__file__).resolve().parent.parent
 DATA_RAW = PROJECT_ROOT / "data" / "raw"
 DATA_METADATA = PROJECT_ROOT / "data" / "metadata"
 DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
 FIGURES = PROJECT_ROOT / "figures"
--- a/scripts/load_roi_data.py
+++ b/scripts/load_roi_data.py
@ -0,0 +1,90 @@
 import pandas as pd
 import sqlite3
 import re
 from config import DATA_RAW, DATA_METADATA, DATA_PROCESSED
 def load_roi_data():
    """Load ROI data from SQLite databases and group by trained/untrained.
    Returns:
        tuple: (trained_df, untrained_df) DataFrames with tracking data.
    """
    metadata = pd.read_csv(DATA_METADATA / '2025_07_15_metadata_fixed.csv')
    metadata['machine_name'] = metadata['machine_name'].astype(str)
    trained_rois = metadata[metadata['group'] == 'trained']
    untrained_rois = metadata[metadata['group'] == 'untrained']
    db_files = list(DATA_RAW.glob('*_tracking.db'))
    trained_df = pd.DataFrame()
    untrained_df = pd.DataFrame()
    for db_file in db_files:
        print(f"Processing {db_file.name}")
        pattern = r'_([0-9a-f]{32})__'
        match = re.search(pattern, db_file.name)
        if not match:
            print(f"Could not extract UUID from {db_file.name}")
            continue
        uuid = match.group(1)
        metadata_matches = metadata[metadata['path'].str.contains(uuid, na=False)]
        if metadata_matches.empty:
            print(f"No metadata matches found for UUID {uuid} from {db_file.name}")
            continue
        machine_id = metadata_matches.iloc[0]['machine_name']
        print(f"Matched to machine ID: {machine_id}")
        conn = sqlite3.connect(str(db_file))
        machine_trained = trained_rois[trained_rois['machine_name'] == machine_id]
        machine_untrained = untrained_rois[untrained_rois['machine_name'] == machine_id]
        for _, row in machine_trained.iterrows():
            roi = row['ROI']
            try:
                query = f"SELECT * FROM ROI_{roi}"
                roi_data = pd.read_sql_query(query, conn)
                roi_data['machine_name'] = machine_id
                roi_data['ROI'] = roi
                roi_data['group'] = 'trained'
                trained_df = pd.concat([trained_df, roi_data], ignore_index=True)
            except Exception as e:
                print(f"Error loading ROI_{roi} from {db_file.name}: {e}")
        for _, row in machine_untrained.iterrows():
            roi = row['ROI']
            try:
                query = f"SELECT * FROM ROI_{roi}"
                roi_data = pd.read_sql_query(query, conn)
                roi_data['machine_name'] = machine_id
                roi_data['ROI'] = roi
                roi_data['group'] = 'untrained'
                untrained_df = pd.concat([untrained_df, roi_data], ignore_index=True)
            except Exception as e:
                print(f"Error loading ROI_{roi} from {db_file.name}: {e}")
        conn.close()
    return trained_df, untrained_df
 if __name__ == "__main__":
    trained_data, untrained_data = load_roi_data()
    print(f"Trained data shape: {trained_data.shape}")
    print(f"Untrained data shape: {untrained_data.shape}")
    if not trained_data.empty:
        print("Trained data columns:", trained_data.columns.tolist())
    if not untrained_data.empty:
        print("Untrained data columns:", untrained_data.columns.tolist())
    trained_data.to_csv(DATA_PROCESSED / 'trained_roi_data.csv', index=False)
    untrained_data.to_csv(DATA_PROCESSED / 'untrained_roi_data.csv', index=False)
    print("Data saved to trained_roi_data.csv and untrained_roi_data.csv")
--- a/scripts/ml_classification.py
+++ b/scripts/ml_classification.py
@ -0,0 +1,97 @@
 import pandas as pd
 import numpy as np
 from sklearn.model_selection import train_test_split, cross_val_score
 from sklearn.ensemble import RandomForestClassifier
 from sklearn.linear_model import LogisticRegression
 from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
 from sklearn.preprocessing import StandardScaler
 from sklearn.impute import SimpleImputer
 import matplotlib.pyplot as plt
 import seaborn as sns
 from config import DATA_PROCESSED, FIGURES
 # Load data
 trained_distances = pd.read_csv(DATA_PROCESSED / 'trained_distances.csv')
 untrained_distances = pd.read_csv(DATA_PROCESSED / 'untrained_distances.csv')
 # Add group labels
 trained_distances['group'] = 'trained'
 untrained_distances['group'] = 'untrained'
 # Combine data
 combined_data = pd.concat([trained_distances, untrained_distances], ignore_index=True)
 combined_data = combined_data.dropna(subset=['group'])
 # Prepare features and target
 features = ['distance', 'n_flies', 'area_fly1', 'area_fly2']
 X = combined_data[features]
 y = combined_data['group']
 # Handle missing values in features
 imputer = SimpleImputer(strategy='mean')
 X_imputed = pd.DataFrame(imputer.fit_transform(X), columns=features)
 # Split data
 X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=42)
 # Standardize features
 scaler = StandardScaler()
 X_train_scaled = scaler.fit_transform(X_train)
 X_test_scaled = scaler.transform(X_test)
 print("=== MACHINE LEARNING CLASSIFICATION ===")
 print(f"Training set size: {len(X_train)}")
 print(f"Testing set size: {len(X_test)}")
 # 1. Logistic Regression
 print("\n1. Logistic Regression:")
 lr_model = LogisticRegression(random_state=42)
 lr_model.fit(X_train_scaled, y_train)
 lr_predictions = lr_model.predict(X_test_scaled)
 lr_accuracy = accuracy_score(y_test, lr_predictions)
 print(f"Accuracy: {lr_accuracy:.4f}")
 print("\nClassification Report:")
 print(classification_report(y_test, lr_predictions))
 # 2. Random Forest
 print("\n2. Random Forest:")
 rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
 rf_model.fit(X_train, y_train)
 rf_predictions = rf_model.predict(X_test)
 rf_accuracy = accuracy_score(y_test, rf_predictions)
 print(f"Accuracy: {rf_accuracy:.4f}")
 print("\nClassification Report:")
 print(classification_report(y_test, rf_predictions))
 # Feature importance
 print("\nFeature Importance (Random Forest):")
 feature_importance = pd.DataFrame({
    'feature': features,
    'importance': rf_model.feature_importances_
 }).sort_values('importance', ascending=False)
 print(feature_importance)
 # Confusion matrix for the best model
 best_model_name = "Random Forest" if rf_accuracy > lr_accuracy else "Logistic Regression"
 best_predictions = rf_predictions if rf_accuracy > lr_accuracy else lr_predictions
 plt.figure(figsize=(8, 6))
 cm = confusion_matrix(y_test, best_predictions)
 sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Trained', 'Untrained'],
            yticklabels=['Trained', 'Untrained'])
 plt.title(f'Confusion Matrix - {best_model_name}')
 plt.xlabel('Predicted')
 plt.ylabel('Actual')
 plt.tight_layout()
 plt.savefig(FIGURES / 'confusion_matrix.png', dpi=300, bbox_inches='tight')
 plt.show()
 # Cross-validation scores
 print("\n=== CROSS-VALIDATION SCORES ===")
 lr_cv_scores = cross_val_score(LogisticRegression(random_state=42), X_train_scaled, y_train, cv=5)
 rf_cv_scores = cross_val_score(RandomForestClassifier(n_estimators=100, random_state=42), X_train, y_train, cv=5)
 print(f"Logistic Regression CV Score: {lr_cv_scores.mean():.4f} (+/- {lr_cv_scores.std() * 2:.4f})")
 print(f"Random Forest CV Score: {rf_cv_scores.mean():.4f} (+/- {rf_cv_scores.std() * 2:.4f})")
--- a/scripts/plot_avg_distance_aligned.py
+++ b/scripts/plot_avg_distance_aligned.py
@ -0,0 +1,101 @@
 import pandas as pd
 import matplotlib.pyplot as plt
 import numpy as np
 from config import DATA_PROCESSED, DATA_METADATA, FIGURES
 # Load data
 trained_distances = pd.read_csv(DATA_PROCESSED / 'trained_distances.csv')
 untrained_distances = pd.read_csv(DATA_PROCESSED / 'untrained_distances.csv')
 barrier_data = pd.read_csv(DATA_METADATA / '2025_07_15_barrier_opening.csv')
 # Convert opening_time to milliseconds and create a mapping
 barrier_data['opening_time_ms'] = barrier_data['opening_time'] * 1000
 opening_times = dict(zip(barrier_data['machine'], barrier_data['opening_time_ms']))
 def align_to_opening_time(df, opening_times, max_time=300000):
    """Align distance data to barrier opening time.
    Args:
        df (pd.DataFrame): Distance data.
        opening_times (dict): Machine to opening time mapping.
        max_time (int): Maximum time in ms to include.
    Returns:
        pd.DataFrame: Aligned data filtered to +/-150s around opening.
    """
    df_aligned = df.copy()
    df_aligned['aligned_time'] = np.nan
    for machine in df['machine_name'].unique():
        if machine in opening_times:
            opening_time = opening_times[machine]
            mask = (df['machine_name'] == machine) & (df['t'] <= max_time)
            df_aligned.loc[mask, 'aligned_time'] = df.loc[mask, 't'] - opening_time
    df_aligned = df_aligned.dropna(subset=['aligned_time'])
    df_aligned = df_aligned[(df_aligned['aligned_time'] >= -150000) &
                           (df_aligned['aligned_time'] <= 150000)]
    return df_aligned
 # Align the data
 trained_aligned = align_to_opening_time(trained_distances, opening_times)
 untrained_aligned = align_to_opening_time(untrained_distances, opening_times)
 # Calculate average distance over aligned time
 trained_avg = trained_aligned.groupby('aligned_time')['distance'].mean()
 untrained_avg = untrained_aligned.groupby('aligned_time')['distance'].mean()
 # Apply smoothing
 window_size = 50
 trained_smooth = trained_avg.rolling(window=window_size, center=True).mean()
 untrained_smooth = untrained_avg.rolling(window=window_size, center=True).mean()
 # Create the plot
 plt.figure(figsize=(12, 6))
 plt.plot(trained_smooth.index/1000, trained_smooth.values,
         label='Trained (smoothed)', color='blue', linewidth=2)
 plt.plot(untrained_smooth.index/1000, untrained_smooth.values,
         label='Untrained (smoothed)', color='red', linewidth=2)
 plt.axvline(x=0, color='black', linestyle='--', alpha=0.7, label='Barrier Opening')
 plt.xlabel('Time (seconds relative to barrier opening)')
 plt.ylabel('Average Distance')
 plt.title('Average Distance Between Flies Aligned to Barrier Opening Time')
 plt.legend()
 plt.grid(True, alpha=0.3)
 plt.xlim(-150, 150)
 plt.tight_layout()
 plt.savefig(FIGURES / 'avg_distance_aligned_to_opening.png', dpi=300, bbox_inches='tight')
 plt.show()
 # Print statistics
 print("Trained flies (aligned to barrier opening):")
 print(f"  Data points: {len(trained_aligned)}")
 print(f"  Mean distance: {trained_aligned['distance'].mean():.2f}")
 print(f"  Std distance: {trained_aligned['distance'].std():.2f}")
 print("\nUntrained flies (aligned to barrier opening):")
 print(f"  Data points: {len(untrained_aligned)}")
 print(f"  Mean distance: {untrained_aligned['distance'].mean():.2f}")
 print(f"  Std distance: {untrained_aligned['distance'].std():.2f}")
 # Pre/post analysis
 trained_pre = trained_aligned[trained_aligned['aligned_time'] < 0]
 trained_post = trained_aligned[trained_aligned['aligned_time'] > 0]
 untrained_pre = untrained_aligned[untrained_aligned['aligned_time'] < 0]
 untrained_post = untrained_aligned[untrained_aligned['aligned_time'] > 0]
 print("\nPre-opening period (t < 0):")
 print(f"  Trained mean distance: {trained_pre['distance'].mean():.2f}")
 print(f"  Untrained mean distance: {untrained_pre['distance'].mean():.2f}")
 print("\nPost-opening period (t > 0):")
 print(f"  Trained mean distance: {trained_post['distance'].mean():.2f}")
 print(f"  Untrained mean distance: {untrained_post['distance'].mean():.2f}")
--- a/scripts/plot_avg_distance_first_200s.py
+++ b/scripts/plot_avg_distance_first_200s.py
@ -0,0 +1,51 @@
 import pandas as pd
 import matplotlib.pyplot as plt
 import numpy as np
 from config import DATA_PROCESSED, FIGURES
 # Load data
 trained_distances = pd.read_csv(DATA_PROCESSED / 'trained_distances.csv')
 untrained_distances = pd.read_csv(DATA_PROCESSED / 'untrained_distances.csv')
 # Remove NaN distances and filter for first 200 seconds
 trained_clean = trained_distances.dropna(subset=['distance'])
 untrained_clean = untrained_distances.dropna(subset=['distance'])
 trained_filtered = trained_clean[trained_clean['t'] <= 200000]
 untrained_filtered = untrained_clean[untrained_clean['t'] <= 200000]
 # Calculate average distance over time
 trained_avg = trained_filtered.groupby('t')['distance'].mean()
 untrained_avg = untrained_filtered.groupby('t')['distance'].mean()
 # Apply smoothing
 window_size = 50
 trained_smooth = trained_avg.rolling(window=window_size, center=True).mean()
 untrained_smooth = untrained_avg.rolling(window=window_size, center=True).mean()
 # Create the plot
 plt.figure(figsize=(12, 6))
 plt.plot(trained_smooth.index/1000, trained_smooth.values,
         label='Trained (smoothed)', color='blue', linewidth=2)
 plt.plot(untrained_smooth.index/1000, untrained_smooth.values,
         label='Untrained (smoothed)', color='red', linewidth=2)
 plt.xlabel('Time (seconds)')
 plt.ylabel('Average Distance')
 plt.title('Average Distance Between Flies Over Time (First 200 Seconds)')
 plt.legend()
 plt.grid(True, alpha=0.3)
 plt.tight_layout()
 plt.savefig(FIGURES / 'avg_distance_over_time_first_200s.png', dpi=300, bbox_inches='tight')
 plt.show()
 print("Trained flies (first 200 seconds):")
 print(f"  Mean distance: {trained_filtered['distance'].mean():.2f}")
 print(f"  Std distance: {trained_filtered['distance'].std():.2f}")
 print("\nUntrained flies (first 200 seconds):")
 print(f"  Mean distance: {untrained_filtered['distance'].mean():.2f}")
 print(f"  Std distance: {untrained_filtered['distance'].std():.2f}")
--- a/scripts/plot_avg_distance_over_time.py
+++ b/scripts/plot_avg_distance_over_time.py
@ -0,0 +1,43 @@
 import pandas as pd
 import matplotlib.pyplot as plt
 import numpy as np
 from config import DATA_PROCESSED, FIGURES
 # Load data
 trained_distances = pd.read_csv(DATA_PROCESSED / 'trained_distances.csv')
 untrained_distances = pd.read_csv(DATA_PROCESSED / 'untrained_distances.csv')
 # Remove NaN distances
 trained_clean = trained_distances.dropna(subset=['distance'])
 untrained_clean = untrained_distances.dropna(subset=['distance'])
 # Calculate average distance over time
 trained_avg = trained_clean.groupby('t')['distance'].mean()
 untrained_avg = untrained_clean.groupby('t')['distance'].mean()
 # Create the plot
 plt.figure(figsize=(12, 6))
 plt.plot(trained_avg.index, trained_avg.values,
         label='Trained (avg)', color='blue', linewidth=1)
 plt.plot(untrained_avg.index, untrained_avg.values,
         label='Untrained (avg)', color='red', linewidth=1)
 plt.xlabel('Time')
 plt.ylabel('Average Distance')
 plt.title('Average Distance Between Flies Over Time by Group')
 plt.legend()
 plt.grid(True, alpha=0.3)
 plt.tight_layout()
 plt.savefig(FIGURES / 'avg_distance_over_time.png', dpi=300, bbox_inches='tight')
 plt.show()
 print("Trained flies:")
 print(f"  Mean distance: {trained_clean['distance'].mean():.2f}")
 print(f"  Std distance: {trained_clean['distance'].std():.2f}")
 print("\nUntrained flies:")
 print(f"  Mean distance: {untrained_clean['distance'].mean():.2f}")
 print(f"  Std distance: {untrained_clean['distance'].std():.2f}")
--- a/scripts/plot_distance_over_time.py
+++ b/scripts/plot_distance_over_time.py
@ -0,0 +1,50 @@
 import pandas as pd
 import matplotlib.pyplot as plt
 import numpy as np
 from config import DATA_PROCESSED, FIGURES
 # Load data
 trained_distances = pd.read_csv(DATA_PROCESSED / 'trained_distances.csv')
 untrained_distances = pd.read_csv(DATA_PROCESSED / 'untrained_distances.csv')
 # Remove NaN distances
 trained_clean = trained_distances.dropna(subset=['distance'])
 untrained_clean = untrained_distances.dropna(subset=['distance'])
 # Create the plot
 plt.figure(figsize=(12, 6))
 # Sample 1000 points from each group to avoid overcrowding
 if len(trained_clean) > 1000:
    trained_sample = trained_clean.sample(1000, random_state=42)
 else:
    trained_sample = trained_clean
 if len(untrained_clean) > 1000:
    untrained_sample = untrained_clean.sample(1000, random_state=42)
 else:
    untrained_sample = untrained_clean
 plt.scatter(trained_sample['t'], trained_sample['distance'],
           alpha=0.5, s=1, label='Trained', color='blue')
 plt.scatter(untrained_sample['t'], untrained_sample['distance'],
           alpha=0.5, s=1, label='Untrained', color='red')
 plt.xlabel('Time')
 plt.ylabel('Distance')
 plt.title('Distance Between Flies Over Time')
 plt.legend()
 plt.grid(True, alpha=0.3)
 plt.tight_layout()
 plt.savefig(FIGURES / 'distance_over_time.png', dpi=300, bbox_inches='tight')
 plt.show()
 print("Trained flies:")
 print(f"  Mean distance: {trained_clean['distance'].mean():.2f}")
 print(f"  Std distance: {trained_clean['distance'].std():.2f}")
 print("\nUntrained flies:")
 print(f"  Mean distance: {untrained_clean['distance'].mean():.2f}")
 print(f"  Std distance: {untrained_clean['distance'].std():.2f}")
--- a/scripts/statistical_tests.py
+++ b/scripts/statistical_tests.py
@ -0,0 +1,90 @@
 import pandas as pd
 import numpy as np
 from scipy import stats
 from config import DATA_PROCESSED, DATA_METADATA
 # Load data
 trained_distances = pd.read_csv(DATA_PROCESSED / 'trained_distances.csv')
 untrained_distances = pd.read_csv(DATA_PROCESSED / 'untrained_distances.csv')
 barrier_data = pd.read_csv(DATA_METADATA / '2025_07_15_barrier_opening.csv')
 # Convert opening_time to milliseconds and create a mapping
 barrier_data['opening_time_ms'] = barrier_data['opening_time'] * 1000
 opening_times = dict(zip(barrier_data['machine'], barrier_data['opening_time_ms']))
 def align_to_opening_time(df, opening_times):
    """Align distance data to barrier opening time.
    Args:
        df (pd.DataFrame): Distance data with machine_name and t columns.
        opening_times (dict): Mapping of machine ID to opening time in ms.
    Returns:
        pd.DataFrame: Data with aligned_time column added.
    """
    df_aligned = df.copy()
    df_aligned['aligned_time'] = np.nan
    for machine in df['machine_name'].unique():
        if machine in opening_times:
            opening_time = opening_times[machine]
            mask = df['machine_name'] == machine
            df_aligned.loc[mask, 'aligned_time'] = df.loc[mask, 't'] - opening_time
    df_aligned = df_aligned.dropna(subset=['aligned_time'])
    return df_aligned
 # Align the data
 trained_aligned = align_to_opening_time(trained_distances, opening_times)
 untrained_aligned = align_to_opening_time(untrained_distances, opening_times)
 # Remove NaN distances
 trained_clean = trained_aligned.dropna(subset=['distance'])
 untrained_clean = untrained_aligned.dropna(subset=['distance'])
 # Split into pre- and post-opening periods
 trained_pre = trained_clean[trained_clean['aligned_time'] < 0]['distance']
 trained_post = trained_clean[trained_clean['aligned_time'] > 0]['distance']
 untrained_pre = untrained_clean[untrained_clean['aligned_time'] < 0]['distance']
 untrained_post = untrained_clean[untrained_clean['aligned_time'] > 0]['distance']
 print("=== STATISTICAL TESTS ===")
 # Pre-opening period comparison
 t_stat_pre, p_val_pre = stats.ttest_ind(trained_pre, untrained_pre)
 cohens_d_pre = (trained_pre.mean() - untrained_pre.mean()) / np.sqrt(((len(trained_pre)-1)*trained_pre.var() + (len(untrained_pre)-1)*untrained_pre.var()) / (len(trained_pre) + len(untrained_pre) - 2))
 print(f"Pre-opening period:")
 print(f"  Trained mean: {trained_pre.mean():.2f}, Untrained mean: {untrained_pre.mean():.2f}")
 print(f"  T-statistic: {t_stat_pre:.4f}, P-value: {p_val_pre:.2e}")
 print(f"  Cohen's d: {cohens_d_pre:.4f}")
 # Post-opening period comparison
 t_stat_post, p_val_post = stats.ttest_ind(trained_post, untrained_post)
 cohens_d_post = (trained_post.mean() - untrained_post.mean()) / np.sqrt(((len(trained_post)-1)*trained_post.var() + (len(untrained_post)-1)*untrained_post.var()) / (len(trained_post) + len(untrained_post) - 2))
 print(f"\nPost-opening period:")
 print(f"  Trained mean: {trained_post.mean():.2f}, Untrained mean: {untrained_post.mean():.2f}")
 print(f"  T-statistic: {t_stat_post:.4f}, P-value: {p_val_post:.2e}")
 print(f"  Cohen's d: {cohens_d_post:.4f}")
 # Within-group comparisons (pre vs post)
 t_stat_trained, p_val_trained = stats.ttest_ind(trained_pre, trained_post)
 cohens_d_trained = (trained_post.mean() - trained_pre.mean()) / np.sqrt(((len(trained_post)-1)*trained_post.var() + (len(trained_pre)-1)*trained_pre.var()) / (len(trained_post) + len(trained_pre) - 2))
 t_stat_untrained, p_val_untrained = stats.ttest_ind(untrained_pre, untrained_post)
 cohens_d_untrained = (untrained_post.mean() - untrained_pre.mean()) / np.sqrt(((len(untrained_post)-1)*untrained_post.var() + (len(untrained_pre)-1)*untrained_pre.var()) / (len(untrained_post) + len(untrained_pre) - 2))
 print(f"\nWithin-group changes:")
 print(f"  Trained flies - Pre vs Post:")
 print(f"    Mean change: {trained_post.mean() - trained_pre.mean():.2f}")
 print(f"    T-statistic: {t_stat_trained:.4f}, P-value: {p_val_trained:.2e}")
 print(f"    Cohen's d: {cohens_d_trained:.4f}")
 print(f"  Untrained flies - Pre vs Post:")
 print(f"    Mean change: {untrained_post.mean() - untrained_pre.mean():.2f}")
 print(f"    T-statistic: {t_stat_untrained:.4f}, P-value: {p_val_untrained:.2e}")
 print(f"    Cohen's d: {cohens_d_untrained:.4f}")
--- a/tasks/lessons.md
+++ b/tasks/lessons.md
@ -0,0 +1,38 @@
 # Lessons Learned
 ## Pseudoreplication Pitfall
 **The most important lesson in this project.**
 The raw data has ~230K data points per group, but the true independent samples are ROIs (N=18 per group). Each ROI contributes thousands of correlated time points. Running t-tests on all data points inflates significance massively (p < 1e-200) while the actual effect size is negligible (Cohen's d = 0.09).
 **Rule**: Always compute per-ROI summary statistics first, then compare groups at the ROI level.
 ## Significance vs Effect Size
 A tiny p-value does NOT mean a meaningful difference. With N=230K, even a Cohen's d of 0.09 (96% overlap between distributions) gives p < 1e-200. Always report and interpret effect sizes alongside p-values.
 ## Data Type Mismatches
 Machine names are stored as integers in metadata (76, 145, 268) but as strings in some contexts. The barrier_opening.csv uses "076" format. Always convert to string with `.astype(str)` before matching.
 ## Time Unit Mismatches
 - SQLite databases: time `t` is in **milliseconds**
 - `2025_07_15_barrier_opening.csv`: `opening_time` is in **seconds**
 - Must multiply barrier opening times by 1000 before aligning
 ## Missing Data
 Machine 139 has 6 ROIs in the metadata (3 trained, 3 untrained) but:
 - No tracking database file exists
 - No entry in barrier_opening.csv
 - This reduces the effective N from 18 to 15 per group
 ## Single-Fly Detection Handling
 When only one fly is detected (instead of two), the tracker reports a single bounding box. If the area of that box is large (>1.5x median two-fly area), it likely means the flies are overlapping (distance ~0). If the area is small, one fly is probably out of frame (distance = NaN, excluded from analysis).
 ## Path Management
 All scripts use `from config import DATA_PROCESSED, FIGURES, ...` for consistent paths. Notebooks use `Path("..")` relative to the `notebooks/` directory. Never use hardcoded absolute paths.
--- a/tasks/todo.md
+++ b/tasks/todo.md
@ -0,0 +1,56 @@
 # Task List
 ## Completed Work
 - [x] Extract ROI data from SQLite databases grouped by trained/untrained
 - [x] Calculate inter-fly distances at each time point
 - [x] Align data to barrier opening time (t=0)
 - [x] Plot average distance over time (entire experiment + 300s window)
 - [x] Track fly identities across frames (Hungarian algorithm)
 - [x] Calculate max velocity over 10-second moving windows
 - [x] Statistical tests (t-tests, Cohen's d) comparing groups
 - [x] ML classification attempt (Logistic Regression, Random Forest)
 - [x] Clustering analysis (K-means)
 - [x] Organize project structure for student handoff
 ## Priority: Bimodal Hypothesis Analysis
 See `docs/bimodal_hypothesis.md` for detailed methodology.
 ### Phase 1: Per-ROI Feature Extraction
 - [ ] Compute per-ROI summary statistics from aligned distance data
  - Mean distance post-opening (0-300s)
  - Median distance post-opening
  - Fraction of time at distance < 50px ("close proximity")
  - Mean max velocity post-opening
 - [ ] Create a summary DataFrame with N=18 trained + N=18 untrained rows
 - [ ] **Note**: Only 30 ROIs have data (Machine 139 missing = 6 ROIs lost)
 ### Phase 2: Distribution Visualization
 - [ ] Plot histograms/KDE of per-ROI metrics for each group
 - [ ] Look for bimodality in trained group vs unimodality in untrained
 ### Phase 3: Formal Bimodality Testing
 - [ ] Hartigan's dip test on trained per-ROI distributions
 - [ ] Fit Gaussian Mixture Models (1 vs 2 components) to trained data
 - [ ] Compare BIC scores to determine optimal number of components
 ### Phase 4: Subgroup Identification
 - [ ] If bimodal: classify trained ROIs as "learner" vs "non-learner" using GMM posteriors
 - [ ] Compare learner subgroup vs untrained group (expect larger effect size)
 ### Phase 5: Effect Size Re-estimation
 - [ ] Mann-Whitney U test (appropriate for small N)
 - [ ] Bootstrap confidence intervals for effect sizes
 - [ ] Account for session as random effect
 ## Maintenance Items
 - [ ] Investigate missing Machine 139 data (has metadata but no tracking DB)
 - [ ] Add `diptest` to requirements.txt when starting bimodal analysis
 - [ ] Consider converting pixel distances to physical units (need calibration)
 - [ ] The second notebook (`flies_analysis.ipynb`) re-runs from DB extraction - consider deprecating
 ## Discovered During Work
 (Add new items here as they come up during analysis)