Add beginner tutorial notebooks for incoming students

Four guided notebooks under notebooks/getting_started/ aimed at someone new to Python and data science. The series progresses: project orientation → Python/pandas crash course → exploring one tracking DB → first trained-vs-naive comparison using load_roi_data + Mann-Whitney U. Each notebook leans heavily on markdown explanations, includes exercises with empty cells, and links out to canonical references (JupyterLab, official Python tutorial, pandas 10-min guide, Wikipedia for stats concepts). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move TARGETS_DIR to /mnt/data/projects/cupido/targets
2026-04-30 18:14:17 +01:00 · 2026-04-30 17:13:55 +01:00 · 2026-04-30 15:20:14 +01:00 · 2026-04-27 17:25:26 +01:00
23 changed files with 3450 additions and 214 deletions
--- a/.gitignore
+++ b/.gitignore
@ -2,6 +2,11 @@
 data/raw/*.db
 data/processed/*.csv
 # Offline-tracking outputs (regenerable from videos + target JSONs)
 # DBs and target JSONs live outside the repo at /mnt/data/projects/cupido/
 data/metadata/video_inventory.csv
 data/logs/*.log
 # Generated figures (reproducible from scripts)
 figures/*.png
--- a/README.md
+++ b/README.md
@ -46,6 +46,32 @@ The key insight: not all "trained" flies may have actually learned. The trained
 **Read `docs/bimodal_hypothesis.md` for the detailed analysis plan and code sketches.**
 ## Offline Tracking Pipeline (added Apr 2026)
 For tracking new videos that have **no auto-detectable targets**, the pipeline
 is split in two stages so you can sit at the screen and click for an hour, then
 let the tracker grind through overnight.
 ```bash
 # extra deps (ethoscope src must be at /home/gg/Code/ethoscope_project/...)
 pip install -r requirements-tracking.txt
 # 1) build the inventory (xlsx ↔ /mnt/ethoscope_data/videos/)
 python scripts/build_video_inventory.py
 # 2) interactive: click TOP, CORNER, LEFT on each video (one frame per video)
 python scripts/pick_targets.py             # process all not-yet-picked
 python scripts/pick_targets.py --redo      # re-pick already-picked videos
 # keys: r=reset  n=skip  f=jump frame  q/ESC=quit  ENTER=save
 # 3) batch tracking (idempotent, can run in background)
 python scripts/track_videos.py --jobs 4    # parallel
 # output → /mnt/data/projects/cupido/tracked/*_tracking.db (SQLite, same schema as data/raw/)
 ```
 See `tasks/todo.md` "Offline Tracking" section for the full plan, and
 `data/metadata/video_inventory.csv` for the list of videos to process.
 ## Folder Structure
 ```
--- a/data/metadata/2025_07_15_metadata_fixed.csv
+++ b/data/metadata/2025_07_15_metadata_fixed.csv
@ -1,37 +1,37 @@
-date,HHMMSS,machine_name,ROI,genotype,group,path,filesize_mb
+date,HHMMSS,machine_name,ROI,genotype,group,path,filesize_mb
 15/07/2025,16-03-10,76,6,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
-15/07/2025,16-03-10,76,4,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
+15/07/2025,16-03-10,76,4,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
 15/07/2025,16-03-10,76,2,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
-15/07/2025,16-03-10,76,5,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
+15/07/2025,16-03-10,76,5,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
 15/07/2025,16-03-10,76,3,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
-15/07/2025,16-03-10,76,1,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
+15/07/2025,16-03-10,76,1,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
 15/07/2025,16-31-34,76,6,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
 15/07/2025,16-31-34,76,4,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
 15/07/2025,16-31-34,76,2,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
-15/07/2025,16-31-34,76,5,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
+15/07/2025,16-31-34,76,5,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
-15/07/2025,16-31-34,76,3,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
+15/07/2025,16-31-34,76,3,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
-15/07/2025,16-31-34,76,1,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
+15/07/2025,16-31-34,76,1,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
 15/07/2025,16-03-27,145,6,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
 15/07/2025,16-03-27,145,4,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
 15/07/2025,16-03-27,145,2,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
-15/07/2025,16-03-27,145,5,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
+15/07/2025,16-03-27,145,5,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
-15/07/2025,16-03-27,145,3,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
+15/07/2025,16-03-27,145,3,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
-15/07/2025,16-03-27,145,1,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
+15/07/2025,16-03-27,145,1,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
 15/07/2025,16-31-41,145,6,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
 15/07/2025,16-31-41,145,4,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
 15/07/2025,16-31-41,145,2,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
-15/07/2025,16-31-41,145,5,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
+15/07/2025,16-31-41,145,5,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
-15/07/2025,16-31-41,145,3,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
+15/07/2025,16-31-41,145,3,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
-15/07/2025,16-31-41,145,1,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
+15/07/2025,16-31-41,145,1,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
 15/07/2025,16-31-52,139,6,CS,trained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
 15/07/2025,16-31-52,139,4,CS,trained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
 15/07/2025,16-31-52,139,2,CS,trained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
-15/07/2025,16-31-52,139,5,CS,untrained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
+15/07/2025,16-31-52,139,5,CS,naive,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
-15/07/2025,16-31-52,139,3,CS,untrained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
+15/07/2025,16-31-52,139,3,CS,naive,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
-15/07/2025,16-31-52,139,1,CS,untrained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
+15/07/2025,16-31-52,139,1,CS,naive,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
-15/07/2025,16-32-05,268,6,CS,untrained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
+15/07/2025,16-32-05,268,6,CS,naive,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
-15/07/2025,16-32-05,268,4,CS,untrained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
+15/07/2025,16-32-05,268,4,CS,naive,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
-15/07/2025,16-32-05,268,2,CS,untrained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
+15/07/2025,16-32-05,268,2,CS,naive,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
 15/07/2025,16-32-05,268,5,CS,trained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
 15/07/2025,16-32-05,268,3,CS,trained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
 15/07/2025,16-32-05,268,1,CS,trained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
--- a/data/processed/README.md
+++ b/data/processed/README.md
@ -1,39 +1,47 @@
 # Processed Data
-Large CSV files generated from the analysis pipeline. All files are gitignored (~370MB total) and can be regenerated.
+CSVs derived from the tracking DBs (`/mnt/data/projects/cupido/tracked/`)
 and the merged TSV (`../../all_video_info_merged.tsv`). All files are
 gitignored and regenerable.
 ## Files and Regeneration
 | File | Description | Generated By |
 |------|-------------|--------------|
-| `trained_roi_data.csv` | Raw tracking data for trained ROIs | `scripts/load_roi_data.py` or notebook step 1 |
+| `distances.csv` | Per-frame inter-fly distances for every (date, machine, ROI, session). Includes metadata columns to filter trained vs naïve, training phase, species, etc. | `scripts/calculate_distances.py` |
-| `untrained_roi_data.csv` | Raw tracking data for untrained ROIs | `scripts/load_roi_data.py` or notebook step 1 |
+| `*_distances_aligned.csv` | (legacy, 2025-07-15 only) distances aligned to barrier opening | `notebooks/flies_analysis*.ipynb` |
-| `trained_distances.csv` | Pairwise distances (unaligned) | `scripts/calculate_distances.py` |
+| `*_tracked.csv` | (legacy) identity-tracked fly positions | `notebooks/flies_analysis_simple.ipynb` |
-| `untrained_distances.csv` | Pairwise distances (unaligned) | `scripts/calculate_distances.py` |
+| `*_max_velocity.csv` | (legacy) max velocity over 10 s windows | `notebooks/flies_analysis_simple.ipynb` |
 | `trained_distances_aligned.csv` | Distances aligned to barrier opening | Notebook step 4 |
 | `untrained_distances_aligned.csv` | Distances aligned to barrier opening | Notebook step 4 |
 | `trained_tracked.csv` | Identity-tracked fly positions | Notebook step 7 |
 | `untrained_tracked.csv` | Identity-tracked fly positions | Notebook step 7 |
 | `trained_max_velocity.csv` | Max velocity over 10s windows | Notebook step 7 |
 | `untrained_max_velocity.csv` | Max velocity over 10s windows | Notebook step 7 |
-## To Regenerate All Data
+## Loading the data
 Run the full notebook `notebooks/flies_analysis_simple.ipynb` with:
 ```python
-recalculate_distances = True
+import sys
-recalculate_tracking = True
+sys.path.insert(0, "../scripts")
 from load_roi_data import load_roi_data
 data = load_roi_data()              # full batch as one DataFrame
 # Or filter the metadata first:
 import pandas as pd
 tsv = pd.read_csv("../../all_video_info_merged.tsv", sep="\t")
 data = load_roi_data(tsv[tsv.species.str.contains("Melanogaster")])
 ```
-**Warning**: Identity tracking and velocity calculations take significant time (~30+ minutes).
+The returned DataFrame has columns:
 `id, t, x, y, w, h, phi, is_inferred, has_interacted, session, ROI, date,
 machine_name, species, male, training_date_time, testing_date_time,
 training_length_hr, consolidation_length_hr, memory, age`.
-## Column Reference
+`session` is `"training"` or `"testing"`; `male` is `"trained"` or
 `"naive"` (canonical — variants like `"naïve"` and `"niave"` are normalized
 at the TSV-export step).
-### Distance CSVs (`*_distances_aligned.csv`)
+## Column Reference (`distances.csv`)
- `machine_name`: Ethoscope machine ID (string)
+
- `ROI`: ROI number (1-6)
+- `date`, `machine_name`, `ROI`, `session`: identifies one fly trajectory
- `aligned_time`: Time in ms relative to barrier opening (0 = opening)
+- `t`: time in ms within that session
- `distance`: Euclidean distance between flies in pixels
+- `distance`: Euclidean distance between the two flies in pixels
- `n_flies`: Number of flies detected at this time point
+- `n_flies`: number of fly detections at this frame (1 or 2)
- `area_fly1`, `area_fly2`: Bounding box areas (w*h) in pixels^2
+- `area_fly1`, `area_fly2`: bounding-box areas (`w * h`) in pixels²
- `group`: "trained" or "untrained"
+- `male`: `trained` or `naive` (carried from the xlsx; normalized)
 - `species`, `memory`, `age`: experimental metadata
--- a/notebooks/flies_analysis.ipynb
+++ b/notebooks/flies_analysis.ipynb
@ -28,7 +28,22 @@
   "execution_count": null,
   "metadata": {},
   "outputs": [],
-   "source": "def load_roi_data():\n    \"\"\"Load ROI data from SQLite databases and group by trained/untrained\"\"\"\n    metadata = pd.read_csv(DATA_METADATA / '2025_07_15_metadata_fixed.csv')\n    metadata['machine_name'] = metadata['machine_name'].astype(str)\n    \n    trained_rois = metadata[metadata['group'] == 'trained']\n    untrained_rois = metadata[metadata['group'] == 'untrained']\n    \n    db_files = list(DATA_RAW.glob('*_tracking.db'))\n    \n    trained_df = pd.DataFrame()\n    untrained_df = pd.DataFrame()\n    \n    for db_file in db_files:\n        print(f\"Processing {db_file.name}\")\n        \n        pattern = r'_([0-9a-f]{32})__'\n        match = re.search(pattern, db_file.name)\n        \n        if not match:\n            print(f\"Could not extract UUID from {db_file.name}\")\n            continue\n            \n        uuid = match.group(1)\n        metadata_matches = metadata[metadata['path'].str.contains(uuid, na=False)]\n        \n        if metadata_matches.empty:\n            print(f\"No metadata matches found for UUID {uuid}\")\n            continue\n            \n        machine_id = metadata_matches.iloc[0]['machine_name']\n        print(f\"Matched to machine ID: {machine_id}\")\n        \n        conn = sqlite3.connect(str(db_file))\n        \n        machine_trained = trained_rois[trained_rois['machine_name'] == machine_id]\n        machine_untrained = untrained_rois[untrained_rois['machine_name'] == machine_id]\n        \n        for _, row in machine_trained.iterrows():\n            roi = row['ROI']\n            try:\n                roi_data = pd.read_sql_query(f\"SELECT * FROM ROI_{roi}\", conn)\n                roi_data['machine_name'] = machine_id\n                roi_data['ROI'] = roi\n                roi_data['group'] = 'trained'\n                trained_df = pd.concat([trained_df, roi_data], ignore_index=True)\n            except Exception as e:\n                print(f\"Error loading ROI_{roi}: {e}\")\n        \n        for _, row in machine_untrained.iterrows():\n            roi = row['ROI']\n            try:\n                roi_data = pd.read_sql_query(f\"SELECT * FROM ROI_{roi}\", conn)\n                roi_data['machine_name'] = machine_id\n                roi_data['ROI'] = roi\n                roi_data['group'] = 'untrained'\n                untrained_df = pd.concat([untrained_df, roi_data], ignore_index=True)\n            except Exception as e:\n                print(f\"Error loading ROI_{roi}: {e}\")\n        \n        conn.close()\n    \n    return trained_df, untrained_df\n\ntrained_data, untrained_data = load_roi_data()\nprint(f\"Trained data shape: {trained_data.shape}\")\nprint(f\"Untrained data shape: {untrained_data.shape}\")\n\ntrained_data.to_csv(DATA_PROCESSED / 'trained_roi_data.csv', index=False)\nuntrained_data.to_csv(DATA_PROCESSED / 'untrained_roi_data.csv', index=False)\nprint(\"Data saved to CSV files\")"
+   "source": [
    "# Load tracking data via the unified loader (driven by all_video_info_merged.tsv).\n",
    "# Reason: replaces the old data/raw + 2025_07_15_metadata_fixed.csv path with\n",
    "# the TSV-based loader that covers the entire batch (2025-07-15 + 2024).\n",
    "sys.path.insert(0, str(PROJECT_ROOT / 'scripts'))\n",
    "from load_roi_data import load_roi_data\n",
    "\n",
    "data = load_roi_data()\n",
    "# Backwards-compat slices for the rest of the notebook.\n",
    "trained_data   = data[data['male'] == 'trained'].copy()\n",
    "untrained_data = data[data['male'] == 'naive'].copy()\n",
    "\n",
    "print(f\"all data:  {data.shape}\")\n",
    "print(f\"trained:   {trained_data.shape}\")\n",
    "print(f\"naive:     {untrained_data.shape}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
@ -219,4 +234,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 4
-}
+}
--- a/notebooks/flies_analysis_simple.ipynb
+++ b/notebooks/flies_analysis_simple.ipynb
@ -28,7 +28,22 @@
   "execution_count": null,
   "metadata": {},
   "outputs": [],
-   "source": "# Load the pre-processed data\ntrained_data = pd.read_csv(DATA_PROCESSED / 'trained_roi_data.csv')\nuntrained_data = pd.read_csv(DATA_PROCESSED / 'untrained_roi_data.csv')\n\nprint(f\"Trained data shape: {trained_data.shape}\")\nprint(f\"Untrained data shape: {untrained_data.shape}\")\nprint(f\"Trained data columns: {list(trained_data.columns)}\")\nprint(f\"Untrained data columns: {list(untrained_data.columns)}\")"
+   "source": [
    "# Load tracking data via the unified loader (driven by all_video_info_merged.tsv).\n",
    "# Reason: replaces reads of trained_roi_data.csv / untrained_roi_data.csv with\n",
    "# the live loader so the notebook always sees the current batch.\n",
    "sys.path.insert(0, str(PROJECT_ROOT / 'scripts'))\n",
    "from load_roi_data import load_roi_data\n",
    "\n",
    "data = load_roi_data()\n",
    "trained_data   = data[data['male'] == 'trained'].copy()\n",
    "untrained_data = data[data['male'] == 'naive'].copy()\n",
    "\n",
    "print(f\"all data shape:    {data.shape}\")\n",
    "print(f\"Trained data:      {trained_data.shape}\")\n",
    "print(f\"Naive data:        {untrained_data.shape}\")\n",
    "print(f\"Columns:           {list(trained_data.columns)}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
@ -418,4 +433,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 4
-}
+}
--- a/notebooks/getting_started/00_welcome.ipynb
+++ b/notebooks/getting_started/00_welcome.ipynb
@ -0,0 +1,255 @@
 {
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 00 \u00b7 Welcome to the Cupido fly-tracking project\n",
    "\n",
    "Hi! You're about to start working on a project that studies how *Drosophila*\n",
    "(fruit flies) form **memories of mating experiences** \u2014 and whether trained\n",
    "flies behave differently from na\u00efve ones in their later courtship.\n",
    "\n",
    "**You don't need any prior experience with Python or data science to follow\n",
    "along.** This series of notebooks will walk you through everything, one\n",
    "small step at a time.\n",
    "\n",
    "> **How to read these notebooks**: each notebook is split into \"cells\".\n",
    "> Some cells are explanations (like this one), others are code that you\n",
    "> can **run** by clicking on the cell and pressing `Shift + Enter`. Try it\n",
    "> on the next cell.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# This is a code cell. Click on it and press Shift+Enter to run it.\n",
    "print(\"Hello, fly world!\")\n",
    "1 + 1\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should have seen `Hello, fly world!` printed and the number `2`\n",
    "appear underneath. If something else happened, ask Giorgio \u2014 that's a\n",
    "sign the environment isn't set up right.\n",
    "\n",
    "If this is the very first time you're using JupyterLab, take 10 minutes\n",
    "to read the [official \"Getting started with JupyterLab\"\n",
    "guide](https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html).\n",
    "The most important things to know are:\n",
    "\n",
    "- A notebook (`.ipynb` file) is a sequence of **cells**.\n",
    "- Each cell is either **Markdown** (formatted text, like this) or **Code**\n",
    "  (Python that the computer runs).\n",
    "- The **kernel** is the running Python process behind the notebook. It\n",
    "  remembers everything you've defined. If something gets weird, restart\n",
    "  the kernel: top menu \u2192 *Kernel* \u2192 *Restart Kernel\u2026*.\n",
    "- `Shift + Enter` runs a cell and moves to the next one.\n",
    "- `Ctrl + Enter` runs a cell and stays put.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is the project about?\n",
    "\n",
    "Drosophila males court females with a stereotyped sequence (chasing,\n",
    "wing-extension, tapping). When a male is rejected by a female (e.g.\n",
    "because she's already mated), he **learns** to suppress his courtship \u2014\n",
    "even toward new, receptive females, for a while. This is a textbook\n",
    "example of *non-associative learning* in invertebrates ([review on\n",
    "PubMed](https://pubmed.ncbi.nlm.nih.gov/?term=courtship+conditioning+drosophila)).\n",
    "\n",
    "The lab is interested in:\n",
    "\n",
    "- Does this learning **transfer across species**? (We have ~7 *Drosophila*\n",
    "  species recorded.)\n",
    "- How long does the memory last? (training_length_hr,\n",
    "  consolidation_length_hr columns in the metadata.)\n",
    "- Are there **individual differences** \u2014 do some males learn while others\n",
    "  don't? (The \"bimodal hypothesis\" in `docs/bimodal_hypothesis.md`.)\n",
    "\n",
    "Your job, broadly, will be to **turn videos of flies into numbers and\n",
    "plots that answer these questions.**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How an experiment works (the bird's-eye view)\n",
    "\n",
    "1. **Training**: a male fly is placed with a non-receptive (mated) female.\n",
    "   He courts, gets rejected, eventually gives up.\n",
    "2. *Wait* for some hours (the \"consolidation\" period \u2014 gives memory time\n",
    "   to form).\n",
    "3. **Testing**: same male is placed with a fresh receptive female.\n",
    "   Does he court her vigorously, or has he learned to give up easily?\n",
    "\n",
    "Each experiment runs in an **HD mating arena** \u2014 a small chamber with\n",
    "6 sub-arenas (we call them **ROIs**, for \"regions of interest\"). Each ROI\n",
    "contains one couple (a male and a female). A camera films the whole arena\n",
    "from above. So one **video** gives us 6 simultaneous experiments.\n",
    "\n",
    "The setup uses [Ethoscopes](https://www.ethoscope.com/) \u2014 open-source\n",
    "behavioural recording boxes built in this lab. Each ethoscope is a\n",
    "machine; we have 16 in total, named `ETHOSCOPE_067`, `ETHOSCOPE_076`, etc.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What does the data look like?\n",
    "\n",
    "For each video, the **tracker** (a piece of software that runs after the\n",
    "recording) finds the flies frame-by-frame and writes their positions to a\n",
    "**SQLite database** (a single file, ending in `.db`). One DB per video.\n",
    "Inside each DB there are 6 tables called `ROI_1`, `ROI_2`, \u2026, `ROI_6` \u2014\n",
    "one per sub-arena. Each row of an ROI table is **one fly detection at one\n",
    "moment in time** with these columns:\n",
    "\n",
    "| column | meaning |\n",
    "|---|---|\n",
    "| `id` | row number (auto-incremented) |\n",
    "| `t` | time in **milliseconds** since the video started |\n",
    "| `x`, `y` | fly position in **pixels** (top-left corner of the image is 0,0) |\n",
    "| `w`, `h` | width and height of the bounding box around the fly, in pixels |\n",
    "| `phi` | orientation angle of the fly |\n",
    "| `is_inferred` | 1 if the position was guessed (not directly seen), 0 otherwise |\n",
    "| `has_interacted` | (legacy column, mostly unused) |\n",
    "\n",
    "If a single ROI has two flies that the tracker can see, you'll get **two\n",
    "rows with the same `t`** \u2014 one for each fly. If only one fly is detected\n",
    "(maybe they're on top of each other), you'll get one row.\n",
    "\n",
    "That's the heart of the data. Everything else (distances, velocities,\n",
    "group comparisons) is computed from these (t, x, y) traces.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Where everything lives\n",
    "\n",
    "Take a moment to memorize these locations \u2014 you'll come back to them often.\n",
    "\n",
    "| what | where |\n",
    "|---|---|\n",
    "| Tracking DBs (SQLite, one per video) | `/mnt/data/projects/cupido/tracked/` |\n",
    "| Target JSONs (the user-clicked reference points) | `/mnt/data/projects/cupido/targets/` |\n",
    "| Source video files | `/mnt/ethoscope_data/videos/` |\n",
    "| Project code (this repo) | `/home/gg/ownCloud/Work/Projects/coding/cupido/tracking/` |\n",
    "| The metadata table (xlsx + TSV) | `/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv` |\n",
    "| Your notebooks | `notebooks/getting_started/` (this folder) |\n",
    "\n",
    "Let's verify a couple of these from inside Python:\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "tracked = Path(\"/mnt/data/projects/cupido/tracked\")\n",
    "targets = Path(\"/mnt/data/projects/cupido/targets\")\n",
    "\n",
    "n_dbs   = len(list(tracked.glob(\"*_tracking.db\")))\n",
    "n_jsons = len(list(targets.glob(\"*.json\")))\n",
    "\n",
    "print(f\"Tracking DBs available: {n_dbs}\")\n",
    "print(f\"Target JSONs available: {n_jsons}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should see roughly 113 tracking DBs and 130 target JSONs. If those\n",
    "numbers are zero, the storage volume isn't mounted \u2014 ask Giorgio.\n",
    "\n",
    "> **Note**: the tracking DBs are read-only inside the JupyterLab\n",
    "> container. You can read them but not modify or delete them. That's a\n",
    "> deliberate safety measure \u2014 we don't want analysis code accidentally\n",
    "> corrupting the source data.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Glossary (refer back as needed)\n",
    "\n",
    "- **ROI** \u2014 *region of interest*. One sub-arena inside the HD mating\n",
    "  arena. There are 6 ROIs per video, numbered 1\u20136.\n",
    "- **fly** \u2014 one detection in a single (t, ROI) cell. Two flies in the\n",
    "  same ROI at the same time = two rows with the same `t`.\n",
    "- **trained** \u2014 the male had a training session before testing.\n",
    "- **naive** \u2014 the male is a control (no training).\n",
    "- **training session** \u2014 the recording where the male meets the\n",
    "  non-receptive female (he gets rejected).\n",
    "- **testing session** \u2014 the recording where the male meets a fresh\n",
    "  receptive female (we measure his courtship).\n",
    "- **t (milliseconds)** \u2014 time within one session, starting at 0.\n",
    "- **(x, y) pixels** \u2014 fly position in the image. Top-left is (0, 0); x\n",
    "  grows to the right, y grows **downward** (this is the image-coordinate\n",
    "  convention, opposite of math class).\n",
    "- **machine_name** \u2014 which ethoscope recorded the video, e.g.\n",
    "  `ETHOSCOPE_076`.\n",
    "- **species** \u2014 `Melanogaster/CS`, `Sechellia`, `Simulans`, `Yakuba`,\n",
    "  `Erecta`, `Willistoni`, or `CS`.\n",
    "\n",
    "If you bump into other terms in the code, ask. Don't guess \u2014 biology\n",
    "codebases pick up jargon over the years.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What's next\n",
    "\n",
    "When you're ready, open these notebooks **in order**:\n",
    "\n",
    "1. `01_python_pandas_basics.ipynb` \u2014 just enough Python and pandas to\n",
    "   read and manipulate tabular data.\n",
    "2. `02_explore_one_database.ipynb` \u2014 open one tracking DB, plot a fly's\n",
    "   trajectory, see what the numbers actually look like.\n",
    "3. `03_compare_trained_vs_naive.ipynb` \u2014 your first real analysis,\n",
    "   comparing groups of flies.\n",
    "\n",
    "After those, the notebooks one level up (`flies_analysis.ipynb`,\n",
    "`flies_analysis_simple.ipynb`) contain the analysis pipeline that the\n",
    "previous student built \u2014 those will make sense once you've worked\n",
    "through the tutorials.\n",
    "\n",
    "Don't try to power through all of them in one sitting. Run a few cells,\n",
    "read the explanation, **change a number** to see what happens, **break\n",
    "something on purpose** to see the error message. That's how you learn.\n"
   ]
  }
 ]
 }
--- a/notebooks/getting_started/01_python_pandas_basics.ipynb
+++ b/notebooks/getting_started/01_python_pandas_basics.ipynb
@ -0,0 +1,500 @@
 {
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 01 \u00b7 Python and pandas \u2014 just enough to be dangerous\n",
    "\n",
    "This notebook teaches the **minimum** Python and `pandas` you need to read\n",
    "the rest of the project's code and write your own analyses.\n",
    "\n",
    "If you've never programmed before, don't try to memorize the syntax.\n",
    "Just run each cell, read what it does, and come back when you're stuck on\n",
    "something specific. The cheat sheet at the end is the only thing worth\n",
    "keeping handy.\n",
    "\n",
    "External resources, in order of how much time they take:\n",
    "\n",
    "- \ud83e\udd98 [Python in 10 minutes (very condensed)](https://www.stavros.io/tutorials/python/)\n",
    "- \ud83d\udc0d [Official Python tutorial \u2014 chapters 3\u20135](https://docs.python.org/3/tutorial/introduction.html)\n",
    "- \ud83d\udc3c [pandas in 10 minutes (official)](https://pandas.pydata.org/docs/user_guide/10min.html)\n",
    "- \ud83d\udcda [Python for Data Analysis (the book)](https://wesmckinney.com/book/) \u2014 free online\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.  Variables\n",
    "\n",
    "A variable is a named box you put a value into. The `=` is **assignment**,\n",
    "not equality. Read it as \"make `name` refer to `value`\".\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "x = 5\n",
    "y = 3\n",
    "total = x + y\n",
    "print(total)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Re-running the cell after changing `x = 5` to `x = 50` gives a different\n",
    "answer. Try it.\n",
    "\n",
    "Variable names: lowercase letters, digits, and underscores. They can't\n",
    "start with a digit. Convention is `snake_case`: `mean_distance`, not\n",
    "`meanDistance` or `MeanDistance`.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.  Strings and numbers\n",
    "\n",
    "A **string** is text in quotes. You can join strings with `+`. You can\n",
    "turn a number into a string with `str()`, and vice-versa with `int()` /\n",
    "`float()`.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "species = \"Drosophila melanogaster\"\n",
    "n_flies = 12\n",
    "message = \"We tracked \" + str(n_flies) + \" \" + species + \" males.\"\n",
    "print(message)\n",
    "\n",
    "# A nicer way to build strings \u2014 f-strings (note the leading 'f'):\n",
    "print(f\"We tracked {n_flies} {species} males.\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.  Lists\n",
    "\n",
    "A list is an ordered collection of things. Square brackets, items\n",
    "separated by commas. You can mix types (but usually shouldn't).\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "machines = [\"ETHOSCOPE_076\", \"ETHOSCOPE_082\", \"ETHOSCOPE_086\"]\n",
    "print(machines[0])         # first item \u2014 Python counts from 0!\n",
    "print(machines[-1])        # last item\n",
    "print(len(machines))       # how many items\n",
    "print(machines + [\"ETHOSCOPE_140\"])  # concatenate (returns a new list)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.  Dictionaries\n",
    "\n",
    "A dictionary maps **keys** to **values**. Curly braces, `key: value`\n",
    "pairs.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "fly = {\"species\": \"Sechellia\", \"trained\": True, \"age_days\": 5}\n",
    "print(fly[\"species\"])\n",
    "print(fly[\"age_days\"])\n",
    "fly[\"alive\"] = False         # add a new key\n",
    "print(fly)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.  Conditions: if / elif / else\n",
    "\n",
    "Compare with `==` (equal), `!=` (not equal), `<`, `>`, `<=`, `>=`.\n",
    "Combine with `and`, `or`, `not`.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "distance_px = 42\n",
    "\n",
    "if distance_px < 50:\n",
    "    label = \"close\"\n",
    "elif distance_px < 200:\n",
    "    label = \"medium\"\n",
    "else:\n",
    "    label = \"far\"\n",
    "\n",
    "print(label)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6.  Loops\n",
    "\n",
    "`for x in collection:` runs the indented block once per item.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "for m in machines:\n",
    "    print(f\"Looking at machine {m}\")\n",
    "\n",
    "# Looping with an index, when you need it:\n",
    "for i, m in enumerate(machines):\n",
    "    print(f\"{i}: {m}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7.  Functions\n",
    "\n",
    "A function is a named, reusable chunk of code. `def` declares it. `return`\n",
    "sends a value back to whoever called it.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def fly_age_in_weeks(days):\n",
    "    \"\"\"Return age in weeks given age in days.\"\"\"\n",
    "    return days / 7\n",
    "\n",
    "print(fly_age_in_weeks(14))    # 2.0\n",
    "print(fly_age_in_weeks(5))     # 0.714\u2026\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8.  Importing libraries\n",
    "\n",
    "A library is somebody else's code. We use `import` to pull it into our\n",
    "notebook.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import math\n",
    "print(math.sqrt(16))   # 4.0\n",
    "print(math.pi)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9.  Meet pandas\n",
    "\n",
    "Real data is rarely a single number \u2014 it's a **table** with rows and\n",
    "columns (think Excel). `pandas` is the library that handles tables in\n",
    "Python. The two main objects are:\n",
    "\n",
    "- **`Series`** \u2014 a single column with a name.\n",
    "- **`DataFrame`** \u2014 a whole table.\n",
    "\n",
    "By convention we import pandas as `pd`. Always.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Read the project's metadata TSV (Tab-Separated Values).\n",
    "tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
    "df = pd.read_csv(tsv_path, sep=\"\\t\")\n",
    "\n",
    "# How big is it?\n",
    "print(f\"Rows: {len(df)}\")\n",
    "print(f\"Columns: {df.shape[1]}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10.  Looking at the table\n",
    "\n",
    "`.head()` shows the first 5 rows. `.tail()` the last 5. `.columns` lists\n",
    "column names. `.dtypes` shows the type of each column.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "df.head(3)\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "print(\"Column names:\")\n",
    "for c in df.columns:\n",
    "    print(f\"  {c}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 11.  Selecting columns\n",
    "\n",
    "Two main ways to get one column: bracket-indexing (`df[\"name\"]`) or\n",
    "attribute access (`df.name`). The first works for any column name; the\n",
    "second only works if the name has no spaces or weird characters.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "df[\"species\"].head()\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "df.species.value_counts()    # how many rows per species\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 12.  Selecting multiple columns\n",
    "\n",
    "Pass a **list** of names inside the brackets:\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "df[[\"machine_name\", \"roi\", \"species\", \"male\"]].head()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 13.  Filtering rows\n",
    "\n",
    "The pattern is `df[condition]`. The condition is a Series of `True`/`False`.\n",
    "Pandas keeps the rows where it's `True`.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "trained = df[df[\"male\"] == \"trained\"]\n",
    "print(f\"trained rows: {len(trained)}\")\n",
    "\n",
    "mel_only = df[df[\"species\"] == \"Melanogaster/CS\"]\n",
    "print(f\"Melanogaster/CS rows: {len(mel_only)}\")\n",
    "\n",
    "# Combine conditions with & (and) | (or) \u2014 and wrap each part in parentheses.\n",
    "trained_mel = df[(df[\"male\"] == \"trained\") & (df[\"species\"] == \"Melanogaster/CS\")]\n",
    "print(f\"trained Mel rows: {len(trained_mel)}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 14.  Grouping and counting\n",
    "\n",
    "`.groupby(\"col\")` followed by an aggregator like `.size()` or `.mean()`\n",
    "splits the table by the values in that column and computes something per\n",
    "group.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# How many ROIs per (species, training condition)?\n",
    "df.groupby([\"species\", \"male\"]).size()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 15.  Quick plots\n",
    "\n",
    "DataFrames know how to draw themselves. Under the hood it's `matplotlib`.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# How many rows per machine?\n",
    "df[\"machine_name\"].value_counts().plot(kind=\"bar\", figsize=(10, 4))\n",
    "plt.title(\"Number of fly-rows per ethoscope machine\")\n",
    "plt.ylabel(\"rows\")\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 16.  Exercises\n",
    "\n",
    "Don't skip these. They're how you find out what you actually understood.\n",
    "\n",
    "1. How many rows does `df` have where `age` equals `'5-7'`?\n",
    "2. Print the **unique values** of the `memory` column. (Hint: `df[\"memory\"].unique()`)\n",
    "3. How many distinct `(date, machine_name)` pairs are in the dataset?\n",
    "   (Hint: `df.groupby([\"date\", \"machine_name\"]).size().shape`.)\n",
    "4. Make a bar plot of `species` counts. Which species has the most rows?\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Try exercise 1 here\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Try exercise 2 here\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Try exercise 3 here\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Try exercise 4 here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cheat sheet\n",
    "\n",
    "```python\n",
    "import pandas as pd\n",
    "df = pd.read_csv(\"file.tsv\", sep=\"\\t\")     # read\n",
    "df.head(); df.tail(); df.shape; df.columns  # peek\n",
    "df[\"col\"]; df[[\"a\", \"b\"]]                    # select\n",
    "df[df[\"col\"] == \"value\"]                     # filter\n",
    "df.groupby(\"col\").size()                     # count per group\n",
    "df.groupby(\"col\")[\"x\"].mean()                # mean of x per group\n",
    "df[\"col\"].value_counts()                     # quick counts\n",
    "df[\"col\"].unique()                           # unique values\n",
    "df[\"new_col\"] = df[\"w\"] * df[\"h\"]            # derived column\n",
    "df.sort_values(\"col\", ascending=False)       # sort\n",
    "df.plot(...)                                 # quick plot\n",
    "```\n",
    "\n",
    "Keep this list open when reading other people's code. Most of pandas is\n",
    "just combinations of these primitives. When you need more, the official\n",
    "[pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)\n",
    "is excellent.\n"
   ]
  }
 ]
 }
--- a/notebooks/getting_started/02_explore_one_database.ipynb
+++ b/notebooks/getting_started/02_explore_one_database.ipynb
@ -0,0 +1,439 @@
 {
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 02 \u00b7 A first look at one tracking database\n",
    "\n",
    "In this notebook we open **one** of the SQLite databases that the tracker\n",
    "produced and look at what's actually inside. By the end you'll be able to:\n",
    "\n",
    "- list the tables in a `.db` file\n",
    "- read one ROI's tracking trace into a DataFrame\n",
    "- plot a fly's path through the arena\n",
    "- count how many flies are visible at each moment\n",
    "- compute a simple distance between the two flies in a ROI\n",
    "\n",
    "If you're curious how SQLite works, the\n",
    "[SQLite Quickstart](https://www.sqlite.org/quickstart.html) is short and\n",
    "worth reading. For our purposes, **SQLite is just a file that contains\n",
    "several tables you can query like a DataFrame**.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "We import the libraries we need. `sqlite3` is part of Python's standard\n",
    "library \u2014 no install needed.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import sqlite3\n",
    "from pathlib import Path\n",
    "\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Find the databases\n",
    "\n",
    "The DBs live at `/mnt/data/projects/cupido/tracked/`. Let's list a few.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "tracked_dir = Path(\"/mnt/data/projects/cupido/tracked\")\n",
    "db_files = sorted(tracked_dir.glob(\"*_tracking.db\"))\n",
    "\n",
    "print(f\"Found {len(db_files)} tracking DBs.\")\n",
    "print(\"\\nFirst 5 by name:\")\n",
    "for db in db_files[:5]:\n",
    "    print(f\"  {db.name}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The filename encodes the date, time, machine UUID, video resolution, and\n",
    "the suffix `_tracking.db`. For example:\n",
    "\n",
    "```\n",
    "2024-09-17_10-32-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged_tracking.db\n",
    "\u2514\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2518\u2514\u2500\u2500\u252c\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n",
    "   date     time       machine UUID                  video format\n",
    "```\n",
    "\n",
    "Pick one to explore. Feel free to change the index.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "db_path = db_files[0]\n",
    "print(\"Working with:\", db_path.name)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Open the database\n",
    "\n",
    "We open it **read-only** as a safety measure. The `?mode=ro` flag is\n",
    "SQLite's read-only switch.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "conn = sqlite3.connect(f\"file:{db_path}?mode=ro\", uri=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What tables are inside?\n",
    "\n",
    "Every SQLite database has a system table called `sqlite_master` that\n",
    "lists everything. We can query it like any other table.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "tables = pd.read_sql_query(\n",
    "    \"SELECT name FROM sqlite_master WHERE type='table' ORDER BY name\", conn\n",
    ")\n",
    "tables\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should see tables like `ROI_1`, `ROI_2`, \u2026, `ROI_6` (one per\n",
    "sub-arena), plus housekeeping tables like `METADATA`, `ROI_MAP`,\n",
    "`VAR_MAP`, `START_EVENTS`. We mostly care about the `ROI_*` ones.\n",
    "\n",
    "## Read one ROI\n",
    "\n",
    "`pd.read_sql_query()` runs an SQL query against the connection and\n",
    "returns a DataFrame. The query `SELECT * FROM ROI_1` means *\"give me all\n",
    "columns and all rows from the table called ROI_1\"*.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "roi1 = pd.read_sql_query(\"SELECT * FROM ROI_1\", conn)\n",
    "print(f\"shape: {roi1.shape}\")     # (rows, columns)\n",
    "roi1.head()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understanding the columns\n",
    "\n",
    "Refer back to notebook `00_welcome` for the full column reference. Quick\n",
    "recap of the important ones:\n",
    "\n",
    "- `t`: time in **milliseconds** since the video started.\n",
    "- `x`, `y`: fly position in **pixels**. The image origin (0, 0) is the\n",
    "  **top-left** corner. y grows downward.\n",
    "- `w`, `h`: bounding-box width/height. Their product (`area = w*h`) is a\n",
    "  rough proxy for \"how big does this blob look\" \u2014 useful for spotting\n",
    "  frames where the tracker merged two flies into one big detection.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Quick descriptive stats\n",
    "roi1[[\"t\", \"x\", \"y\", \"w\", \"h\"]].describe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The minimum `t` should be 0 (start of the video). The maximum tells you\n",
    "how long the recording was. Convert ms to minutes by dividing by 60000:\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "duration_min = roi1[\"t\"].max() / 60_000\n",
    "print(f\"Session length: {duration_min:.1f} minutes\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How many flies per frame?\n",
    "\n",
    "If two flies are visible in this ROI, we get **two rows per `t`**. Let's\n",
    "check.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "flies_per_frame = roi1.groupby(\"t\").size()\n",
    "print(flies_per_frame.value_counts().sort_index())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output tells you, e.g., \"100,000 frames had 2 flies visible, 30,000\n",
    "had 1 fly visible\". Frames with 1 fly usually mean the two flies are\n",
    "overlapping or one is occluded \u2014 that's something we'll handle properly\n",
    "in the next notebook.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plot one fly's trajectory\n",
    "\n",
    "We'll plot the position over the first 5 minutes (300 000 ms). For\n",
    "clarity we'll only look at frames where there were 2 flies and pick the\n",
    "**first** of the two (sorted by `id`) as \"fly 1\" \u2014 this is a rough\n",
    "heuristic; identity tracking is harder than it sounds.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Filter to the first 5 minutes\n",
    "sub = roi1[roi1[\"t\"] <= 5 * 60_000]\n",
    "\n",
    "# Pick \"fly 1\" by taking the first row at each time point\n",
    "fly1 = sub.sort_values([\"t\", \"id\"]).drop_duplicates(\"t\", keep=\"first\")\n",
    "\n",
    "plt.figure(figsize=(6, 5))\n",
    "plt.plot(fly1[\"x\"], fly1[\"y\"], color=\"steelblue\", linewidth=0.5, alpha=0.7)\n",
    "plt.scatter(fly1[\"x\"].iloc[0], fly1[\"y\"].iloc[0], color=\"green\", label=\"start\", zorder=5)\n",
    "plt.scatter(fly1[\"x\"].iloc[-1], fly1[\"y\"].iloc[-1], color=\"red\", label=\"end\", zorder=5)\n",
    "plt.gca().invert_yaxis()   # because pixel y grows downward\n",
    "plt.xlabel(\"x (pixels)\")\n",
    "plt.ylabel(\"y (pixels)\")\n",
    "plt.title(f\"Fly 1 trajectory \u2014 first 5 min \u2014 {db_path.name[:30]}\u2026\")\n",
    "plt.legend()\n",
    "plt.axis(\"equal\")\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should see a tangle of lines confined to a roughly rectangular ROI.\n",
    "That tangle is the fly walking around its sub-arena.\n",
    "\n",
    "Notice we did `plt.gca().invert_yaxis()` \u2014 that's because in image\n",
    "coordinates y grows downward, but humans expect plots where y grows\n",
    "upward. Without it the plot would be vertically flipped.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plot position over time\n",
    "\n",
    "A trajectory plot collapses time into \"shape on a page\". To see *when*\n",
    "things happen we need time on the x-axis.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(2, 1, figsize=(12, 5), sharex=True)\n",
    "\n",
    "axes[0].plot(fly1[\"t\"] / 1000, fly1[\"x\"], linewidth=0.5)\n",
    "axes[0].set_ylabel(\"x (px)\")\n",
    "axes[0].set_title(f\"Fly 1, ROI 1, {db_path.name[:30]}\u2026\")\n",
    "\n",
    "axes[1].plot(fly1[\"t\"] / 1000, fly1[\"y\"], linewidth=0.5, color=\"darkorange\")\n",
    "axes[1].set_ylabel(\"y (px)\")\n",
    "axes[1].set_xlabel(\"time (s)\")\n",
    "axes[1].invert_yaxis()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bursts of variation = active fly. Long flat stretches = the fly is sitting\n",
    "still. You'll come to recognize courtship vs idling by eye after a while.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Distance between the two flies\n",
    "\n",
    "Whenever the ROI has 2 detections at the same `t`, we can compute the\n",
    "Euclidean distance between them: `sqrt((x1-x2)\u00b2 + (y1-y2)\u00b2)`.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "two_fly_frames = roi1.groupby(\"t\").filter(lambda g: len(g) == 2)\n",
    "two_fly_frames = two_fly_frames.sort_values([\"t\", \"id\"])\n",
    "\n",
    "# Pivot so each row is one timepoint with x1, y1, x2, y2\n",
    "def pair_up(g):\n",
    "    g = g.reset_index(drop=True)\n",
    "    return pd.Series({\n",
    "        \"x1\": g.loc[0, \"x\"], \"y1\": g.loc[0, \"y\"],\n",
    "        \"x2\": g.loc[1, \"x\"], \"y2\": g.loc[1, \"y\"],\n",
    "    })\n",
    "\n",
    "paired = two_fly_frames.groupby(\"t\").apply(pair_up).reset_index()\n",
    "paired[\"distance_px\"] = np.hypot(paired[\"x1\"] - paired[\"x2\"], paired[\"y1\"] - paired[\"y2\"])\n",
    "paired.head()\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "plt.figure(figsize=(12, 4))\n",
    "plt.plot(paired[\"t\"] / 1000, paired[\"distance_px\"], linewidth=0.4)\n",
    "plt.xlabel(\"time (s)\")\n",
    "plt.ylabel(\"inter-fly distance (px)\")\n",
    "plt.title(\"Distance between the two flies in ROI 1\")\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the kind of trace that drives the rest of the analysis: a male\n",
    "courting a female stays close (small distance); a male giving up wanders\n",
    "off (large distance). The shape of this curve is the behavioural readout.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Don't forget to close the connection\n",
    "\n",
    "If you opened a connection, close it when you're done. (Not strictly\n",
    "necessary in a notebook \u2014 Python tidies up \u2014 but a good habit.)\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "conn.close()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "1. Pick a different DB (change `db_files[0]` to `db_files[10]` for example)\n",
    "   and re-run the trajectory plot. Is the arena bigger / smaller? Why\n",
    "   might that be? (Hint: look at the resolution part of the filename.)\n",
    "2. Plot the distance trace for **ROI 4** instead of ROI 1.\n",
    "3. Compute the **percentage of frames** in ROI 1 that had only 1 fly visible.\n",
    "4. The `area = w * h` column is a useful diagnostic. Plot `area` vs `t`\n",
    "   for fly 1 \u2014 when does the bounding box get unusually large?\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Exercise space\n"
   ]
  }
 ]
 }
--- a/notebooks/getting_started/03_compare_trained_vs_naive.ipynb
+++ b/notebooks/getting_started/03_compare_trained_vs_naive.ipynb
@ -0,0 +1,398 @@
 {
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 03 \u00b7 Your first real analysis: trained vs naive\n",
    "\n",
    "In notebook 02 we explored a single database. Now we'll work with **all\n",
    "of them at once**, compute a simple per-fly metric, and ask the central\n",
    "question of the project:\n",
    "\n",
    "> **Do trained males behave differently from na\u00efve males in the testing\n",
    "> session?**\n",
    "\n",
    "By the end you'll have:\n",
    "\n",
    "- loaded every (fly, session) trace into one big DataFrame using the\n",
    "  project's helper function;\n",
    "- reduced each trace to one number per fly (the *median inter-fly\n",
    "  distance*);\n",
    "- compared the trained group against the na\u00efve group with a histogram\n",
    "  and a non-parametric statistical test;\n",
    "- learnt enough to start asking your own questions.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import sys\n",
    "from pathlib import Path\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy import stats\n",
    "\n",
    "# Tell Python where to find the project's helper modules.\n",
    "PROJECT_ROOT = Path(\"..\").resolve().parent  # this notebook is in notebooks/getting_started/\n",
    "sys.path.insert(0, str(PROJECT_ROOT / \"scripts\"))\n",
    "\n",
    "from load_roi_data import load_roi_data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading everything at once \u2014 but carefully\n",
    "\n",
    "`load_roi_data()` opens every tracking DB referenced by the metadata TSV\n",
    "and returns one big DataFrame. **It can be slow and memory-hungry**\n",
    "(the full batch is ~200 million rows). Always start small.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Load the metadata TSV first \u2014 it's small and fast.\n",
    "tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
    "meta = pd.read_csv(tsv_path, sep=\"\\t\")\n",
    "print(f\"metadata rows: {len(meta)}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pre-filter the metadata before passing it to `load_roi_data`. We'll start\n",
    "with **just one species and just the testing sessions**, because:\n",
    "\n",
    "1. mixing species is a confound (different species behave differently);\n",
    "2. the question is about behaviour after training, so the testing session\n",
    "   is the relevant one;\n",
    "3. starting small means we can iterate quickly.\n",
    "\n",
    "You can come back later and broaden this filter.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Pick one species. 'Melanogaster/CS' has the most rows (127), so a good default.\n",
    "sub = meta[meta[\"species\"] == \"Melanogaster/CS\"].copy()\n",
    "\n",
    "# We're loading every session for these flies, but the loader stamps each\n",
    "# row with a 'session' column so we can filter to testing afterwards.\n",
    "print(f\"selected metadata rows: {len(sub)}\")\n",
    "print(sub[\"male\"].value_counts())\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# This will take a minute or two and use a chunk of RAM. Be patient.\n",
    "data = load_roi_data(sub)\n",
    "print(f\"loaded shape: {data.shape}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What did we get?\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "data.head(3)\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# How big is each session, in tracking samples?\n",
    "data.groupby([\"session\", \"male\"]).size().unstack(fill_value=0)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Restrict to the testing session\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "testing = data[data[\"session\"] == \"testing\"].copy()\n",
    "print(f\"testing samples: {len(testing):,}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reduce each trace to one number\n",
    "\n",
    "Right now each fly contributes **tens of thousands** of (t, x, y) rows.\n",
    "We can't compare distributions of millions of points across two groups\n",
    "in any meaningful way. So we **collapse each (date, machine_name, ROI)\n",
    "trace into a single summary number** \u2014 here, the median distance between\n",
    "the two flies during testing.\n",
    "\n",
    "Why median rather than mean? Because tracker glitches (one fly\n",
    "temporarily lost) can produce huge spikes that the median ignores.\n",
    "[Why medians beat means in noisy data\n",
    "(2-min read)](https://en.wikipedia.org/wiki/Median#Robustness).\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Step 1 \u2014 per-frame distance.\n",
    "# Take only frames with exactly 2 flies (so we have a real distance).\n",
    "two_fly = testing.groupby([\"date\", \"machine_name\", \"ROI\", \"t\"]).filter(lambda g: len(g) == 2)\n",
    "\n",
    "# For each (track, t), compute the distance between the two rows.\n",
    "def distance_for_frame(g):\n",
    "    g = g.sort_values(\"id\").reset_index(drop=True)\n",
    "    return np.hypot(g.loc[0, \"x\"] - g.loc[1, \"x\"], g.loc[0, \"y\"] - g.loc[1, \"y\"])\n",
    "\n",
    "# This is the slow step. With ~3 M frames it takes a while.\n",
    "per_frame = (\n",
    "    two_fly\n",
    "    .groupby([\"date\", \"machine_name\", \"ROI\", \"t\", \"male\"])\n",
    "    .apply(distance_for_frame)\n",
    "    .reset_index(name=\"distance_px\")\n",
    ")\n",
    "print(f\"per-frame distance rows: {len(per_frame):,}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Step 2 \u2014 one number per (date, machine_name, ROI).\n",
    "per_fly = (\n",
    "    per_frame\n",
    "    .groupby([\"date\", \"machine_name\", \"ROI\", \"male\"])[\"distance_px\"]\n",
    "    .median()\n",
    "    .reset_index(name=\"median_distance_px\")\n",
    ")\n",
    "\n",
    "# Each row now is \"one fly during testing\", with its median distance.\n",
    "print(per_fly.shape)\n",
    "per_fly.head()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sanity check: how many flies per group?\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "per_fly[\"male\"].value_counts()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the numbers are very different, your statistical comparison will be\n",
    "underpowered for one side. Note them down.\n",
    "\n",
    "## Plot the distributions\n",
    "\n",
    "The first thing to do with two groups is to **look at them**. Don't trust\n",
    "a p-value before you've seen the histogram.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(10, 5))\n",
    "\n",
    "bins = np.linspace(0, per_fly[\"median_distance_px\"].max(), 40)\n",
    "\n",
    "for label, color in [(\"trained\", \"steelblue\"), (\"naive\", \"darkorange\")]:\n",
    "    sub = per_fly[per_fly[\"male\"] == label][\"median_distance_px\"]\n",
    "    ax.hist(sub, bins=bins, alpha=0.6, label=f\"{label} (n={len(sub)})\", color=color)\n",
    "\n",
    "ax.set_xlabel(\"median inter-fly distance during testing (px)\")\n",
    "ax.set_ylabel(\"number of flies\")\n",
    "ax.set_title(\"Trained vs na\u00efve \u2014 Melanogaster/CS \u2014 testing session\")\n",
    "ax.legend()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**What you might see:**\n",
    "\n",
    "- If the trained group's distribution is shifted to **higher** distances,\n",
    "  trained males are spending less time near the female (i.e. they\n",
    "  learned to give up).\n",
    "- If the two distributions look identical, no learning effect was\n",
    "  measurable with this metric \u2014 but that doesn't mean there's no effect,\n",
    "  just that this particular summary didn't capture it.\n",
    "- A **bimodal** trained distribution (two humps) would mean some males\n",
    "  learned and others didn't \u2014 the \"individual differences\" story in\n",
    "  `docs/bimodal_hypothesis.md`.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add a stat test\n",
    "\n",
    "A formal comparison. Because group sizes are small and we don't know if\n",
    "the data are normally distributed, the\n",
    "[Mann-Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test)\n",
    "is a safer default than the classic t-test.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "trained_vals = per_fly[per_fly[\"male\"] == \"trained\"][\"median_distance_px\"]\n",
    "naive_vals   = per_fly[per_fly[\"male\"] == \"naive\"][\"median_distance_px\"]\n",
    "\n",
    "stat, pvalue = stats.mannwhitneyu(trained_vals, naive_vals, alternative=\"two-sided\")\n",
    "\n",
    "print(f\"trained median: {trained_vals.median():.1f} px (n={len(trained_vals)})\")\n",
    "print(f\"naive   median: {naive_vals.median():.1f} px (n={len(naive_vals)})\")\n",
    "print(f\"Mann-Whitney U: {stat:.0f}    p-value: {pvalue:.4f}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**How to read this**: the p-value is the probability of seeing a\n",
    "difference at least this big *if there were really no difference*. By\n",
    "convention p < 0.05 is \"interesting\", p < 0.01 is \"fairly convincing\".\n",
    "But never trust a p-value without:\n",
    "\n",
    "1. eyeballing the histogram first (you did);\n",
    "2. reporting the **effect size**, not just the p-value (e.g. the\n",
    "   difference of medians);\n",
    "3. understanding that p-values\n",
    "   [say nothing about practical importance](https://www.nature.com/articles/d41586-019-00857-9).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What's next?\n",
    "\n",
    "- **Pick a different metric**: instead of median distance, try fraction\n",
    "  of time the flies were within 50 px (a \"close-proximity\" metric), or\n",
    "  the maximum velocity per fly. (Velocity needs identity tracking, which\n",
    "  is harder \u2014 see `flies_analysis_simple.ipynb` cell 16 for an example.)\n",
    "- **Look at it per species**: re-run with `species == \"Sechellia\"` and\n",
    "  compare. Does the effect generalize? Where is it strongest?\n",
    "- **Look at the bimodality**: a kernel density plot\n",
    "  ([seaborn.kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html))\n",
    "  will show humps better than a histogram.\n",
    "- **Time inside the session**: maybe the difference only shows up in the\n",
    "  first few minutes (right after the female is introduced). Slice\n",
    "  `per_frame` by `t` before aggregating.\n",
    "- **Consult `docs/bimodal_hypothesis.md`**: it lays out a formal plan for\n",
    "  testing the \"some flies learn, others don't\" hypothesis.\n",
    "\n",
    "When you write your own analysis, **save it as a new notebook** (don't\n",
    "edit this one). Copy the setup cells, change the question, change the\n",
    "plot. That's how analysis projects grow.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A note on iteration speed\n",
    "\n",
    "The pipeline above is correct but **slow** because we apply a Python\n",
    "function to every (track, t) group. If you find yourself re-running the\n",
    "same expensive computation a lot, save the intermediate result to disk:\n",
    "\n",
    "```python\n",
    "per_frame.to_parquet(\"per_frame_distance.parquet\")\n",
    "# next time:\n",
    "per_frame = pd.read_parquet(\"per_frame_distance.parquet\")\n",
    "```\n",
    "\n",
    "`parquet` is a fast columnar format. `pip install pyarrow` if your\n",
    "environment doesn't have it.\n",
    "\n",
    "There are also vectorized ways to compute these distances ~100\u00d7 faster\n",
    "that avoid `groupby().apply()`. Don't worry about that yet \u2014 get a\n",
    "correct answer first, optimize only if you find yourself waiting.\n"
   ]
  }
 ]
 }
--- a/notebooks/getting_started/README.md
+++ b/notebooks/getting_started/README.md
@ -0,0 +1,15 @@
 # Tutorial notebooks
 Read these in order:
 1. **`00_welcome.ipynb`** — what's the project, where the data lives,
   how to use a Jupyter notebook.
 2. **`01_python_pandas_basics.ipynb`** — minimum Python and pandas you
   need to read project code.
 3. **`02_explore_one_database.ipynb`** — open one tracking DB, plot a
   trajectory, compute a single distance.
 4. **`03_compare_trained_vs_naive.ipynb`** — first real analysis,
   comparing groups.
 After these, the notebooks one level up (`flies_analysis*.ipynb`) walk
 through the full analysis pipeline that the previous student built.
--- a/requirements-tracking.txt
+++ b/requirements-tracking.txt
@ -0,0 +1,11 @@
 # Extra dependencies needed only for the offline-tracking pipeline
 # (build_video_inventory.py, pick_targets.py, auto_detect_targets.py,
 # track_videos.py). Not needed for the existing analysis notebooks.
 #
 # install with: pip install -r requirements-tracking.txt
 opencv-python>=4.8
 openpyxl>=3.1
 gitpython>=3.1
 netifaces>=0.11
 mysql-connector-python>=8.0
 pyserial>=3.5
--- a/scripts/auto_detect_targets.py
+++ b/scripts/auto_detect_targets.py
@ -0,0 +1,119 @@
 """Try auto-detection of L-shape targets on each video and save JSON sidecars.
 Useful for:
 - videos that DO have visible black-circle targets (saves manual clicks);
 - as a smoke test of the whole pipeline before running the picker.
 Failure is silent — videos that fail auto-detection are simply not written
 to disk, leaving them for the manual `pick_targets.py` tool.
 Output JSON has the same shape as the manual picker's so `track_videos.py`
 can consume either.
 """
 from __future__ import annotations
 import argparse
 import datetime as dt
 import json
 import logging
 import sys
 from pathlib import Path
 import cv2
 import numpy as np
 import pandas as pd
 # ethoscope source tree
 sys.path.insert(0, "/home/gg/Code/ethoscope_project/ethoscope/src/ethoscope")
 from config import INVENTORY_CSV, TARGETS_DIR  # noqa: E402
 from ethoscope.roi_builders.target_roi_builder import TargetGridROIBuilder  # noqa: E402
 def detect_one(video_path: Path, frame_idx: int) -> tuple[list[list[int]], int] | None:
    """Run ethoscope target detection on one frame; return (points, frame_idx) or None."""
    cap = cv2.VideoCapture(str(video_path))
    if not cap.isOpened():
        return None
    n = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    if n > 0 and frame_idx >= n:
        frame_idx = max(0, n - 1)
    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
    ok, frame = cap.read()
    cap.release()
    if not ok or frame is None:
        return None
    # The detector expects a single-channel image (grey) like ethoscope cameras produce.
    if frame.ndim == 3:
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    else:
        gray = frame
    # We don't actually need a fully-configured grid here — _find_target_coordinates
    # alone gives us the 3 reference points.
    builder = TargetGridROIBuilder(n_rows=2, n_cols=3)
    try:
        ref = builder._find_target_coordinates(gray)
    except Exception as e:
        logging.debug(f"detection failed for {video_path.name}: {e}")
        return None
    if ref is None:
        return None
    return [[int(p[0]), int(p[1])] for p in ref], frame_idx
 def main() -> None:
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("--frame", type=int, default=125)
    parser.add_argument("--limit", type=int, default=None)
    parser.add_argument("--video", type=str, default=None,
                        help="run on a single video path (skips inventory)")
    parser.add_argument("--overwrite", action="store_true",
                        help="overwrite existing JSON sidecars")
    args = parser.parse_args()
    TARGETS_DIR.mkdir(parents=True, exist_ok=True)
    if args.video:
        videos = [Path(args.video)]
    else:
        if not INVENTORY_CSV.exists():
            sys.exit("Inventory missing — run build_video_inventory.py first.")
        inv = pd.read_csv(INVENTORY_CSV)
        todo = inv[inv["in_xlsx"] & ~inv["already_tracked"]]
        videos = [Path(p) for p in todo["mp4_path"].tolist()]
        if args.limit:
            videos = videos[: args.limit]
    n_ok = n_fail = n_skip = 0
    for v in videos:
        out = TARGETS_DIR / f"{v.stem}.json"
        if out.exists() and not args.overwrite:
            n_skip += 1
            continue
        result = detect_one(v, args.frame)
        if result is None:
            n_fail += 1
            print(f"  fail: {v.name}")
            continue
        points, used_frame = result
        out.write_text(json.dumps({
            "video_path": str(v),
            "frame_index": int(used_frame),
            "reference_points": points,
            "order": ["top", "corner", "left"],
            "picked_at": dt.datetime.now().isoformat(timespec="seconds"),
            "method": "auto",
        }, indent=2))
        n_ok += 1
        print(f"  ok:   {v.name}  →  {points}")
    print(f"\nDone. ok={n_ok}  fail={n_fail}  skipped(existing)={n_skip}")
 if __name__ == "__main__":
    logging.basicConfig(level=logging.WARNING, format="%(levelname)s %(message)s")
    main()
--- a/scripts/build_video_inventory.py
+++ b/scripts/build_video_inventory.py
@ -0,0 +1,150 @@
 """Build an inventory of videos available on disk and join with the metadata xlsx.
 Scans /mnt/ethoscope_data/videos/<uuid>/<machine_name>/<date_time>/*.mp4
 and produces a CSV mapping each (date, machine_name) row in
 all_video_info_merged.xlsx to the corresponding merged.mp4 path on disk.
 Output: data/metadata/video_inventory.csv with columns:
    machine_uuid, machine_name, session_date, session_time, mp4_path,
    in_xlsx (bool), already_tracked (bool)
 """
 from __future__ import annotations
 import re
 from pathlib import Path
 import pandas as pd
 from config import DATA_RAW, INVENTORY_CSV, VIDEO_INFO_XLSX, VIDEOS_ROOT
 SESSION_RE = re.compile(r"^(\d{4}-\d{2}-\d{2})_(\d{2}-\d{2}-\d{2})$")
 def scan_videos(videos_root: Path) -> pd.DataFrame:
    """Walk videos_root and return one row per merged.mp4 found.
    Args:
        videos_root: Root directory containing <uuid>/<machine_name>/<date_time>/.
    Returns:
        DataFrame with columns: machine_uuid, machine_name, session_date,
        session_time, session_datetime, mp4_path.
    """
    rows = []
    for uuid_dir in sorted(videos_root.iterdir()):
        if not uuid_dir.is_dir():
            continue
        for machine_dir in uuid_dir.iterdir():
            if not machine_dir.is_dir() or not machine_dir.name.startswith("ETHOSCOPE_"):
                continue
            for session_dir in machine_dir.iterdir():
                if not session_dir.is_dir():
                    continue
                m = SESSION_RE.match(session_dir.name)
                if not m:
                    continue
                date_str, time_str = m.group(1), m.group(2)
                # Prefer *_merged.mp4 if present
                merged = sorted(session_dir.glob("*_merged.mp4"))
                if not merged:
                    merged = sorted(session_dir.glob("*.mp4"))
                if not merged:
                    continue
                rows.append(
                    {
                        "machine_uuid": uuid_dir.name,
                        "machine_name": machine_dir.name,
                        "session_date": date_str,
                        "session_time": time_str,
                        "session_datetime": f"{date_str}_{time_str}",
                        "mp4_path": str(merged[0]),
                    }
                )
    return pd.DataFrame(rows)
 def already_tracked_set(data_raw: Path) -> set[tuple[str, str]]:
    """Return the set of (date, time) sessions for which a tracking DB exists.
    DBs are named like:
        2025-07-15_16-03-10_<uuid>__1920x1088@25fps-28q_merged_tracking.db
    """
    out = set()
    for db in data_raw.glob("*_tracking.db"):
        m = re.match(r"^(\d{4}-\d{2}-\d{2})_(\d{2}-\d{2}-\d{2})_", db.name)
        if m:
            out.add((m.group(1), m.group(2)))
    return out
 def main() -> None:
    print(f"Scanning {VIDEOS_ROOT} ...")
    videos_df = scan_videos(VIDEOS_ROOT)
    print(f"  found {len(videos_df)} video sessions on disk")
    print(f"Loading metadata xlsx: {VIDEO_INFO_XLSX}")
    meta = pd.read_excel(VIDEO_INFO_XLSX)
    meta["session_date"] = meta["date"].dt.strftime("%Y-%m-%d")
    # The xlsx has one row per (date, machine, ROI) — collapse to unique sessions
    meta_sessions = (
        meta[["session_date", "machine_name"]].drop_duplicates().reset_index(drop=True)
    )
    print(f"  xlsx contains {len(meta_sessions)} unique (date, machine) sessions")
    # Mark which video sessions are referenced by the xlsx
    xlsx_keys = set(zip(meta_sessions["session_date"], meta_sessions["machine_name"]))
    videos_df["in_xlsx"] = videos_df.apply(
        lambda r: (r["session_date"], r["machine_name"]) in xlsx_keys, axis=1
    )
    # Mark which already have tracking DBs in data/raw/
    tracked = already_tracked_set(DATA_RAW)
    videos_df["already_tracked"] = videos_df.apply(
        lambda r: (r["session_date"], r["session_time"]) in tracked, axis=1
    )
    INVENTORY_CSV.parent.mkdir(parents=True, exist_ok=True)
    videos_df.sort_values(["session_date", "machine_name", "session_time"]).to_csv(
        INVENTORY_CSV, index=False
    )
    # Coverage report
    in_xlsx = videos_df["in_xlsx"]
    needed = videos_df[in_xlsx & ~videos_df["already_tracked"]]
    n_xlsx_sessions = len(meta_sessions)
    n_with_video = videos_df[in_xlsx].drop_duplicates(
        ["session_date", "machine_name"]
    ).shape[0]
    # xlsx sessions that have no video on disk
    found_keys = set(
        zip(
            videos_df.loc[in_xlsx, "session_date"],
            videos_df.loc[in_xlsx, "machine_name"],
        )
    )
    missing = sorted(xlsx_keys - found_keys)
    print()
    print("=" * 70)
    print(f"Wrote inventory: {INVENTORY_CSV}")
    print(f"  total video sessions on disk: {len(videos_df)}")
    print(f"  xlsx unique sessions:         {n_xlsx_sessions}")
    print(f"  xlsx sessions with video:     {n_with_video}")
    print(f"  xlsx sessions missing video:  {len(missing)}")
    print(f"  already tracked (DB exists):  {videos_df['already_tracked'].sum()}")
    print(f"  TO TRACK (in_xlsx & ~tracked, video instances): {len(needed)}")
    if missing:
        print()
        print("xlsx sessions with NO matching video on disk:")
        for d, m in missing[:20]:
            print(f"  {d}  {m}")
        if len(missing) > 20:
            print(f"  ... and {len(missing) - 20} more")
 if __name__ == "__main__":
    main()
--- a/scripts/calculate_distances.py
+++ b/scripts/calculate_distances.py
@ -1,117 +1,99 @@
-import pandas as pd
+"""Compute per-frame inter-fly distances for every (date, machine, ROI, session).
 Reads tracking data via :func:`load_roi_data.load_roi_data` (which is driven
 by ``all_video_info_merged.tsv``) and produces one distances DataFrame
 spanning every fly/session in the batch. Group membership (``trained`` /
 ``untrained``) is preserved from the ``male`` column.
 """
 import numpy as np
 import pandas as pd
 from scipy.spatial.distance import euclidean
 from config import DATA_PROCESSED
 from load_roi_data import load_roi_data
-def calculate_fly_distances(trained_file=None, untrained_file=None):
+def calculate_fly_distances(data: pd.DataFrame | None = None) -> pd.DataFrame:
-    """Calculate distances between flies at each time point.
+    """Compute inter-fly distances over time for every fly/session.
-    For each time point:
+    For each time point inside one (date, machine, ROI, session) trajectory:
-    - If two flies are detected: calculate Cartesian distance between them
+    - 2+ flies detected: Euclidean distance between the first two by id
-    - If one fly is detected: set distance to 0 if area > average area, otherwise NaN
+    - 1 fly detected: distance = 0 if its bbox area exceeds the global
      mean (likely a single blob containing both flies), else NaN
    Args:
-        trained_file (Path): Path to trained ROI data CSV.
+        data: optional pre-loaded DataFrame from :func:`load_roi_data`. If
-        untrained_file (Path): Path to untrained ROI data CSV.
+            None, the full batch is loaded.
    Returns:
-        tuple: (trained_distances, untrained_distances) DataFrames.
+        DataFrame with one row per (track, time) pair, including ``distance``,
        ``n_flies``, ``area_fly1``, ``area_fly2``, plus the metadata columns
        propagated from the source row (``date``, ``machine_name``, ``ROI``,
        ``session``, ``male``, ``species``, ``memory``, ``age``).
    """
-    if trained_file is None:
+    if data is None:
-        trained_file = DATA_PROCESSED / 'trained_roi_data.csv'
+        data = load_roi_data()
-    if untrained_file is None:
+    if data.empty:
-        untrained_file = DATA_PROCESSED / 'untrained_roi_data.csv'
+        return pd.DataFrame()
-    trained_df = pd.read_csv(trained_file)
+    data = data.copy()
-    untrained_df = pd.read_csv(untrained_file)
+    data["area"] = data["w"] * data["h"]
-
+    avg_area = data["area"].mean()
    trained_df['area'] = trained_df['w'] * trained_df['h']
    untrained_df['area'] = untrained_df['w'] * untrained_df['h']
    avg_area = np.mean([trained_df['area'].mean(), untrained_df['area'].mean()])
    print(f"Average area across all data: {avg_area:.2f}")
-    trained_distances = process_distance_data(trained_df, avg_area)
+    # Carry these onto every output row (constant within a track).
-    untrained_distances = process_distance_data(untrained_df, avg_area)
+    keep_meta = ["date", "machine_name", "ROI", "session", "male",
                 "species", "memory", "age"]
-    return trained_distances, untrained_distances
+    rows: list[dict] = []
-
+    track_keys = ["date", "machine_name", "ROI", "session"]
-
+    for track, track_df in data.groupby(track_keys, sort=False):
-def process_distance_data(df, avg_area):
+        meta_row = {k: v for k, v in zip(track_keys, track)}
-    """Process a DataFrame to calculate distances between flies at each time point.
+        # Carry the rest of the metadata from any sample (constant per track).
-
+        sample = track_df.iloc[0]
-    Args:
+        for col in keep_meta:
-        df (pd.DataFrame): Input tracking data.
+            if col not in meta_row:
-        avg_area (float): Average area threshold for single-fly detection.
+                meta_row[col] = sample[col]
    Returns:
        pd.DataFrame: Distance data with columns for machine, ROI, time, distance.
    """
    results = []
    for (machine_name, roi), group in df.groupby(['machine_name', 'ROI']):
        for t, time_group in group.groupby('t'):
            time_group = time_group.sort_values('id').reset_index(drop=True)
        for t, time_group in track_df.groupby("t", sort=False):
            time_group = time_group.sort_values("id").reset_index(drop=True)
            row = dict(meta_row)
            row["t"] = t
            if len(time_group) >= 2:
-                fly1 = time_group.iloc[0]
+                f1, f2 = time_group.iloc[0], time_group.iloc[1]
-                fly2 = time_group.iloc[1]
+                row["distance"] = euclidean([f1["x"], f1["y"]], [f2["x"], f2["y"]])
-                distance = euclidean([fly1['x'], fly1['y']], [fly2['x'], fly2['y']])
+                row["n_flies"] = len(time_group)
                row["area_fly1"] = f1["area"]
                row["area_fly2"] = f2["area"]
            else:
                f = time_group.iloc[0]
                row["distance"] = 0.0 if f["area"] > avg_area else np.nan
                row["n_flies"] = 1
                row["area_fly1"] = f["area"]
                row["area_fly2"] = np.nan
            rows.append(row)
-                results.append({
+    return pd.DataFrame(rows)
                    'machine_name': machine_name,
                    'ROI': roi,
                    't': t,
                    'distance': distance,
                    'n_flies': len(time_group),
                    'area_fly1': fly1['area'],
                    'area_fly2': fly2['area']
                })
            elif len(time_group) == 1:
                fly = time_group.iloc[0]
                area = fly['area']
                if area > avg_area:
                    distance = 0.0
                else:
                    distance = np.nan
                results.append({
                    'machine_name': machine_name,
                    'ROI': roi,
                    't': t,
                    'distance': distance,
                    'n_flies': 1,
                    'area_fly1': area,
                    'area_fly2': np.nan
                })
    return pd.DataFrame(results)
-def main():
+def main() -> None:
-    """Run distance calculations and save results."""
+    distances = calculate_fly_distances()
    trained_distances, untrained_distances = calculate_fly_distances()
-    print(f"Trained data distance summary:")
+    print("\nDistance summary:")
-    print(f"  Shape: {trained_distances.shape}")
+    print(f"  Shape: {distances.shape}")
-    print(f"  Distance stats:")
+    if not distances.empty:
-    print(f"    Count: {trained_distances['distance'].count()}")
+        print(f"  Distance count: {distances['distance'].count()}")
-    print(f"    Mean: {trained_distances['distance'].mean():.2f}")
+        print(f"  Distance mean:  {distances['distance'].mean():.2f}")
-    print(f"    Std: {trained_distances['distance'].std():.2f}")
+        print(f"  Distance std:   {distances['distance'].std():.2f}")
        male = distances["male"]
        print(f"  Trained tracks: {(male == 'trained').sum()}")
        print(f"  Naive tracks:   {(male == 'naive').sum()}")
-    print(f"\nUntrained data distance summary:")
+    DATA_PROCESSED.mkdir(parents=True, exist_ok=True)
-    print(f"  Shape: {untrained_distances.shape}")
+    out = DATA_PROCESSED / "distances.csv"
-    print(f"  Distance stats:")
+    distances.to_csv(out, index=False)
-    print(f"    Count: {untrained_distances['distance'].count()}")
+    print(f"\nSaved {out}")
    print(f"    Mean: {untrained_distances['distance'].mean():.2f}")
    print(f"    Std: {untrained_distances['distance'].std():.2f}")
    trained_distances.to_csv(DATA_PROCESSED / 'trained_distances.csv', index=False)
    untrained_distances.to_csv(DATA_PROCESSED / 'untrained_distances.csv', index=False)
    print("\nDistance data saved")
 if __name__ == "__main__":
--- a/scripts/config.py
+++ b/scripts/config.py
@ -7,3 +7,16 @@ DATA_RAW = PROJECT_ROOT / "data" / "raw"
 DATA_METADATA = PROJECT_ROOT / "data" / "metadata"
 DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
 FIGURES = PROJECT_ROOT / "figures"
 # Offline-tracking pipeline paths
 VIDEOS_ROOT = Path("/mnt/ethoscope_data/videos")
 VIDEO_INFO_XLSX = PROJECT_ROOT.parent / "all_video_info_merged.xlsx"
 INVENTORY_CSV = DATA_METADATA / "video_inventory.csv"
 # Reason: kept on the local data volume alongside the tracking DBs (out of
 # ownCloud sync). See TRACKING_OUTPUT_DIR comment below.
 TARGETS_DIR = Path("/mnt/data/projects/cupido/targets")
 # Reason: tracking DBs are large binary files that don't belong in
 # ownCloud-synced storage (sync conflicts + bandwidth). They live on the
 # local data volume instead. Regenerable from videos + target JSONs.
 TRACKING_OUTPUT_DIR = Path("/mnt/data/projects/cupido/tracked")
 LOGS_DIR = PROJECT_ROOT / "data" / "logs"
--- a/scripts/export_video_db_index.py
+++ b/scripts/export_video_db_index.py
@ -0,0 +1,181 @@
 """Augment all_video_info_merged.xlsx with the input video + tracking DB paths.
 Each xlsx row represents one fly (date, machine_name, ROI), observed across a
 training session and a testing session. We resolve those two sessions to the
 on-disk video files (via the inventory CSV) and to their tracking DBs (under
 TRACKING_OUTPUT_DIR), then write the result as TSV.
 Output columns added:
    training_video_path, training_db_path,
    testing_video_path,  testing_db_path
 Empty values mean either no video matched (rare — implies missing inventory
 entry) or no DB exists yet (e.g. the one video the completeness gate
 rejected).
 Usage:
    python export_video_db_index.py
    python export_video_db_index.py --out path/to/output.tsv
 """
 from __future__ import annotations
 import argparse
 import re
 from pathlib import Path
 import pandas as pd
 from config import INVENTORY_CSV, TRACKING_OUTPUT_DIR, VIDEO_INFO_XLSX
 _TIME_RE = re.compile(r"^(\d{8})_(\d{1,2})(\d{2})?(AM|PM)$", re.IGNORECASE)
 def parse_xlsx_time(value: str) -> tuple[str, int] | None:
    """Convert '20241021_11AM' / '20240918_1030AM' to (YYYY-MM-DD, minutes24).
    Resolution is hour-only when no minutes are given (e.g. '11AM' → 11:00).
    Returns minutes-from-midnight so we can do nearest-neighbor matching.
    """
    if not isinstance(value, str):
        return None
    m = _TIME_RE.match(value.strip())
    if not m:
        return None
    ymd, hh, mm, ampm = m.groups()
    date = f"{ymd[:4]}-{ymd[4:6]}-{ymd[6:8]}"
    hour = int(hh)
    minute = int(mm) if mm else 0
    if ampm.upper() == "PM" and hour != 12:
        hour += 12
    if ampm.upper() == "AM" and hour == 12:
        hour = 0
    return date, hour * 60 + minute
 def build_session_index(inventory: pd.DataFrame) -> dict[tuple[str, str], list[dict]]:
    """Index inventory rows by (date, machine_name) → list of session dicts."""
    idx: dict[tuple[str, str], list[dict]] = {}
    for row in inventory.itertuples(index=False):
        h, m, _s = (int(p) for p in str(row.session_time).split("-"))
        key = (row.session_date, row.machine_name)
        idx.setdefault(key, []).append({
            "mp4_path": row.mp4_path,
            "session_datetime": row.session_datetime,
            "minutes": h * 60 + m,
        })
    return idx
 def db_path_for_video(mp4_path: str) -> Path | None:
    """Tracker writes <video_stem>_tracking.db under TRACKING_OUTPUT_DIR."""
    stem = Path(mp4_path).stem
    db = TRACKING_OUTPUT_DIR / f"{stem}_tracking.db"
    return db if db.exists() else None
 _TIME_TOLERANCE_MIN = 90  # xlsx labels are approximate ("11AM" → 10:51 is fine)
 def resolve_session(
    machine_name: str,
    when: str,
    fallback_date: str | None,
    index: dict[tuple[str, str], list[dict]],
 ) -> tuple[str, str]:
    """Look up the video + db whose start time is closest to `when`.
    Match strategy:
    1. Use the date embedded in `when` (training/testing can fall on a
       different calendar day from the row's ``date`` column).
    2. If no candidates exist for that date, fall back to ``fallback_date``
       (the xlsx row's ``date`` column). Reason: the xlsx contains
       date typos like '20240110_11AM' for an Oct 1 experiment.
    Among candidates, pick the video whose start minute is closest to the
    xlsx-claimed time, within ±_TIME_TOLERANCE_MIN.
    """
    parsed = parse_xlsx_time(when)
    if parsed is None:
        return "", ""
    date, target_min = parsed
    candidates = index.get((date, machine_name), [])
    if not candidates and fallback_date:
        candidates = index.get((fallback_date, machine_name), [])
    if not candidates:
        return "", ""
    def _gap(target: int, c: dict) -> int:
        # Reason: xlsx times like '1230AM' are ambiguous (12 AM vs 12 PM).
        # We try both the literal time AND a +12-hour shift, picking the
        # interpretation that brings us closest to a real session.
        return min(abs(c["minutes"] - target), abs(c["minutes"] - (target + 720) % 1440))
    best = min(candidates, key=lambda c: _gap(target_min, c))
    if _gap(target_min, best) > _TIME_TOLERANCE_MIN:
        return "", ""
    db = db_path_for_video(best["mp4_path"])
    return best["mp4_path"], (str(db) if db else "")
 # Variants of "naive" the xlsx has accumulated: 'naïve', 'niave', plus
 # trailing whitespace. All collapse to a single canonical 'naive'.
 _MALE_NAIVE_VARIANTS = {"naïve", "niave", "naive"}
 def _normalize_metadata(df: pd.DataFrame) -> None:
    """Strip whitespace and canonicalize the ``male`` column in place."""
    for col in df.select_dtypes(include=("object", "string")).columns:
        df[col] = df[col].astype(str).str.strip()
    df["male"] = df["male"].apply(
        lambda v: "naive" if v.lower() in _MALE_NAIVE_VARIANTS else v
    )
 def main() -> None:
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "--out",
        type=Path,
        default=VIDEO_INFO_XLSX.with_suffix(".tsv"),
        help="output TSV path (default: alongside the xlsx)",
    )
    args = parser.parse_args()
    inv = pd.read_csv(INVENTORY_CSV)
    inv = inv[inv["in_xlsx"]].copy()
    index = build_session_index(inv)
    df = pd.read_excel(VIDEO_INFO_XLSX)
    _normalize_metadata(df)
    date_iso = pd.to_datetime(df["date"]).dt.strftime("%Y-%m-%d")
    train_videos, train_dbs, test_videos, test_dbs = [], [], [], []
    for fallback, row in zip(date_iso, df.itertuples(index=False)):
        tv, td = resolve_session(row.machine_name, row.training_date_time, fallback, index)
        sv, sd = resolve_session(row.machine_name, row.testing_date_time, fallback, index)
        train_videos.append(tv)
        train_dbs.append(td)
        test_videos.append(sv)
        test_dbs.append(sd)
    df["training_video_path"] = train_videos
    df["training_db_path"] = train_dbs
    df["testing_video_path"] = test_videos
    df["testing_db_path"] = test_dbs
    df.to_csv(args.out, sep="\t", index=False)
    n_rows = len(df)
    n_train_video = sum(bool(v) for v in train_videos)
    n_train_db = sum(bool(v) for v in train_dbs)
    n_test_video = sum(bool(v) for v in test_videos)
    n_test_db = sum(bool(v) for v in test_dbs)
    print(f"wrote {args.out}  ({n_rows} rows)")
    print(f"  training:  {n_train_video} with video,  {n_train_db} with DB")
    print(f"  testing:   {n_test_video} with video,  {n_test_db} with DB")
 if __name__ == "__main__":
    main()
--- a/scripts/load_roi_data.py
+++ b/scripts/load_roi_data.py
@ -1,90 +1,113 @@
-import pandas as pd
+"""Load ROI tracking data from all sessions into one DataFrame.
 Drives off the merged TSV (one row per ROI/fly across training + testing
 phases). For each TSV row, opens the corresponding tracking DB and pulls
 the matching ROI table, then attaches the experimental metadata.
 The TSV is the single source of truth for what data exists and how it
 maps to flies and conditions.
 """
 import sqlite3
-import re
+from pathlib import Path
-from config import DATA_RAW, DATA_METADATA, DATA_PROCESSED
+import pandas as pd
 from config import VIDEO_INFO_XLSX
-def load_roi_data():
+# Metadata columns to copy onto every tracking sample. These are the xlsx
-    """Load ROI data from SQLite databases and group by trained/untrained.
+# fields that describe the experimental condition behind each fly/ROI.
 # Reason: the ROI column is uppercase ("ROI") for backwards compatibility
 # with the existing analysis pipeline (calculate_distances.py, notebooks).
 _META_COLS = (
    "date",
    "machine_name",
    "species",
    "male",
    "training_date_time",
    "testing_date_time",
    "training_length_hr",
    "consolidation_length_hr",
    "memory",
    "age",
 )
 def _open_ro(db_path: str, cache: dict) -> sqlite3.Connection | None:
    """Cached read-only sqlite connection. Returns None on failure."""
    if not isinstance(db_path, str) or not db_path:
        return None
    if db_path not in cache:
        try:
            cache[db_path] = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
        except sqlite3.Error as e:
            print(f"failed to open {Path(db_path).name}: {e}")
            cache[db_path] = None
    return cache[db_path]
 def load_roi_data(meta: pd.DataFrame | None = None) -> pd.DataFrame:
    """Load ROI tracking data joined with experimental metadata.
    For each row in ``meta``, reads the matching ROI table from both the
    training DB and the testing DB (whichever exist), and stamps every
    sample with the row's metadata plus a ``session`` column
    (``"training"`` or ``"testing"``). Rows with empty DB paths (unusable
    videos, or videos that didn't pass the completeness gate) are skipped.
    Args:
        meta: optional DataFrame with the same schema as
            ``all_video_info_merged.tsv``. Pass a filtered slice to load a
            subset (e.g. ``meta[meta.species == 'Melanogaster/CS']``).
            Defaults to the full TSV.
    Returns:
-        tuple: (trained_df, untrained_df) DataFrames with tracking data.
+        DataFrame with columns ``id, t, x, y, w, h, phi, is_inferred,
        has_interacted, session, <metadata>`` — one row per tracking
        sample. Empty if nothing could be loaded.
    """
-    metadata = pd.read_csv(DATA_METADATA / '2025_07_15_metadata_fixed.csv')
+    if meta is None:
-    metadata['machine_name'] = metadata['machine_name'].astype(str)
+        meta = pd.read_csv(VIDEO_INFO_XLSX.with_suffix(".tsv"), sep="\t")
-    trained_rois = metadata[metadata['group'] == 'trained']
+    db_cache: dict = {}
-    untrained_rois = metadata[metadata['group'] == 'untrained']
+    chunks: list[pd.DataFrame] = []
-    db_files = list(DATA_RAW.glob('*_tracking.db'))
+    for row in meta.itertuples(index=False):
-
+        for session in ("training", "testing"):
-    trained_df = pd.DataFrame()
+            conn = _open_ro(getattr(row, f"{session}_db_path"), db_cache)
-    untrained_df = pd.DataFrame()
+            if conn is None:
-
+                continue
    for db_file in db_files:
        print(f"Processing {db_file.name}")
        pattern = r'_([0-9a-f]{32})__'
        match = re.search(pattern, db_file.name)
        if not match:
            print(f"Could not extract UUID from {db_file.name}")
            continue
        uuid = match.group(1)
        metadata_matches = metadata[metadata['path'].str.contains(uuid, na=False)]
        if metadata_matches.empty:
            print(f"No metadata matches found for UUID {uuid} from {db_file.name}")
            continue
        machine_id = metadata_matches.iloc[0]['machine_name']
        print(f"Matched to machine ID: {machine_id}")
        conn = sqlite3.connect(str(db_file))
        machine_trained = trained_rois[trained_rois['machine_name'] == machine_id]
        machine_untrained = untrained_rois[untrained_rois['machine_name'] == machine_id]
        for _, row in machine_trained.iterrows():
            roi = row['ROI']
            try:
-                query = f"SELECT * FROM ROI_{roi}"
+                df = pd.read_sql_query(
-                roi_data = pd.read_sql_query(query, conn)
+                    f"SELECT * FROM ROI_{int(row.roi)}", conn
-                roi_data['machine_name'] = machine_id
+                )
                roi_data['ROI'] = roi
                roi_data['group'] = 'trained'
                trained_df = pd.concat([trained_df, roi_data], ignore_index=True)
            except Exception as e:
-                print(f"Error loading ROI_{roi} from {db_file.name}: {e}")
+                # Reason: a DB may be missing a ROI table if tracking was
                # partial — skip rather than abort the whole batch.
                print(f"  ROI_{row.roi} from {session} DB: {e}")
                continue
            df["session"] = session
            df["ROI"] = int(row.roi)
            for col in _META_COLS:
                df[col] = getattr(row, col)
            chunks.append(df)
-        for _, row in machine_untrained.iterrows():
+    for conn in db_cache.values():
-            roi = row['ROI']
+        if conn is not None:
-            try:
+            conn.close()
                query = f"SELECT * FROM ROI_{roi}"
                roi_data = pd.read_sql_query(query, conn)
                roi_data['machine_name'] = machine_id
                roi_data['ROI'] = roi
                roi_data['group'] = 'untrained'
                untrained_df = pd.concat([untrained_df, roi_data], ignore_index=True)
            except Exception as e:
                print(f"Error loading ROI_{roi} from {db_file.name}: {e}")
-        conn.close()
+    return pd.concat(chunks, ignore_index=True) if chunks else pd.DataFrame()
    return trained_df, untrained_df
 if __name__ == "__main__":
-    trained_data, untrained_data = load_roi_data()
+    data = load_roi_data()
-    print(f"Trained data shape: {trained_data.shape}")
+    print(f"shape: {data.shape}")
-    print(f"Untrained data shape: {untrained_data.shape}")
+    if not data.empty:
-    if not trained_data.empty:
+        print(f"columns: {list(data.columns)}")
-        print("Trained data columns:", trained_data.columns.tolist())
+        print(f"sessions: {data['session'].value_counts().to_dict()}")
-    if not untrained_data.empty:
+        print(f"unique machines: {data['machine_name'].nunique()}")
-        print("Untrained data columns:", untrained_data.columns.tolist())
+        print(
-
+            f"unique flies (date,machine,roi): "
-    trained_data.to_csv(DATA_PROCESSED / 'trained_roi_data.csv', index=False)
+            f"{data.groupby(['date','machine_name','roi']).ngroups}"
-    untrained_data.to_csv(DATA_PROCESSED / 'untrained_roi_data.csv', index=False)
+        )
    print("Data saved to trained_roi_data.csv and untrained_roi_data.csv")
--- a/scripts/monitor_tracking.py
+++ b/scripts/monitor_tracking.py
@ -0,0 +1,176 @@
 """Live progress + ETA for the offline tracker batch.
 Counts ground-truth (DBs on disk) rather than parsing log lines, so it works
 whether the batch is running fresh or was resumed after a crash. Errors are
 parsed out of any *.log files in data/logs/.
 Usage:
    python monitor_tracking.py              # one snapshot, exit
    python monitor_tracking.py --watch      # refresh every 10 s
    python monitor_tracking.py --watch 30   # refresh every 30 s
 """
 from __future__ import annotations
 import argparse
 import json
 import re
 import time
 from datetime import datetime, timedelta
 from pathlib import Path
 from config import LOGS_DIR, TARGETS_DIR, TRACKING_OUTPUT_DIR
 def count_target_jsons() -> tuple[int, int, list[str]]:
    """Return (n_pickable, n_unusable, unusable_video_stems)."""
    pickable = 0
    unusable_stems: list[str] = []
    for j in TARGETS_DIR.glob("*.json"):
        try:
            d = json.loads(j.read_text())
        except Exception:
            continue
        if d.get("unusable"):
            unusable_stems.append(j.stem)
        elif d.get("reference_points"):
            pickable += 1
    return pickable, len(unusable_stems), unusable_stems
 def count_tracked_dbs() -> tuple[int, datetime | None, str | None]:
    """Return (n_dbs, mtime_of_newest, name_of_newest)."""
    dbs = list(TRACKING_OUTPUT_DIR.glob("*_tracking.db"))
    if not dbs:
        return 0, None, None
    newest = max(dbs, key=lambda p: p.stat().st_mtime)
    return len(dbs), datetime.fromtimestamp(newest.stat().st_mtime), newest.stem
 def parse_recent_errors(log_dir: Path, tail_lines: int = 5000) -> list[str]:
    """Scan the most recent *.log file for lines reporting errors."""
    if not log_dir.exists():
        return []
    logs = sorted(log_dir.glob("*.log"), key=lambda p: p.stat().st_mtime)
    if not logs:
        return []
    latest = logs[-1]
    try:
        with latest.open() as f:
            tail = f.readlines()[-tail_lines:]
    except Exception:
        return []
    out = []
    for line in tail:
        if re.search(r":\s*error\b", line) or " error: " in line.lower():
            out.append(line.rstrip())
    return out
 def db_completion_history() -> list[float]:
    """Return mtimes of all tracking DBs, sorted ascending. Used for rate."""
    return sorted(p.stat().st_mtime for p in TRACKING_OUTPUT_DIR.glob("*_tracking.db"))
 def fmt_duration(seconds: float) -> str:
    if seconds < 60:
        return f"{int(seconds)} s"
    if seconds < 3600:
        return f"{int(seconds // 60)} min"
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    return f"{h} h {m} min"
 def snapshot() -> str:
    pickable, unusable, _ = count_target_jsons()
    tracked, last_mtime, last_name = count_tracked_dbs()
    history = db_completion_history()
    errors = parse_recent_errors(LOGS_DIR)
    lines = [f"tracking progress @ {datetime.now():%Y-%m-%d %H:%M:%S}"]
    lines.append(f"  pickable JSONs:    {pickable}")
    lines.append(f"  unusable JSONs:    {unusable}  (skipped by tracker)")
    pct = (tracked / pickable * 100) if pickable else 0
    lines.append(
        f"  DBs on disk:       {tracked} / {pickable}  ({pct:.0f}%)"
    )
    lines.append(f"  errors in log:     {len(errors)}")
    # Rate from completions in the last 6 h — robust to gaps from killed /
    # restarted runs, while wide enough to span multiple parallel-worker
    # completion bursts. Reason: with 8 workers all started together on
    # multi-hour videos, completions arrive in tight bursts every ~video-
    # length apart; a 30-min window catches one burst and overestimates by
    # ~10×. 6 h spans at least one full burst cycle for typical videos.
    now_ts = time.time()
    window_secs = 6 * 3600
    recent = [t for t in history if t >= now_ts - window_secs]
    if len(recent) >= 2:
        # Reason: with N parallel workers, completions arrive in clumps
        # (all workers finish near-simultaneously). Dividing N by the *burst*
        # span gives nonsense rates. Use the full window as the denominator
        # once the batch has been running long enough to fill it; otherwise
        # use elapsed-since-first-DB. Detection: if every DB on disk also
        # falls inside the window, the batch is younger than the window.
        if len(recent) == len(history):
            elapsed = max(1.0, now_ts - history[0])
        else:
            elapsed = float(window_secs)
        if elapsed > 0:
            rate_per_hour = len(recent) / elapsed * 3600
            lines.append(
                f"  rate (last {len(recent)} in {int(window_secs/3600)} h):"
                f"    {rate_per_hour:.1f} videos/hour"
            )
            remaining = max(0, pickable - tracked)
            if rate_per_hour > 0 and remaining > 0:
                eta_sec = remaining * 3600 / rate_per_hour
                eta_at = datetime.now() + timedelta(seconds=eta_sec)
                lines.append(
                    f"  ETA remaining:     {fmt_duration(eta_sec)}  "
                    f"(done by {eta_at:%H:%M %a})"
                )
    else:
        lines.append("  rate:              (warming up — check again in a few min)")
    if last_mtime is not None and last_name is not None:
        ago = (datetime.now() - last_mtime).total_seconds()
        lines.append(
            f"  most recent DB:    {last_name[:60]}...  ({fmt_duration(ago)} ago)"
        )
    if errors:
        lines.append("")
        lines.append(f"  recent errors ({min(5, len(errors))} of {len(errors)}):")
        for e in errors[-5:]:
            lines.append(f"    {e[:120]}")
    return "\n".join(lines)
 def main() -> None:
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "--watch", nargs="?", type=int, const=10, default=None,
        help="refresh every N seconds (default 10 if flag given without value)",
    )
    args = parser.parse_args()
    if args.watch is None:
        print(snapshot())
        return
    try:
        while True:
            # Clear screen and reprint
            print("\033[2J\033[H", end="")
            print(snapshot())
            print(f"\n(refreshing every {args.watch}s — Ctrl-C to exit)")
            time.sleep(args.watch)
    except KeyboardInterrupt:
        print()
 if __name__ == "__main__":
    main()
--- a/scripts/pick_targets.py
+++ b/scripts/pick_targets.py
@ -0,0 +1,467 @@
 """Interactive target picker for offline tracking (matplotlib/Tk GUI).
 Loops through videos that need tracking and lets the user click 3 reference
 points per video in L-shape order:
    1) TOP target (above the corner)
    2) CORNER target (the right-angle vertex)
    3) LEFT target (to the left of the corner)
 These three points are the same reference layout used by ethoscope's
 `TargetGridROIBuilder`: dst_points = [(0, -1), (0, 0), (-1, 0)] in unit
 coordinates. Saving them as a JSON sidecar lets the offline tracker build the
 6-ROI HD mating arena grid without needing auto-target detection.
 Output JSON sidecar: TARGETS_DIR/<video_basename>.json
    {
      "video_path": "/mnt/.../*.mp4",
      "frame_index": <int>,
      "reference_points": [[x0, y0], [x1, y1], [x2, y2]],
      "order": ["top", "corner", "left"],
      "picked_at": "<isoformat>"
    }
 Keys (in the picker window):
    LEFT-CLICK  add a point (top → corner → left)
    r           reset clicks for current video
    d           skip this video for THIS run only (no JSON written)
    u           mark this video unusable (FOV wrong etc.); skipped forever
    .  /  ,     advance / rewind by 25 frames (≈ 1 s @ 25 fps)
    ]  /  [     advance / rewind by 5% of the video (~3 min in a 1 h video)
    #           jump to the middle of the video
    enter       save the 3 points and move on
    q / ESC     quit picker
 After the 3rd click, the 6 ROI rectangles are drawn over the frame so you
 can sanity-check the geometry before pressing ENTER.
 With --redo, if a JSON sidecar exists, its points are pre-loaded so you can
 nudge them rather than restart from scratch.
 Why matplotlib instead of cv2.imshow:
    OpenCV's bundled GUI uses Qt, which needs XKeyboard + a fonts directory and
    is fragile over SSH X11-forwarding. matplotlib's TkAgg backend uses pure
    Tk/X11 and works out of the box on any DISPLAY (and gives free pan/zoom
    via the toolbar — useful for clicking small targets precisely).
 """
 from __future__ import annotations
 import argparse
 import datetime as dt
 import json
 import os
 import sys
 from pathlib import Path
 # Force TkAgg BEFORE importing matplotlib. We override even if MPLBACKEND is
 # already set, because the script is unusable with a non-interactive backend.
 os.environ["MPLBACKEND"] = "TkAgg"
 import cv2  # noqa: E402
 import matplotlib  # noqa: E402
 import matplotlib.pyplot as plt  # noqa: E402
 import numpy as np  # noqa: E402
 import pandas as pd  # noqa: E402
 # matplotlib.backend_bases exposes the cursor identifiers under different
 # names depending on version: `Cursors` enum on 3.5+, lowercase `cursors`
 # instance on older releases. Both have the same integer attributes.
 try:
    from matplotlib.backend_bases import Cursors as _Cursors  # 3.5+
 except ImportError:
    try:
        from matplotlib.backend_bases import cursors as _Cursors  # older
    except ImportError:
        _Cursors = None
 # Verify we ended up on an interactive backend; bail loud (with a concrete
 # explanation) if not. matplotlib silently falls back to 'agg' when its
 # requested backend can't load, which is hard to debug without help.
 _backend = matplotlib.get_backend()
 if _backend.lower() in ("agg", "headless", "template", "pdf", "svg", "ps"):
    diag = []
    try:
        import tkinter as _tk
        try:
            _tk.Tk().destroy()
            diag.append("tkinter import + Tk() instantiation: OK")
        except Exception as e:
            diag.append(f"tkinter imported but Tk() failed: {e!r}")
    except Exception as e:
        diag.append(f"tkinter import FAILED: {e!r}")
        diag.append("  → on Manjaro/Arch, run:  sudo pacman -S tk")
    print(
        f"ERROR: matplotlib loaded the non-interactive backend {_backend!r}.\n"
        f"  Expected 'TkAgg'. Diagnostic info:\n"
        f"    DISPLAY        = {os.environ.get('DISPLAY')!r}\n"
        f"    MPLBACKEND     = {os.environ.get('MPLBACKEND')!r}\n"
        f"    matplotlib ver = {matplotlib.__version__}\n"
        + "\n".join(f"    {d}" for d in diag),
        file=sys.stderr,
    )
    sys.exit(2)
 from config import INVENTORY_CSV, TARGETS_DIR  # noqa: E402
 from tracking_geometry import compute_roi_polygons  # noqa: E402
 # Strip default matplotlib keybindings that would conflict with ours.
 for k in ("keymap.home", "keymap.save", "keymap.quit", "keymap.fullscreen",
          "keymap.pan", "keymap.zoom", "keymap.back", "keymap.forward"):
    try:
        plt.rcParams[k] = []
    except KeyError:
        pass
 CLICK_LABELS = ("TOP", "CORNER", "LEFT")
 CLICK_COLORS = ("red", "lime", "deepskyblue")
 def grab_frame(
    video_path: Path, frame_idx: int
 ) -> tuple[np.ndarray, int, int] | None:
    """Return (RGB frame, actual_frame_idx, n_frames) from the video, or None.
    Clamps frame_idx to [0, n_frames-1] so callers can step blindly.
    """
    cap = cv2.VideoCapture(str(video_path))
    if not cap.isOpened():
        return None
    n = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    if n > 0:
        frame_idx = max(0, min(frame_idx, n - 1))
    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
    ok, frame = cap.read()
    cap.release()
    if not ok or frame is None:
        return None
    return cv2.cvtColor(frame, cv2.COLOR_BGR2RGB), frame_idx, n
 def pick_one(
    video_path: Path,
    frame_idx: int,
    status_prefix: str,
    initial_points: list[tuple[float, float]] | None = None,
 ) -> dict | None:
    """Show the picker UI for a single video; return the result dict or None."""
    grabbed = grab_frame(video_path, frame_idx)
    if grabbed is None:
        print(f"  ! cannot read {video_path}")
        return None
    frame, frame_idx, n_frames = grabbed
    # Big-step size for ] / [ : 5% of total length, ~3 min in a 1h video.
    big_step = max(1, int(round(0.05 * n_frames))) if n_frames > 0 else 250
    fig, ax = plt.subplots(figsize=(14, 8))
    try:
        fig.canvas.manager.set_window_title("pick targets")
    except Exception:
        pass
    # Use a crosshair cursor over the axes so it's obvious where the click
    # will land. matplotlib's toolbar resets the cursor to POINTER (arrow) on
    # every mouse-move when no tool is active, so we intercept set_cursor:
    # whenever it asks for POINTER, we substitute SELECT_REGION (crosshair).
    # Tool modes (zoom/pan) keep their native cursors.
    if _Cursors is not None:
        _orig_set_cursor = fig.canvas.set_cursor
        def _set_cursor_with_crosshair(cursor):
            if cursor == _Cursors.POINTER:
                cursor = _Cursors.SELECT_REGION
            return _orig_set_cursor(cursor)
        fig.canvas.set_cursor = _set_cursor_with_crosshair
        try:
            fig.canvas.set_cursor(_Cursors.SELECT_REGION)
        except Exception:
            pass
    else:
        # Last-ditch: just set the Tk widget's cursor once and hope the
        # toolbar doesn't immediately overwrite it.
        try:
            fig.canvas.get_tk_widget().config(cursor="tcross")
        except Exception:
            pass
    img_artist = ax.imshow(frame)
    ax.set_axis_off()
    fig.tight_layout()
    state = {
        "points": list(initial_points) if initial_points else [],
        "action": None,          # 'save' | 'skip' | 'quit' | 'unusable'
        "frame": frame,
        "frame_idx": frame_idx,
        "drawn": [],             # artists drawn on top of the image
    }
    def update_title():
        nb = len(state["points"])
        nxt = (
            f"click {CLICK_LABELS[nb]}"
            if nb < 3
            else "ENTER=save | r=reset d=skip u=unusable q=quit | . , [ ] # = step frame"
        )
        ax.set_title(
            f'{status_prefix}  frame {state["frame_idx"]}  |  {nxt}',
            fontsize=10,
        )
    def redraw_points():
        for a in state["drawn"]:
            try:
                a.remove()
            except Exception:
                pass
        state["drawn"].clear()
        for i, (x, y) in enumerate(state["points"]):
            color = CLICK_COLORS[i]
            label = CLICK_LABELS[i]
            (cross,) = ax.plot(x, y, marker="+", color=color, markersize=22, mew=2)
            (ring,) = ax.plot(
                x, y, marker="o", color=color, markersize=22,
                fillstyle="none", mew=2,
            )
            txt = ax.text(
                x + 14, y - 14, label,
                color=color, fontsize=10, weight="bold",
            )
            state["drawn"].extend([cross, ring, txt])
        if len(state["points"]) >= 2:
            (line1,) = ax.plot(
                [state["points"][0][0], state["points"][1][0]],
                [state["points"][0][1], state["points"][1][1]],
                color="white", linewidth=0.7, alpha=0.6,
            )
            state["drawn"].append(line1)
        if len(state["points"]) == 3:
            (line2,) = ax.plot(
                [state["points"][1][0], state["points"][2][0]],
                [state["points"][1][1], state["points"][2][1]],
                color="white", linewidth=0.7, alpha=0.6,
            )
            state["drawn"].append(line2)
            # ROI overlay — draw the 6 computed rectangles on top of the frame
            try:
                polys = compute_roi_polygons(state["points"])
            except Exception as e:
                polys = []
                print(f"  (ROI preview failed: {e})")
            for j, poly in enumerate(polys):
                # Close the polygon by repeating the first point
                xs = list(poly[:, 0]) + [poly[0, 0]]
                ys = list(poly[:, 1]) + [poly[0, 1]]
                (line,) = ax.plot(
                    xs, ys, color="yellow", linewidth=1.5, alpha=0.9,
                )
                state["drawn"].append(line)
                cx = float(np.mean(poly[:, 0]))
                cy = float(np.mean(poly[:, 1]))
                lbl = ax.text(
                    cx, cy, str(j + 1),
                    color="yellow", fontsize=14, weight="bold",
                    ha="center", va="center",
                )
                state["drawn"].append(lbl)
        update_title()
        fig.canvas.draw_idle()
    def reload_frame(new_idx: int):
        grabbed = grab_frame(video_path, new_idx)
        if grabbed is None:
            return
        new_frame, new_idx, _ = grabbed
        state["frame"] = new_frame
        state["frame_idx"] = new_idx
        img_artist.set_data(new_frame)
        # Keep clicked targets + ROI overlay in place across frame-stepping —
        # press 'r' to clear them explicitly.
        redraw_points()
    def on_click(event):
        if event.inaxes is not ax:
            return
        if event.button != 1:  # left click only
            return
        if event.xdata is None or event.ydata is None:
            return
        # Skip clicks fired while the toolbar's pan/zoom is active.
        toolbar = getattr(fig.canvas, "toolbar", None)
        if toolbar is not None and getattr(toolbar, "mode", ""):
            return
        x, y = float(event.xdata), float(event.ydata)
        if len(state["points"]) < 3:
            state["points"].append((x, y))
        else:
            # 3 points already there — replace the nearest one. Lets the user
            # nudge pre-loaded targets in --redo mode, or correct a bad click.
            dists = [(x - px) ** 2 + (y - py) ** 2 for px, py in state["points"]]
            i_nearest = min(range(3), key=dists.__getitem__)
            state["points"][i_nearest] = (x, y)
        redraw_points()
    def on_key(event):
        k = event.key or ""
        if k in ("escape", "q"):
            state["action"] = "quit"
            plt.close(fig)
        elif k == "r":
            state["points"].clear()
            redraw_points()
        elif k == "d":
            state["action"] = "skip"
            plt.close(fig)
        elif k == "u":
            state["action"] = "unusable"
            plt.close(fig)
        elif k == "enter":
            if len(state["points"]) == 3:
                state["action"] = "save"
                plt.close(fig)
        elif k == ".":
            reload_frame(state["frame_idx"] + 25)
        elif k == ",":
            reload_frame(state["frame_idx"] - 25)
        elif k == "]":
            reload_frame(state["frame_idx"] + big_step)
        elif k == "[":
            reload_frame(state["frame_idx"] - big_step)
        elif k == "#":
            if n_frames > 0:
                reload_frame(n_frames // 2)
    fig.canvas.mpl_connect("button_press_event", on_click)
    fig.canvas.mpl_connect("key_press_event", on_key)
    update_title()
    plt.show()  # blocks until the figure is closed
    if state["action"] == "save":
        return {
            "action": "save",
            "frame_idx": state["frame_idx"],
            "points": state["points"],
        }
    if state["action"] == "unusable":
        return {"action": "unusable", "frame_idx": state["frame_idx"]}
    if state["action"] in ("skip", "quit"):
        return {"action": state["action"]}
    # Window closed via the WM "X" button — treat as quit so the loop stops
    return {"action": "quit"}
 def main() -> None:
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "--redo", action="store_true",
        help="re-pick videos that already have JSON sidecars",
    )
    parser.add_argument(
        "--frame", type=int, default=125,
        help="default frame index to display (default 125 ≈ 5 s @ 25 fps)",
    )
    parser.add_argument(
        "--limit", type=int, default=None,
        help="only process the first N videos",
    )
    args = parser.parse_args()
    if not INVENTORY_CSV.exists():
        sys.exit(
            f"Inventory not found at {INVENTORY_CSV}. "
            "Run build_video_inventory.py first."
        )
    inv = pd.read_csv(INVENTORY_CSV)
    todo = inv[inv["in_xlsx"] & ~inv["already_tracked"]].copy()
    todo = todo.sort_values(
        ["session_date", "machine_name", "session_time"]
    ).reset_index(drop=True)
    TARGETS_DIR.mkdir(parents=True, exist_ok=True)
    def sidecar_for(mp4_path: str) -> Path:
        return TARGETS_DIR / (Path(mp4_path).stem + ".json")
    if not args.redo:
        todo = todo[
            ~todo["mp4_path"].apply(lambda p: sidecar_for(p).exists())
        ].reset_index(drop=True)
    if args.limit:
        todo = todo.head(args.limit)
    n = len(todo)
    if n == 0:
        print("Nothing to pick. All eligible videos already have target JSONs.")
        return
    print(
        f"Picking targets for {n} videos. "
        "Window keys: ENTER=save  r=reset  d=skip  u=unusable  q=quit  "
        ".,[]=step frame  |  pan/zoom via toolbar"
    )
    saved = skipped = unusable = 0
    for i, row in todo.iterrows():
        mp4 = Path(row["mp4_path"])
        prefix = f"[{i + 1}/{n}] {row['machine_name']} {row['session_datetime']}"
        print(f"\n{prefix}")
        # If --redo and a JSON sidecar exists, pre-load its points (only for
        # regular saves — unusable sidecars are left as-is and shown empty).
        initial_points = None
        existing = sidecar_for(row["mp4_path"])
        if args.redo and existing.exists():
            try:
                prev = json.loads(existing.read_text())
                if not prev.get("unusable") and prev.get("reference_points"):
                    initial_points = [tuple(p) for p in prev["reference_points"]]
                    print(f"  pre-loaded {len(initial_points)} previous point(s)")
            except Exception as e:
                print(f"  ! could not read previous sidecar: {e}")
        result = pick_one(mp4, args.frame, prefix, initial_points=initial_points)
        if result is None or result.get("action") == "quit":
            print("  quitting picker.")
            break
        if result["action"] == "skip":
            skipped += 1
            print("  skipped (no JSON written, will be re-asked next run).")
            continue
        if result["action"] == "unusable":
            try:
                reason = input("  reason for marking unusable (Enter to skip): ").strip()
            except EOFError:
                reason = ""
            payload = {
                "video_path": str(mp4),
                "unusable": True,
                "reason": reason,
                "marked_at": dt.datetime.now().isoformat(timespec="seconds"),
            }
            out_path = sidecar_for(row["mp4_path"])
            out_path.write_text(json.dumps(payload, indent=2))
            unusable += 1
            print(f"  marked unusable → {out_path.name}")
            continue
        if result["action"] == "save":
            payload = {
                "video_path": str(mp4),
                "frame_index": int(result["frame_idx"]),
                "reference_points": [list(map(int, p)) for p in result["points"]],
                "order": ["top", "corner", "left"],
                "picked_at": dt.datetime.now().isoformat(timespec="seconds"),
            }
            out_path = sidecar_for(row["mp4_path"])
            out_path.write_text(json.dumps(payload, indent=2))
            saved += 1
            print(f"  saved → {out_path.name}")
    remaining = n - saved - skipped - unusable
    print(
        f"\nDone. saved={saved}  unusable={unusable}  "
        f"skipped(this run)={skipped}  remaining={remaining}"
    )
 if __name__ == "__main__":
    main()
--- a/scripts/track_videos.py
+++ b/scripts/track_videos.py
@ -0,0 +1,283 @@
 """Headless offline tracker.
 Reads target JSONs produced by `pick_targets.py`, builds the 6 ROIs of the
 HD mating arena from the L-shape reference points, runs ethoscope's
 `MultiFlyTracker` against the merged.mp4 file via `MovieVirtualCamera`, and
 writes a SQLite DB to `TRACKING_OUTPUT_DIR/<video_basename>_tracking.db`.
 Idempotent: skips videos whose tracking DB already exists (unless --redo).
 Usage:
    python track_videos.py                # process all videos with target JSON
    python track_videos.py --redo         # re-track even if DB exists
    python track_videos.py --jobs 4       # run up to 4 videos in parallel
    python track_videos.py --max-duration 1800  # cap each video at 30 min (sec)
 """
 from __future__ import annotations
 import argparse
 import json
 import logging
 import os
 import sys
 import traceback
 from concurrent.futures import ProcessPoolExecutor, as_completed
 from pathlib import Path
 import numpy as np
 # Import ethoscope from the local source tree (no pip install).
 ETHOSCOPE_SRC = Path("/home/gg/Code/ethoscope_project/ethoscope/src/ethoscope")
 sys.path.insert(0, str(ETHOSCOPE_SRC))
 from config import TARGETS_DIR, TRACKING_OUTPUT_DIR  # noqa: E402
 from tracking_geometry import HD_FG_DATA, compute_roi_polygons  # noqa: E402
 def build_rois_from_targets(reference_points):
    """Wrap the shared geometry into ethoscope `ROI` objects."""
    from ethoscope.core.roi import ROI
    polys = compute_roi_polygons(reference_points)
    return [ROI(poly.reshape((1, 4, 2)), idx=i + 1) for i, poly in enumerate(polys)]
 def track_one(json_path: Path, output_dir: Path, max_duration: float | None,
              redo: bool) -> tuple[str, str]:
    """Track a single video. Returns (status, message). Run in subprocess.
    Statuses: "ok", "skip", "error".
    """
    # Re-import inside subprocess so each worker has its own ethoscope state.
    import sys as _sys
    _sys.path.insert(0, str(ETHOSCOPE_SRC))
    import cv2
    from ethoscope.core.monitor import Monitor
    from ethoscope.hardware.input.cameras import MovieVirtualCamera
    from ethoscope.io.sqlite import SQLiteResultWriter
    from ethoscope.trackers.multi_fly_tracker import MultiFlyTracker
    import time as _time
    class BGRMovieCamera(MovieVirtualCamera):
        """MovieVirtualCamera that keeps BGR frames AND retries on transient
        read failures.
        Two reasons for the override:
        1. MultiFlyTracker calls cv2.cvtColor(img, COLOR_BGR2GRAY) without
           checking whether img is already grayscale, so we must feed it
           3-channel input.
        2. cv2.VideoCapture.read() can return False on transient I/O hiccups
           (NFS contention when 8 workers pull big mp4s in parallel) without
           the file actually being at EOF. A naive "False -> StopIteration"
           handling makes the tracker silently exit mid-video and write a
           short, lying DB. We retry a few times and only treat persistent
           failures within the *interior* of the video as real EOF.
        """
        _retry_count = 5
        _retry_backoff_s = 0.25
        _eof_safety_frames = 50  # near end-of-file, treat False as legitimate
        def _next_image(self):
            for attempt in range(self._retry_count):
                ret, frame = self.capture.read()
                if ret and frame is not None:
                    return frame  # BGR, untouched
                # If we're near the genuine end of the file, accept it.
                if (
                    self._has_end_of_file
                    and self._frame_idx >= self._total_n_frames - self._eof_safety_frames
                ):
                    return None
                # Otherwise, this is a suspected transient hiccup — back off
                # and try again. The capture is still open; cv2 will pick up
                # the next decoded frame.
                _time.sleep(self._retry_backoff_s)
            return None  # truly persistent failure
    payload = json.loads(json_path.read_text())
    if payload.get("unusable"):
        reason = payload.get("reason") or "no reason given"
        return "skip", f"marked unusable: {reason}"
    video_path = Path(payload["video_path"])
    if not video_path.exists():
        return "error", f"video missing: {video_path}"
    out_db = output_dir / f"{video_path.stem}_tracking.db"
    if out_db.exists() and not redo:
        return "skip", f"DB exists: {out_db.name}"
    if out_db.exists():
        out_db.unlink()
    rois = build_rois_from_targets(payload["reference_points"])
    cam_kwargs = {"use_wall_clock": False}
    if max_duration is not None:
        cam_kwargs["max_duration"] = max_duration
    cam = BGRMovieCamera(str(video_path), **cam_kwargs)
    metadata = {
        "machine_id": payload.get("machine_uuid", "unknown"),
        "machine_name": payload.get("machine_name", "unknown"),
        "date_time": int(payload.get("session_epoch", 0)),
        "frame_width": cam.width,
        "frame_height": cam.height,
        "version": "offline-tracker-1",
        "experimental_info": "{}",
        "selected_options": json.dumps({
            "tracker": "MultiFlyTracker",
            "template": "HD_Mating_Arena_6_ROIS",
            "fg_data": HD_FG_DATA,
            "maxN": 2,
        }),
        "hardware_info": "{}",
        "reference_points": str([list(map(int, p)) for p in payload["reference_points"]]),
        "backup_filename": out_db.name,
        "result_writer_type": "SQLite3",
        "sqlite_source_path": str(out_db),
    }
    tracker_data = {
        "maxN": 2,
        "visualise": False,
        "fg_data": HD_FG_DATA,
        "adaptive_threshold": True,
        "min_fg_threshold": 10,
        "max_fg_threshold": 50,
    }
    db_credentials = {"name": str(out_db)}
    rw = SQLiteResultWriter(
        db_credentials, rois, metadata=metadata,
        make_dam_like_table=False, take_frame_shots=False, erase_old_db=True,
    )
    monit = Monitor(
        cam, MultiFlyTracker, rois,
        reference_points=payload["reference_points"],
        data=tracker_data,
    )
    try:
        with rw as result_writer:
            monit.run(result_writer=result_writer, drawer=None, verbose=False)
    except Exception:
        return "error", traceback.format_exc(limit=5)
    finally:
        try:
            cam._close()
        except Exception:
            pass
    if not out_db.exists():
        return "error", "tracking finished but DB was not created"
    # Post-tracking sanity check: did we cover most of the source video?
    # If not (cv2 retry exhausted, codec corruption, etc.), reject the DB so
    # it doesn't get cached as "done" — better an explicit failure than a
    # silent partial write.
    expected_ms = (cam._total_n_frames / 25.0) * 1000.0
    if max_duration is not None:
        expected_ms = min(expected_ms, max_duration * 1000.0)
    completeness_threshold = 0.90  # require ≥ 90 % of expected duration
    # Use MAX(t) across all ROIs — a single ROI can run dry early if its fly
    # stops moving, so the latest detection anywhere in the arena is the
    # better signal of how far the iterator actually got.
    import sqlite3 as _sqlite3
    try:
        _con = _sqlite3.connect(f"file:{out_db}?mode=ro", uri=True)
        t_max = 0
        for _i in range(1, 7):
            _v = _con.execute(f"SELECT MAX(t) FROM ROI_{_i}").fetchone()[0]
            if _v and _v > t_max:
                t_max = _v
        _con.close()
    except Exception:
        t_max = 0
    if expected_ms > 0 and t_max < expected_ms * completeness_threshold:
        out_db.unlink()
        for sidecar in (str(out_db) + "-wal", str(out_db) + "-shm"):
            Path(sidecar).unlink(missing_ok=True)
        ratio = t_max / expected_ms if expected_ms else 0
        return (
            "error",
            f"short output: t_max={t_max} ms vs expected {int(expected_ms)} ms "
            f"({ratio*100:.0f}%); DB removed",
        )
    return "ok", str(out_db)
 def main() -> None:
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("--redo", action="store_true", help="re-track even if DB exists")
    parser.add_argument("--jobs", type=int, default=1, help="parallel workers")
    parser.add_argument(
        "--max-duration", type=float, default=None,
        help="cap each video at this many seconds (default: full video)",
    )
    parser.add_argument("--limit", type=int, default=None, help="process only first N")
    parser.add_argument("--video", type=str, default=None,
                        help="track a single video (mp4 path); requires its target JSON")
    args = parser.parse_args()
    TRACKING_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    if args.video:
        stem = Path(args.video).stem
        json_path = TARGETS_DIR / f"{stem}.json"
        if not json_path.exists():
            sys.exit(f"No target JSON for {args.video}: expected {json_path}")
        jsons = [json_path]
    else:
        jsons = sorted(TARGETS_DIR.glob("*.json"))
    if args.limit:
        jsons = jsons[: args.limit]
    if not jsons:
        print("No target JSONs found. Run pick_targets.py first.")
        return
    print(f"Tracking {len(jsons)} videos (jobs={args.jobs}, redo={args.redo}).")
    n_ok = n_skip = n_err = 0
    if args.jobs <= 1:
        for jp in jsons:
            print(f"  → {jp.name}", flush=True)
            status, msg = track_one(jp, TRACKING_OUTPUT_DIR, args.max_duration, args.redo)
            print(f"    {status}: {msg.splitlines()[-1] if msg else ''}", flush=True)
            n_ok += status == "ok"
            n_skip += status == "skip"
            n_err += status == "error"
    else:
        with ProcessPoolExecutor(max_workers=args.jobs) as ex:
            futs = {
                ex.submit(track_one, jp, TRACKING_OUTPUT_DIR, args.max_duration, args.redo): jp
                for jp in jsons
            }
            for fut in as_completed(futs):
                jp = futs[fut]
                try:
                    status, msg = fut.result()
                except Exception as e:
                    status, msg = "error", f"future raised: {e}"
                print(f"  {jp.name}: {status} — {msg.splitlines()[-1] if msg else ''}",
                      flush=True)
                n_ok += status == "ok"
                n_skip += status == "skip"
                n_err += status == "error"
    print(f"\nDone. ok={n_ok}  skipped={n_skip}  errors={n_err}")
    sys.exit(0 if n_err == 0 else 1)
 if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
    main()
--- a/scripts/tracking_geometry.py
+++ b/scripts/tracking_geometry.py
@ -0,0 +1,71 @@
 """Shared HD-mating-arena ROI geometry, used by both pick_targets.py
 (for live overlay) and track_videos.py (for actual tracking).
 Pure numpy + cv2; no ethoscope dependency.
 """
 from __future__ import annotations
 import itertools
 import cv2
 import numpy as np
 # Layout from
 # ethoscope/.../roi_builders/roi_templates/builtin/HD_Mating_Arena_6_ROIS.json
 HD_MATING_ARENA = {
    "n_rows": 2,
    "n_cols": 3,
    "top_margin": -0.21,
    "bottom_margin": -0.13,
    "left_margin": 0.05,
    "right_margin": 0.05,
    "horizontal_fill": 0.85,
    "vertical_fill": 1.3,
 }
 HD_FG_DATA = {
    "sample_size": 400,
    "normal_limits": [800, 2000],
    "tolerance": 0.8,
 }
 def compute_roi_polygons(reference_points, layout=HD_MATING_ARENA):
    """Map 3 L-shape reference points to 6 ROI polygons, in the order ROI 1..6.
    Reference points must be ordered:
        [TOP, CORNER, LEFT]
    matching ethoscope's dst_points = [(0, -1), (0, 0), (-1, 0)].
    Returns:
        list[np.ndarray]  # 6 arrays, each shape (4, 2), int32, in image coords
    """
    ref = np.asarray(reference_points, dtype=np.float32)
    if ref.shape != (3, 2):
        raise ValueError(f"reference_points must be 3x2, got shape {ref.shape}")
    dst_points = np.array([(0, -1), (0, 0), (-1, 0)], dtype=np.float32)
    wrap_mat = cv2.getAffineTransform(dst_points, ref)
    n_col = layout["n_cols"]
    n_row = layout["n_rows"]
    tm, bm = layout["top_margin"], layout["bottom_margin"]
    lm, rm = layout["left_margin"], layout["right_margin"]
    hf, vf = layout["horizontal_fill"], layout["vertical_fill"]
    y_positions = (np.arange(n_row) * 2.0 + 1) * (1 - tm - bm) / (2 * n_row) + tm
    x_positions = (np.arange(n_col) * 2.0 + 1) * (1 - lm - rm) / (2 * n_col) + lm
    centres = [np.array([x, y]) for x, y in itertools.product(x_positions, y_positions)]
    sign_mat = np.array([[-1, -1], [+1, -1], [+1, +1], [-1, +1]])
    xy_size = np.array([hf / float(n_col), vf / float(n_row)]) / 2.0
    rectangles = [sign_mat * xy_size + c for c in centres]
    shift = np.dot(wrap_mat, [1, 1, 0]) - ref[1]
    polys = []
    for r in rectangles:
        r3 = np.append(r, np.zeros((4, 1)), axis=1)
        mapped = np.dot(wrap_mat, r3.T).T - shift
        polys.append(mapped.astype(np.int32))
    return polys
--- a/tasks/todo.md
+++ b/tasks/todo.md
@ -51,6 +51,90 @@ See `docs/bimodal_hypothesis.md` for detailed methodology.
 - [ ] Consider converting pixel distances to physical units (need calibration)
 - [ ] The second notebook (`flies_analysis.ipynb`) re-runs from DB extraction - consider deprecating
 ## Phase: Offline Tracking of 2024 Video Backlog (added 2026-04-27)
 ### Recap
 Tracked so far (5 sessions, all from 2025-07-15, machines 076/145/268). The DBs in
 `data/raw/` use tracker `ConstrainedMultiFlyTracker` and template
 `HD_Mating_Arena_6_ROIS.json` (2 flies × 6 ROIs per video).
 The metadata file `../all_video_info_merged.xlsx` indexes a different set of
 experiments: 7 dates from 2024-09-17 → 2024-10-21, 16 ethoscope machines,
 63 unique (date, machine) sessions = 484 ROI-rows. **None of the already-tracked
 sessions are in this xlsx — these are fresh recordings to track.**
 Inventory: see `data/metadata/video_inventory.csv` (built by
 `scripts/build_video_inventory.py`).
 - 1163 video sessions on disk under `/mnt/ethoscope_data/videos/`
 - 63/63 xlsx (date, machine) sessions have video on disk
 - 129 video instances need tracking (some (date, machine) have 2-4 recordings/day)
 ### Plan
 The HD-mating-arena videos have no auto-detectable targets — the user must
 manually click 3 reference points (L-shape: top, corner, left) per video. Once
 all targets are picked, tracking can run in the background.
 - [x] **Step 1 — Inventory**: `scripts/build_video_inventory.py` →
      `data/metadata/video_inventory.csv`. 63 (date,machine) sessions match
      the xlsx, all videos found, 129 video instances need tracking.
 - [x] **Step 2 — Manual target picker**: `scripts/pick_targets.py`. Loops over
      videos with `in_xlsx & ~already_tracked & no JSON yet`; per video, shows
      a representative frame, captures 3 clicks (top, corner, left), saves
      `data/targets/<video_basename>.json`. Skips videos already done.
 - [x] **Step 3 — Background tracker**: `scripts/track_videos.py`. Reads target
      JSONs, builds 6 ROIs from the HD-mating-arena geometry, runs
      `MovieVirtualCamera` + `MultiFlyTracker` + `SQLiteResultWriter`, writes
      `data/tracked/<basename>_tracking.db`. Idempotent. Smoke-tested
      end-to-end: 90s of video → ~3000 rows/ROI, areas in 800-2000 band.
 - [x] **Step 4 — Tracking deps**: `requirements-tracking.txt`.
 ### Still TODO
 - [ ] User to run `pick_targets.py` (interactive — needs DISPLAY) on the 129
      pending videos.
 - [ ] Run `track_videos.py --jobs 4` against the resulting JSONs.
 - [ ] (Optional) `auto_detect_targets.py` exists as a fallback for videos that
      DO have visible targets (saves clicks). Confirmed not useful on the
      2025-07-15 batch — these arenas don't have black target dots — but worth
      trying on 2024 batches before falling back to manual.
 - [ ] Decide what to do with the 4 (date, machine) sessions that have 3-4
      recordings/day instead of 2 (e.g. ETHOSCOPE_086 on 2024-09-17 has 4).
      One of them is at lower resolution (1280x960) — likely an aborted take.
 ### Open questions / risks
 - Some (date, machine) combos have 3-4 recordings (e.g. ETHOSCOPE_086 on
  2024-09-17). Need to figure out which is the real "test" video vs aborted
  takes — possibly use video duration or filename pattern.
 - One mismatched-resolution file: `1280x960@25fps-20q` instead of
  `1920x1088@25fps-28q` — flag for inspection.
 - The original `ConstrainedMultiFlyTracker` is no longer in the ethoscope repo;
  `MultiFlyTracker` is its likely successor. Validate output schema matches
  what the existing analysis pipeline expects (`load_roi_data.py`, etc.).
 ## Discovered During Work
-(Add new items here as they come up during analysis)
+### Barrier-opening annotation for the 2024 batch (added 2026-04-30)
 The current `flies_analysis*.ipynb` aligns trajectories to a barrier-opening
 event sourced from `data/metadata/2025_07_15_barrier_opening.csv`. That file
 covers only the 5 machines in the 2025-07-15 experiment. The 2024 batch
 (`/mnt/data/projects/cupido/tracked/`, 113 DBs) has no equivalent annotation
 yet, so all post-alignment cells silently exclude that data.
 - [ ] Build a small picker that lets the user scrub through each tracking
      DB / video and mark the barrier-opening frame, writing a row to a new
      `data/metadata/barrier_opening_2024.csv` (or extend the existing
      file with a date column).
 - [ ] Once the 2024 entries exist, update `align_to_opening_time` so it
      pulls from a unified `barrier_opening` table keyed by
      `(date, machine_name)` rather than `machine_name` alone.
 ### Metadata vocabulary normalization (done 2026-04-30)
 The xlsx had inconsistent labels for control flies (`'naïve'`, `'niave'`,
 `'untrained'` plus trailing whitespace). All sources now use a single
 canonical `'naive'`. Normalization happens in
 `scripts/export_video_db_index.py` so re-running it from the xlsx always
 produces a clean TSV. The 2025-07-15 legacy CSV
 (`data/metadata/2025_07_15_metadata_fixed.csv`) was edited in place from
 `'untrained'` → `'naive'`.
Author	SHA1	Message	Date
Giorgio Gilestro	ec56e51bf9	Add beginner tutorial notebooks for incoming students Four guided notebooks under notebooks/getting_started/ aimed at someone new to Python and data science. The series progresses: project orientation → Python/pandas crash course → exploring one tracking DB → first trained-vs-naive comparison using load_roi_data + Mann-Whitney U. Each notebook leans heavily on markdown explanations, includes exercises with empty cells, and links out to canonical references (JupyterLab, official Python tutorial, pandas 10-min guide, Wikipedia for stats concepts). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-30 18:14:17 +01:00
Giorgio Gilestro	7d09523840	Move TARGETS_DIR to /mnt/data/projects/cupido/targets Targets relocated alongside the tracking DBs (out of ownCloud sync) so the docker mount already covers them and ownCloud no longer churns on JSON sidecars. Updated config, fixed a stale docstring in pick_targets, and dropped the now-moot data/targets/*.json gitignore rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-30 17:13:55 +01:00
Giorgio Gilestro	f60a9d0530	Unify analysis pipeline around the TSV; move tracked DBs out of cloud sync - Tracked DBs now live at /mnt/data/projects/cupido/tracked/ (out of ownCloud to avoid sync conflicts and bandwidth churn). config.py TRACKING_OUTPUT_DIR points there; the docker-compose for ethoscope-lab mounts it world-readable for JupyterHub users. - New scripts/export_video_db_index.py joins all_video_info_merged.xlsx with the video inventory and the on-disk DBs, producing a TSV that has one row per fly/ROI plus training/testing video and DB paths. Handles approximate xlsx times, cross-day training/testing, the 12 AM/PM ambiguity, and date typos. - scripts/load_roi_data.py rewritten as a TSV-driven loader returning a single DataFrame with session and metadata columns. calculate_distances and the two flies_analysis notebooks migrated to use it; downstream trained/naive splits remain available via simple equality filters. - Metadata vocabulary canonicalized: {naïve, niave, untrained, test} all resolve to {trained, naive}. Normalization happens at the TSV-export boundary (idempotent); the xlsx and the 2025-07-15 legacy CSV were edited in place to remove the worst variants. - scripts/monitor_tracking.py rate calculation fixed: with N parallel workers, completions arrive in bursts; the old formula divided by burst width and reported nonsense rates. Now uses a 6 h window denominator. - scripts/track_videos.py: BGRMovieCamera retries cv2.read on transient NFS hiccups and a post-tracking completeness gate (≥ 90 % of expected duration via MAX(t) across all 6 ROIs) deletes silent partial DBs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-30 15:20:14 +01:00
Giorgio	e4da7691d5	Add offline tracking pipeline for video backlog The 2024 video set in all_video_info_merged.xlsx covers 63 (date, machine) sessions — 129 video instances — that have no auto-detectable targets, so ROI placement requires manual reference-point selection. This commit adds the three-stage pipeline that lets a user click for an hour, then walk away while the tracker grinds overnight: 1. build_video_inventory.py — scan /mnt/ethoscope_data/videos/ and join against the xlsx, producing data/metadata/video_inventory.csv 2. pick_targets.py — interactive matplotlib/Tk picker. User clicks TOP/CORNER/LEFT (the L-shape ethoscope expects); after the third click the 6 ROI rectangles are drawn on top of the frame so geometry can be verified before saving. Also supports marking a video 'unusable' (FOV wrong) so it's permanently skipped, frame stepping by ±1s/±5%/midpoint, point editing in --redo mode, and a crosshair cursor that survives matplotlib's per-motion cursor reset. 3. track_videos.py — headless batch tracker. Reads the JSON sidecars, builds 6 ROIs from the HD-mating-arena geometry, runs MultiFlyTracker against the merged.mp4 via MovieVirtualCamera, writes SQLite DBs to data/tracked/. Idempotent (skips done DBs), parallel via --jobs, subclasses MovieVirtualCamera so frames stay BGR (MultiFlyTracker calls cvtColor(BGR2GRAY) without checking channel count). Plus auto_detect_targets.py (fallback that runs ethoscope's auto-detector in case any videos do have visible target dots), monitor_tracking.py (progress + ETA from data/tracked/ ground truth, --watch for live view), and tracking_geometry.py (single source of truth for the affine math shared by picker and tracker). requirements-tracking.txt pins the extra deps (opencv-python, openpyxl, gitpython, netifaces, mysql-connector-python) — these are only needed for the tracking pipeline, not the existing analysis notebooks. Verified end-to-end on one of the user-picked videos: ~4000 rows/ROI in a 120s slice, fly bounding boxes in the expected 800-2000 px² band. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-27 17:25:26 +01:00