Compare commits

...

4 commits

Author SHA1 Message Date
ec56e51bf9 Add beginner tutorial notebooks for incoming students
Four guided notebooks under notebooks/getting_started/ aimed at someone
new to Python and data science. The series progresses: project orientation
→ Python/pandas crash course → exploring one tracking DB → first
trained-vs-naive comparison using load_roi_data + Mann-Whitney U.

Each notebook leans heavily on markdown explanations, includes exercises
with empty cells, and links out to canonical references (JupyterLab,
official Python tutorial, pandas 10-min guide, Wikipedia for stats
concepts).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 18:14:17 +01:00
7d09523840 Move TARGETS_DIR to /mnt/data/projects/cupido/targets
Targets relocated alongside the tracking DBs (out of ownCloud sync) so
the docker mount already covers them and ownCloud no longer churns on
JSON sidecars. Updated config, fixed a stale docstring in pick_targets,
and dropped the now-moot data/targets/*.json gitignore rule.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 17:13:55 +01:00
f60a9d0530 Unify analysis pipeline around the TSV; move tracked DBs out of cloud sync
- Tracked DBs now live at /mnt/data/projects/cupido/tracked/ (out of
  ownCloud to avoid sync conflicts and bandwidth churn). config.py
  TRACKING_OUTPUT_DIR points there; the docker-compose for ethoscope-lab
  mounts it world-readable for JupyterHub users.
- New scripts/export_video_db_index.py joins all_video_info_merged.xlsx
  with the video inventory and the on-disk DBs, producing a TSV that has
  one row per fly/ROI plus training/testing video and DB paths. Handles
  approximate xlsx times, cross-day training/testing, the 12 AM/PM
  ambiguity, and date typos.
- scripts/load_roi_data.py rewritten as a TSV-driven loader returning a
  single DataFrame with session and metadata columns. calculate_distances
  and the two flies_analysis notebooks migrated to use it; downstream
  trained/naive splits remain available via simple equality filters.
- Metadata vocabulary canonicalized: {naïve, niave, untrained, test} all
  resolve to {trained, naive}. Normalization happens at the TSV-export
  boundary (idempotent); the xlsx and the 2025-07-15 legacy CSV were
  edited in place to remove the worst variants.
- scripts/monitor_tracking.py rate calculation fixed: with N parallel
  workers, completions arrive in bursts; the old formula divided by burst
  width and reported nonsense rates. Now uses a 6 h window denominator.
- scripts/track_videos.py: BGRMovieCamera retries cv2.read on transient
  NFS hiccups and a post-tracking completeness gate (≥ 90 % of expected
  duration via MAX(t) across all 6 ROIs) deletes silent partial DBs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 15:20:14 +01:00
e4da7691d5 Add offline tracking pipeline for video backlog
The 2024 video set in all_video_info_merged.xlsx covers 63 (date, machine)
sessions — 129 video instances — that have no auto-detectable targets, so
ROI placement requires manual reference-point selection. This commit adds
the three-stage pipeline that lets a user click for an hour, then walk
away while the tracker grinds overnight:

  1. build_video_inventory.py — scan /mnt/ethoscope_data/videos/ and join
     against the xlsx, producing data/metadata/video_inventory.csv

  2. pick_targets.py — interactive matplotlib/Tk picker. User clicks
     TOP/CORNER/LEFT (the L-shape ethoscope expects); after the third
     click the 6 ROI rectangles are drawn on top of the frame so geometry
     can be verified before saving. Also supports marking a video
     'unusable' (FOV wrong) so it's permanently skipped, frame stepping
     by ±1s/±5%/midpoint, point editing in --redo mode, and a crosshair
     cursor that survives matplotlib's per-motion cursor reset.

  3. track_videos.py — headless batch tracker. Reads the JSON sidecars,
     builds 6 ROIs from the HD-mating-arena geometry, runs MultiFlyTracker
     against the merged.mp4 via MovieVirtualCamera, writes SQLite DBs to
     data/tracked/. Idempotent (skips done DBs), parallel via --jobs,
     subclasses MovieVirtualCamera so frames stay BGR (MultiFlyTracker
     calls cvtColor(BGR2GRAY) without checking channel count).

Plus auto_detect_targets.py (fallback that runs ethoscope's auto-detector
in case any videos do have visible target dots), monitor_tracking.py
(progress + ETA from data/tracked/ ground truth, --watch for live view),
and tracking_geometry.py (single source of truth for the affine math
shared by picker and tracker).

requirements-tracking.txt pins the extra deps (opencv-python, openpyxl,
gitpython, netifaces, mysql-connector-python) — these are only needed
for the tracking pipeline, not the existing analysis notebooks.

Verified end-to-end on one of the user-picked videos: ~4000 rows/ROI in
a 120s slice, fly bounding boxes in the expected 800-2000 px² band.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 17:25:26 +01:00
23 changed files with 3450 additions and 214 deletions

5
.gitignore vendored
View file

@ -2,6 +2,11 @@
data/raw/*.db
data/processed/*.csv
# Offline-tracking outputs (regenerable from videos + target JSONs)
# DBs and target JSONs live outside the repo at /mnt/data/projects/cupido/
data/metadata/video_inventory.csv
data/logs/*.log
# Generated figures (reproducible from scripts)
figures/*.png

View file

@ -46,6 +46,32 @@ The key insight: not all "trained" flies may have actually learned. The trained
**Read `docs/bimodal_hypothesis.md` for the detailed analysis plan and code sketches.**
## Offline Tracking Pipeline (added Apr 2026)
For tracking new videos that have **no auto-detectable targets**, the pipeline
is split in two stages so you can sit at the screen and click for an hour, then
let the tracker grind through overnight.
```bash
# extra deps (ethoscope src must be at /home/gg/Code/ethoscope_project/...)
pip install -r requirements-tracking.txt
# 1) build the inventory (xlsx ↔ /mnt/ethoscope_data/videos/)
python scripts/build_video_inventory.py
# 2) interactive: click TOP, CORNER, LEFT on each video (one frame per video)
python scripts/pick_targets.py # process all not-yet-picked
python scripts/pick_targets.py --redo # re-pick already-picked videos
# keys: r=reset n=skip f=jump frame q/ESC=quit ENTER=save
# 3) batch tracking (idempotent, can run in background)
python scripts/track_videos.py --jobs 4 # parallel
# output → /mnt/data/projects/cupido/tracked/*_tracking.db (SQLite, same schema as data/raw/)
```
See `tasks/todo.md` "Offline Tracking" section for the full plan, and
`data/metadata/video_inventory.csv` for the list of videos to process.
## Folder Structure
```

View file

@ -1,37 +1,37 @@
date,HHMMSS,machine_name,ROI,genotype,group,path,filesize_mb
date,HHMMSS,machine_name,ROI,genotype,group,path,filesize_mb
15/07/2025,16-03-10,76,6,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
15/07/2025,16-03-10,76,4,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
15/07/2025,16-03-10,76,4,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
15/07/2025,16-03-10,76,2,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
15/07/2025,16-03-10,76,5,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
15/07/2025,16-03-10,76,5,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
15/07/2025,16-03-10,76,3,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
15/07/2025,16-03-10,76,1,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
15/07/2025,16-03-10,76,1,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,59.4
15/07/2025,16-31-34,76,6,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
15/07/2025,16-31-34,76,4,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
15/07/2025,16-31-34,76,2,CS,trained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
15/07/2025,16-31-34,76,5,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
15/07/2025,16-31-34,76,3,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
15/07/2025,16-31-34,76,1,CS,untrained,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
15/07/2025,16-31-34,76,5,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
15/07/2025,16-31-34,76,3,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
15/07/2025,16-31-34,76,1,CS,naive,/mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4,78.98
15/07/2025,16-03-27,145,6,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
15/07/2025,16-03-27,145,4,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
15/07/2025,16-03-27,145,2,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
15/07/2025,16-03-27,145,5,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
15/07/2025,16-03-27,145,3,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
15/07/2025,16-03-27,145,1,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
15/07/2025,16-03-27,145,5,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
15/07/2025,16-03-27,145,3,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
15/07/2025,16-03-27,145,1,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,78.72
15/07/2025,16-31-41,145,6,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
15/07/2025,16-31-41,145,4,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
15/07/2025,16-31-41,145,2,CS,trained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
15/07/2025,16-31-41,145,5,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
15/07/2025,16-31-41,145,3,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
15/07/2025,16-31-41,145,1,CS,untrained,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
15/07/2025,16-31-41,145,5,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
15/07/2025,16-31-41,145,3,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
15/07/2025,16-31-41,145,1,CS,naive,/mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4,90.9
15/07/2025,16-31-52,139,6,CS,trained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
15/07/2025,16-31-52,139,4,CS,trained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
15/07/2025,16-31-52,139,2,CS,trained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
15/07/2025,16-31-52,139,5,CS,untrained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
15/07/2025,16-31-52,139,3,CS,untrained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
15/07/2025,16-31-52,139,1,CS,untrained,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
15/07/2025,16-32-05,268,6,CS,untrained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
15/07/2025,16-32-05,268,4,CS,untrained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
15/07/2025,16-32-05,268,2,CS,untrained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
15/07/2025,16-31-52,139,5,CS,naive,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
15/07/2025,16-31-52,139,3,CS,naive,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
15/07/2025,16-31-52,139,1,CS,naive,/mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4,73.4
15/07/2025,16-32-05,268,6,CS,naive,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
15/07/2025,16-32-05,268,4,CS,naive,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
15/07/2025,16-32-05,268,2,CS,naive,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
15/07/2025,16-32-05,268,5,CS,trained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
15/07/2025,16-32-05,268,3,CS,trained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72
15/07/2025,16-32-05,268,1,CS,trained,/mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4,43.72

1 date HHMMSS machine_name ROI genotype group path filesize_mb
2 15/07/2025 16-03-10 76 6 CS trained /mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4 59.4
3 15/07/2025 16-03-10 76 4 CS untrained naive /mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4 59.4
4 15/07/2025 16-03-10 76 2 CS trained /mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4 59.4
5 15/07/2025 16-03-10 76 5 CS untrained naive /mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4 59.4
6 15/07/2025 16-03-10 76 3 CS trained /mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4 59.4
7 15/07/2025 16-03-10 76 1 CS untrained naive /mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-03-10/2025-07-15_16-03-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4 59.4
8 15/07/2025 16-31-34 76 6 CS trained /mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4 78.98
9 15/07/2025 16-31-34 76 4 CS trained /mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4 78.98
10 15/07/2025 16-31-34 76 2 CS trained /mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4 78.98
11 15/07/2025 16-31-34 76 5 CS untrained naive /mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4 78.98
12 15/07/2025 16-31-34 76 3 CS untrained naive /mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4 78.98
13 15/07/2025 16-31-34 76 1 CS untrained naive /mnt/ethoscope_data/videos/076e2825a7274661bd0697c42d6fa4c0/ETHOSCOPE_076/2025-07-15_16-31-34/2025-07-15_16-31-34_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged.mp4 78.98
14 15/07/2025 16-03-27 145 6 CS trained /mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4 78.72
15 15/07/2025 16-03-27 145 4 CS trained /mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4 78.72
16 15/07/2025 16-03-27 145 2 CS trained /mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4 78.72
17 15/07/2025 16-03-27 145 5 CS untrained naive /mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4 78.72
18 15/07/2025 16-03-27 145 3 CS untrained naive /mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4 78.72
19 15/07/2025 16-03-27 145 1 CS untrained naive /mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-03-27/2025-07-15_16-03-27_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4 78.72
20 15/07/2025 16-31-41 145 6 CS trained /mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4 90.9
21 15/07/2025 16-31-41 145 4 CS trained /mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4 90.9
22 15/07/2025 16-31-41 145 2 CS trained /mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4 90.9
23 15/07/2025 16-31-41 145 5 CS untrained naive /mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4 90.9
24 15/07/2025 16-31-41 145 3 CS untrained naive /mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4 90.9
25 15/07/2025 16-31-41 145 1 CS untrained naive /mnt/ethoscope_data/videos/145bb573497a4e15b0690206748a3af6/ETHOSCOPE_145/2025-07-15_16-31-41/2025-07-15_16-31-41_145bb573497a4e15b0690206748a3af6__1920x1088@25fps-28q_merged.mp4 90.9
26 15/07/2025 16-31-52 139 6 CS trained /mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4 73.4
27 15/07/2025 16-31-52 139 4 CS trained /mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4 73.4
28 15/07/2025 16-31-52 139 2 CS trained /mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4 73.4
29 15/07/2025 16-31-52 139 5 CS untrained naive /mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4 73.4
30 15/07/2025 16-31-52 139 3 CS untrained naive /mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4 73.4
31 15/07/2025 16-31-52 139 1 CS untrained naive /mnt/ethoscope_data/videos/13924be2046d49f4a641cef2a5559852/ETHOSCOPE_139/2025-07-15_16-31-52/2025-07-15_16-31-52_13924be2046d49f4a641cef2a5559852__1920x1088@25fps-28q_merged.mp4 73.4
32 15/07/2025 16-32-05 268 6 CS untrained naive /mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4 43.72
33 15/07/2025 16-32-05 268 4 CS untrained naive /mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4 43.72
34 15/07/2025 16-32-05 268 2 CS untrained naive /mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4 43.72
35 15/07/2025 16-32-05 268 5 CS trained /mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4 43.72
36 15/07/2025 16-32-05 268 3 CS trained /mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4 43.72
37 15/07/2025 16-32-05 268 1 CS trained /mnt/ethoscope_data/videos/268102f92f51486f995200c29d980477/ETHOSCOPE_268/2025-07-15_16-32-05/2025-07-15_16-32-05_268102f92f51486f995200c29d980477__1920x1088@25fps-28q_merged.mp4 43.72

View file

@ -1,39 +1,47 @@
# Processed Data
Large CSV files generated from the analysis pipeline. All files are gitignored (~370MB total) and can be regenerated.
CSVs derived from the tracking DBs (`/mnt/data/projects/cupido/tracked/`)
and the merged TSV (`../../all_video_info_merged.tsv`). All files are
gitignored and regenerable.
## Files and Regeneration
| File | Description | Generated By |
|------|-------------|--------------|
| `trained_roi_data.csv` | Raw tracking data for trained ROIs | `scripts/load_roi_data.py` or notebook step 1 |
| `untrained_roi_data.csv` | Raw tracking data for untrained ROIs | `scripts/load_roi_data.py` or notebook step 1 |
| `trained_distances.csv` | Pairwise distances (unaligned) | `scripts/calculate_distances.py` |
| `untrained_distances.csv` | Pairwise distances (unaligned) | `scripts/calculate_distances.py` |
| `trained_distances_aligned.csv` | Distances aligned to barrier opening | Notebook step 4 |
| `untrained_distances_aligned.csv` | Distances aligned to barrier opening | Notebook step 4 |
| `trained_tracked.csv` | Identity-tracked fly positions | Notebook step 7 |
| `untrained_tracked.csv` | Identity-tracked fly positions | Notebook step 7 |
| `trained_max_velocity.csv` | Max velocity over 10s windows | Notebook step 7 |
| `untrained_max_velocity.csv` | Max velocity over 10s windows | Notebook step 7 |
| `distances.csv` | Per-frame inter-fly distances for every (date, machine, ROI, session). Includes metadata columns to filter trained vs naïve, training phase, species, etc. | `scripts/calculate_distances.py` |
| `*_distances_aligned.csv` | (legacy, 2025-07-15 only) distances aligned to barrier opening | `notebooks/flies_analysis*.ipynb` |
| `*_tracked.csv` | (legacy) identity-tracked fly positions | `notebooks/flies_analysis_simple.ipynb` |
| `*_max_velocity.csv` | (legacy) max velocity over 10 s windows | `notebooks/flies_analysis_simple.ipynb` |
## To Regenerate All Data
## Loading the data
Run the full notebook `notebooks/flies_analysis_simple.ipynb` with:
```python
recalculate_distances = True
recalculate_tracking = True
import sys
sys.path.insert(0, "../scripts")
from load_roi_data import load_roi_data
data = load_roi_data() # full batch as one DataFrame
# Or filter the metadata first:
import pandas as pd
tsv = pd.read_csv("../../all_video_info_merged.tsv", sep="\t")
data = load_roi_data(tsv[tsv.species.str.contains("Melanogaster")])
```
**Warning**: Identity tracking and velocity calculations take significant time (~30+ minutes).
The returned DataFrame has columns:
`id, t, x, y, w, h, phi, is_inferred, has_interacted, session, ROI, date,
machine_name, species, male, training_date_time, testing_date_time,
training_length_hr, consolidation_length_hr, memory, age`.
## Column Reference
`session` is `"training"` or `"testing"`; `male` is `"trained"` or
`"naive"` (canonical — variants like `"naïve"` and `"niave"` are normalized
at the TSV-export step).
### Distance CSVs (`*_distances_aligned.csv`)
- `machine_name`: Ethoscope machine ID (string)
- `ROI`: ROI number (1-6)
- `aligned_time`: Time in ms relative to barrier opening (0 = opening)
- `distance`: Euclidean distance between flies in pixels
- `n_flies`: Number of flies detected at this time point
- `area_fly1`, `area_fly2`: Bounding box areas (w*h) in pixels^2
- `group`: "trained" or "untrained"
## Column Reference (`distances.csv`)
- `date`, `machine_name`, `ROI`, `session`: identifies one fly trajectory
- `t`: time in ms within that session
- `distance`: Euclidean distance between the two flies in pixels
- `n_flies`: number of fly detections at this frame (1 or 2)
- `area_fly1`, `area_fly2`: bounding-box areas (`w * h`) in pixels²
- `male`: `trained` or `naive` (carried from the xlsx; normalized)
- `species`, `memory`, `age`: experimental metadata

View file

@ -28,7 +28,22 @@
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "def load_roi_data():\n \"\"\"Load ROI data from SQLite databases and group by trained/untrained\"\"\"\n metadata = pd.read_csv(DATA_METADATA / '2025_07_15_metadata_fixed.csv')\n metadata['machine_name'] = metadata['machine_name'].astype(str)\n \n trained_rois = metadata[metadata['group'] == 'trained']\n untrained_rois = metadata[metadata['group'] == 'untrained']\n \n db_files = list(DATA_RAW.glob('*_tracking.db'))\n \n trained_df = pd.DataFrame()\n untrained_df = pd.DataFrame()\n \n for db_file in db_files:\n print(f\"Processing {db_file.name}\")\n \n pattern = r'_([0-9a-f]{32})__'\n match = re.search(pattern, db_file.name)\n \n if not match:\n print(f\"Could not extract UUID from {db_file.name}\")\n continue\n \n uuid = match.group(1)\n metadata_matches = metadata[metadata['path'].str.contains(uuid, na=False)]\n \n if metadata_matches.empty:\n print(f\"No metadata matches found for UUID {uuid}\")\n continue\n \n machine_id = metadata_matches.iloc[0]['machine_name']\n print(f\"Matched to machine ID: {machine_id}\")\n \n conn = sqlite3.connect(str(db_file))\n \n machine_trained = trained_rois[trained_rois['machine_name'] == machine_id]\n machine_untrained = untrained_rois[untrained_rois['machine_name'] == machine_id]\n \n for _, row in machine_trained.iterrows():\n roi = row['ROI']\n try:\n roi_data = pd.read_sql_query(f\"SELECT * FROM ROI_{roi}\", conn)\n roi_data['machine_name'] = machine_id\n roi_data['ROI'] = roi\n roi_data['group'] = 'trained'\n trained_df = pd.concat([trained_df, roi_data], ignore_index=True)\n except Exception as e:\n print(f\"Error loading ROI_{roi}: {e}\")\n \n for _, row in machine_untrained.iterrows():\n roi = row['ROI']\n try:\n roi_data = pd.read_sql_query(f\"SELECT * FROM ROI_{roi}\", conn)\n roi_data['machine_name'] = machine_id\n roi_data['ROI'] = roi\n roi_data['group'] = 'untrained'\n untrained_df = pd.concat([untrained_df, roi_data], ignore_index=True)\n except Exception as e:\n print(f\"Error loading ROI_{roi}: {e}\")\n \n conn.close()\n \n return trained_df, untrained_df\n\ntrained_data, untrained_data = load_roi_data()\nprint(f\"Trained data shape: {trained_data.shape}\")\nprint(f\"Untrained data shape: {untrained_data.shape}\")\n\ntrained_data.to_csv(DATA_PROCESSED / 'trained_roi_data.csv', index=False)\nuntrained_data.to_csv(DATA_PROCESSED / 'untrained_roi_data.csv', index=False)\nprint(\"Data saved to CSV files\")"
"source": [
"# Load tracking data via the unified loader (driven by all_video_info_merged.tsv).\n",
"# Reason: replaces the old data/raw + 2025_07_15_metadata_fixed.csv path with\n",
"# the TSV-based loader that covers the entire batch (2025-07-15 + 2024).\n",
"sys.path.insert(0, str(PROJECT_ROOT / 'scripts'))\n",
"from load_roi_data import load_roi_data\n",
"\n",
"data = load_roi_data()\n",
"# Backwards-compat slices for the rest of the notebook.\n",
"trained_data = data[data['male'] == 'trained'].copy()\n",
"untrained_data = data[data['male'] == 'naive'].copy()\n",
"\n",
"print(f\"all data: {data.shape}\")\n",
"print(f\"trained: {trained_data.shape}\")\n",
"print(f\"naive: {untrained_data.shape}\")\n"
]
},
{
"cell_type": "markdown",

View file

@ -28,7 +28,22 @@
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Load the pre-processed data\ntrained_data = pd.read_csv(DATA_PROCESSED / 'trained_roi_data.csv')\nuntrained_data = pd.read_csv(DATA_PROCESSED / 'untrained_roi_data.csv')\n\nprint(f\"Trained data shape: {trained_data.shape}\")\nprint(f\"Untrained data shape: {untrained_data.shape}\")\nprint(f\"Trained data columns: {list(trained_data.columns)}\")\nprint(f\"Untrained data columns: {list(untrained_data.columns)}\")"
"source": [
"# Load tracking data via the unified loader (driven by all_video_info_merged.tsv).\n",
"# Reason: replaces reads of trained_roi_data.csv / untrained_roi_data.csv with\n",
"# the live loader so the notebook always sees the current batch.\n",
"sys.path.insert(0, str(PROJECT_ROOT / 'scripts'))\n",
"from load_roi_data import load_roi_data\n",
"\n",
"data = load_roi_data()\n",
"trained_data = data[data['male'] == 'trained'].copy()\n",
"untrained_data = data[data['male'] == 'naive'].copy()\n",
"\n",
"print(f\"all data shape: {data.shape}\")\n",
"print(f\"Trained data: {trained_data.shape}\")\n",
"print(f\"Naive data: {untrained_data.shape}\")\n",
"print(f\"Columns: {list(trained_data.columns)}\")\n"
]
},
{
"cell_type": "markdown",

View file

@ -0,0 +1,255 @@
{
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 00 \u00b7 Welcome to the Cupido fly-tracking project\n",
"\n",
"Hi! You're about to start working on a project that studies how *Drosophila*\n",
"(fruit flies) form **memories of mating experiences** \u2014 and whether trained\n",
"flies behave differently from na\u00efve ones in their later courtship.\n",
"\n",
"**You don't need any prior experience with Python or data science to follow\n",
"along.** This series of notebooks will walk you through everything, one\n",
"small step at a time.\n",
"\n",
"> **How to read these notebooks**: each notebook is split into \"cells\".\n",
"> Some cells are explanations (like this one), others are code that you\n",
"> can **run** by clicking on the cell and pressing `Shift + Enter`. Try it\n",
"> on the next cell.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# This is a code cell. Click on it and press Shift+Enter to run it.\n",
"print(\"Hello, fly world!\")\n",
"1 + 1\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should have seen `Hello, fly world!` printed and the number `2`\n",
"appear underneath. If something else happened, ask Giorgio \u2014 that's a\n",
"sign the environment isn't set up right.\n",
"\n",
"If this is the very first time you're using JupyterLab, take 10 minutes\n",
"to read the [official \"Getting started with JupyterLab\"\n",
"guide](https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html).\n",
"The most important things to know are:\n",
"\n",
"- A notebook (`.ipynb` file) is a sequence of **cells**.\n",
"- Each cell is either **Markdown** (formatted text, like this) or **Code**\n",
" (Python that the computer runs).\n",
"- The **kernel** is the running Python process behind the notebook. It\n",
" remembers everything you've defined. If something gets weird, restart\n",
" the kernel: top menu \u2192 *Kernel* \u2192 *Restart Kernel\u2026*.\n",
"- `Shift + Enter` runs a cell and moves to the next one.\n",
"- `Ctrl + Enter` runs a cell and stays put.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is the project about?\n",
"\n",
"Drosophila males court females with a stereotyped sequence (chasing,\n",
"wing-extension, tapping). When a male is rejected by a female (e.g.\n",
"because she's already mated), he **learns** to suppress his courtship \u2014\n",
"even toward new, receptive females, for a while. This is a textbook\n",
"example of *non-associative learning* in invertebrates ([review on\n",
"PubMed](https://pubmed.ncbi.nlm.nih.gov/?term=courtship+conditioning+drosophila)).\n",
"\n",
"The lab is interested in:\n",
"\n",
"- Does this learning **transfer across species**? (We have ~7 *Drosophila*\n",
" species recorded.)\n",
"- How long does the memory last? (training_length_hr,\n",
" consolidation_length_hr columns in the metadata.)\n",
"- Are there **individual differences** \u2014 do some males learn while others\n",
" don't? (The \"bimodal hypothesis\" in `docs/bimodal_hypothesis.md`.)\n",
"\n",
"Your job, broadly, will be to **turn videos of flies into numbers and\n",
"plots that answer these questions.**\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How an experiment works (the bird's-eye view)\n",
"\n",
"1. **Training**: a male fly is placed with a non-receptive (mated) female.\n",
" He courts, gets rejected, eventually gives up.\n",
"2. *Wait* for some hours (the \"consolidation\" period \u2014 gives memory time\n",
" to form).\n",
"3. **Testing**: same male is placed with a fresh receptive female.\n",
" Does he court her vigorously, or has he learned to give up easily?\n",
"\n",
"Each experiment runs in an **HD mating arena** \u2014 a small chamber with\n",
"6 sub-arenas (we call them **ROIs**, for \"regions of interest\"). Each ROI\n",
"contains one couple (a male and a female). A camera films the whole arena\n",
"from above. So one **video** gives us 6 simultaneous experiments.\n",
"\n",
"The setup uses [Ethoscopes](https://www.ethoscope.com/) \u2014 open-source\n",
"behavioural recording boxes built in this lab. Each ethoscope is a\n",
"machine; we have 16 in total, named `ETHOSCOPE_067`, `ETHOSCOPE_076`, etc.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What does the data look like?\n",
"\n",
"For each video, the **tracker** (a piece of software that runs after the\n",
"recording) finds the flies frame-by-frame and writes their positions to a\n",
"**SQLite database** (a single file, ending in `.db`). One DB per video.\n",
"Inside each DB there are 6 tables called `ROI_1`, `ROI_2`, \u2026, `ROI_6` \u2014\n",
"one per sub-arena. Each row of an ROI table is **one fly detection at one\n",
"moment in time** with these columns:\n",
"\n",
"| column | meaning |\n",
"|---|---|\n",
"| `id` | row number (auto-incremented) |\n",
"| `t` | time in **milliseconds** since the video started |\n",
"| `x`, `y` | fly position in **pixels** (top-left corner of the image is 0,0) |\n",
"| `w`, `h` | width and height of the bounding box around the fly, in pixels |\n",
"| `phi` | orientation angle of the fly |\n",
"| `is_inferred` | 1 if the position was guessed (not directly seen), 0 otherwise |\n",
"| `has_interacted` | (legacy column, mostly unused) |\n",
"\n",
"If a single ROI has two flies that the tracker can see, you'll get **two\n",
"rows with the same `t`** \u2014 one for each fly. If only one fly is detected\n",
"(maybe they're on top of each other), you'll get one row.\n",
"\n",
"That's the heart of the data. Everything else (distances, velocities,\n",
"group comparisons) is computed from these (t, x, y) traces.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Where everything lives\n",
"\n",
"Take a moment to memorize these locations \u2014 you'll come back to them often.\n",
"\n",
"| what | where |\n",
"|---|---|\n",
"| Tracking DBs (SQLite, one per video) | `/mnt/data/projects/cupido/tracked/` |\n",
"| Target JSONs (the user-clicked reference points) | `/mnt/data/projects/cupido/targets/` |\n",
"| Source video files | `/mnt/ethoscope_data/videos/` |\n",
"| Project code (this repo) | `/home/gg/ownCloud/Work/Projects/coding/cupido/tracking/` |\n",
"| The metadata table (xlsx + TSV) | `/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv` |\n",
"| Your notebooks | `notebooks/getting_started/` (this folder) |\n",
"\n",
"Let's verify a couple of these from inside Python:\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"from pathlib import Path\n",
"\n",
"tracked = Path(\"/mnt/data/projects/cupido/tracked\")\n",
"targets = Path(\"/mnt/data/projects/cupido/targets\")\n",
"\n",
"n_dbs = len(list(tracked.glob(\"*_tracking.db\")))\n",
"n_jsons = len(list(targets.glob(\"*.json\")))\n",
"\n",
"print(f\"Tracking DBs available: {n_dbs}\")\n",
"print(f\"Target JSONs available: {n_jsons}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should see roughly 113 tracking DBs and 130 target JSONs. If those\n",
"numbers are zero, the storage volume isn't mounted \u2014 ask Giorgio.\n",
"\n",
"> **Note**: the tracking DBs are read-only inside the JupyterLab\n",
"> container. You can read them but not modify or delete them. That's a\n",
"> deliberate safety measure \u2014 we don't want analysis code accidentally\n",
"> corrupting the source data.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Glossary (refer back as needed)\n",
"\n",
"- **ROI** \u2014 *region of interest*. One sub-arena inside the HD mating\n",
" arena. There are 6 ROIs per video, numbered 1\u20136.\n",
"- **fly** \u2014 one detection in a single (t, ROI) cell. Two flies in the\n",
" same ROI at the same time = two rows with the same `t`.\n",
"- **trained** \u2014 the male had a training session before testing.\n",
"- **naive** \u2014 the male is a control (no training).\n",
"- **training session** \u2014 the recording where the male meets the\n",
" non-receptive female (he gets rejected).\n",
"- **testing session** \u2014 the recording where the male meets a fresh\n",
" receptive female (we measure his courtship).\n",
"- **t (milliseconds)** \u2014 time within one session, starting at 0.\n",
"- **(x, y) pixels** \u2014 fly position in the image. Top-left is (0, 0); x\n",
" grows to the right, y grows **downward** (this is the image-coordinate\n",
" convention, opposite of math class).\n",
"- **machine_name** \u2014 which ethoscope recorded the video, e.g.\n",
" `ETHOSCOPE_076`.\n",
"- **species** \u2014 `Melanogaster/CS`, `Sechellia`, `Simulans`, `Yakuba`,\n",
" `Erecta`, `Willistoni`, or `CS`.\n",
"\n",
"If you bump into other terms in the code, ask. Don't guess \u2014 biology\n",
"codebases pick up jargon over the years.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What's next\n",
"\n",
"When you're ready, open these notebooks **in order**:\n",
"\n",
"1. `01_python_pandas_basics.ipynb` \u2014 just enough Python and pandas to\n",
" read and manipulate tabular data.\n",
"2. `02_explore_one_database.ipynb` \u2014 open one tracking DB, plot a fly's\n",
" trajectory, see what the numbers actually look like.\n",
"3. `03_compare_trained_vs_naive.ipynb` \u2014 your first real analysis,\n",
" comparing groups of flies.\n",
"\n",
"After those, the notebooks one level up (`flies_analysis.ipynb`,\n",
"`flies_analysis_simple.ipynb`) contain the analysis pipeline that the\n",
"previous student built \u2014 those will make sense once you've worked\n",
"through the tutorials.\n",
"\n",
"Don't try to power through all of them in one sitting. Run a few cells,\n",
"read the explanation, **change a number** to see what happens, **break\n",
"something on purpose** to see the error message. That's how you learn.\n"
]
}
]
}

View file

@ -0,0 +1,500 @@
{
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 01 \u00b7 Python and pandas \u2014 just enough to be dangerous\n",
"\n",
"This notebook teaches the **minimum** Python and `pandas` you need to read\n",
"the rest of the project's code and write your own analyses.\n",
"\n",
"If you've never programmed before, don't try to memorize the syntax.\n",
"Just run each cell, read what it does, and come back when you're stuck on\n",
"something specific. The cheat sheet at the end is the only thing worth\n",
"keeping handy.\n",
"\n",
"External resources, in order of how much time they take:\n",
"\n",
"- \ud83e\udd98 [Python in 10 minutes (very condensed)](https://www.stavros.io/tutorials/python/)\n",
"- \ud83d\udc0d [Official Python tutorial \u2014 chapters 3\u20135](https://docs.python.org/3/tutorial/introduction.html)\n",
"- \ud83d\udc3c [pandas in 10 minutes (official)](https://pandas.pydata.org/docs/user_guide/10min.html)\n",
"- \ud83d\udcda [Python for Data Analysis (the book)](https://wesmckinney.com/book/) \u2014 free online\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Variables\n",
"\n",
"A variable is a named box you put a value into. The `=` is **assignment**,\n",
"not equality. Read it as \"make `name` refer to `value`\".\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"x = 5\n",
"y = 3\n",
"total = x + y\n",
"print(total)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Re-running the cell after changing `x = 5` to `x = 50` gives a different\n",
"answer. Try it.\n",
"\n",
"Variable names: lowercase letters, digits, and underscores. They can't\n",
"start with a digit. Convention is `snake_case`: `mean_distance`, not\n",
"`meanDistance` or `MeanDistance`.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Strings and numbers\n",
"\n",
"A **string** is text in quotes. You can join strings with `+`. You can\n",
"turn a number into a string with `str()`, and vice-versa with `int()` /\n",
"`float()`.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"species = \"Drosophila melanogaster\"\n",
"n_flies = 12\n",
"message = \"We tracked \" + str(n_flies) + \" \" + species + \" males.\"\n",
"print(message)\n",
"\n",
"# A nicer way to build strings \u2014 f-strings (note the leading 'f'):\n",
"print(f\"We tracked {n_flies} {species} males.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Lists\n",
"\n",
"A list is an ordered collection of things. Square brackets, items\n",
"separated by commas. You can mix types (but usually shouldn't).\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"machines = [\"ETHOSCOPE_076\", \"ETHOSCOPE_082\", \"ETHOSCOPE_086\"]\n",
"print(machines[0]) # first item \u2014 Python counts from 0!\n",
"print(machines[-1]) # last item\n",
"print(len(machines)) # how many items\n",
"print(machines + [\"ETHOSCOPE_140\"]) # concatenate (returns a new list)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Dictionaries\n",
"\n",
"A dictionary maps **keys** to **values**. Curly braces, `key: value`\n",
"pairs.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"fly = {\"species\": \"Sechellia\", \"trained\": True, \"age_days\": 5}\n",
"print(fly[\"species\"])\n",
"print(fly[\"age_days\"])\n",
"fly[\"alive\"] = False # add a new key\n",
"print(fly)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Conditions: if / elif / else\n",
"\n",
"Compare with `==` (equal), `!=` (not equal), `<`, `>`, `<=`, `>=`.\n",
"Combine with `and`, `or`, `not`.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"distance_px = 42\n",
"\n",
"if distance_px < 50:\n",
" label = \"close\"\n",
"elif distance_px < 200:\n",
" label = \"medium\"\n",
"else:\n",
" label = \"far\"\n",
"\n",
"print(label)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Loops\n",
"\n",
"`for x in collection:` runs the indented block once per item.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"for m in machines:\n",
" print(f\"Looking at machine {m}\")\n",
"\n",
"# Looping with an index, when you need it:\n",
"for i, m in enumerate(machines):\n",
" print(f\"{i}: {m}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Functions\n",
"\n",
"A function is a named, reusable chunk of code. `def` declares it. `return`\n",
"sends a value back to whoever called it.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"def fly_age_in_weeks(days):\n",
" \"\"\"Return age in weeks given age in days.\"\"\"\n",
" return days / 7\n",
"\n",
"print(fly_age_in_weeks(14)) # 2.0\n",
"print(fly_age_in_weeks(5)) # 0.714\u2026\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Importing libraries\n",
"\n",
"A library is somebody else's code. We use `import` to pull it into our\n",
"notebook.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import math\n",
"print(math.sqrt(16)) # 4.0\n",
"print(math.pi)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Meet pandas\n",
"\n",
"Real data is rarely a single number \u2014 it's a **table** with rows and\n",
"columns (think Excel). `pandas` is the library that handles tables in\n",
"Python. The two main objects are:\n",
"\n",
"- **`Series`** \u2014 a single column with a name.\n",
"- **`DataFrame`** \u2014 a whole table.\n",
"\n",
"By convention we import pandas as `pd`. Always.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# Read the project's metadata TSV (Tab-Separated Values).\n",
"tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
"df = pd.read_csv(tsv_path, sep=\"\\t\")\n",
"\n",
"# How big is it?\n",
"print(f\"Rows: {len(df)}\")\n",
"print(f\"Columns: {df.shape[1]}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Looking at the table\n",
"\n",
"`.head()` shows the first 5 rows. `.tail()` the last 5. `.columns` lists\n",
"column names. `.dtypes` shows the type of each column.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"df.head(3)\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"print(\"Column names:\")\n",
"for c in df.columns:\n",
" print(f\" {c}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Selecting columns\n",
"\n",
"Two main ways to get one column: bracket-indexing (`df[\"name\"]`) or\n",
"attribute access (`df.name`). The first works for any column name; the\n",
"second only works if the name has no spaces or weird characters.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"df[\"species\"].head()\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"df.species.value_counts() # how many rows per species\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Selecting multiple columns\n",
"\n",
"Pass a **list** of names inside the brackets:\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"df[[\"machine_name\", \"roi\", \"species\", \"male\"]].head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. Filtering rows\n",
"\n",
"The pattern is `df[condition]`. The condition is a Series of `True`/`False`.\n",
"Pandas keeps the rows where it's `True`.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"trained = df[df[\"male\"] == \"trained\"]\n",
"print(f\"trained rows: {len(trained)}\")\n",
"\n",
"mel_only = df[df[\"species\"] == \"Melanogaster/CS\"]\n",
"print(f\"Melanogaster/CS rows: {len(mel_only)}\")\n",
"\n",
"# Combine conditions with & (and) | (or) \u2014 and wrap each part in parentheses.\n",
"trained_mel = df[(df[\"male\"] == \"trained\") & (df[\"species\"] == \"Melanogaster/CS\")]\n",
"print(f\"trained Mel rows: {len(trained_mel)}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Grouping and counting\n",
"\n",
"`.groupby(\"col\")` followed by an aggregator like `.size()` or `.mean()`\n",
"splits the table by the values in that column and computes something per\n",
"group.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# How many ROIs per (species, training condition)?\n",
"df.groupby([\"species\", \"male\"]).size()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 15. Quick plots\n",
"\n",
"DataFrames know how to draw themselves. Under the hood it's `matplotlib`.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"# How many rows per machine?\n",
"df[\"machine_name\"].value_counts().plot(kind=\"bar\", figsize=(10, 4))\n",
"plt.title(\"Number of fly-rows per ethoscope machine\")\n",
"plt.ylabel(\"rows\")\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 16. Exercises\n",
"\n",
"Don't skip these. They're how you find out what you actually understood.\n",
"\n",
"1. How many rows does `df` have where `age` equals `'5-7'`?\n",
"2. Print the **unique values** of the `memory` column. (Hint: `df[\"memory\"].unique()`)\n",
"3. How many distinct `(date, machine_name)` pairs are in the dataset?\n",
" (Hint: `df.groupby([\"date\", \"machine_name\"]).size().shape`.)\n",
"4. Make a bar plot of `species` counts. Which species has the most rows?\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Try exercise 1 here\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Try exercise 2 here\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Try exercise 3 here\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Try exercise 4 here\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cheat sheet\n",
"\n",
"```python\n",
"import pandas as pd\n",
"df = pd.read_csv(\"file.tsv\", sep=\"\\t\") # read\n",
"df.head(); df.tail(); df.shape; df.columns # peek\n",
"df[\"col\"]; df[[\"a\", \"b\"]] # select\n",
"df[df[\"col\"] == \"value\"] # filter\n",
"df.groupby(\"col\").size() # count per group\n",
"df.groupby(\"col\")[\"x\"].mean() # mean of x per group\n",
"df[\"col\"].value_counts() # quick counts\n",
"df[\"col\"].unique() # unique values\n",
"df[\"new_col\"] = df[\"w\"] * df[\"h\"] # derived column\n",
"df.sort_values(\"col\", ascending=False) # sort\n",
"df.plot(...) # quick plot\n",
"```\n",
"\n",
"Keep this list open when reading other people's code. Most of pandas is\n",
"just combinations of these primitives. When you need more, the official\n",
"[pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)\n",
"is excellent.\n"
]
}
]
}

View file

@ -0,0 +1,439 @@
{
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 02 \u00b7 A first look at one tracking database\n",
"\n",
"In this notebook we open **one** of the SQLite databases that the tracker\n",
"produced and look at what's actually inside. By the end you'll be able to:\n",
"\n",
"- list the tables in a `.db` file\n",
"- read one ROI's tracking trace into a DataFrame\n",
"- plot a fly's path through the arena\n",
"- count how many flies are visible at each moment\n",
"- compute a simple distance between the two flies in a ROI\n",
"\n",
"If you're curious how SQLite works, the\n",
"[SQLite Quickstart](https://www.sqlite.org/quickstart.html) is short and\n",
"worth reading. For our purposes, **SQLite is just a file that contains\n",
"several tables you can query like a DataFrame**.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"We import the libraries we need. `sqlite3` is part of Python's standard\n",
"library \u2014 no install needed.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import sqlite3\n",
"from pathlib import Path\n",
"\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Find the databases\n",
"\n",
"The DBs live at `/mnt/data/projects/cupido/tracked/`. Let's list a few.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"tracked_dir = Path(\"/mnt/data/projects/cupido/tracked\")\n",
"db_files = sorted(tracked_dir.glob(\"*_tracking.db\"))\n",
"\n",
"print(f\"Found {len(db_files)} tracking DBs.\")\n",
"print(\"\\nFirst 5 by name:\")\n",
"for db in db_files[:5]:\n",
" print(f\" {db.name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The filename encodes the date, time, machine UUID, video resolution, and\n",
"the suffix `_tracking.db`. For example:\n",
"\n",
"```\n",
"2024-09-17_10-32-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged_tracking.db\n",
"\u2514\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2518\u2514\u2500\u2500\u252c\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n",
" date time machine UUID video format\n",
"```\n",
"\n",
"Pick one to explore. Feel free to change the index.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"db_path = db_files[0]\n",
"print(\"Working with:\", db_path.name)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Open the database\n",
"\n",
"We open it **read-only** as a safety measure. The `?mode=ro` flag is\n",
"SQLite's read-only switch.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"conn = sqlite3.connect(f\"file:{db_path}?mode=ro\", uri=True)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What tables are inside?\n",
"\n",
"Every SQLite database has a system table called `sqlite_master` that\n",
"lists everything. We can query it like any other table.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"tables = pd.read_sql_query(\n",
" \"SELECT name FROM sqlite_master WHERE type='table' ORDER BY name\", conn\n",
")\n",
"tables\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should see tables like `ROI_1`, `ROI_2`, \u2026, `ROI_6` (one per\n",
"sub-arena), plus housekeeping tables like `METADATA`, `ROI_MAP`,\n",
"`VAR_MAP`, `START_EVENTS`. We mostly care about the `ROI_*` ones.\n",
"\n",
"## Read one ROI\n",
"\n",
"`pd.read_sql_query()` runs an SQL query against the connection and\n",
"returns a DataFrame. The query `SELECT * FROM ROI_1` means *\"give me all\n",
"columns and all rows from the table called ROI_1\"*.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"roi1 = pd.read_sql_query(\"SELECT * FROM ROI_1\", conn)\n",
"print(f\"shape: {roi1.shape}\") # (rows, columns)\n",
"roi1.head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Understanding the columns\n",
"\n",
"Refer back to notebook `00_welcome` for the full column reference. Quick\n",
"recap of the important ones:\n",
"\n",
"- `t`: time in **milliseconds** since the video started.\n",
"- `x`, `y`: fly position in **pixels**. The image origin (0, 0) is the\n",
" **top-left** corner. y grows downward.\n",
"- `w`, `h`: bounding-box width/height. Their product (`area = w*h`) is a\n",
" rough proxy for \"how big does this blob look\" \u2014 useful for spotting\n",
" frames where the tracker merged two flies into one big detection.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Quick descriptive stats\n",
"roi1[[\"t\", \"x\", \"y\", \"w\", \"h\"]].describe()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The minimum `t` should be 0 (start of the video). The maximum tells you\n",
"how long the recording was. Convert ms to minutes by dividing by 60000:\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"duration_min = roi1[\"t\"].max() / 60_000\n",
"print(f\"Session length: {duration_min:.1f} minutes\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How many flies per frame?\n",
"\n",
"If two flies are visible in this ROI, we get **two rows per `t`**. Let's\n",
"check.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"flies_per_frame = roi1.groupby(\"t\").size()\n",
"print(flies_per_frame.value_counts().sort_index())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output tells you, e.g., \"100,000 frames had 2 flies visible, 30,000\n",
"had 1 fly visible\". Frames with 1 fly usually mean the two flies are\n",
"overlapping or one is occluded \u2014 that's something we'll handle properly\n",
"in the next notebook.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plot one fly's trajectory\n",
"\n",
"We'll plot the position over the first 5 minutes (300 000 ms). For\n",
"clarity we'll only look at frames where there were 2 flies and pick the\n",
"**first** of the two (sorted by `id`) as \"fly 1\" \u2014 this is a rough\n",
"heuristic; identity tracking is harder than it sounds.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Filter to the first 5 minutes\n",
"sub = roi1[roi1[\"t\"] <= 5 * 60_000]\n",
"\n",
"# Pick \"fly 1\" by taking the first row at each time point\n",
"fly1 = sub.sort_values([\"t\", \"id\"]).drop_duplicates(\"t\", keep=\"first\")\n",
"\n",
"plt.figure(figsize=(6, 5))\n",
"plt.plot(fly1[\"x\"], fly1[\"y\"], color=\"steelblue\", linewidth=0.5, alpha=0.7)\n",
"plt.scatter(fly1[\"x\"].iloc[0], fly1[\"y\"].iloc[0], color=\"green\", label=\"start\", zorder=5)\n",
"plt.scatter(fly1[\"x\"].iloc[-1], fly1[\"y\"].iloc[-1], color=\"red\", label=\"end\", zorder=5)\n",
"plt.gca().invert_yaxis() # because pixel y grows downward\n",
"plt.xlabel(\"x (pixels)\")\n",
"plt.ylabel(\"y (pixels)\")\n",
"plt.title(f\"Fly 1 trajectory \u2014 first 5 min \u2014 {db_path.name[:30]}\u2026\")\n",
"plt.legend()\n",
"plt.axis(\"equal\")\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should see a tangle of lines confined to a roughly rectangular ROI.\n",
"That tangle is the fly walking around its sub-arena.\n",
"\n",
"Notice we did `plt.gca().invert_yaxis()` \u2014 that's because in image\n",
"coordinates y grows downward, but humans expect plots where y grows\n",
"upward. Without it the plot would be vertically flipped.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plot position over time\n",
"\n",
"A trajectory plot collapses time into \"shape on a page\". To see *when*\n",
"things happen we need time on the x-axis.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"fig, axes = plt.subplots(2, 1, figsize=(12, 5), sharex=True)\n",
"\n",
"axes[0].plot(fly1[\"t\"] / 1000, fly1[\"x\"], linewidth=0.5)\n",
"axes[0].set_ylabel(\"x (px)\")\n",
"axes[0].set_title(f\"Fly 1, ROI 1, {db_path.name[:30]}\u2026\")\n",
"\n",
"axes[1].plot(fly1[\"t\"] / 1000, fly1[\"y\"], linewidth=0.5, color=\"darkorange\")\n",
"axes[1].set_ylabel(\"y (px)\")\n",
"axes[1].set_xlabel(\"time (s)\")\n",
"axes[1].invert_yaxis()\n",
"\n",
"plt.tight_layout()\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Bursts of variation = active fly. Long flat stretches = the fly is sitting\n",
"still. You'll come to recognize courtship vs idling by eye after a while.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Distance between the two flies\n",
"\n",
"Whenever the ROI has 2 detections at the same `t`, we can compute the\n",
"Euclidean distance between them: `sqrt((x1-x2)\u00b2 + (y1-y2)\u00b2)`.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"two_fly_frames = roi1.groupby(\"t\").filter(lambda g: len(g) == 2)\n",
"two_fly_frames = two_fly_frames.sort_values([\"t\", \"id\"])\n",
"\n",
"# Pivot so each row is one timepoint with x1, y1, x2, y2\n",
"def pair_up(g):\n",
" g = g.reset_index(drop=True)\n",
" return pd.Series({\n",
" \"x1\": g.loc[0, \"x\"], \"y1\": g.loc[0, \"y\"],\n",
" \"x2\": g.loc[1, \"x\"], \"y2\": g.loc[1, \"y\"],\n",
" })\n",
"\n",
"paired = two_fly_frames.groupby(\"t\").apply(pair_up).reset_index()\n",
"paired[\"distance_px\"] = np.hypot(paired[\"x1\"] - paired[\"x2\"], paired[\"y1\"] - paired[\"y2\"])\n",
"paired.head()\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"plt.figure(figsize=(12, 4))\n",
"plt.plot(paired[\"t\"] / 1000, paired[\"distance_px\"], linewidth=0.4)\n",
"plt.xlabel(\"time (s)\")\n",
"plt.ylabel(\"inter-fly distance (px)\")\n",
"plt.title(\"Distance between the two flies in ROI 1\")\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the kind of trace that drives the rest of the analysis: a male\n",
"courting a female stays close (small distance); a male giving up wanders\n",
"off (large distance). The shape of this curve is the behavioural readout.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Don't forget to close the connection\n",
"\n",
"If you opened a connection, close it when you're done. (Not strictly\n",
"necessary in a notebook \u2014 Python tidies up \u2014 but a good habit.)\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"conn.close()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercises\n",
"\n",
"1. Pick a different DB (change `db_files[0]` to `db_files[10]` for example)\n",
" and re-run the trajectory plot. Is the arena bigger / smaller? Why\n",
" might that be? (Hint: look at the resolution part of the filename.)\n",
"2. Plot the distance trace for **ROI 4** instead of ROI 1.\n",
"3. Compute the **percentage of frames** in ROI 1 that had only 1 fly visible.\n",
"4. The `area = w * h` column is a useful diagnostic. Plot `area` vs `t`\n",
" for fly 1 \u2014 when does the bounding box get unusually large?\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Exercise space\n"
]
}
]
}

View file

@ -0,0 +1,398 @@
{
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 03 \u00b7 Your first real analysis: trained vs naive\n",
"\n",
"In notebook 02 we explored a single database. Now we'll work with **all\n",
"of them at once**, compute a simple per-fly metric, and ask the central\n",
"question of the project:\n",
"\n",
"> **Do trained males behave differently from na\u00efve males in the testing\n",
"> session?**\n",
"\n",
"By the end you'll have:\n",
"\n",
"- loaded every (fly, session) trace into one big DataFrame using the\n",
" project's helper function;\n",
"- reduced each trace to one number per fly (the *median inter-fly\n",
" distance*);\n",
"- compared the trained group against the na\u00efve group with a histogram\n",
" and a non-parametric statistical test;\n",
"- learnt enough to start asking your own questions.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import sys\n",
"from pathlib import Path\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from scipy import stats\n",
"\n",
"# Tell Python where to find the project's helper modules.\n",
"PROJECT_ROOT = Path(\"..\").resolve().parent # this notebook is in notebooks/getting_started/\n",
"sys.path.insert(0, str(PROJECT_ROOT / \"scripts\"))\n",
"\n",
"from load_roi_data import load_roi_data\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading everything at once \u2014 but carefully\n",
"\n",
"`load_roi_data()` opens every tracking DB referenced by the metadata TSV\n",
"and returns one big DataFrame. **It can be slow and memory-hungry**\n",
"(the full batch is ~200 million rows). Always start small.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Load the metadata TSV first \u2014 it's small and fast.\n",
"tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
"meta = pd.read_csv(tsv_path, sep=\"\\t\")\n",
"print(f\"metadata rows: {len(meta)}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pre-filter the metadata before passing it to `load_roi_data`. We'll start\n",
"with **just one species and just the testing sessions**, because:\n",
"\n",
"1. mixing species is a confound (different species behave differently);\n",
"2. the question is about behaviour after training, so the testing session\n",
" is the relevant one;\n",
"3. starting small means we can iterate quickly.\n",
"\n",
"You can come back later and broaden this filter.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Pick one species. 'Melanogaster/CS' has the most rows (127), so a good default.\n",
"sub = meta[meta[\"species\"] == \"Melanogaster/CS\"].copy()\n",
"\n",
"# We're loading every session for these flies, but the loader stamps each\n",
"# row with a 'session' column so we can filter to testing afterwards.\n",
"print(f\"selected metadata rows: {len(sub)}\")\n",
"print(sub[\"male\"].value_counts())\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# This will take a minute or two and use a chunk of RAM. Be patient.\n",
"data = load_roi_data(sub)\n",
"print(f\"loaded shape: {data.shape}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What did we get?\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"data.head(3)\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# How big is each session, in tracking samples?\n",
"data.groupby([\"session\", \"male\"]).size().unstack(fill_value=0)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Restrict to the testing session\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"testing = data[data[\"session\"] == \"testing\"].copy()\n",
"print(f\"testing samples: {len(testing):,}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reduce each trace to one number\n",
"\n",
"Right now each fly contributes **tens of thousands** of (t, x, y) rows.\n",
"We can't compare distributions of millions of points across two groups\n",
"in any meaningful way. So we **collapse each (date, machine_name, ROI)\n",
"trace into a single summary number** \u2014 here, the median distance between\n",
"the two flies during testing.\n",
"\n",
"Why median rather than mean? Because tracker glitches (one fly\n",
"temporarily lost) can produce huge spikes that the median ignores.\n",
"[Why medians beat means in noisy data\n",
"(2-min read)](https://en.wikipedia.org/wiki/Median#Robustness).\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Step 1 \u2014 per-frame distance.\n",
"# Take only frames with exactly 2 flies (so we have a real distance).\n",
"two_fly = testing.groupby([\"date\", \"machine_name\", \"ROI\", \"t\"]).filter(lambda g: len(g) == 2)\n",
"\n",
"# For each (track, t), compute the distance between the two rows.\n",
"def distance_for_frame(g):\n",
" g = g.sort_values(\"id\").reset_index(drop=True)\n",
" return np.hypot(g.loc[0, \"x\"] - g.loc[1, \"x\"], g.loc[0, \"y\"] - g.loc[1, \"y\"])\n",
"\n",
"# This is the slow step. With ~3 M frames it takes a while.\n",
"per_frame = (\n",
" two_fly\n",
" .groupby([\"date\", \"machine_name\", \"ROI\", \"t\", \"male\"])\n",
" .apply(distance_for_frame)\n",
" .reset_index(name=\"distance_px\")\n",
")\n",
"print(f\"per-frame distance rows: {len(per_frame):,}\")\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Step 2 \u2014 one number per (date, machine_name, ROI).\n",
"per_fly = (\n",
" per_frame\n",
" .groupby([\"date\", \"machine_name\", \"ROI\", \"male\"])[\"distance_px\"]\n",
" .median()\n",
" .reset_index(name=\"median_distance_px\")\n",
")\n",
"\n",
"# Each row now is \"one fly during testing\", with its median distance.\n",
"print(per_fly.shape)\n",
"per_fly.head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sanity check: how many flies per group?\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"per_fly[\"male\"].value_counts()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the numbers are very different, your statistical comparison will be\n",
"underpowered for one side. Note them down.\n",
"\n",
"## Plot the distributions\n",
"\n",
"The first thing to do with two groups is to **look at them**. Don't trust\n",
"a p-value before you've seen the histogram.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"fig, ax = plt.subplots(figsize=(10, 5))\n",
"\n",
"bins = np.linspace(0, per_fly[\"median_distance_px\"].max(), 40)\n",
"\n",
"for label, color in [(\"trained\", \"steelblue\"), (\"naive\", \"darkorange\")]:\n",
" sub = per_fly[per_fly[\"male\"] == label][\"median_distance_px\"]\n",
" ax.hist(sub, bins=bins, alpha=0.6, label=f\"{label} (n={len(sub)})\", color=color)\n",
"\n",
"ax.set_xlabel(\"median inter-fly distance during testing (px)\")\n",
"ax.set_ylabel(\"number of flies\")\n",
"ax.set_title(\"Trained vs na\u00efve \u2014 Melanogaster/CS \u2014 testing session\")\n",
"ax.legend()\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**What you might see:**\n",
"\n",
"- If the trained group's distribution is shifted to **higher** distances,\n",
" trained males are spending less time near the female (i.e. they\n",
" learned to give up).\n",
"- If the two distributions look identical, no learning effect was\n",
" measurable with this metric \u2014 but that doesn't mean there's no effect,\n",
" just that this particular summary didn't capture it.\n",
"- A **bimodal** trained distribution (two humps) would mean some males\n",
" learned and others didn't \u2014 the \"individual differences\" story in\n",
" `docs/bimodal_hypothesis.md`.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Add a stat test\n",
"\n",
"A formal comparison. Because group sizes are small and we don't know if\n",
"the data are normally distributed, the\n",
"[Mann-Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test)\n",
"is a safer default than the classic t-test.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"trained_vals = per_fly[per_fly[\"male\"] == \"trained\"][\"median_distance_px\"]\n",
"naive_vals = per_fly[per_fly[\"male\"] == \"naive\"][\"median_distance_px\"]\n",
"\n",
"stat, pvalue = stats.mannwhitneyu(trained_vals, naive_vals, alternative=\"two-sided\")\n",
"\n",
"print(f\"trained median: {trained_vals.median():.1f} px (n={len(trained_vals)})\")\n",
"print(f\"naive median: {naive_vals.median():.1f} px (n={len(naive_vals)})\")\n",
"print(f\"Mann-Whitney U: {stat:.0f} p-value: {pvalue:.4f}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**How to read this**: the p-value is the probability of seeing a\n",
"difference at least this big *if there were really no difference*. By\n",
"convention p < 0.05 is \"interesting\", p < 0.01 is \"fairly convincing\".\n",
"But never trust a p-value without:\n",
"\n",
"1. eyeballing the histogram first (you did);\n",
"2. reporting the **effect size**, not just the p-value (e.g. the\n",
" difference of medians);\n",
"3. understanding that p-values\n",
" [say nothing about practical importance](https://www.nature.com/articles/d41586-019-00857-9).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What's next?\n",
"\n",
"- **Pick a different metric**: instead of median distance, try fraction\n",
" of time the flies were within 50 px (a \"close-proximity\" metric), or\n",
" the maximum velocity per fly. (Velocity needs identity tracking, which\n",
" is harder \u2014 see `flies_analysis_simple.ipynb` cell 16 for an example.)\n",
"- **Look at it per species**: re-run with `species == \"Sechellia\"` and\n",
" compare. Does the effect generalize? Where is it strongest?\n",
"- **Look at the bimodality**: a kernel density plot\n",
" ([seaborn.kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html))\n",
" will show humps better than a histogram.\n",
"- **Time inside the session**: maybe the difference only shows up in the\n",
" first few minutes (right after the female is introduced). Slice\n",
" `per_frame` by `t` before aggregating.\n",
"- **Consult `docs/bimodal_hypothesis.md`**: it lays out a formal plan for\n",
" testing the \"some flies learn, others don't\" hypothesis.\n",
"\n",
"When you write your own analysis, **save it as a new notebook** (don't\n",
"edit this one). Copy the setup cells, change the question, change the\n",
"plot. That's how analysis projects grow.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A note on iteration speed\n",
"\n",
"The pipeline above is correct but **slow** because we apply a Python\n",
"function to every (track, t) group. If you find yourself re-running the\n",
"same expensive computation a lot, save the intermediate result to disk:\n",
"\n",
"```python\n",
"per_frame.to_parquet(\"per_frame_distance.parquet\")\n",
"# next time:\n",
"per_frame = pd.read_parquet(\"per_frame_distance.parquet\")\n",
"```\n",
"\n",
"`parquet` is a fast columnar format. `pip install pyarrow` if your\n",
"environment doesn't have it.\n",
"\n",
"There are also vectorized ways to compute these distances ~100\u00d7 faster\n",
"that avoid `groupby().apply()`. Don't worry about that yet \u2014 get a\n",
"correct answer first, optimize only if you find yourself waiting.\n"
]
}
]
}

View file

@ -0,0 +1,15 @@
# Tutorial notebooks
Read these in order:
1. **`00_welcome.ipynb`** — what's the project, where the data lives,
how to use a Jupyter notebook.
2. **`01_python_pandas_basics.ipynb`** — minimum Python and pandas you
need to read project code.
3. **`02_explore_one_database.ipynb`** — open one tracking DB, plot a
trajectory, compute a single distance.
4. **`03_compare_trained_vs_naive.ipynb`** — first real analysis,
comparing groups.
After these, the notebooks one level up (`flies_analysis*.ipynb`) walk
through the full analysis pipeline that the previous student built.

11
requirements-tracking.txt Normal file
View file

@ -0,0 +1,11 @@
# Extra dependencies needed only for the offline-tracking pipeline
# (build_video_inventory.py, pick_targets.py, auto_detect_targets.py,
# track_videos.py). Not needed for the existing analysis notebooks.
#
# install with: pip install -r requirements-tracking.txt
opencv-python>=4.8
openpyxl>=3.1
gitpython>=3.1
netifaces>=0.11
mysql-connector-python>=8.0
pyserial>=3.5

View file

@ -0,0 +1,119 @@
"""Try auto-detection of L-shape targets on each video and save JSON sidecars.
Useful for:
- videos that DO have visible black-circle targets (saves manual clicks);
- as a smoke test of the whole pipeline before running the picker.
Failure is silent videos that fail auto-detection are simply not written
to disk, leaving them for the manual `pick_targets.py` tool.
Output JSON has the same shape as the manual picker's so `track_videos.py`
can consume either.
"""
from __future__ import annotations
import argparse
import datetime as dt
import json
import logging
import sys
from pathlib import Path
import cv2
import numpy as np
import pandas as pd
# ethoscope source tree
sys.path.insert(0, "/home/gg/Code/ethoscope_project/ethoscope/src/ethoscope")
from config import INVENTORY_CSV, TARGETS_DIR # noqa: E402
from ethoscope.roi_builders.target_roi_builder import TargetGridROIBuilder # noqa: E402
def detect_one(video_path: Path, frame_idx: int) -> tuple[list[list[int]], int] | None:
"""Run ethoscope target detection on one frame; return (points, frame_idx) or None."""
cap = cv2.VideoCapture(str(video_path))
if not cap.isOpened():
return None
n = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
if n > 0 and frame_idx >= n:
frame_idx = max(0, n - 1)
cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
ok, frame = cap.read()
cap.release()
if not ok or frame is None:
return None
# The detector expects a single-channel image (grey) like ethoscope cameras produce.
if frame.ndim == 3:
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
else:
gray = frame
# We don't actually need a fully-configured grid here — _find_target_coordinates
# alone gives us the 3 reference points.
builder = TargetGridROIBuilder(n_rows=2, n_cols=3)
try:
ref = builder._find_target_coordinates(gray)
except Exception as e:
logging.debug(f"detection failed for {video_path.name}: {e}")
return None
if ref is None:
return None
return [[int(p[0]), int(p[1])] for p in ref], frame_idx
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--frame", type=int, default=125)
parser.add_argument("--limit", type=int, default=None)
parser.add_argument("--video", type=str, default=None,
help="run on a single video path (skips inventory)")
parser.add_argument("--overwrite", action="store_true",
help="overwrite existing JSON sidecars")
args = parser.parse_args()
TARGETS_DIR.mkdir(parents=True, exist_ok=True)
if args.video:
videos = [Path(args.video)]
else:
if not INVENTORY_CSV.exists():
sys.exit("Inventory missing — run build_video_inventory.py first.")
inv = pd.read_csv(INVENTORY_CSV)
todo = inv[inv["in_xlsx"] & ~inv["already_tracked"]]
videos = [Path(p) for p in todo["mp4_path"].tolist()]
if args.limit:
videos = videos[: args.limit]
n_ok = n_fail = n_skip = 0
for v in videos:
out = TARGETS_DIR / f"{v.stem}.json"
if out.exists() and not args.overwrite:
n_skip += 1
continue
result = detect_one(v, args.frame)
if result is None:
n_fail += 1
print(f" fail: {v.name}")
continue
points, used_frame = result
out.write_text(json.dumps({
"video_path": str(v),
"frame_index": int(used_frame),
"reference_points": points,
"order": ["top", "corner", "left"],
"picked_at": dt.datetime.now().isoformat(timespec="seconds"),
"method": "auto",
}, indent=2))
n_ok += 1
print(f" ok: {v.name}{points}")
print(f"\nDone. ok={n_ok} fail={n_fail} skipped(existing)={n_skip}")
if __name__ == "__main__":
logging.basicConfig(level=logging.WARNING, format="%(levelname)s %(message)s")
main()

View file

@ -0,0 +1,150 @@
"""Build an inventory of videos available on disk and join with the metadata xlsx.
Scans /mnt/ethoscope_data/videos/<uuid>/<machine_name>/<date_time>/*.mp4
and produces a CSV mapping each (date, machine_name) row in
all_video_info_merged.xlsx to the corresponding merged.mp4 path on disk.
Output: data/metadata/video_inventory.csv with columns:
machine_uuid, machine_name, session_date, session_time, mp4_path,
in_xlsx (bool), already_tracked (bool)
"""
from __future__ import annotations
import re
from pathlib import Path
import pandas as pd
from config import DATA_RAW, INVENTORY_CSV, VIDEO_INFO_XLSX, VIDEOS_ROOT
SESSION_RE = re.compile(r"^(\d{4}-\d{2}-\d{2})_(\d{2}-\d{2}-\d{2})$")
def scan_videos(videos_root: Path) -> pd.DataFrame:
"""Walk videos_root and return one row per merged.mp4 found.
Args:
videos_root: Root directory containing <uuid>/<machine_name>/<date_time>/.
Returns:
DataFrame with columns: machine_uuid, machine_name, session_date,
session_time, session_datetime, mp4_path.
"""
rows = []
for uuid_dir in sorted(videos_root.iterdir()):
if not uuid_dir.is_dir():
continue
for machine_dir in uuid_dir.iterdir():
if not machine_dir.is_dir() or not machine_dir.name.startswith("ETHOSCOPE_"):
continue
for session_dir in machine_dir.iterdir():
if not session_dir.is_dir():
continue
m = SESSION_RE.match(session_dir.name)
if not m:
continue
date_str, time_str = m.group(1), m.group(2)
# Prefer *_merged.mp4 if present
merged = sorted(session_dir.glob("*_merged.mp4"))
if not merged:
merged = sorted(session_dir.glob("*.mp4"))
if not merged:
continue
rows.append(
{
"machine_uuid": uuid_dir.name,
"machine_name": machine_dir.name,
"session_date": date_str,
"session_time": time_str,
"session_datetime": f"{date_str}_{time_str}",
"mp4_path": str(merged[0]),
}
)
return pd.DataFrame(rows)
def already_tracked_set(data_raw: Path) -> set[tuple[str, str]]:
"""Return the set of (date, time) sessions for which a tracking DB exists.
DBs are named like:
2025-07-15_16-03-10_<uuid>__1920x1088@25fps-28q_merged_tracking.db
"""
out = set()
for db in data_raw.glob("*_tracking.db"):
m = re.match(r"^(\d{4}-\d{2}-\d{2})_(\d{2}-\d{2}-\d{2})_", db.name)
if m:
out.add((m.group(1), m.group(2)))
return out
def main() -> None:
print(f"Scanning {VIDEOS_ROOT} ...")
videos_df = scan_videos(VIDEOS_ROOT)
print(f" found {len(videos_df)} video sessions on disk")
print(f"Loading metadata xlsx: {VIDEO_INFO_XLSX}")
meta = pd.read_excel(VIDEO_INFO_XLSX)
meta["session_date"] = meta["date"].dt.strftime("%Y-%m-%d")
# The xlsx has one row per (date, machine, ROI) — collapse to unique sessions
meta_sessions = (
meta[["session_date", "machine_name"]].drop_duplicates().reset_index(drop=True)
)
print(f" xlsx contains {len(meta_sessions)} unique (date, machine) sessions")
# Mark which video sessions are referenced by the xlsx
xlsx_keys = set(zip(meta_sessions["session_date"], meta_sessions["machine_name"]))
videos_df["in_xlsx"] = videos_df.apply(
lambda r: (r["session_date"], r["machine_name"]) in xlsx_keys, axis=1
)
# Mark which already have tracking DBs in data/raw/
tracked = already_tracked_set(DATA_RAW)
videos_df["already_tracked"] = videos_df.apply(
lambda r: (r["session_date"], r["session_time"]) in tracked, axis=1
)
INVENTORY_CSV.parent.mkdir(parents=True, exist_ok=True)
videos_df.sort_values(["session_date", "machine_name", "session_time"]).to_csv(
INVENTORY_CSV, index=False
)
# Coverage report
in_xlsx = videos_df["in_xlsx"]
needed = videos_df[in_xlsx & ~videos_df["already_tracked"]]
n_xlsx_sessions = len(meta_sessions)
n_with_video = videos_df[in_xlsx].drop_duplicates(
["session_date", "machine_name"]
).shape[0]
# xlsx sessions that have no video on disk
found_keys = set(
zip(
videos_df.loc[in_xlsx, "session_date"],
videos_df.loc[in_xlsx, "machine_name"],
)
)
missing = sorted(xlsx_keys - found_keys)
print()
print("=" * 70)
print(f"Wrote inventory: {INVENTORY_CSV}")
print(f" total video sessions on disk: {len(videos_df)}")
print(f" xlsx unique sessions: {n_xlsx_sessions}")
print(f" xlsx sessions with video: {n_with_video}")
print(f" xlsx sessions missing video: {len(missing)}")
print(f" already tracked (DB exists): {videos_df['already_tracked'].sum()}")
print(f" TO TRACK (in_xlsx & ~tracked, video instances): {len(needed)}")
if missing:
print()
print("xlsx sessions with NO matching video on disk:")
for d, m in missing[:20]:
print(f" {d} {m}")
if len(missing) > 20:
print(f" ... and {len(missing) - 20} more")
if __name__ == "__main__":
main()

View file

@ -1,117 +1,99 @@
import pandas as pd
"""Compute per-frame inter-fly distances for every (date, machine, ROI, session).
Reads tracking data via :func:`load_roi_data.load_roi_data` (which is driven
by ``all_video_info_merged.tsv``) and produces one distances DataFrame
spanning every fly/session in the batch. Group membership (``trained`` /
``untrained``) is preserved from the ``male`` column.
"""
import numpy as np
import pandas as pd
from scipy.spatial.distance import euclidean
from config import DATA_PROCESSED
from load_roi_data import load_roi_data
def calculate_fly_distances(trained_file=None, untrained_file=None):
"""Calculate distances between flies at each time point.
def calculate_fly_distances(data: pd.DataFrame | None = None) -> pd.DataFrame:
"""Compute inter-fly distances over time for every fly/session.
For each time point:
- If two flies are detected: calculate Cartesian distance between them
- If one fly is detected: set distance to 0 if area > average area, otherwise NaN
For each time point inside one (date, machine, ROI, session) trajectory:
- 2+ flies detected: Euclidean distance between the first two by id
- 1 fly detected: distance = 0 if its bbox area exceeds the global
mean (likely a single blob containing both flies), else NaN
Args:
trained_file (Path): Path to trained ROI data CSV.
untrained_file (Path): Path to untrained ROI data CSV.
data: optional pre-loaded DataFrame from :func:`load_roi_data`. If
None, the full batch is loaded.
Returns:
tuple: (trained_distances, untrained_distances) DataFrames.
DataFrame with one row per (track, time) pair, including ``distance``,
``n_flies``, ``area_fly1``, ``area_fly2``, plus the metadata columns
propagated from the source row (``date``, ``machine_name``, ``ROI``,
``session``, ``male``, ``species``, ``memory``, ``age``).
"""
if trained_file is None:
trained_file = DATA_PROCESSED / 'trained_roi_data.csv'
if untrained_file is None:
untrained_file = DATA_PROCESSED / 'untrained_roi_data.csv'
if data is None:
data = load_roi_data()
if data.empty:
return pd.DataFrame()
trained_df = pd.read_csv(trained_file)
untrained_df = pd.read_csv(untrained_file)
trained_df['area'] = trained_df['w'] * trained_df['h']
untrained_df['area'] = untrained_df['w'] * untrained_df['h']
avg_area = np.mean([trained_df['area'].mean(), untrained_df['area'].mean()])
data = data.copy()
data["area"] = data["w"] * data["h"]
avg_area = data["area"].mean()
print(f"Average area across all data: {avg_area:.2f}")
trained_distances = process_distance_data(trained_df, avg_area)
untrained_distances = process_distance_data(untrained_df, avg_area)
# Carry these onto every output row (constant within a track).
keep_meta = ["date", "machine_name", "ROI", "session", "male",
"species", "memory", "age"]
return trained_distances, untrained_distances
def process_distance_data(df, avg_area):
"""Process a DataFrame to calculate distances between flies at each time point.
Args:
df (pd.DataFrame): Input tracking data.
avg_area (float): Average area threshold for single-fly detection.
Returns:
pd.DataFrame: Distance data with columns for machine, ROI, time, distance.
"""
results = []
for (machine_name, roi), group in df.groupby(['machine_name', 'ROI']):
for t, time_group in group.groupby('t'):
time_group = time_group.sort_values('id').reset_index(drop=True)
rows: list[dict] = []
track_keys = ["date", "machine_name", "ROI", "session"]
for track, track_df in data.groupby(track_keys, sort=False):
meta_row = {k: v for k, v in zip(track_keys, track)}
# Carry the rest of the metadata from any sample (constant per track).
sample = track_df.iloc[0]
for col in keep_meta:
if col not in meta_row:
meta_row[col] = sample[col]
for t, time_group in track_df.groupby("t", sort=False):
time_group = time_group.sort_values("id").reset_index(drop=True)
row = dict(meta_row)
row["t"] = t
if len(time_group) >= 2:
fly1 = time_group.iloc[0]
fly2 = time_group.iloc[1]
distance = euclidean([fly1['x'], fly1['y']], [fly2['x'], fly2['y']])
f1, f2 = time_group.iloc[0], time_group.iloc[1]
row["distance"] = euclidean([f1["x"], f1["y"]], [f2["x"], f2["y"]])
row["n_flies"] = len(time_group)
row["area_fly1"] = f1["area"]
row["area_fly2"] = f2["area"]
else:
f = time_group.iloc[0]
row["distance"] = 0.0 if f["area"] > avg_area else np.nan
row["n_flies"] = 1
row["area_fly1"] = f["area"]
row["area_fly2"] = np.nan
rows.append(row)
results.append({
'machine_name': machine_name,
'ROI': roi,
't': t,
'distance': distance,
'n_flies': len(time_group),
'area_fly1': fly1['area'],
'area_fly2': fly2['area']
})
elif len(time_group) == 1:
fly = time_group.iloc[0]
area = fly['area']
if area > avg_area:
distance = 0.0
else:
distance = np.nan
results.append({
'machine_name': machine_name,
'ROI': roi,
't': t,
'distance': distance,
'n_flies': 1,
'area_fly1': area,
'area_fly2': np.nan
})
return pd.DataFrame(results)
return pd.DataFrame(rows)
def main():
"""Run distance calculations and save results."""
trained_distances, untrained_distances = calculate_fly_distances()
def main() -> None:
distances = calculate_fly_distances()
print(f"Trained data distance summary:")
print(f" Shape: {trained_distances.shape}")
print(f" Distance stats:")
print(f" Count: {trained_distances['distance'].count()}")
print(f" Mean: {trained_distances['distance'].mean():.2f}")
print(f" Std: {trained_distances['distance'].std():.2f}")
print("\nDistance summary:")
print(f" Shape: {distances.shape}")
if not distances.empty:
print(f" Distance count: {distances['distance'].count()}")
print(f" Distance mean: {distances['distance'].mean():.2f}")
print(f" Distance std: {distances['distance'].std():.2f}")
male = distances["male"]
print(f" Trained tracks: {(male == 'trained').sum()}")
print(f" Naive tracks: {(male == 'naive').sum()}")
print(f"\nUntrained data distance summary:")
print(f" Shape: {untrained_distances.shape}")
print(f" Distance stats:")
print(f" Count: {untrained_distances['distance'].count()}")
print(f" Mean: {untrained_distances['distance'].mean():.2f}")
print(f" Std: {untrained_distances['distance'].std():.2f}")
trained_distances.to_csv(DATA_PROCESSED / 'trained_distances.csv', index=False)
untrained_distances.to_csv(DATA_PROCESSED / 'untrained_distances.csv', index=False)
print("\nDistance data saved")
DATA_PROCESSED.mkdir(parents=True, exist_ok=True)
out = DATA_PROCESSED / "distances.csv"
distances.to_csv(out, index=False)
print(f"\nSaved {out}")
if __name__ == "__main__":

View file

@ -7,3 +7,16 @@ DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_METADATA = PROJECT_ROOT / "data" / "metadata"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
FIGURES = PROJECT_ROOT / "figures"
# Offline-tracking pipeline paths
VIDEOS_ROOT = Path("/mnt/ethoscope_data/videos")
VIDEO_INFO_XLSX = PROJECT_ROOT.parent / "all_video_info_merged.xlsx"
INVENTORY_CSV = DATA_METADATA / "video_inventory.csv"
# Reason: kept on the local data volume alongside the tracking DBs (out of
# ownCloud sync). See TRACKING_OUTPUT_DIR comment below.
TARGETS_DIR = Path("/mnt/data/projects/cupido/targets")
# Reason: tracking DBs are large binary files that don't belong in
# ownCloud-synced storage (sync conflicts + bandwidth). They live on the
# local data volume instead. Regenerable from videos + target JSONs.
TRACKING_OUTPUT_DIR = Path("/mnt/data/projects/cupido/tracked")
LOGS_DIR = PROJECT_ROOT / "data" / "logs"

View file

@ -0,0 +1,181 @@
"""Augment all_video_info_merged.xlsx with the input video + tracking DB paths.
Each xlsx row represents one fly (date, machine_name, ROI), observed across a
training session and a testing session. We resolve those two sessions to the
on-disk video files (via the inventory CSV) and to their tracking DBs (under
TRACKING_OUTPUT_DIR), then write the result as TSV.
Output columns added:
training_video_path, training_db_path,
testing_video_path, testing_db_path
Empty values mean either no video matched (rare implies missing inventory
entry) or no DB exists yet (e.g. the one video the completeness gate
rejected).
Usage:
python export_video_db_index.py
python export_video_db_index.py --out path/to/output.tsv
"""
from __future__ import annotations
import argparse
import re
from pathlib import Path
import pandas as pd
from config import INVENTORY_CSV, TRACKING_OUTPUT_DIR, VIDEO_INFO_XLSX
_TIME_RE = re.compile(r"^(\d{8})_(\d{1,2})(\d{2})?(AM|PM)$", re.IGNORECASE)
def parse_xlsx_time(value: str) -> tuple[str, int] | None:
"""Convert '20241021_11AM' / '20240918_1030AM' to (YYYY-MM-DD, minutes24).
Resolution is hour-only when no minutes are given (e.g. '11AM' 11:00).
Returns minutes-from-midnight so we can do nearest-neighbor matching.
"""
if not isinstance(value, str):
return None
m = _TIME_RE.match(value.strip())
if not m:
return None
ymd, hh, mm, ampm = m.groups()
date = f"{ymd[:4]}-{ymd[4:6]}-{ymd[6:8]}"
hour = int(hh)
minute = int(mm) if mm else 0
if ampm.upper() == "PM" and hour != 12:
hour += 12
if ampm.upper() == "AM" and hour == 12:
hour = 0
return date, hour * 60 + minute
def build_session_index(inventory: pd.DataFrame) -> dict[tuple[str, str], list[dict]]:
"""Index inventory rows by (date, machine_name) → list of session dicts."""
idx: dict[tuple[str, str], list[dict]] = {}
for row in inventory.itertuples(index=False):
h, m, _s = (int(p) for p in str(row.session_time).split("-"))
key = (row.session_date, row.machine_name)
idx.setdefault(key, []).append({
"mp4_path": row.mp4_path,
"session_datetime": row.session_datetime,
"minutes": h * 60 + m,
})
return idx
def db_path_for_video(mp4_path: str) -> Path | None:
"""Tracker writes <video_stem>_tracking.db under TRACKING_OUTPUT_DIR."""
stem = Path(mp4_path).stem
db = TRACKING_OUTPUT_DIR / f"{stem}_tracking.db"
return db if db.exists() else None
_TIME_TOLERANCE_MIN = 90 # xlsx labels are approximate ("11AM" → 10:51 is fine)
def resolve_session(
machine_name: str,
when: str,
fallback_date: str | None,
index: dict[tuple[str, str], list[dict]],
) -> tuple[str, str]:
"""Look up the video + db whose start time is closest to `when`.
Match strategy:
1. Use the date embedded in `when` (training/testing can fall on a
different calendar day from the row's ``date`` column).
2. If no candidates exist for that date, fall back to ``fallback_date``
(the xlsx row's ``date`` column). Reason: the xlsx contains
date typos like '20240110_11AM' for an Oct 1 experiment.
Among candidates, pick the video whose start minute is closest to the
xlsx-claimed time, within ±_TIME_TOLERANCE_MIN.
"""
parsed = parse_xlsx_time(when)
if parsed is None:
return "", ""
date, target_min = parsed
candidates = index.get((date, machine_name), [])
if not candidates and fallback_date:
candidates = index.get((fallback_date, machine_name), [])
if not candidates:
return "", ""
def _gap(target: int, c: dict) -> int:
# Reason: xlsx times like '1230AM' are ambiguous (12 AM vs 12 PM).
# We try both the literal time AND a +12-hour shift, picking the
# interpretation that brings us closest to a real session.
return min(abs(c["minutes"] - target), abs(c["minutes"] - (target + 720) % 1440))
best = min(candidates, key=lambda c: _gap(target_min, c))
if _gap(target_min, best) > _TIME_TOLERANCE_MIN:
return "", ""
db = db_path_for_video(best["mp4_path"])
return best["mp4_path"], (str(db) if db else "")
# Variants of "naive" the xlsx has accumulated: 'naïve', 'niave', plus
# trailing whitespace. All collapse to a single canonical 'naive'.
_MALE_NAIVE_VARIANTS = {"naïve", "niave", "naive"}
def _normalize_metadata(df: pd.DataFrame) -> None:
"""Strip whitespace and canonicalize the ``male`` column in place."""
for col in df.select_dtypes(include=("object", "string")).columns:
df[col] = df[col].astype(str).str.strip()
df["male"] = df["male"].apply(
lambda v: "naive" if v.lower() in _MALE_NAIVE_VARIANTS else v
)
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--out",
type=Path,
default=VIDEO_INFO_XLSX.with_suffix(".tsv"),
help="output TSV path (default: alongside the xlsx)",
)
args = parser.parse_args()
inv = pd.read_csv(INVENTORY_CSV)
inv = inv[inv["in_xlsx"]].copy()
index = build_session_index(inv)
df = pd.read_excel(VIDEO_INFO_XLSX)
_normalize_metadata(df)
date_iso = pd.to_datetime(df["date"]).dt.strftime("%Y-%m-%d")
train_videos, train_dbs, test_videos, test_dbs = [], [], [], []
for fallback, row in zip(date_iso, df.itertuples(index=False)):
tv, td = resolve_session(row.machine_name, row.training_date_time, fallback, index)
sv, sd = resolve_session(row.machine_name, row.testing_date_time, fallback, index)
train_videos.append(tv)
train_dbs.append(td)
test_videos.append(sv)
test_dbs.append(sd)
df["training_video_path"] = train_videos
df["training_db_path"] = train_dbs
df["testing_video_path"] = test_videos
df["testing_db_path"] = test_dbs
df.to_csv(args.out, sep="\t", index=False)
n_rows = len(df)
n_train_video = sum(bool(v) for v in train_videos)
n_train_db = sum(bool(v) for v in train_dbs)
n_test_video = sum(bool(v) for v in test_videos)
n_test_db = sum(bool(v) for v in test_dbs)
print(f"wrote {args.out} ({n_rows} rows)")
print(f" training: {n_train_video} with video, {n_train_db} with DB")
print(f" testing: {n_test_video} with video, {n_test_db} with DB")
if __name__ == "__main__":
main()

View file

@ -1,90 +1,113 @@
import pandas as pd
"""Load ROI tracking data from all sessions into one DataFrame.
Drives off the merged TSV (one row per ROI/fly across training + testing
phases). For each TSV row, opens the corresponding tracking DB and pulls
the matching ROI table, then attaches the experimental metadata.
The TSV is the single source of truth for what data exists and how it
maps to flies and conditions.
"""
import sqlite3
import re
from pathlib import Path
from config import DATA_RAW, DATA_METADATA, DATA_PROCESSED
import pandas as pd
from config import VIDEO_INFO_XLSX
def load_roi_data():
"""Load ROI data from SQLite databases and group by trained/untrained.
# Metadata columns to copy onto every tracking sample. These are the xlsx
# fields that describe the experimental condition behind each fly/ROI.
# Reason: the ROI column is uppercase ("ROI") for backwards compatibility
# with the existing analysis pipeline (calculate_distances.py, notebooks).
_META_COLS = (
"date",
"machine_name",
"species",
"male",
"training_date_time",
"testing_date_time",
"training_length_hr",
"consolidation_length_hr",
"memory",
"age",
)
def _open_ro(db_path: str, cache: dict) -> sqlite3.Connection | None:
"""Cached read-only sqlite connection. Returns None on failure."""
if not isinstance(db_path, str) or not db_path:
return None
if db_path not in cache:
try:
cache[db_path] = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
except sqlite3.Error as e:
print(f"failed to open {Path(db_path).name}: {e}")
cache[db_path] = None
return cache[db_path]
def load_roi_data(meta: pd.DataFrame | None = None) -> pd.DataFrame:
"""Load ROI tracking data joined with experimental metadata.
For each row in ``meta``, reads the matching ROI table from both the
training DB and the testing DB (whichever exist), and stamps every
sample with the row's metadata plus a ``session`` column
(``"training"`` or ``"testing"``). Rows with empty DB paths (unusable
videos, or videos that didn't pass the completeness gate) are skipped.
Args:
meta: optional DataFrame with the same schema as
``all_video_info_merged.tsv``. Pass a filtered slice to load a
subset (e.g. ``meta[meta.species == 'Melanogaster/CS']``).
Defaults to the full TSV.
Returns:
tuple: (trained_df, untrained_df) DataFrames with tracking data.
DataFrame with columns ``id, t, x, y, w, h, phi, is_inferred,
has_interacted, session, <metadata>`` one row per tracking
sample. Empty if nothing could be loaded.
"""
metadata = pd.read_csv(DATA_METADATA / '2025_07_15_metadata_fixed.csv')
metadata['machine_name'] = metadata['machine_name'].astype(str)
if meta is None:
meta = pd.read_csv(VIDEO_INFO_XLSX.with_suffix(".tsv"), sep="\t")
trained_rois = metadata[metadata['group'] == 'trained']
untrained_rois = metadata[metadata['group'] == 'untrained']
db_cache: dict = {}
chunks: list[pd.DataFrame] = []
db_files = list(DATA_RAW.glob('*_tracking.db'))
trained_df = pd.DataFrame()
untrained_df = pd.DataFrame()
for db_file in db_files:
print(f"Processing {db_file.name}")
pattern = r'_([0-9a-f]{32})__'
match = re.search(pattern, db_file.name)
if not match:
print(f"Could not extract UUID from {db_file.name}")
continue
uuid = match.group(1)
metadata_matches = metadata[metadata['path'].str.contains(uuid, na=False)]
if metadata_matches.empty:
print(f"No metadata matches found for UUID {uuid} from {db_file.name}")
continue
machine_id = metadata_matches.iloc[0]['machine_name']
print(f"Matched to machine ID: {machine_id}")
conn = sqlite3.connect(str(db_file))
machine_trained = trained_rois[trained_rois['machine_name'] == machine_id]
machine_untrained = untrained_rois[untrained_rois['machine_name'] == machine_id]
for _, row in machine_trained.iterrows():
roi = row['ROI']
for row in meta.itertuples(index=False):
for session in ("training", "testing"):
conn = _open_ro(getattr(row, f"{session}_db_path"), db_cache)
if conn is None:
continue
try:
query = f"SELECT * FROM ROI_{roi}"
roi_data = pd.read_sql_query(query, conn)
roi_data['machine_name'] = machine_id
roi_data['ROI'] = roi
roi_data['group'] = 'trained'
trained_df = pd.concat([trained_df, roi_data], ignore_index=True)
df = pd.read_sql_query(
f"SELECT * FROM ROI_{int(row.roi)}", conn
)
except Exception as e:
print(f"Error loading ROI_{roi} from {db_file.name}: {e}")
# Reason: a DB may be missing a ROI table if tracking was
# partial — skip rather than abort the whole batch.
print(f" ROI_{row.roi} from {session} DB: {e}")
continue
df["session"] = session
df["ROI"] = int(row.roi)
for col in _META_COLS:
df[col] = getattr(row, col)
chunks.append(df)
for _, row in machine_untrained.iterrows():
roi = row['ROI']
try:
query = f"SELECT * FROM ROI_{roi}"
roi_data = pd.read_sql_query(query, conn)
roi_data['machine_name'] = machine_id
roi_data['ROI'] = roi
roi_data['group'] = 'untrained'
untrained_df = pd.concat([untrained_df, roi_data], ignore_index=True)
except Exception as e:
print(f"Error loading ROI_{roi} from {db_file.name}: {e}")
for conn in db_cache.values():
if conn is not None:
conn.close()
conn.close()
return trained_df, untrained_df
return pd.concat(chunks, ignore_index=True) if chunks else pd.DataFrame()
if __name__ == "__main__":
trained_data, untrained_data = load_roi_data()
print(f"Trained data shape: {trained_data.shape}")
print(f"Untrained data shape: {untrained_data.shape}")
if not trained_data.empty:
print("Trained data columns:", trained_data.columns.tolist())
if not untrained_data.empty:
print("Untrained data columns:", untrained_data.columns.tolist())
trained_data.to_csv(DATA_PROCESSED / 'trained_roi_data.csv', index=False)
untrained_data.to_csv(DATA_PROCESSED / 'untrained_roi_data.csv', index=False)
print("Data saved to trained_roi_data.csv and untrained_roi_data.csv")
data = load_roi_data()
print(f"shape: {data.shape}")
if not data.empty:
print(f"columns: {list(data.columns)}")
print(f"sessions: {data['session'].value_counts().to_dict()}")
print(f"unique machines: {data['machine_name'].nunique()}")
print(
f"unique flies (date,machine,roi): "
f"{data.groupby(['date','machine_name','roi']).ngroups}"
)

176
scripts/monitor_tracking.py Normal file
View file

@ -0,0 +1,176 @@
"""Live progress + ETA for the offline tracker batch.
Counts ground-truth (DBs on disk) rather than parsing log lines, so it works
whether the batch is running fresh or was resumed after a crash. Errors are
parsed out of any *.log files in data/logs/.
Usage:
python monitor_tracking.py # one snapshot, exit
python monitor_tracking.py --watch # refresh every 10 s
python monitor_tracking.py --watch 30 # refresh every 30 s
"""
from __future__ import annotations
import argparse
import json
import re
import time
from datetime import datetime, timedelta
from pathlib import Path
from config import LOGS_DIR, TARGETS_DIR, TRACKING_OUTPUT_DIR
def count_target_jsons() -> tuple[int, int, list[str]]:
"""Return (n_pickable, n_unusable, unusable_video_stems)."""
pickable = 0
unusable_stems: list[str] = []
for j in TARGETS_DIR.glob("*.json"):
try:
d = json.loads(j.read_text())
except Exception:
continue
if d.get("unusable"):
unusable_stems.append(j.stem)
elif d.get("reference_points"):
pickable += 1
return pickable, len(unusable_stems), unusable_stems
def count_tracked_dbs() -> tuple[int, datetime | None, str | None]:
"""Return (n_dbs, mtime_of_newest, name_of_newest)."""
dbs = list(TRACKING_OUTPUT_DIR.glob("*_tracking.db"))
if not dbs:
return 0, None, None
newest = max(dbs, key=lambda p: p.stat().st_mtime)
return len(dbs), datetime.fromtimestamp(newest.stat().st_mtime), newest.stem
def parse_recent_errors(log_dir: Path, tail_lines: int = 5000) -> list[str]:
"""Scan the most recent *.log file for lines reporting errors."""
if not log_dir.exists():
return []
logs = sorted(log_dir.glob("*.log"), key=lambda p: p.stat().st_mtime)
if not logs:
return []
latest = logs[-1]
try:
with latest.open() as f:
tail = f.readlines()[-tail_lines:]
except Exception:
return []
out = []
for line in tail:
if re.search(r":\s*error\b", line) or " error: " in line.lower():
out.append(line.rstrip())
return out
def db_completion_history() -> list[float]:
"""Return mtimes of all tracking DBs, sorted ascending. Used for rate."""
return sorted(p.stat().st_mtime for p in TRACKING_OUTPUT_DIR.glob("*_tracking.db"))
def fmt_duration(seconds: float) -> str:
if seconds < 60:
return f"{int(seconds)} s"
if seconds < 3600:
return f"{int(seconds // 60)} min"
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
return f"{h} h {m} min"
def snapshot() -> str:
pickable, unusable, _ = count_target_jsons()
tracked, last_mtime, last_name = count_tracked_dbs()
history = db_completion_history()
errors = parse_recent_errors(LOGS_DIR)
lines = [f"tracking progress @ {datetime.now():%Y-%m-%d %H:%M:%S}"]
lines.append(f" pickable JSONs: {pickable}")
lines.append(f" unusable JSONs: {unusable} (skipped by tracker)")
pct = (tracked / pickable * 100) if pickable else 0
lines.append(
f" DBs on disk: {tracked} / {pickable} ({pct:.0f}%)"
)
lines.append(f" errors in log: {len(errors)}")
# Rate from completions in the last 6 h — robust to gaps from killed /
# restarted runs, while wide enough to span multiple parallel-worker
# completion bursts. Reason: with 8 workers all started together on
# multi-hour videos, completions arrive in tight bursts every ~video-
# length apart; a 30-min window catches one burst and overestimates by
# ~10×. 6 h spans at least one full burst cycle for typical videos.
now_ts = time.time()
window_secs = 6 * 3600
recent = [t for t in history if t >= now_ts - window_secs]
if len(recent) >= 2:
# Reason: with N parallel workers, completions arrive in clumps
# (all workers finish near-simultaneously). Dividing N by the *burst*
# span gives nonsense rates. Use the full window as the denominator
# once the batch has been running long enough to fill it; otherwise
# use elapsed-since-first-DB. Detection: if every DB on disk also
# falls inside the window, the batch is younger than the window.
if len(recent) == len(history):
elapsed = max(1.0, now_ts - history[0])
else:
elapsed = float(window_secs)
if elapsed > 0:
rate_per_hour = len(recent) / elapsed * 3600
lines.append(
f" rate (last {len(recent)} in {int(window_secs/3600)} h):"
f" {rate_per_hour:.1f} videos/hour"
)
remaining = max(0, pickable - tracked)
if rate_per_hour > 0 and remaining > 0:
eta_sec = remaining * 3600 / rate_per_hour
eta_at = datetime.now() + timedelta(seconds=eta_sec)
lines.append(
f" ETA remaining: {fmt_duration(eta_sec)} "
f"(done by {eta_at:%H:%M %a})"
)
else:
lines.append(" rate: (warming up — check again in a few min)")
if last_mtime is not None and last_name is not None:
ago = (datetime.now() - last_mtime).total_seconds()
lines.append(
f" most recent DB: {last_name[:60]}... ({fmt_duration(ago)} ago)"
)
if errors:
lines.append("")
lines.append(f" recent errors ({min(5, len(errors))} of {len(errors)}):")
for e in errors[-5:]:
lines.append(f" {e[:120]}")
return "\n".join(lines)
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--watch", nargs="?", type=int, const=10, default=None,
help="refresh every N seconds (default 10 if flag given without value)",
)
args = parser.parse_args()
if args.watch is None:
print(snapshot())
return
try:
while True:
# Clear screen and reprint
print("\033[2J\033[H", end="")
print(snapshot())
print(f"\n(refreshing every {args.watch}s — Ctrl-C to exit)")
time.sleep(args.watch)
except KeyboardInterrupt:
print()
if __name__ == "__main__":
main()

467
scripts/pick_targets.py Normal file
View file

@ -0,0 +1,467 @@
"""Interactive target picker for offline tracking (matplotlib/Tk GUI).
Loops through videos that need tracking and lets the user click 3 reference
points per video in L-shape order:
1) TOP target (above the corner)
2) CORNER target (the right-angle vertex)
3) LEFT target (to the left of the corner)
These three points are the same reference layout used by ethoscope's
`TargetGridROIBuilder`: dst_points = [(0, -1), (0, 0), (-1, 0)] in unit
coordinates. Saving them as a JSON sidecar lets the offline tracker build the
6-ROI HD mating arena grid without needing auto-target detection.
Output JSON sidecar: TARGETS_DIR/<video_basename>.json
{
"video_path": "/mnt/.../*.mp4",
"frame_index": <int>,
"reference_points": [[x0, y0], [x1, y1], [x2, y2]],
"order": ["top", "corner", "left"],
"picked_at": "<isoformat>"
}
Keys (in the picker window):
LEFT-CLICK add a point (top corner left)
r reset clicks for current video
d skip this video for THIS run only (no JSON written)
u mark this video unusable (FOV wrong etc.); skipped forever
. / , advance / rewind by 25 frames ( 1 s @ 25 fps)
] / [ advance / rewind by 5% of the video (~3 min in a 1 h video)
# jump to the middle of the video
enter save the 3 points and move on
q / ESC quit picker
After the 3rd click, the 6 ROI rectangles are drawn over the frame so you
can sanity-check the geometry before pressing ENTER.
With --redo, if a JSON sidecar exists, its points are pre-loaded so you can
nudge them rather than restart from scratch.
Why matplotlib instead of cv2.imshow:
OpenCV's bundled GUI uses Qt, which needs XKeyboard + a fonts directory and
is fragile over SSH X11-forwarding. matplotlib's TkAgg backend uses pure
Tk/X11 and works out of the box on any DISPLAY (and gives free pan/zoom
via the toolbar useful for clicking small targets precisely).
"""
from __future__ import annotations
import argparse
import datetime as dt
import json
import os
import sys
from pathlib import Path
# Force TkAgg BEFORE importing matplotlib. We override even if MPLBACKEND is
# already set, because the script is unusable with a non-interactive backend.
os.environ["MPLBACKEND"] = "TkAgg"
import cv2 # noqa: E402
import matplotlib # noqa: E402
import matplotlib.pyplot as plt # noqa: E402
import numpy as np # noqa: E402
import pandas as pd # noqa: E402
# matplotlib.backend_bases exposes the cursor identifiers under different
# names depending on version: `Cursors` enum on 3.5+, lowercase `cursors`
# instance on older releases. Both have the same integer attributes.
try:
from matplotlib.backend_bases import Cursors as _Cursors # 3.5+
except ImportError:
try:
from matplotlib.backend_bases import cursors as _Cursors # older
except ImportError:
_Cursors = None
# Verify we ended up on an interactive backend; bail loud (with a concrete
# explanation) if not. matplotlib silently falls back to 'agg' when its
# requested backend can't load, which is hard to debug without help.
_backend = matplotlib.get_backend()
if _backend.lower() in ("agg", "headless", "template", "pdf", "svg", "ps"):
diag = []
try:
import tkinter as _tk
try:
_tk.Tk().destroy()
diag.append("tkinter import + Tk() instantiation: OK")
except Exception as e:
diag.append(f"tkinter imported but Tk() failed: {e!r}")
except Exception as e:
diag.append(f"tkinter import FAILED: {e!r}")
diag.append(" → on Manjaro/Arch, run: sudo pacman -S tk")
print(
f"ERROR: matplotlib loaded the non-interactive backend {_backend!r}.\n"
f" Expected 'TkAgg'. Diagnostic info:\n"
f" DISPLAY = {os.environ.get('DISPLAY')!r}\n"
f" MPLBACKEND = {os.environ.get('MPLBACKEND')!r}\n"
f" matplotlib ver = {matplotlib.__version__}\n"
+ "\n".join(f" {d}" for d in diag),
file=sys.stderr,
)
sys.exit(2)
from config import INVENTORY_CSV, TARGETS_DIR # noqa: E402
from tracking_geometry import compute_roi_polygons # noqa: E402
# Strip default matplotlib keybindings that would conflict with ours.
for k in ("keymap.home", "keymap.save", "keymap.quit", "keymap.fullscreen",
"keymap.pan", "keymap.zoom", "keymap.back", "keymap.forward"):
try:
plt.rcParams[k] = []
except KeyError:
pass
CLICK_LABELS = ("TOP", "CORNER", "LEFT")
CLICK_COLORS = ("red", "lime", "deepskyblue")
def grab_frame(
video_path: Path, frame_idx: int
) -> tuple[np.ndarray, int, int] | None:
"""Return (RGB frame, actual_frame_idx, n_frames) from the video, or None.
Clamps frame_idx to [0, n_frames-1] so callers can step blindly.
"""
cap = cv2.VideoCapture(str(video_path))
if not cap.isOpened():
return None
n = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
if n > 0:
frame_idx = max(0, min(frame_idx, n - 1))
cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
ok, frame = cap.read()
cap.release()
if not ok or frame is None:
return None
return cv2.cvtColor(frame, cv2.COLOR_BGR2RGB), frame_idx, n
def pick_one(
video_path: Path,
frame_idx: int,
status_prefix: str,
initial_points: list[tuple[float, float]] | None = None,
) -> dict | None:
"""Show the picker UI for a single video; return the result dict or None."""
grabbed = grab_frame(video_path, frame_idx)
if grabbed is None:
print(f" ! cannot read {video_path}")
return None
frame, frame_idx, n_frames = grabbed
# Big-step size for ] / [ : 5% of total length, ~3 min in a 1h video.
big_step = max(1, int(round(0.05 * n_frames))) if n_frames > 0 else 250
fig, ax = plt.subplots(figsize=(14, 8))
try:
fig.canvas.manager.set_window_title("pick targets")
except Exception:
pass
# Use a crosshair cursor over the axes so it's obvious where the click
# will land. matplotlib's toolbar resets the cursor to POINTER (arrow) on
# every mouse-move when no tool is active, so we intercept set_cursor:
# whenever it asks for POINTER, we substitute SELECT_REGION (crosshair).
# Tool modes (zoom/pan) keep their native cursors.
if _Cursors is not None:
_orig_set_cursor = fig.canvas.set_cursor
def _set_cursor_with_crosshair(cursor):
if cursor == _Cursors.POINTER:
cursor = _Cursors.SELECT_REGION
return _orig_set_cursor(cursor)
fig.canvas.set_cursor = _set_cursor_with_crosshair
try:
fig.canvas.set_cursor(_Cursors.SELECT_REGION)
except Exception:
pass
else:
# Last-ditch: just set the Tk widget's cursor once and hope the
# toolbar doesn't immediately overwrite it.
try:
fig.canvas.get_tk_widget().config(cursor="tcross")
except Exception:
pass
img_artist = ax.imshow(frame)
ax.set_axis_off()
fig.tight_layout()
state = {
"points": list(initial_points) if initial_points else [],
"action": None, # 'save' | 'skip' | 'quit' | 'unusable'
"frame": frame,
"frame_idx": frame_idx,
"drawn": [], # artists drawn on top of the image
}
def update_title():
nb = len(state["points"])
nxt = (
f"click {CLICK_LABELS[nb]}"
if nb < 3
else "ENTER=save | r=reset d=skip u=unusable q=quit | . , [ ] # = step frame"
)
ax.set_title(
f'{status_prefix} frame {state["frame_idx"]} | {nxt}',
fontsize=10,
)
def redraw_points():
for a in state["drawn"]:
try:
a.remove()
except Exception:
pass
state["drawn"].clear()
for i, (x, y) in enumerate(state["points"]):
color = CLICK_COLORS[i]
label = CLICK_LABELS[i]
(cross,) = ax.plot(x, y, marker="+", color=color, markersize=22, mew=2)
(ring,) = ax.plot(
x, y, marker="o", color=color, markersize=22,
fillstyle="none", mew=2,
)
txt = ax.text(
x + 14, y - 14, label,
color=color, fontsize=10, weight="bold",
)
state["drawn"].extend([cross, ring, txt])
if len(state["points"]) >= 2:
(line1,) = ax.plot(
[state["points"][0][0], state["points"][1][0]],
[state["points"][0][1], state["points"][1][1]],
color="white", linewidth=0.7, alpha=0.6,
)
state["drawn"].append(line1)
if len(state["points"]) == 3:
(line2,) = ax.plot(
[state["points"][1][0], state["points"][2][0]],
[state["points"][1][1], state["points"][2][1]],
color="white", linewidth=0.7, alpha=0.6,
)
state["drawn"].append(line2)
# ROI overlay — draw the 6 computed rectangles on top of the frame
try:
polys = compute_roi_polygons(state["points"])
except Exception as e:
polys = []
print(f" (ROI preview failed: {e})")
for j, poly in enumerate(polys):
# Close the polygon by repeating the first point
xs = list(poly[:, 0]) + [poly[0, 0]]
ys = list(poly[:, 1]) + [poly[0, 1]]
(line,) = ax.plot(
xs, ys, color="yellow", linewidth=1.5, alpha=0.9,
)
state["drawn"].append(line)
cx = float(np.mean(poly[:, 0]))
cy = float(np.mean(poly[:, 1]))
lbl = ax.text(
cx, cy, str(j + 1),
color="yellow", fontsize=14, weight="bold",
ha="center", va="center",
)
state["drawn"].append(lbl)
update_title()
fig.canvas.draw_idle()
def reload_frame(new_idx: int):
grabbed = grab_frame(video_path, new_idx)
if grabbed is None:
return
new_frame, new_idx, _ = grabbed
state["frame"] = new_frame
state["frame_idx"] = new_idx
img_artist.set_data(new_frame)
# Keep clicked targets + ROI overlay in place across frame-stepping —
# press 'r' to clear them explicitly.
redraw_points()
def on_click(event):
if event.inaxes is not ax:
return
if event.button != 1: # left click only
return
if event.xdata is None or event.ydata is None:
return
# Skip clicks fired while the toolbar's pan/zoom is active.
toolbar = getattr(fig.canvas, "toolbar", None)
if toolbar is not None and getattr(toolbar, "mode", ""):
return
x, y = float(event.xdata), float(event.ydata)
if len(state["points"]) < 3:
state["points"].append((x, y))
else:
# 3 points already there — replace the nearest one. Lets the user
# nudge pre-loaded targets in --redo mode, or correct a bad click.
dists = [(x - px) ** 2 + (y - py) ** 2 for px, py in state["points"]]
i_nearest = min(range(3), key=dists.__getitem__)
state["points"][i_nearest] = (x, y)
redraw_points()
def on_key(event):
k = event.key or ""
if k in ("escape", "q"):
state["action"] = "quit"
plt.close(fig)
elif k == "r":
state["points"].clear()
redraw_points()
elif k == "d":
state["action"] = "skip"
plt.close(fig)
elif k == "u":
state["action"] = "unusable"
plt.close(fig)
elif k == "enter":
if len(state["points"]) == 3:
state["action"] = "save"
plt.close(fig)
elif k == ".":
reload_frame(state["frame_idx"] + 25)
elif k == ",":
reload_frame(state["frame_idx"] - 25)
elif k == "]":
reload_frame(state["frame_idx"] + big_step)
elif k == "[":
reload_frame(state["frame_idx"] - big_step)
elif k == "#":
if n_frames > 0:
reload_frame(n_frames // 2)
fig.canvas.mpl_connect("button_press_event", on_click)
fig.canvas.mpl_connect("key_press_event", on_key)
update_title()
plt.show() # blocks until the figure is closed
if state["action"] == "save":
return {
"action": "save",
"frame_idx": state["frame_idx"],
"points": state["points"],
}
if state["action"] == "unusable":
return {"action": "unusable", "frame_idx": state["frame_idx"]}
if state["action"] in ("skip", "quit"):
return {"action": state["action"]}
# Window closed via the WM "X" button — treat as quit so the loop stops
return {"action": "quit"}
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--redo", action="store_true",
help="re-pick videos that already have JSON sidecars",
)
parser.add_argument(
"--frame", type=int, default=125,
help="default frame index to display (default 125 ≈ 5 s @ 25 fps)",
)
parser.add_argument(
"--limit", type=int, default=None,
help="only process the first N videos",
)
args = parser.parse_args()
if not INVENTORY_CSV.exists():
sys.exit(
f"Inventory not found at {INVENTORY_CSV}. "
"Run build_video_inventory.py first."
)
inv = pd.read_csv(INVENTORY_CSV)
todo = inv[inv["in_xlsx"] & ~inv["already_tracked"]].copy()
todo = todo.sort_values(
["session_date", "machine_name", "session_time"]
).reset_index(drop=True)
TARGETS_DIR.mkdir(parents=True, exist_ok=True)
def sidecar_for(mp4_path: str) -> Path:
return TARGETS_DIR / (Path(mp4_path).stem + ".json")
if not args.redo:
todo = todo[
~todo["mp4_path"].apply(lambda p: sidecar_for(p).exists())
].reset_index(drop=True)
if args.limit:
todo = todo.head(args.limit)
n = len(todo)
if n == 0:
print("Nothing to pick. All eligible videos already have target JSONs.")
return
print(
f"Picking targets for {n} videos. "
"Window keys: ENTER=save r=reset d=skip u=unusable q=quit "
".,[]=step frame | pan/zoom via toolbar"
)
saved = skipped = unusable = 0
for i, row in todo.iterrows():
mp4 = Path(row["mp4_path"])
prefix = f"[{i + 1}/{n}] {row['machine_name']} {row['session_datetime']}"
print(f"\n{prefix}")
# If --redo and a JSON sidecar exists, pre-load its points (only for
# regular saves — unusable sidecars are left as-is and shown empty).
initial_points = None
existing = sidecar_for(row["mp4_path"])
if args.redo and existing.exists():
try:
prev = json.loads(existing.read_text())
if not prev.get("unusable") and prev.get("reference_points"):
initial_points = [tuple(p) for p in prev["reference_points"]]
print(f" pre-loaded {len(initial_points)} previous point(s)")
except Exception as e:
print(f" ! could not read previous sidecar: {e}")
result = pick_one(mp4, args.frame, prefix, initial_points=initial_points)
if result is None or result.get("action") == "quit":
print(" quitting picker.")
break
if result["action"] == "skip":
skipped += 1
print(" skipped (no JSON written, will be re-asked next run).")
continue
if result["action"] == "unusable":
try:
reason = input(" reason for marking unusable (Enter to skip): ").strip()
except EOFError:
reason = ""
payload = {
"video_path": str(mp4),
"unusable": True,
"reason": reason,
"marked_at": dt.datetime.now().isoformat(timespec="seconds"),
}
out_path = sidecar_for(row["mp4_path"])
out_path.write_text(json.dumps(payload, indent=2))
unusable += 1
print(f" marked unusable → {out_path.name}")
continue
if result["action"] == "save":
payload = {
"video_path": str(mp4),
"frame_index": int(result["frame_idx"]),
"reference_points": [list(map(int, p)) for p in result["points"]],
"order": ["top", "corner", "left"],
"picked_at": dt.datetime.now().isoformat(timespec="seconds"),
}
out_path = sidecar_for(row["mp4_path"])
out_path.write_text(json.dumps(payload, indent=2))
saved += 1
print(f" saved → {out_path.name}")
remaining = n - saved - skipped - unusable
print(
f"\nDone. saved={saved} unusable={unusable} "
f"skipped(this run)={skipped} remaining={remaining}"
)
if __name__ == "__main__":
main()

283
scripts/track_videos.py Normal file
View file

@ -0,0 +1,283 @@
"""Headless offline tracker.
Reads target JSONs produced by `pick_targets.py`, builds the 6 ROIs of the
HD mating arena from the L-shape reference points, runs ethoscope's
`MultiFlyTracker` against the merged.mp4 file via `MovieVirtualCamera`, and
writes a SQLite DB to `TRACKING_OUTPUT_DIR/<video_basename>_tracking.db`.
Idempotent: skips videos whose tracking DB already exists (unless --redo).
Usage:
python track_videos.py # process all videos with target JSON
python track_videos.py --redo # re-track even if DB exists
python track_videos.py --jobs 4 # run up to 4 videos in parallel
python track_videos.py --max-duration 1800 # cap each video at 30 min (sec)
"""
from __future__ import annotations
import argparse
import json
import logging
import os
import sys
import traceback
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path
import numpy as np
# Import ethoscope from the local source tree (no pip install).
ETHOSCOPE_SRC = Path("/home/gg/Code/ethoscope_project/ethoscope/src/ethoscope")
sys.path.insert(0, str(ETHOSCOPE_SRC))
from config import TARGETS_DIR, TRACKING_OUTPUT_DIR # noqa: E402
from tracking_geometry import HD_FG_DATA, compute_roi_polygons # noqa: E402
def build_rois_from_targets(reference_points):
"""Wrap the shared geometry into ethoscope `ROI` objects."""
from ethoscope.core.roi import ROI
polys = compute_roi_polygons(reference_points)
return [ROI(poly.reshape((1, 4, 2)), idx=i + 1) for i, poly in enumerate(polys)]
def track_one(json_path: Path, output_dir: Path, max_duration: float | None,
redo: bool) -> tuple[str, str]:
"""Track a single video. Returns (status, message). Run in subprocess.
Statuses: "ok", "skip", "error".
"""
# Re-import inside subprocess so each worker has its own ethoscope state.
import sys as _sys
_sys.path.insert(0, str(ETHOSCOPE_SRC))
import cv2
from ethoscope.core.monitor import Monitor
from ethoscope.hardware.input.cameras import MovieVirtualCamera
from ethoscope.io.sqlite import SQLiteResultWriter
from ethoscope.trackers.multi_fly_tracker import MultiFlyTracker
import time as _time
class BGRMovieCamera(MovieVirtualCamera):
"""MovieVirtualCamera that keeps BGR frames AND retries on transient
read failures.
Two reasons for the override:
1. MultiFlyTracker calls cv2.cvtColor(img, COLOR_BGR2GRAY) without
checking whether img is already grayscale, so we must feed it
3-channel input.
2. cv2.VideoCapture.read() can return False on transient I/O hiccups
(NFS contention when 8 workers pull big mp4s in parallel) without
the file actually being at EOF. A naive "False -> StopIteration"
handling makes the tracker silently exit mid-video and write a
short, lying DB. We retry a few times and only treat persistent
failures within the *interior* of the video as real EOF.
"""
_retry_count = 5
_retry_backoff_s = 0.25
_eof_safety_frames = 50 # near end-of-file, treat False as legitimate
def _next_image(self):
for attempt in range(self._retry_count):
ret, frame = self.capture.read()
if ret and frame is not None:
return frame # BGR, untouched
# If we're near the genuine end of the file, accept it.
if (
self._has_end_of_file
and self._frame_idx >= self._total_n_frames - self._eof_safety_frames
):
return None
# Otherwise, this is a suspected transient hiccup — back off
# and try again. The capture is still open; cv2 will pick up
# the next decoded frame.
_time.sleep(self._retry_backoff_s)
return None # truly persistent failure
payload = json.loads(json_path.read_text())
if payload.get("unusable"):
reason = payload.get("reason") or "no reason given"
return "skip", f"marked unusable: {reason}"
video_path = Path(payload["video_path"])
if not video_path.exists():
return "error", f"video missing: {video_path}"
out_db = output_dir / f"{video_path.stem}_tracking.db"
if out_db.exists() and not redo:
return "skip", f"DB exists: {out_db.name}"
if out_db.exists():
out_db.unlink()
rois = build_rois_from_targets(payload["reference_points"])
cam_kwargs = {"use_wall_clock": False}
if max_duration is not None:
cam_kwargs["max_duration"] = max_duration
cam = BGRMovieCamera(str(video_path), **cam_kwargs)
metadata = {
"machine_id": payload.get("machine_uuid", "unknown"),
"machine_name": payload.get("machine_name", "unknown"),
"date_time": int(payload.get("session_epoch", 0)),
"frame_width": cam.width,
"frame_height": cam.height,
"version": "offline-tracker-1",
"experimental_info": "{}",
"selected_options": json.dumps({
"tracker": "MultiFlyTracker",
"template": "HD_Mating_Arena_6_ROIS",
"fg_data": HD_FG_DATA,
"maxN": 2,
}),
"hardware_info": "{}",
"reference_points": str([list(map(int, p)) for p in payload["reference_points"]]),
"backup_filename": out_db.name,
"result_writer_type": "SQLite3",
"sqlite_source_path": str(out_db),
}
tracker_data = {
"maxN": 2,
"visualise": False,
"fg_data": HD_FG_DATA,
"adaptive_threshold": True,
"min_fg_threshold": 10,
"max_fg_threshold": 50,
}
db_credentials = {"name": str(out_db)}
rw = SQLiteResultWriter(
db_credentials, rois, metadata=metadata,
make_dam_like_table=False, take_frame_shots=False, erase_old_db=True,
)
monit = Monitor(
cam, MultiFlyTracker, rois,
reference_points=payload["reference_points"],
data=tracker_data,
)
try:
with rw as result_writer:
monit.run(result_writer=result_writer, drawer=None, verbose=False)
except Exception:
return "error", traceback.format_exc(limit=5)
finally:
try:
cam._close()
except Exception:
pass
if not out_db.exists():
return "error", "tracking finished but DB was not created"
# Post-tracking sanity check: did we cover most of the source video?
# If not (cv2 retry exhausted, codec corruption, etc.), reject the DB so
# it doesn't get cached as "done" — better an explicit failure than a
# silent partial write.
expected_ms = (cam._total_n_frames / 25.0) * 1000.0
if max_duration is not None:
expected_ms = min(expected_ms, max_duration * 1000.0)
completeness_threshold = 0.90 # require ≥ 90 % of expected duration
# Use MAX(t) across all ROIs — a single ROI can run dry early if its fly
# stops moving, so the latest detection anywhere in the arena is the
# better signal of how far the iterator actually got.
import sqlite3 as _sqlite3
try:
_con = _sqlite3.connect(f"file:{out_db}?mode=ro", uri=True)
t_max = 0
for _i in range(1, 7):
_v = _con.execute(f"SELECT MAX(t) FROM ROI_{_i}").fetchone()[0]
if _v and _v > t_max:
t_max = _v
_con.close()
except Exception:
t_max = 0
if expected_ms > 0 and t_max < expected_ms * completeness_threshold:
out_db.unlink()
for sidecar in (str(out_db) + "-wal", str(out_db) + "-shm"):
Path(sidecar).unlink(missing_ok=True)
ratio = t_max / expected_ms if expected_ms else 0
return (
"error",
f"short output: t_max={t_max} ms vs expected {int(expected_ms)} ms "
f"({ratio*100:.0f}%); DB removed",
)
return "ok", str(out_db)
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--redo", action="store_true", help="re-track even if DB exists")
parser.add_argument("--jobs", type=int, default=1, help="parallel workers")
parser.add_argument(
"--max-duration", type=float, default=None,
help="cap each video at this many seconds (default: full video)",
)
parser.add_argument("--limit", type=int, default=None, help="process only first N")
parser.add_argument("--video", type=str, default=None,
help="track a single video (mp4 path); requires its target JSON")
args = parser.parse_args()
TRACKING_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
if args.video:
stem = Path(args.video).stem
json_path = TARGETS_DIR / f"{stem}.json"
if not json_path.exists():
sys.exit(f"No target JSON for {args.video}: expected {json_path}")
jsons = [json_path]
else:
jsons = sorted(TARGETS_DIR.glob("*.json"))
if args.limit:
jsons = jsons[: args.limit]
if not jsons:
print("No target JSONs found. Run pick_targets.py first.")
return
print(f"Tracking {len(jsons)} videos (jobs={args.jobs}, redo={args.redo}).")
n_ok = n_skip = n_err = 0
if args.jobs <= 1:
for jp in jsons:
print(f"{jp.name}", flush=True)
status, msg = track_one(jp, TRACKING_OUTPUT_DIR, args.max_duration, args.redo)
print(f" {status}: {msg.splitlines()[-1] if msg else ''}", flush=True)
n_ok += status == "ok"
n_skip += status == "skip"
n_err += status == "error"
else:
with ProcessPoolExecutor(max_workers=args.jobs) as ex:
futs = {
ex.submit(track_one, jp, TRACKING_OUTPUT_DIR, args.max_duration, args.redo): jp
for jp in jsons
}
for fut in as_completed(futs):
jp = futs[fut]
try:
status, msg = fut.result()
except Exception as e:
status, msg = "error", f"future raised: {e}"
print(f" {jp.name}: {status}{msg.splitlines()[-1] if msg else ''}",
flush=True)
n_ok += status == "ok"
n_skip += status == "skip"
n_err += status == "error"
print(f"\nDone. ok={n_ok} skipped={n_skip} errors={n_err}")
sys.exit(0 if n_err == 0 else 1)
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
main()

View file

@ -0,0 +1,71 @@
"""Shared HD-mating-arena ROI geometry, used by both pick_targets.py
(for live overlay) and track_videos.py (for actual tracking).
Pure numpy + cv2; no ethoscope dependency.
"""
from __future__ import annotations
import itertools
import cv2
import numpy as np
# Layout from
# ethoscope/.../roi_builders/roi_templates/builtin/HD_Mating_Arena_6_ROIS.json
HD_MATING_ARENA = {
"n_rows": 2,
"n_cols": 3,
"top_margin": -0.21,
"bottom_margin": -0.13,
"left_margin": 0.05,
"right_margin": 0.05,
"horizontal_fill": 0.85,
"vertical_fill": 1.3,
}
HD_FG_DATA = {
"sample_size": 400,
"normal_limits": [800, 2000],
"tolerance": 0.8,
}
def compute_roi_polygons(reference_points, layout=HD_MATING_ARENA):
"""Map 3 L-shape reference points to 6 ROI polygons, in the order ROI 1..6.
Reference points must be ordered:
[TOP, CORNER, LEFT]
matching ethoscope's dst_points = [(0, -1), (0, 0), (-1, 0)].
Returns:
list[np.ndarray] # 6 arrays, each shape (4, 2), int32, in image coords
"""
ref = np.asarray(reference_points, dtype=np.float32)
if ref.shape != (3, 2):
raise ValueError(f"reference_points must be 3x2, got shape {ref.shape}")
dst_points = np.array([(0, -1), (0, 0), (-1, 0)], dtype=np.float32)
wrap_mat = cv2.getAffineTransform(dst_points, ref)
n_col = layout["n_cols"]
n_row = layout["n_rows"]
tm, bm = layout["top_margin"], layout["bottom_margin"]
lm, rm = layout["left_margin"], layout["right_margin"]
hf, vf = layout["horizontal_fill"], layout["vertical_fill"]
y_positions = (np.arange(n_row) * 2.0 + 1) * (1 - tm - bm) / (2 * n_row) + tm
x_positions = (np.arange(n_col) * 2.0 + 1) * (1 - lm - rm) / (2 * n_col) + lm
centres = [np.array([x, y]) for x, y in itertools.product(x_positions, y_positions)]
sign_mat = np.array([[-1, -1], [+1, -1], [+1, +1], [-1, +1]])
xy_size = np.array([hf / float(n_col), vf / float(n_row)]) / 2.0
rectangles = [sign_mat * xy_size + c for c in centres]
shift = np.dot(wrap_mat, [1, 1, 0]) - ref[1]
polys = []
for r in rectangles:
r3 = np.append(r, np.zeros((4, 1)), axis=1)
mapped = np.dot(wrap_mat, r3.T).T - shift
polys.append(mapped.astype(np.int32))
return polys

View file

@ -51,6 +51,90 @@ See `docs/bimodal_hypothesis.md` for detailed methodology.
- [ ] Consider converting pixel distances to physical units (need calibration)
- [ ] The second notebook (`flies_analysis.ipynb`) re-runs from DB extraction - consider deprecating
## Phase: Offline Tracking of 2024 Video Backlog (added 2026-04-27)
### Recap
Tracked so far (5 sessions, all from 2025-07-15, machines 076/145/268). The DBs in
`data/raw/` use tracker `ConstrainedMultiFlyTracker` and template
`HD_Mating_Arena_6_ROIS.json` (2 flies × 6 ROIs per video).
The metadata file `../all_video_info_merged.xlsx` indexes a different set of
experiments: 7 dates from 2024-09-17 → 2024-10-21, 16 ethoscope machines,
63 unique (date, machine) sessions = 484 ROI-rows. **None of the already-tracked
sessions are in this xlsx — these are fresh recordings to track.**
Inventory: see `data/metadata/video_inventory.csv` (built by
`scripts/build_video_inventory.py`).
- 1163 video sessions on disk under `/mnt/ethoscope_data/videos/`
- 63/63 xlsx (date, machine) sessions have video on disk
- 129 video instances need tracking (some (date, machine) have 2-4 recordings/day)
### Plan
The HD-mating-arena videos have no auto-detectable targets — the user must
manually click 3 reference points (L-shape: top, corner, left) per video. Once
all targets are picked, tracking can run in the background.
- [x] **Step 1 — Inventory**: `scripts/build_video_inventory.py`
`data/metadata/video_inventory.csv`. 63 (date,machine) sessions match
the xlsx, all videos found, 129 video instances need tracking.
- [x] **Step 2 — Manual target picker**: `scripts/pick_targets.py`. Loops over
videos with `in_xlsx & ~already_tracked & no JSON yet`; per video, shows
a representative frame, captures 3 clicks (top, corner, left), saves
`data/targets/<video_basename>.json`. Skips videos already done.
- [x] **Step 3 — Background tracker**: `scripts/track_videos.py`. Reads target
JSONs, builds 6 ROIs from the HD-mating-arena geometry, runs
`MovieVirtualCamera` + `MultiFlyTracker` + `SQLiteResultWriter`, writes
`data/tracked/<basename>_tracking.db`. Idempotent. Smoke-tested
end-to-end: 90s of video → ~3000 rows/ROI, areas in 800-2000 band.
- [x] **Step 4 — Tracking deps**: `requirements-tracking.txt`.
### Still TODO
- [ ] User to run `pick_targets.py` (interactive — needs DISPLAY) on the 129
pending videos.
- [ ] Run `track_videos.py --jobs 4` against the resulting JSONs.
- [ ] (Optional) `auto_detect_targets.py` exists as a fallback for videos that
DO have visible targets (saves clicks). Confirmed not useful on the
2025-07-15 batch — these arenas don't have black target dots — but worth
trying on 2024 batches before falling back to manual.
- [ ] Decide what to do with the 4 (date, machine) sessions that have 3-4
recordings/day instead of 2 (e.g. ETHOSCOPE_086 on 2024-09-17 has 4).
One of them is at lower resolution (1280x960) — likely an aborted take.
### Open questions / risks
- Some (date, machine) combos have 3-4 recordings (e.g. ETHOSCOPE_086 on
2024-09-17). Need to figure out which is the real "test" video vs aborted
takes — possibly use video duration or filename pattern.
- One mismatched-resolution file: `1280x960@25fps-20q` instead of
`1920x1088@25fps-28q` — flag for inspection.
- The original `ConstrainedMultiFlyTracker` is no longer in the ethoscope repo;
`MultiFlyTracker` is its likely successor. Validate output schema matches
what the existing analysis pipeline expects (`load_roi_data.py`, etc.).
## Discovered During Work
(Add new items here as they come up during analysis)
### Barrier-opening annotation for the 2024 batch (added 2026-04-30)
The current `flies_analysis*.ipynb` aligns trajectories to a barrier-opening
event sourced from `data/metadata/2025_07_15_barrier_opening.csv`. That file
covers only the 5 machines in the 2025-07-15 experiment. The 2024 batch
(`/mnt/data/projects/cupido/tracked/`, 113 DBs) has no equivalent annotation
yet, so all post-alignment cells silently exclude that data.
- [ ] Build a small picker that lets the user scrub through each tracking
DB / video and mark the barrier-opening frame, writing a row to a new
`data/metadata/barrier_opening_2024.csv` (or extend the existing
file with a date column).
- [ ] Once the 2024 entries exist, update `align_to_opening_time` so it
pulls from a unified `barrier_opening` table keyed by
`(date, machine_name)` rather than `machine_name` alone.
### Metadata vocabulary normalization (done 2026-04-30)
The xlsx had inconsistent labels for control flies (`'naïve'`, `'niave'`,
`'untrained'` plus trailing whitespace). All sources now use a single
canonical `'naive'`. Normalization happens in
`scripts/export_video_db_index.py` so re-running it from the xlsx always
produces a clean TSV. The 2025-07-15 legacy CSV
(`data/metadata/2025_07_15_metadata_fixed.csv`) was edited in place from
`'untrained'``'naive'`.