Add per-row include flag to TSV; expand flies_analysis_simple narrative

- export_video_db_index.py now writes a boolean `include` column (default True). Flip it to False to drop a noisy/unusable row from analysis without deleting it. - load_roi_data filters on `include` automatically (back-compat: missing column = load everything). - flies_analysis_simple.ipynb section headers now explain *why* each step exists (barrier alignment, body-area baseline, merged-blob heuristic, Hungarian identity tracking) rather than just naming the step. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-01 09:09:59 +01:00 · 2026-05-01 09:09:59 +01:00 · 9f3ee24a23
commit 9f3ee24a23
parent 723d1f3682
4 changed files with 25 additions and 29 deletions
--- a/notebooks/flies_analysis_simple.ipynb
+++ b/notebooks/flies_analysis_simple.ipynb
@ -3,11 +3,7 @@
  {
   "cell_type": "markdown",
   "metadata": {},
-   "source": [
-    "# Flies Behavior Analysis Pipeline\n",
-    "\n",
-    "This notebook analyzes the behavior of trained vs untrained flies based on their distance patterns."
-   ]
+   "source": "# Trained vs naïve flies — distance & velocity pipeline\n\nThis notebook is the canonical pipeline the previous student built to ask\n\"do trained males behave differently from naïve males?\" using two\ncomplementary metrics:\n\n1. **Inter-fly distance over time** — does the male stay close to the\n   female (courting) or drift away (giving up)?\n2. **Per-fly maximum velocity** — when each fly is identified across\n   frames, how fast does it move?\n\nIt runs in roughly this order:\n\n| step | what it does |\n|---|---|\n| Load   | pull every (fly, session) trace via `load_roi_data` |\n| Align  | shift each track so `t = 0` is the moment the barrier opened |\n| Area   | compute a baseline body-area to spot frames where the tracker merged two flies into one blob |\n| Distance | per-frame Euclidean distance, with the merged-blob heuristic |\n| Track   | re-identify \"fly 1\" vs \"fly 2\" frame-to-frame (Hungarian assignment) |\n| Velocity | per-fly velocity from those tracked identities |\n| Plot   | trained-vs-naïve mean curves with smoothing |\n| Stats  | t-test + Cohen's d, pre- and post-barrier-opening |\n\n**Important caveat.** Step 2 (Align) needs the **barrier-opening time\nper machine**, which we currently only have for the 2025-07-15 batch\n(`data/metadata/2025_07_15_barrier_opening.csv`). Running this notebook\non the full dataset will silently drop every machine that doesn't\nappear in that file. Annotating the 2024 batch is on the todo list —\nsee `tasks/todo.md`.\n\nThe notebook caches expensive intermediate results to\n`data/processed/*.csv` so re-running is cheap. Set\n`recalculate_distances = True` (or `recalculate_tracking = True`) to\nforce a fresh computation.\n"
  },
  {
   "cell_type": "code",
@ -19,9 +15,7 @@
  {
   "cell_type": "markdown",
   "metadata": {},
-   "source": [
-    "## 1. Load existing CSV data"
-   ]
+   "source": "## 1. Load the tracking data\n\n`load_roi_data` opens every tracking DB referenced by the merged TSV\nand returns one big DataFrame stamped with experimental metadata\n(species, male/naïve, age, …). The TSV has a boolean `include` column\n(default `True`) — set it to `False` for any row you want to drop\n(e.g. videos that turned out to be too noisy). The loader respects\nthat flag automatically; nothing else needs to change here.\n\nIf you only want a subset, pre-filter `meta` before passing it in\n(e.g. `load_roi_data(meta[meta.species == 'Melanogaster/CS'])`).\n"
  },
  {
   "cell_type": "code",
@ -33,9 +27,7 @@
  {
   "cell_type": "markdown",
   "metadata": {},
-   "source": [
-    "## 2. Align data using barrier opening time as time 0"
-   ]
+   "source": "## 2. Align tracks to the barrier-opening time\n\nEvery video starts at its own arbitrary moment. What matters is the\n**barrier opening** — the experimenter physically lifts a divider, the\nsexes meet, and the courtship clock starts. We define `aligned_time = 0`\nas that moment, so curves from different machines can be averaged.\n\nPer-machine opening times are hand-annotated in\n`2025_07_15_barrier_opening.csv` (one row per ethoscope machine,\n`opening_time` in seconds from the start of the video). Any machine\nnot in that file is silently dropped from the aligned data — its rows\nget `aligned_time = NaN`.\n"
  },
  {
   "cell_type": "code",
@ -88,9 +80,7 @@
  {
   "cell_type": "markdown",
   "metadata": {},
-   "source": [
-    "## 3. Calculate median area size in rows where two flies are being tracked"
-   ]
+   "source": "## 3. Body-area baseline (used as a \"two flies merged\" detector)\n\nAt any given time the tracker may detect one fly or two flies in a\nROI. The interesting case is \"we expected two but only see one\" —\nwhich usually means the two flies are touching/overlapping and the\ntracker fused them into a single, **larger** bounding box.\n\nTo distinguish \"real single-fly frames\" (one fly hidden, true distance\nunknown) from \"merged-blob frames\" (flies effectively touching, distance\n≈ 0), we need a baseline: how big is one fly's bounding box on average?\nWe estimate that from frames where we *do* see two flies — there the\nboxes are guaranteed to be one-fly-each. The median of those areas is\nour reference value.\n"
  },
  {
   "cell_type": "code",
@ -137,9 +127,7 @@
  {
   "cell_type": "markdown",
   "metadata": {},
-   "source": [
-    "## 4. Calculate distances taking into account area size"
-   ]
+   "source": "## 4. Per-frame inter-fly distance\n\nFor each `(machine, ROI, aligned_time)`:\n\n- **2 detections** → Euclidean distance between them (the obvious case).\n- **1 detection, large box** (area > 1.5× the median) → treat as\n  distance 0. The flies are touching, the tracker fused them.\n- **1 detection, normal box** → `NaN`. One fly is genuinely lost\n  (occluded, off-frame); we don't know the distance and shouldn't pretend.\n\nThis is computed once per group (trained/naïve) and cached to CSV. Set\n`recalculate_distances = True` to force a re-computation after changing\nany of the inputs.\n"
  },
  {
   "cell_type": "code",
@ -151,9 +139,7 @@
  {
   "cell_type": "markdown",
   "metadata": {},
-   "source": [
-    "## 5. Plot averaged lines of trained vs untrained for the entire experiment"
-   ]
+   "source": "## 5. Trained vs naïve — average distance over the whole session\n\nFor each `aligned_time`, average the inter-fly distance across all\n(machine, ROI) tracks in the group, then smooth with a rolling mean\n(50-point window) to dampen frame-to-frame noise.\n\nA clearly-shifted curve = the groups behave differently. The vertical\ndashed line marks `t = 0` (barrier opening).\n"
  },
  {
   "cell_type": "code",
@ -165,9 +151,7 @@
  {
   "cell_type": "markdown",
   "metadata": {},
-   "source": [
-    "## 6. Same plot but ending at time +300 seconds"
-   ]
+   "source": "## 6. Same plot, zoomed to the first 5 minutes after opening\n\nThe interesting behavioural signal is concentrated in the moments\n**right after** the barrier lifts (the male's first reaction to the\nfemale). Re-plot with `xlim` cropped to ±150 s around opening and\nending at +300 s.\n"
  },
  {
   "cell_type": "code",
@ -179,9 +163,7 @@
  {
   "cell_type": "markdown",
   "metadata": {},
-   "source": [
-    "## 7. Track fly identities and calculate meaningful velocity"
-   ]
+   "source": "## 7. Identity tracking → per-fly velocity\n\nInter-fly distance is *symmetric* — it doesn't matter which detection\nis \"fly 1\" vs \"fly 2\". **Velocity is not.** For velocity to mean\nanything we need to follow the same fly across consecutive frames.\n\nThe tracker only labels detections as \"id 1, id 2\" within a single\nframe; those ids can swap between consecutive frames. To stitch them\ntogether we use the **Hungarian algorithm** (`scipy.optimize.linear_sum_assignment`):\nat each `t → t+1` step, pair detections so the total displacement is\nminimised. That's the standard light-touch approach for short, simple\nmulti-object tracking — see the\n[Wikipedia entry](https://en.wikipedia.org/wiki/Hungarian_algorithm)\nfor the maths.\n\nOnce identities are stable across time, velocity is just `Δposition / Δt`\nper fly. We then compute the **maximum velocity within a 10-second\nsliding window** — that's a coarse \"is this fly active right now?\" signal.\n"
  },
  {
   "cell_type": "code",
@ -339,9 +321,7 @@
  {
   "cell_type": "markdown",
   "metadata": {},
-   "source": [
-    "## Summary Statistics"
-   ]
+   "source": "## Summary statistics\n\nIndependent t-test on **post-opening** distances, plus\n[Cohen's d](https://en.wikipedia.org/wiki/Effect_size#Cohen's_d) for\neffect size. P-values from large samples can be tiny even when the\neffect is small — always read p-value and Cohen's d together.\n"
  },
  {
   "cell_type": "code",