Consolidates everything bulky (tracking DBs, targets, metadata spreadsheet) under a single DATA_VOLUME root outside the ownCloud-synced repo. Notebooks now use a visible DATA_DIR = Path(...) idiom rather than walking up the filesystem with PROJECT_ROOT.parent — easier for students with no Python background to follow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
379 lines
No EOL
13 KiB
Text
379 lines
No EOL
13 KiB
Text
{
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5,
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"name": "python"
|
||
}
|
||
},
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 03 · Your first real analysis: trained vs naive\n",
|
||
"\n",
|
||
"In notebook 02 we explored a single database. Now we'll work with **all\n",
|
||
"of them at once**, compute a simple per-fly metric, and ask the central\n",
|
||
"question of the project:\n",
|
||
"\n",
|
||
"> **Do trained males behave differently from naïve males in the testing\n",
|
||
"> session?**\n",
|
||
"\n",
|
||
"By the end you'll have:\n",
|
||
"\n",
|
||
"- loaded every (fly, session) trace into one big DataFrame using the\n",
|
||
" project's helper function;\n",
|
||
"- reduced each trace to one number per fly (the *median inter-fly\n",
|
||
" distance*);\n",
|
||
"- compared the trained group against the naïve group with a histogram\n",
|
||
" and a non-parametric statistical test;\n",
|
||
"- learnt enough to start asking your own questions.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Setup\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": "import sys\nfrom pathlib import Path\n\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nfrom scipy import stats\n\n# Two locations to know about:\n# - DATA_DIR : where the project's data files live (read-only data volume)\n# - REPO_ROOT : where the code repository lives (this notebook is inside it)\n# We build both as Path objects, then derive everything else from them.\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path(\"/home/gg/ownCloud/Work/Projects/coding/cupido/tracking\")\n\n# Tell Python where to find the project's helper modules (in scripts/).\nsys.path.insert(0, str(REPO_ROOT / \"scripts\"))\n\nfrom load_roi_data import load_roi_data\n"
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Loading everything at once — but carefully\n",
|
||
"\n",
|
||
"`load_roi_data()` opens every tracking DB referenced by the metadata TSV\n",
|
||
"and returns one big DataFrame. **It can be slow and memory-hungry**\n",
|
||
"(the full batch is ~200 million rows). Always start small.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": "# Load the metadata TSV first — it's small and fast.\ntsv_path = DATA_DIR / \"all_video_info_merged.tsv\"\nmeta = pd.read_csv(tsv_path, sep=\"\\t\")\nprint(f\"metadata rows: {len(meta)}\")\n"
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Pre-filter the metadata before passing it to `load_roi_data`. We'll start\n",
|
||
"with **just one species and just the testing sessions**, because:\n",
|
||
"\n",
|
||
"1. mixing species is a confound (different species behave differently);\n",
|
||
"2. the question is about behaviour after training, so the testing session\n",
|
||
" is the relevant one;\n",
|
||
"3. starting small means we can iterate quickly.\n",
|
||
"\n",
|
||
"You can come back later and broaden this filter.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"# Pick one species. 'Melanogaster/CS' has the most rows (127), so a good default.\n",
|
||
"sub = meta[meta[\"species\"] == \"Melanogaster/CS\"].copy()\n",
|
||
"\n",
|
||
"# We're loading every session for these flies, but the loader stamps each\n",
|
||
"# row with a 'session' column so we can filter to testing afterwards.\n",
|
||
"print(f\"selected metadata rows: {len(sub)}\")\n",
|
||
"print(sub[\"male\"].value_counts())\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"# This will take a minute or two and use a chunk of RAM. Be patient.\n",
|
||
"data = load_roi_data(sub)\n",
|
||
"print(f\"loaded shape: {data.shape}\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## What did we get?\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"data.head(3)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"# How big is each session, in tracking samples?\n",
|
||
"data.groupby([\"session\", \"male\"]).size().unstack(fill_value=0)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Restrict to the testing session\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"testing = data[data[\"session\"] == \"testing\"].copy()\n",
|
||
"print(f\"testing samples: {len(testing):,}\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Reduce each trace to one number\n",
|
||
"\n",
|
||
"Right now each fly contributes **tens of thousands** of (t, x, y) rows.\n",
|
||
"We can't compare distributions of millions of points across two groups\n",
|
||
"in any meaningful way. So we **collapse each (date, machine_name, ROI)\n",
|
||
"trace into a single summary number** — here, the median distance between\n",
|
||
"the two flies during testing.\n",
|
||
"\n",
|
||
"Why median rather than mean? Because tracker glitches (one fly\n",
|
||
"temporarily lost) can produce huge spikes that the median ignores.\n",
|
||
"[Why medians beat means in noisy data\n",
|
||
"(2-min read)](https://en.wikipedia.org/wiki/Median#Robustness).\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"# Step 1 — per-frame distance.\n",
|
||
"# Take only frames with exactly 2 flies (so we have a real distance).\n",
|
||
"two_fly = testing.groupby([\"date\", \"machine_name\", \"ROI\", \"t\"]).filter(lambda g: len(g) == 2)\n",
|
||
"\n",
|
||
"# For each (track, t), compute the distance between the two rows.\n",
|
||
"def distance_for_frame(g):\n",
|
||
" g = g.sort_values(\"id\").reset_index(drop=True)\n",
|
||
" return np.hypot(g.loc[0, \"x\"] - g.loc[1, \"x\"], g.loc[0, \"y\"] - g.loc[1, \"y\"])\n",
|
||
"\n",
|
||
"# This is the slow step. With ~3 M frames it takes a while.\n",
|
||
"per_frame = (\n",
|
||
" two_fly\n",
|
||
" .groupby([\"date\", \"machine_name\", \"ROI\", \"t\", \"male\"])\n",
|
||
" .apply(distance_for_frame)\n",
|
||
" .reset_index(name=\"distance_px\")\n",
|
||
")\n",
|
||
"print(f\"per-frame distance rows: {len(per_frame):,}\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"# Step 2 — one number per (date, machine_name, ROI).\n",
|
||
"per_fly = (\n",
|
||
" per_frame\n",
|
||
" .groupby([\"date\", \"machine_name\", \"ROI\", \"male\"])[\"distance_px\"]\n",
|
||
" .median()\n",
|
||
" .reset_index(name=\"median_distance_px\")\n",
|
||
")\n",
|
||
"\n",
|
||
"# Each row now is \"one fly during testing\", with its median distance.\n",
|
||
"print(per_fly.shape)\n",
|
||
"per_fly.head()\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Sanity check: how many flies per group?\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"per_fly[\"male\"].value_counts()\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"If the numbers are very different, your statistical comparison will be\n",
|
||
"underpowered for one side. Note them down.\n",
|
||
"\n",
|
||
"## Plot the distributions\n",
|
||
"\n",
|
||
"The first thing to do with two groups is to **look at them**. Don't trust\n",
|
||
"a p-value before you've seen the histogram.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"fig, ax = plt.subplots(figsize=(10, 5))\n",
|
||
"\n",
|
||
"bins = np.linspace(0, per_fly[\"median_distance_px\"].max(), 40)\n",
|
||
"\n",
|
||
"for label, color in [(\"trained\", \"steelblue\"), (\"naive\", \"darkorange\")]:\n",
|
||
" sub = per_fly[per_fly[\"male\"] == label][\"median_distance_px\"]\n",
|
||
" ax.hist(sub, bins=bins, alpha=0.6, label=f\"{label} (n={len(sub)})\", color=color)\n",
|
||
"\n",
|
||
"ax.set_xlabel(\"median inter-fly distance during testing (px)\")\n",
|
||
"ax.set_ylabel(\"number of flies\")\n",
|
||
"ax.set_title(\"Trained vs naïve — Melanogaster/CS — testing session\")\n",
|
||
"ax.legend()\n",
|
||
"plt.show()\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"**What you might see:**\n",
|
||
"\n",
|
||
"- If the trained group's distribution is shifted to **higher** distances,\n",
|
||
" trained males are spending less time near the female (i.e. they\n",
|
||
" learned to give up).\n",
|
||
"- If the two distributions look identical, no learning effect was\n",
|
||
" measurable with this metric — but that doesn't mean there's no effect,\n",
|
||
" just that this particular summary didn't capture it.\n",
|
||
"- A **bimodal** trained distribution (two humps) would mean some males\n",
|
||
" learned and others didn't — the \"individual differences\" story in\n",
|
||
" `docs/bimodal_hypothesis.md`.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Add a stat test\n",
|
||
"\n",
|
||
"A formal comparison. Because group sizes are small and we don't know if\n",
|
||
"the data are normally distributed, the\n",
|
||
"[Mann-Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test)\n",
|
||
"is a safer default than the classic t-test.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"trained_vals = per_fly[per_fly[\"male\"] == \"trained\"][\"median_distance_px\"]\n",
|
||
"naive_vals = per_fly[per_fly[\"male\"] == \"naive\"][\"median_distance_px\"]\n",
|
||
"\n",
|
||
"stat, pvalue = stats.mannwhitneyu(trained_vals, naive_vals, alternative=\"two-sided\")\n",
|
||
"\n",
|
||
"print(f\"trained median: {trained_vals.median():.1f} px (n={len(trained_vals)})\")\n",
|
||
"print(f\"naive median: {naive_vals.median():.1f} px (n={len(naive_vals)})\")\n",
|
||
"print(f\"Mann-Whitney U: {stat:.0f} p-value: {pvalue:.4f}\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"**How to read this**: the p-value is the probability of seeing a\n",
|
||
"difference at least this big *if there were really no difference*. By\n",
|
||
"convention p < 0.05 is \"interesting\", p < 0.01 is \"fairly convincing\".\n",
|
||
"But never trust a p-value without:\n",
|
||
"\n",
|
||
"1. eyeballing the histogram first (you did);\n",
|
||
"2. reporting the **effect size**, not just the p-value (e.g. the\n",
|
||
" difference of medians);\n",
|
||
"3. understanding that p-values\n",
|
||
" [say nothing about practical importance](https://www.nature.com/articles/d41586-019-00857-9).\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## What's next?\n",
|
||
"\n",
|
||
"- **Pick a different metric**: instead of median distance, try fraction\n",
|
||
" of time the flies were within 50 px (a \"close-proximity\" metric), or\n",
|
||
" the maximum velocity per fly. (Velocity needs identity tracking, which\n",
|
||
" is harder — see `flies_analysis_simple.ipynb` cell 16 for an example.)\n",
|
||
"- **Look at it per species**: re-run with `species == \"Sechellia\"` and\n",
|
||
" compare. Does the effect generalize? Where is it strongest?\n",
|
||
"- **Look at the bimodality**: a kernel density plot\n",
|
||
" ([seaborn.kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html))\n",
|
||
" will show humps better than a histogram.\n",
|
||
"- **Time inside the session**: maybe the difference only shows up in the\n",
|
||
" first few minutes (right after the female is introduced). Slice\n",
|
||
" `per_frame` by `t` before aggregating.\n",
|
||
"- **Consult `docs/bimodal_hypothesis.md`**: it lays out a formal plan for\n",
|
||
" testing the \"some flies learn, others don't\" hypothesis.\n",
|
||
"\n",
|
||
"When you write your own analysis, **save it as a new notebook** (don't\n",
|
||
"edit this one). Copy the setup cells, change the question, change the\n",
|
||
"plot. That's how analysis projects grow.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## A note on iteration speed\n",
|
||
"\n",
|
||
"The pipeline above is correct but **slow** because we apply a Python\n",
|
||
"function to every (track, t) group. If you find yourself re-running the\n",
|
||
"same expensive computation a lot, save the intermediate result to disk:\n",
|
||
"\n",
|
||
"```python\n",
|
||
"per_frame.to_parquet(\"per_frame_distance.parquet\")\n",
|
||
"# next time:\n",
|
||
"per_frame = pd.read_parquet(\"per_frame_distance.parquet\")\n",
|
||
"```\n",
|
||
"\n",
|
||
"`parquet` is a fast columnar format. `pip install pyarrow` if your\n",
|
||
"environment doesn't have it.\n",
|
||
"\n",
|
||
"There are also vectorized ways to compute these distances ~100× faster\n",
|
||
"that avoid `groupby().apply()`. Don't worry about that yet — get a\n",
|
||
"correct answer first, optimize only if you find yourself waiting.\n"
|
||
]
|
||
}
|
||
]
|
||
} |