{ "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 03 · Your first real analysis: trained vs naive\n", "\n", "In notebook 02 we explored a single database. Now we'll work with **all\n", "of them at once**, compute a simple per-fly metric, and ask the central\n", "question of the project:\n", "\n", "> **Do trained males behave differently from naïve males in the testing\n", "> session?**\n", "\n", "By the end you'll have:\n", "\n", "- loaded every (fly, session) trace into one big DataFrame using the\n", " project's helper function;\n", "- reduced each trace to one number per fly (the *median inter-fly\n", " distance*);\n", "- compared the trained group against the naïve group with a histogram\n", " and a non-parametric statistical test;\n", "- learnt enough to start asking your own questions.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": "import sys\nfrom pathlib import Path\n\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nfrom scipy import stats\n\n# Two locations to know about:\n# - DATA_DIR : where the project's data files live (read-only data volume)\n# - REPO_ROOT : where the code repository lives (this notebook is inside it)\n# We build both as Path objects, then derive everything else from them.\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path(\"/home/gg/ownCloud/Work/Projects/coding/cupido/tracking\")\n\n# Tell Python where to find the project's helper modules (in scripts/).\nsys.path.insert(0, str(REPO_ROOT / \"scripts\"))\n\nfrom load_roi_data import load_roi_data\n" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading everything at once — but carefully\n", "\n", "`load_roi_data()` opens every tracking DB referenced by the metadata TSV\n", "and returns one big DataFrame. **It can be slow and memory-hungry**\n", "(the full batch is ~200 million rows). Always start small.\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": "# Load the metadata TSV first — it's small and fast.\ntsv_path = DATA_DIR / \"all_video_info_merged.tsv\"\nmeta = pd.read_csv(tsv_path, sep=\"\\t\")\nprint(f\"metadata rows: {len(meta)}\")\n" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pre-filter the metadata before passing it to `load_roi_data`. We'll start\n", "with **just one species and just the testing sessions**, because:\n", "\n", "1. mixing species is a confound (different species behave differently);\n", "2. the question is about behaviour after training, so the testing session\n", " is the relevant one;\n", "3. starting small means we can iterate quickly.\n", "\n", "You can come back later and broaden this filter.\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "# Pick one species. 'Melanogaster/CS' has the most rows (127), so a good default.\n", "sub = meta[meta[\"species\"] == \"Melanogaster/CS\"].copy()\n", "\n", "# We're loading every session for these flies, but the loader stamps each\n", "# row with a 'session' column so we can filter to testing afterwards.\n", "print(f\"selected metadata rows: {len(sub)}\")\n", "print(sub[\"male\"].value_counts())\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "# This will take a minute or two and use a chunk of RAM. Be patient.\n", "data = load_roi_data(sub)\n", "print(f\"loaded shape: {data.shape}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What did we get?\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "data.head(3)\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "# How big is each session, in tracking samples?\n", "data.groupby([\"session\", \"male\"]).size().unstack(fill_value=0)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Restrict to the testing session\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "testing = data[data[\"session\"] == \"testing\"].copy()\n", "print(f\"testing samples: {len(testing):,}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reduce each trace to one number\n", "\n", "Right now each fly contributes **tens of thousands** of (t, x, y) rows.\n", "We can't compare distributions of millions of points across two groups\n", "in any meaningful way. So we **collapse each (date, machine_name, ROI)\n", "trace into a single summary number** — here, the median distance between\n", "the two flies during testing.\n", "\n", "Why median rather than mean? Because tracker glitches (one fly\n", "temporarily lost) can produce huge spikes that the median ignores.\n", "[Why medians beat means in noisy data\n", "(2-min read)](https://en.wikipedia.org/wiki/Median#Robustness).\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "# Step 1 — per-frame distance.\n", "# Take only frames with exactly 2 flies (so we have a real distance).\n", "two_fly = testing.groupby([\"date\", \"machine_name\", \"ROI\", \"t\"]).filter(lambda g: len(g) == 2)\n", "\n", "# For each (track, t), compute the distance between the two rows.\n", "def distance_for_frame(g):\n", " g = g.sort_values(\"id\").reset_index(drop=True)\n", " return np.hypot(g.loc[0, \"x\"] - g.loc[1, \"x\"], g.loc[0, \"y\"] - g.loc[1, \"y\"])\n", "\n", "# This is the slow step. With ~3 M frames it takes a while.\n", "per_frame = (\n", " two_fly\n", " .groupby([\"date\", \"machine_name\", \"ROI\", \"t\", \"male\"])\n", " .apply(distance_for_frame)\n", " .reset_index(name=\"distance_px\")\n", ")\n", "print(f\"per-frame distance rows: {len(per_frame):,}\")\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "# Step 2 — one number per (date, machine_name, ROI).\n", "per_fly = (\n", " per_frame\n", " .groupby([\"date\", \"machine_name\", \"ROI\", \"male\"])[\"distance_px\"]\n", " .median()\n", " .reset_index(name=\"median_distance_px\")\n", ")\n", "\n", "# Each row now is \"one fly during testing\", with its median distance.\n", "print(per_fly.shape)\n", "per_fly.head()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sanity check: how many flies per group?\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "per_fly[\"male\"].value_counts()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the numbers are very different, your statistical comparison will be\n", "underpowered for one side. Note them down.\n", "\n", "## Plot the distributions\n", "\n", "The first thing to do with two groups is to **look at them**. Don't trust\n", "a p-value before you've seen the histogram.\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(10, 5))\n", "\n", "bins = np.linspace(0, per_fly[\"median_distance_px\"].max(), 40)\n", "\n", "for label, color in [(\"trained\", \"steelblue\"), (\"naive\", \"darkorange\")]:\n", " sub = per_fly[per_fly[\"male\"] == label][\"median_distance_px\"]\n", " ax.hist(sub, bins=bins, alpha=0.6, label=f\"{label} (n={len(sub)})\", color=color)\n", "\n", "ax.set_xlabel(\"median inter-fly distance during testing (px)\")\n", "ax.set_ylabel(\"number of flies\")\n", "ax.set_title(\"Trained vs naïve — Melanogaster/CS — testing session\")\n", "ax.legend()\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**What you might see:**\n", "\n", "- If the trained group's distribution is shifted to **higher** distances,\n", " trained males are spending less time near the female (i.e. they\n", " learned to give up).\n", "- If the two distributions look identical, no learning effect was\n", " measurable with this metric — but that doesn't mean there's no effect,\n", " just that this particular summary didn't capture it.\n", "- A **bimodal** trained distribution (two humps) would mean some males\n", " learned and others didn't — the \"individual differences\" story in\n", " `docs/bimodal_hypothesis.md`.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Add a stat test\n", "\n", "A formal comparison. Because group sizes are small and we don't know if\n", "the data are normally distributed, the\n", "[Mann-Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test)\n", "is a safer default than the classic t-test.\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "trained_vals = per_fly[per_fly[\"male\"] == \"trained\"][\"median_distance_px\"]\n", "naive_vals = per_fly[per_fly[\"male\"] == \"naive\"][\"median_distance_px\"]\n", "\n", "stat, pvalue = stats.mannwhitneyu(trained_vals, naive_vals, alternative=\"two-sided\")\n", "\n", "print(f\"trained median: {trained_vals.median():.1f} px (n={len(trained_vals)})\")\n", "print(f\"naive median: {naive_vals.median():.1f} px (n={len(naive_vals)})\")\n", "print(f\"Mann-Whitney U: {stat:.0f} p-value: {pvalue:.4f}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**How to read this**: the p-value is the probability of seeing a\n", "difference at least this big *if there were really no difference*. By\n", "convention p < 0.05 is \"interesting\", p < 0.01 is \"fairly convincing\".\n", "But never trust a p-value without:\n", "\n", "1. eyeballing the histogram first (you did);\n", "2. reporting the **effect size**, not just the p-value (e.g. the\n", " difference of medians);\n", "3. understanding that p-values\n", " [say nothing about practical importance](https://www.nature.com/articles/d41586-019-00857-9).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's next?\n", "\n", "- **Pick a different metric**: instead of median distance, try fraction\n", " of time the flies were within 50 px (a \"close-proximity\" metric), or\n", " the maximum velocity per fly. (Velocity needs identity tracking, which\n", " is harder — see `flies_analysis_simple.ipynb` cell 16 for an example.)\n", "- **Look at it per species**: re-run with `species == \"Sechellia\"` and\n", " compare. Does the effect generalize? Where is it strongest?\n", "- **Look at the bimodality**: a kernel density plot\n", " ([seaborn.kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html))\n", " will show humps better than a histogram.\n", "- **Time inside the session**: maybe the difference only shows up in the\n", " first few minutes (right after the female is introduced). Slice\n", " `per_frame` by `t` before aggregating.\n", "- **Consult `docs/bimodal_hypothesis.md`**: it lays out a formal plan for\n", " testing the \"some flies learn, others don't\" hypothesis.\n", "\n", "When you write your own analysis, **save it as a new notebook** (don't\n", "edit this one). Copy the setup cells, change the question, change the\n", "plot. That's how analysis projects grow.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A note on iteration speed\n", "\n", "The pipeline above is correct but **slow** because we apply a Python\n", "function to every (track, t) group. If you find yourself re-running the\n", "same expensive computation a lot, save the intermediate result to disk:\n", "\n", "```python\n", "per_frame.to_parquet(\"per_frame_distance.parquet\")\n", "# next time:\n", "per_frame = pd.read_parquet(\"per_frame_distance.parquet\")\n", "```\n", "\n", "`parquet` is a fast columnar format. `pip install pyarrow` if your\n", "environment doesn't have it.\n", "\n", "There are also vectorized ways to compute these distances ~100× faster\n", "that avoid `groupby().apply()`. Don't worry about that yet — get a\n", "correct answer first, optimize only if you find yourself waiting.\n" ] } ] }