Move metadata xlsx/TSV to /mnt/data/projects/cupido/

Consolidates everything bulky (tracking DBs, targets, metadata
spreadsheet) under a single DATA_VOLUME root outside the ownCloud-synced
repo. Notebooks now use a visible DATA_DIR = Path(...) idiom rather than
walking up the filesystem with PROJECT_ROOT.parent — easier for students
with no Python background to follow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Giorgio Gilestro 2026-05-01 08:47:15 +01:00
parent ec56e51bf9
commit f176224150
8 changed files with 102 additions and 160 deletions

View file

@ -16,13 +16,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# 03 \u00b7 Your first real analysis: trained vs naive\n",
"# 03 · Your first real analysis: trained vs naive\n",
"\n",
"In notebook 02 we explored a single database. Now we'll work with **all\n",
"of them at once**, compute a simple per-fly metric, and ask the central\n",
"question of the project:\n",
"\n",
"> **Do trained males behave differently from na\u00efve males in the testing\n",
"> **Do trained males behave differently from naïve males in the testing\n",
"> session?**\n",
"\n",
"By the end you'll have:\n",
@ -31,7 +31,7 @@
" project's helper function;\n",
"- reduced each trace to one number per fly (the *median inter-fly\n",
" distance*);\n",
"- compared the trained group against the na\u00efve group with a histogram\n",
"- compared the trained group against the naïve group with a histogram\n",
" and a non-parametric statistical test;\n",
"- learnt enough to start asking your own questions.\n"
]
@ -48,27 +48,13 @@
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import sys\n",
"from pathlib import Path\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from scipy import stats\n",
"\n",
"# Tell Python where to find the project's helper modules.\n",
"PROJECT_ROOT = Path(\"..\").resolve().parent # this notebook is in notebooks/getting_started/\n",
"sys.path.insert(0, str(PROJECT_ROOT / \"scripts\"))\n",
"\n",
"from load_roi_data import load_roi_data\n"
]
"source": "import sys\nfrom pathlib import Path\n\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nfrom scipy import stats\n\n# Two locations to know about:\n# - DATA_DIR : where the project's data files live (read-only data volume)\n# - REPO_ROOT : where the code repository lives (this notebook is inside it)\n# We build both as Path objects, then derive everything else from them.\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path(\"/home/gg/ownCloud/Work/Projects/coding/cupido/tracking\")\n\n# Tell Python where to find the project's helper modules (in scripts/).\nsys.path.insert(0, str(REPO_ROOT / \"scripts\"))\n\nfrom load_roi_data import load_roi_data\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading everything at once \u2014 but carefully\n",
"## Loading everything at once — but carefully\n",
"\n",
"`load_roi_data()` opens every tracking DB referenced by the metadata TSV\n",
"and returns one big DataFrame. **It can be slow and memory-hungry**\n",
@ -80,12 +66,7 @@
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Load the metadata TSV first \u2014 it's small and fast.\n",
"tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
"meta = pd.read_csv(tsv_path, sep=\"\\t\")\n",
"print(f\"metadata rows: {len(meta)}\")\n"
]
"source": "# Load the metadata TSV first — it's small and fast.\ntsv_path = DATA_DIR / \"all_video_info_merged.tsv\"\nmeta = pd.read_csv(tsv_path, sep=\"\\t\")\nprint(f\"metadata rows: {len(meta)}\")\n"
},
{
"cell_type": "markdown",
@ -180,7 +161,7 @@
"Right now each fly contributes **tens of thousands** of (t, x, y) rows.\n",
"We can't compare distributions of millions of points across two groups\n",
"in any meaningful way. So we **collapse each (date, machine_name, ROI)\n",
"trace into a single summary number** \u2014 here, the median distance between\n",
"trace into a single summary number** here, the median distance between\n",
"the two flies during testing.\n",
"\n",
"Why median rather than mean? Because tracker glitches (one fly\n",
@ -195,7 +176,7 @@
"execution_count": null,
"outputs": [],
"source": [
"# Step 1 \u2014 per-frame distance.\n",
"# Step 1 per-frame distance.\n",
"# Take only frames with exactly 2 flies (so we have a real distance).\n",
"two_fly = testing.groupby([\"date\", \"machine_name\", \"ROI\", \"t\"]).filter(lambda g: len(g) == 2)\n",
"\n",
@ -220,7 +201,7 @@
"execution_count": null,
"outputs": [],
"source": [
"# Step 2 \u2014 one number per (date, machine_name, ROI).\n",
"# Step 2 one number per (date, machine_name, ROI).\n",
"per_fly = (\n",
" per_frame\n",
" .groupby([\"date\", \"machine_name\", \"ROI\", \"male\"])[\"distance_px\"]\n",
@ -278,7 +259,7 @@
"\n",
"ax.set_xlabel(\"median inter-fly distance during testing (px)\")\n",
"ax.set_ylabel(\"number of flies\")\n",
"ax.set_title(\"Trained vs na\u00efve \u2014 Melanogaster/CS \u2014 testing session\")\n",
"ax.set_title(\"Trained vs naïve — Melanogaster/CS — testing session\")\n",
"ax.legend()\n",
"plt.show()\n"
]
@ -293,10 +274,10 @@
" trained males are spending less time near the female (i.e. they\n",
" learned to give up).\n",
"- If the two distributions look identical, no learning effect was\n",
" measurable with this metric \u2014 but that doesn't mean there's no effect,\n",
" measurable with this metric but that doesn't mean there's no effect,\n",
" just that this particular summary didn't capture it.\n",
"- A **bimodal** trained distribution (two humps) would mean some males\n",
" learned and others didn't \u2014 the \"individual differences\" story in\n",
" learned and others didn't the \"individual differences\" story in\n",
" `docs/bimodal_hypothesis.md`.\n"
]
},
@ -353,7 +334,7 @@
"- **Pick a different metric**: instead of median distance, try fraction\n",
" of time the flies were within 50 px (a \"close-proximity\" metric), or\n",
" the maximum velocity per fly. (Velocity needs identity tracking, which\n",
" is harder \u2014 see `flies_analysis_simple.ipynb` cell 16 for an example.)\n",
" is harder see `flies_analysis_simple.ipynb` cell 16 for an example.)\n",
"- **Look at it per species**: re-run with `species == \"Sechellia\"` and\n",
" compare. Does the effect generalize? Where is it strongest?\n",
"- **Look at the bimodality**: a kernel density plot\n",
@ -389,10 +370,10 @@
"`parquet` is a fast columnar format. `pip install pyarrow` if your\n",
"environment doesn't have it.\n",
"\n",
"There are also vectorized ways to compute these distances ~100\u00d7 faster\n",
"that avoid `groupby().apply()`. Don't worry about that yet \u2014 get a\n",
"There are also vectorized ways to compute these distances ~100× faster\n",
"that avoid `groupby().apply()`. Don't worry about that yet get a\n",
"correct answer first, optimize only if you find yourself waiting.\n"
]
}
]
}
}