Move metadata xlsx/TSV to /mnt/data/projects/cupido/

Consolidates everything bulky (tracking DBs, targets, metadata spreadsheet) under a single DATA_VOLUME root outside the ownCloud-synced repo. Notebooks now use a visible DATA_DIR = Path(...) idiom rather than walking up the filesystem with PROJECT_ROOT.parent — easier for students with no Python background to follow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-01 08:47:15 +01:00 · 2026-05-01 08:47:15 +01:00 · f176224150
commit f176224150
parent ec56e51bf9
8 changed files with 102 additions and 160 deletions
--- a/notebooks/getting_started/03_compare_trained_vs_naive.ipynb
+++ b/notebooks/getting_started/03_compare_trained_vs_naive.ipynb
@ -16,13 +16,13 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# 03 \u00b7 Your first real analysis: trained vs naive\n",
+    "# 03 · Your first real analysis: trained vs naive\n",
    "\n",
    "In notebook 02 we explored a single database. Now we'll work with **all\n",
    "of them at once**, compute a simple per-fly metric, and ask the central\n",
    "question of the project:\n",
    "\n",
-    "> **Do trained males behave differently from na\u00efve males in the testing\n",
+    "> **Do trained males behave differently from naïve males in the testing\n",
    "> session?**\n",
    "\n",
    "By the end you'll have:\n",
@ -31,7 +31,7 @@
    "  project's helper function;\n",
    "- reduced each trace to one number per fly (the *median inter-fly\n",
    "  distance*);\n",
-    "- compared the trained group against the na\u00efve group with a histogram\n",
+    "- compared the trained group against the naïve group with a histogram\n",
    "  and a non-parametric statistical test;\n",
    "- learnt enough to start asking your own questions.\n"
   ]
@ -48,27 +48,13 @@
   "metadata": {},
   "execution_count": null,
   "outputs": [],
-   "source": [
-    "import sys\n",
-    "from pathlib import Path\n",
-    "\n",
-    "import numpy as np\n",
-    "import pandas as pd\n",
-    "import matplotlib.pyplot as plt\n",
-    "from scipy import stats\n",
-    "\n",
-    "# Tell Python where to find the project's helper modules.\n",
-    "PROJECT_ROOT = Path(\"..\").resolve().parent  # this notebook is in notebooks/getting_started/\n",
-    "sys.path.insert(0, str(PROJECT_ROOT / \"scripts\"))\n",
-    "\n",
-    "from load_roi_data import load_roi_data\n"
-   ]
+   "source": "import sys\nfrom pathlib import Path\n\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nfrom scipy import stats\n\n# Two locations to know about:\n#   - DATA_DIR  : where the project's data files live (read-only data volume)\n#   - REPO_ROOT : where the code repository lives (this notebook is inside it)\n# We build both as Path objects, then derive everything else from them.\nDATA_DIR  = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path(\"/home/gg/ownCloud/Work/Projects/coding/cupido/tracking\")\n\n# Tell Python where to find the project's helper modules (in scripts/).\nsys.path.insert(0, str(REPO_ROOT / \"scripts\"))\n\nfrom load_roi_data import load_roi_data\n"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Loading everything at once \u2014 but carefully\n",
+    "## Loading everything at once — but carefully\n",
    "\n",
    "`load_roi_data()` opens every tracking DB referenced by the metadata TSV\n",
    "and returns one big DataFrame. **It can be slow and memory-hungry**\n",
@ -80,12 +66,7 @@
   "metadata": {},
   "execution_count": null,
   "outputs": [],
-   "source": [
-    "# Load the metadata TSV first \u2014 it's small and fast.\n",
-    "tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
-    "meta = pd.read_csv(tsv_path, sep=\"\\t\")\n",
-    "print(f\"metadata rows: {len(meta)}\")\n"
-   ]
+   "source": "# Load the metadata TSV first — it's small and fast.\ntsv_path = DATA_DIR / \"all_video_info_merged.tsv\"\nmeta = pd.read_csv(tsv_path, sep=\"\\t\")\nprint(f\"metadata rows: {len(meta)}\")\n"
  },
  {
   "cell_type": "markdown",
@ -180,7 +161,7 @@
    "Right now each fly contributes **tens of thousands** of (t, x, y) rows.\n",
    "We can't compare distributions of millions of points across two groups\n",
    "in any meaningful way. So we **collapse each (date, machine_name, ROI)\n",
-    "trace into a single summary number** \u2014 here, the median distance between\n",
+    "trace into a single summary number** — here, the median distance between\n",
    "the two flies during testing.\n",
    "\n",
    "Why median rather than mean? Because tracker glitches (one fly\n",
@ -195,7 +176,7 @@
   "execution_count": null,
   "outputs": [],
   "source": [
-    "# Step 1 \u2014 per-frame distance.\n",
+    "# Step 1 — per-frame distance.\n",
    "# Take only frames with exactly 2 flies (so we have a real distance).\n",
    "two_fly = testing.groupby([\"date\", \"machine_name\", \"ROI\", \"t\"]).filter(lambda g: len(g) == 2)\n",
    "\n",
@ -220,7 +201,7 @@
   "execution_count": null,
   "outputs": [],
   "source": [
-    "# Step 2 \u2014 one number per (date, machine_name, ROI).\n",
+    "# Step 2 — one number per (date, machine_name, ROI).\n",
    "per_fly = (\n",
    "    per_frame\n",
    "    .groupby([\"date\", \"machine_name\", \"ROI\", \"male\"])[\"distance_px\"]\n",
@ -278,7 +259,7 @@
    "\n",
    "ax.set_xlabel(\"median inter-fly distance during testing (px)\")\n",
    "ax.set_ylabel(\"number of flies\")\n",
-    "ax.set_title(\"Trained vs na\u00efve \u2014 Melanogaster/CS \u2014 testing session\")\n",
+    "ax.set_title(\"Trained vs naïve — Melanogaster/CS — testing session\")\n",
    "ax.legend()\n",
    "plt.show()\n"
   ]
@ -293,10 +274,10 @@
    "  trained males are spending less time near the female (i.e. they\n",
    "  learned to give up).\n",
    "- If the two distributions look identical, no learning effect was\n",
-    "  measurable with this metric \u2014 but that doesn't mean there's no effect,\n",
+    "  measurable with this metric — but that doesn't mean there's no effect,\n",
    "  just that this particular summary didn't capture it.\n",
    "- A **bimodal** trained distribution (two humps) would mean some males\n",
-    "  learned and others didn't \u2014 the \"individual differences\" story in\n",
+    "  learned and others didn't — the \"individual differences\" story in\n",
    "  `docs/bimodal_hypothesis.md`.\n"
   ]
  },
@ -353,7 +334,7 @@
    "- **Pick a different metric**: instead of median distance, try fraction\n",
    "  of time the flies were within 50 px (a \"close-proximity\" metric), or\n",
    "  the maximum velocity per fly. (Velocity needs identity tracking, which\n",
-    "  is harder \u2014 see `flies_analysis_simple.ipynb` cell 16 for an example.)\n",
+    "  is harder — see `flies_analysis_simple.ipynb` cell 16 for an example.)\n",
    "- **Look at it per species**: re-run with `species == \"Sechellia\"` and\n",
    "  compare. Does the effect generalize? Where is it strongest?\n",
    "- **Look at the bimodality**: a kernel density plot\n",
@ -389,10 +370,10 @@
    "`parquet` is a fast columnar format. `pip install pyarrow` if your\n",
    "environment doesn't have it.\n",
    "\n",
-    "There are also vectorized ways to compute these distances ~100\u00d7 faster\n",
-    "that avoid `groupby().apply()`. Don't worry about that yet \u2014 get a\n",
+    "There are also vectorized ways to compute these distances ~100× faster\n",
+    "that avoid `groupby().apply()`. Don't worry about that yet — get a\n",
    "correct answer first, optimize only if you find yourself waiting.\n"
   ]
  }
 ]
-}
+}