Move metadata xlsx/TSV to /mnt/data/projects/cupido/
Consolidates everything bulky (tracking DBs, targets, metadata spreadsheet) under a single DATA_VOLUME root outside the ownCloud-synced repo. Notebooks now use a visible DATA_DIR = Path(...) idiom rather than walking up the filesystem with PROJECT_ROOT.parent — easier for students with no Python background to follow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
ec56e51bf9
commit
f176224150
8 changed files with 102 additions and 160 deletions
|
|
@ -16,13 +16,13 @@
|
|||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 03 \u00b7 Your first real analysis: trained vs naive\n",
|
||||
"# 03 · Your first real analysis: trained vs naive\n",
|
||||
"\n",
|
||||
"In notebook 02 we explored a single database. Now we'll work with **all\n",
|
||||
"of them at once**, compute a simple per-fly metric, and ask the central\n",
|
||||
"question of the project:\n",
|
||||
"\n",
|
||||
"> **Do trained males behave differently from na\u00efve males in the testing\n",
|
||||
"> **Do trained males behave differently from naïve males in the testing\n",
|
||||
"> session?**\n",
|
||||
"\n",
|
||||
"By the end you'll have:\n",
|
||||
|
|
@ -31,7 +31,7 @@
|
|||
" project's helper function;\n",
|
||||
"- reduced each trace to one number per fly (the *median inter-fly\n",
|
||||
" distance*);\n",
|
||||
"- compared the trained group against the na\u00efve group with a histogram\n",
|
||||
"- compared the trained group against the naïve group with a histogram\n",
|
||||
" and a non-parametric statistical test;\n",
|
||||
"- learnt enough to start asking your own questions.\n"
|
||||
]
|
||||
|
|
@ -48,27 +48,13 @@
|
|||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import sys\n",
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"import numpy as np\n",
|
||||
"import pandas as pd\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"from scipy import stats\n",
|
||||
"\n",
|
||||
"# Tell Python where to find the project's helper modules.\n",
|
||||
"PROJECT_ROOT = Path(\"..\").resolve().parent # this notebook is in notebooks/getting_started/\n",
|
||||
"sys.path.insert(0, str(PROJECT_ROOT / \"scripts\"))\n",
|
||||
"\n",
|
||||
"from load_roi_data import load_roi_data\n"
|
||||
]
|
||||
"source": "import sys\nfrom pathlib import Path\n\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nfrom scipy import stats\n\n# Two locations to know about:\n# - DATA_DIR : where the project's data files live (read-only data volume)\n# - REPO_ROOT : where the code repository lives (this notebook is inside it)\n# We build both as Path objects, then derive everything else from them.\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path(\"/home/gg/ownCloud/Work/Projects/coding/cupido/tracking\")\n\n# Tell Python where to find the project's helper modules (in scripts/).\nsys.path.insert(0, str(REPO_ROOT / \"scripts\"))\n\nfrom load_roi_data import load_roi_data\n"
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Loading everything at once \u2014 but carefully\n",
|
||||
"## Loading everything at once — but carefully\n",
|
||||
"\n",
|
||||
"`load_roi_data()` opens every tracking DB referenced by the metadata TSV\n",
|
||||
"and returns one big DataFrame. **It can be slow and memory-hungry**\n",
|
||||
|
|
@ -80,12 +66,7 @@
|
|||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Load the metadata TSV first \u2014 it's small and fast.\n",
|
||||
"tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
|
||||
"meta = pd.read_csv(tsv_path, sep=\"\\t\")\n",
|
||||
"print(f\"metadata rows: {len(meta)}\")\n"
|
||||
]
|
||||
"source": "# Load the metadata TSV first — it's small and fast.\ntsv_path = DATA_DIR / \"all_video_info_merged.tsv\"\nmeta = pd.read_csv(tsv_path, sep=\"\\t\")\nprint(f\"metadata rows: {len(meta)}\")\n"
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
|
|
@ -180,7 +161,7 @@
|
|||
"Right now each fly contributes **tens of thousands** of (t, x, y) rows.\n",
|
||||
"We can't compare distributions of millions of points across two groups\n",
|
||||
"in any meaningful way. So we **collapse each (date, machine_name, ROI)\n",
|
||||
"trace into a single summary number** \u2014 here, the median distance between\n",
|
||||
"trace into a single summary number** — here, the median distance between\n",
|
||||
"the two flies during testing.\n",
|
||||
"\n",
|
||||
"Why median rather than mean? Because tracker glitches (one fly\n",
|
||||
|
|
@ -195,7 +176,7 @@
|
|||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Step 1 \u2014 per-frame distance.\n",
|
||||
"# Step 1 — per-frame distance.\n",
|
||||
"# Take only frames with exactly 2 flies (so we have a real distance).\n",
|
||||
"two_fly = testing.groupby([\"date\", \"machine_name\", \"ROI\", \"t\"]).filter(lambda g: len(g) == 2)\n",
|
||||
"\n",
|
||||
|
|
@ -220,7 +201,7 @@
|
|||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Step 2 \u2014 one number per (date, machine_name, ROI).\n",
|
||||
"# Step 2 — one number per (date, machine_name, ROI).\n",
|
||||
"per_fly = (\n",
|
||||
" per_frame\n",
|
||||
" .groupby([\"date\", \"machine_name\", \"ROI\", \"male\"])[\"distance_px\"]\n",
|
||||
|
|
@ -278,7 +259,7 @@
|
|||
"\n",
|
||||
"ax.set_xlabel(\"median inter-fly distance during testing (px)\")\n",
|
||||
"ax.set_ylabel(\"number of flies\")\n",
|
||||
"ax.set_title(\"Trained vs na\u00efve \u2014 Melanogaster/CS \u2014 testing session\")\n",
|
||||
"ax.set_title(\"Trained vs naïve — Melanogaster/CS — testing session\")\n",
|
||||
"ax.legend()\n",
|
||||
"plt.show()\n"
|
||||
]
|
||||
|
|
@ -293,10 +274,10 @@
|
|||
" trained males are spending less time near the female (i.e. they\n",
|
||||
" learned to give up).\n",
|
||||
"- If the two distributions look identical, no learning effect was\n",
|
||||
" measurable with this metric \u2014 but that doesn't mean there's no effect,\n",
|
||||
" measurable with this metric — but that doesn't mean there's no effect,\n",
|
||||
" just that this particular summary didn't capture it.\n",
|
||||
"- A **bimodal** trained distribution (two humps) would mean some males\n",
|
||||
" learned and others didn't \u2014 the \"individual differences\" story in\n",
|
||||
" learned and others didn't — the \"individual differences\" story in\n",
|
||||
" `docs/bimodal_hypothesis.md`.\n"
|
||||
]
|
||||
},
|
||||
|
|
@ -353,7 +334,7 @@
|
|||
"- **Pick a different metric**: instead of median distance, try fraction\n",
|
||||
" of time the flies were within 50 px (a \"close-proximity\" metric), or\n",
|
||||
" the maximum velocity per fly. (Velocity needs identity tracking, which\n",
|
||||
" is harder \u2014 see `flies_analysis_simple.ipynb` cell 16 for an example.)\n",
|
||||
" is harder — see `flies_analysis_simple.ipynb` cell 16 for an example.)\n",
|
||||
"- **Look at it per species**: re-run with `species == \"Sechellia\"` and\n",
|
||||
" compare. Does the effect generalize? Where is it strongest?\n",
|
||||
"- **Look at the bimodality**: a kernel density plot\n",
|
||||
|
|
@ -389,10 +370,10 @@
|
|||
"`parquet` is a fast columnar format. `pip install pyarrow` if your\n",
|
||||
"environment doesn't have it.\n",
|
||||
"\n",
|
||||
"There are also vectorized ways to compute these distances ~100\u00d7 faster\n",
|
||||
"that avoid `groupby().apply()`. Don't worry about that yet \u2014 get a\n",
|
||||
"There are also vectorized ways to compute these distances ~100× faster\n",
|
||||
"that avoid `groupby().apply()`. Don't worry about that yet — get a\n",
|
||||
"correct answer first, optimize only if you find yourself waiting.\n"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
Loading…
Add table
Add a link
Reference in a new issue