Move metadata xlsx/TSV to /mnt/data/projects/cupido/

Consolidates everything bulky (tracking DBs, targets, metadata
spreadsheet) under a single DATA_VOLUME root outside the ownCloud-synced
repo. Notebooks now use a visible DATA_DIR = Path(...) idiom rather than
walking up the filesystem with PROJECT_ROOT.parent — easier for students
with no Python background to follow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Giorgio Gilestro 2026-05-01 08:47:15 +01:00
parent ec56e51bf9
commit f176224150
8 changed files with 102 additions and 160 deletions

View file

@ -1,8 +1,8 @@
# Processed Data
CSVs derived from the tracking DBs (`/mnt/data/projects/cupido/tracked/`)
and the merged TSV (`../../all_video_info_merged.tsv`). All files are
gitignored and regenerable.
and the merged TSV (`/mnt/data/projects/cupido/all_video_info_merged.tsv`).
All files are gitignored and regenerable.
## Files and Regeneration
@ -23,7 +23,7 @@ from load_roi_data import load_roi_data
data = load_roi_data() # full batch as one DataFrame
# Or filter the metadata first:
import pandas as pd
tsv = pd.read_csv("../../all_video_info_merged.tsv", sep="\t")
tsv = pd.read_csv("/mnt/data/projects/cupido/all_video_info_merged.tsv", sep="\t")
data = load_roi_data(tsv[tsv.species.str.contains("Melanogaster")])
```

View file

@ -16,11 +16,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# 00 \u00b7 Welcome to the Cupido fly-tracking project\n",
"# 00 · Welcome to the Cupido fly-tracking project\n",
"\n",
"Hi! You're about to start working on a project that studies how *Drosophila*\n",
"(fruit flies) form **memories of mating experiences** \u2014 and whether trained\n",
"flies behave differently from na\u00efve ones in their later courtship.\n",
"(fruit flies) form **memories of mating experiences** and whether trained\n",
"flies behave differently from naïve ones in their later courtship.\n",
"\n",
"**You don't need any prior experience with Python or data science to follow\n",
"along.** This series of notebooks will walk you through everything, one\n",
@ -48,7 +48,7 @@
"metadata": {},
"source": [
"You should have seen `Hello, fly world!` printed and the number `2`\n",
"appear underneath. If something else happened, ask Giorgio \u2014 that's a\n",
"appear underneath. If something else happened, ask Giorgio that's a\n",
"sign the environment isn't set up right.\n",
"\n",
"If this is the very first time you're using JupyterLab, take 10 minutes\n",
@ -61,7 +61,7 @@
" (Python that the computer runs).\n",
"- The **kernel** is the running Python process behind the notebook. It\n",
" remembers everything you've defined. If something gets weird, restart\n",
" the kernel: top menu \u2192 *Kernel* \u2192 *Restart Kernel\u2026*.\n",
" the kernel: top menu → *Kernel* → *Restart Kernel…*.\n",
"- `Shift + Enter` runs a cell and moves to the next one.\n",
"- `Ctrl + Enter` runs a cell and stays put.\n"
]
@ -74,7 +74,7 @@
"\n",
"Drosophila males court females with a stereotyped sequence (chasing,\n",
"wing-extension, tapping). When a male is rejected by a female (e.g.\n",
"because she's already mated), he **learns** to suppress his courtship \u2014\n",
"because she's already mated), he **learns** to suppress his courtship \n",
"even toward new, receptive females, for a while. This is a textbook\n",
"example of *non-associative learning* in invertebrates ([review on\n",
"PubMed](https://pubmed.ncbi.nlm.nih.gov/?term=courtship+conditioning+drosophila)).\n",
@ -85,7 +85,7 @@
" species recorded.)\n",
"- How long does the memory last? (training_length_hr,\n",
" consolidation_length_hr columns in the metadata.)\n",
"- Are there **individual differences** \u2014 do some males learn while others\n",
"- Are there **individual differences** do some males learn while others\n",
" don't? (The \"bimodal hypothesis\" in `docs/bimodal_hypothesis.md`.)\n",
"\n",
"Your job, broadly, will be to **turn videos of flies into numbers and\n",
@ -100,17 +100,17 @@
"\n",
"1. **Training**: a male fly is placed with a non-receptive (mated) female.\n",
" He courts, gets rejected, eventually gives up.\n",
"2. *Wait* for some hours (the \"consolidation\" period \u2014 gives memory time\n",
"2. *Wait* for some hours (the \"consolidation\" period gives memory time\n",
" to form).\n",
"3. **Testing**: same male is placed with a fresh receptive female.\n",
" Does he court her vigorously, or has he learned to give up easily?\n",
"\n",
"Each experiment runs in an **HD mating arena** \u2014 a small chamber with\n",
"Each experiment runs in an **HD mating arena** a small chamber with\n",
"6 sub-arenas (we call them **ROIs**, for \"regions of interest\"). Each ROI\n",
"contains one couple (a male and a female). A camera films the whole arena\n",
"from above. So one **video** gives us 6 simultaneous experiments.\n",
"\n",
"The setup uses [Ethoscopes](https://www.ethoscope.com/) \u2014 open-source\n",
"The setup uses [Ethoscopes](https://www.ethoscope.com/) open-source\n",
"behavioural recording boxes built in this lab. Each ethoscope is a\n",
"machine; we have 16 in total, named `ETHOSCOPE_067`, `ETHOSCOPE_076`, etc.\n"
]
@ -124,7 +124,7 @@
"For each video, the **tracker** (a piece of software that runs after the\n",
"recording) finds the flies frame-by-frame and writes their positions to a\n",
"**SQLite database** (a single file, ending in `.db`). One DB per video.\n",
"Inside each DB there are 6 tables called `ROI_1`, `ROI_2`, \u2026, `ROI_6` \u2014\n",
"Inside each DB there are 6 tables called `ROI_1`, `ROI_2`, …, `ROI_6` —\n",
"one per sub-arena. Each row of an ROI table is **one fly detection at one\n",
"moment in time** with these columns:\n",
"\n",
@ -139,7 +139,7 @@
"| `has_interacted` | (legacy column, mostly unused) |\n",
"\n",
"If a single ROI has two flies that the tracker can see, you'll get **two\n",
"rows with the same `t`** \u2014 one for each fly. If only one fly is detected\n",
"rows with the same `t`** one for each fly. If only one fly is detected\n",
"(maybe they're on top of each other), you'll get one row.\n",
"\n",
"That's the heart of the data. Everything else (distances, velocities,\n",
@ -149,51 +149,25 @@
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Where everything lives\n",
"\n",
"Take a moment to memorize these locations \u2014 you'll come back to them often.\n",
"\n",
"| what | where |\n",
"|---|---|\n",
"| Tracking DBs (SQLite, one per video) | `/mnt/data/projects/cupido/tracked/` |\n",
"| Target JSONs (the user-clicked reference points) | `/mnt/data/projects/cupido/targets/` |\n",
"| Source video files | `/mnt/ethoscope_data/videos/` |\n",
"| Project code (this repo) | `/home/gg/ownCloud/Work/Projects/coding/cupido/tracking/` |\n",
"| The metadata table (xlsx + TSV) | `/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv` |\n",
"| Your notebooks | `notebooks/getting_started/` (this folder) |\n",
"\n",
"Let's verify a couple of these from inside Python:\n"
]
"source": "## Where everything lives\n\nTake a moment to memorize these locations — you'll come back to them often.\n\n| what | where |\n|---|---|\n| Tracking DBs (SQLite, one per video) | `/mnt/data/projects/cupido/tracked/` |\n| Target JSONs (the user-clicked reference points) | `/mnt/data/projects/cupido/targets/` |\n| The metadata table (xlsx + TSV) | `/mnt/data/projects/cupido/all_video_info_merged.tsv` |\n| Source video files | `/mnt/ethoscope_data/videos/` |\n| Project code (this repo) | `/home/gg/ownCloud/Work/Projects/coding/cupido/tracking/` |\n| Your notebooks | `notebooks/getting_started/` (this folder) |\n\nNotice the pattern: **everything bulky or regenerable lives under\n`/mnt/data/projects/cupido/`**. The repository itself only stores code,\ndocumentation, and small metadata files. We'll refer to that data\ndirectory as `DATA_DIR` from here on.\n\nLet's verify a couple of these from inside Python:\n"
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"from pathlib import Path\n",
"\n",
"tracked = Path(\"/mnt/data/projects/cupido/tracked\")\n",
"targets = Path(\"/mnt/data/projects/cupido/targets\")\n",
"\n",
"n_dbs = len(list(tracked.glob(\"*_tracking.db\")))\n",
"n_jsons = len(list(targets.glob(\"*.json\")))\n",
"\n",
"print(f\"Tracking DBs available: {n_dbs}\")\n",
"print(f\"Target JSONs available: {n_jsons}\")\n"
]
"source": "from pathlib import Path\n\n# Single root for all the bulky / regenerable project data.\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\n\ntracked_dir = DATA_DIR / \"tracked\"\ntargets_dir = DATA_DIR / \"targets\"\nmetadata_tsv = DATA_DIR / \"all_video_info_merged.tsv\"\n\nprint(f\"Tracking DBs available: {len(list(tracked_dir.glob('*_tracking.db')))}\")\nprint(f\"Target JSONs available: {len(list(targets_dir.glob('*.json')))}\")\nprint(f\"Metadata TSV exists: {metadata_tsv.exists()}\")\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should see roughly 113 tracking DBs and 130 target JSONs. If those\n",
"numbers are zero, the storage volume isn't mounted \u2014 ask Giorgio.\n",
"numbers are zero, the storage volume isn't mounted — ask Giorgio.\n",
"\n",
"> **Note**: the tracking DBs are read-only inside the JupyterLab\n",
"> container. You can read them but not modify or delete them. That's a\n",
"> deliberate safety measure \u2014 we don't want analysis code accidentally\n",
"> deliberate safety measure — we don't want analysis code accidentally\n",
"> corrupting the source data.\n"
]
},
@ -203,26 +177,26 @@
"source": [
"## Glossary (refer back as needed)\n",
"\n",
"- **ROI** \u2014 *region of interest*. One sub-arena inside the HD mating\n",
" arena. There are 6 ROIs per video, numbered 1\u20136.\n",
"- **fly** \u2014 one detection in a single (t, ROI) cell. Two flies in the\n",
"- **ROI** *region of interest*. One sub-arena inside the HD mating\n",
" arena. There are 6 ROIs per video, numbered 16.\n",
"- **fly** one detection in a single (t, ROI) cell. Two flies in the\n",
" same ROI at the same time = two rows with the same `t`.\n",
"- **trained** \u2014 the male had a training session before testing.\n",
"- **naive** \u2014 the male is a control (no training).\n",
"- **training session** \u2014 the recording where the male meets the\n",
"- **trained** the male had a training session before testing.\n",
"- **naive** the male is a control (no training).\n",
"- **training session** the recording where the male meets the\n",
" non-receptive female (he gets rejected).\n",
"- **testing session** \u2014 the recording where the male meets a fresh\n",
"- **testing session** the recording where the male meets a fresh\n",
" receptive female (we measure his courtship).\n",
"- **t (milliseconds)** \u2014 time within one session, starting at 0.\n",
"- **(x, y) pixels** \u2014 fly position in the image. Top-left is (0, 0); x\n",
"- **t (milliseconds)** time within one session, starting at 0.\n",
"- **(x, y) pixels** fly position in the image. Top-left is (0, 0); x\n",
" grows to the right, y grows **downward** (this is the image-coordinate\n",
" convention, opposite of math class).\n",
"- **machine_name** \u2014 which ethoscope recorded the video, e.g.\n",
"- **machine_name** which ethoscope recorded the video, e.g.\n",
" `ETHOSCOPE_076`.\n",
"- **species** \u2014 `Melanogaster/CS`, `Sechellia`, `Simulans`, `Yakuba`,\n",
"- **species** `Melanogaster/CS`, `Sechellia`, `Simulans`, `Yakuba`,\n",
" `Erecta`, `Willistoni`, or `CS`.\n",
"\n",
"If you bump into other terms in the code, ask. Don't guess \u2014 biology\n",
"If you bump into other terms in the code, ask. Don't guess biology\n",
"codebases pick up jargon over the years.\n"
]
},
@ -234,16 +208,16 @@
"\n",
"When you're ready, open these notebooks **in order**:\n",
"\n",
"1. `01_python_pandas_basics.ipynb` \u2014 just enough Python and pandas to\n",
"1. `01_python_pandas_basics.ipynb` just enough Python and pandas to\n",
" read and manipulate tabular data.\n",
"2. `02_explore_one_database.ipynb` \u2014 open one tracking DB, plot a fly's\n",
"2. `02_explore_one_database.ipynb` open one tracking DB, plot a fly's\n",
" trajectory, see what the numbers actually look like.\n",
"3. `03_compare_trained_vs_naive.ipynb` \u2014 your first real analysis,\n",
"3. `03_compare_trained_vs_naive.ipynb` your first real analysis,\n",
" comparing groups of flies.\n",
"\n",
"After those, the notebooks one level up (`flies_analysis.ipynb`,\n",
"`flies_analysis_simple.ipynb`) contain the analysis pipeline that the\n",
"previous student built \u2014 those will make sense once you've worked\n",
"previous student built those will make sense once you've worked\n",
"through the tutorials.\n",
"\n",
"Don't try to power through all of them in one sitting. Run a few cells,\n",

View file

@ -16,7 +16,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# 01 \u00b7 Python and pandas \u2014 just enough to be dangerous\n",
"# 01 · Python and pandas — just enough to be dangerous\n",
"\n",
"This notebook teaches the **minimum** Python and `pandas` you need to read\n",
"the rest of the project's code and write your own analyses.\n",
@ -28,10 +28,10 @@
"\n",
"External resources, in order of how much time they take:\n",
"\n",
"- \ud83e\udd98 [Python in 10 minutes (very condensed)](https://www.stavros.io/tutorials/python/)\n",
"- \ud83d\udc0d [Official Python tutorial \u2014 chapters 3\u20135](https://docs.python.org/3/tutorial/introduction.html)\n",
"- \ud83d\udc3c [pandas in 10 minutes (official)](https://pandas.pydata.org/docs/user_guide/10min.html)\n",
"- \ud83d\udcda [Python for Data Analysis (the book)](https://wesmckinney.com/book/) \u2014 free online\n"
"- 🦘 [Python in 10 minutes (very condensed)](https://www.stavros.io/tutorials/python/)\n",
"- 🐍 [Official Python tutorial — chapters 35](https://docs.python.org/3/tutorial/introduction.html)\n",
"- 🐼 [pandas in 10 minutes (official)](https://pandas.pydata.org/docs/user_guide/10min.html)\n",
"- 📚 [Python for Data Analysis (the book)](https://wesmckinney.com/book/) — free online\n"
]
},
{
@ -90,7 +90,7 @@
"message = \"We tracked \" + str(n_flies) + \" \" + species + \" males.\"\n",
"print(message)\n",
"\n",
"# A nicer way to build strings \u2014 f-strings (note the leading 'f'):\n",
"# A nicer way to build strings f-strings (note the leading 'f'):\n",
"print(f\"We tracked {n_flies} {species} males.\")\n"
]
},
@ -111,7 +111,7 @@
"outputs": [],
"source": [
"machines = [\"ETHOSCOPE_076\", \"ETHOSCOPE_082\", \"ETHOSCOPE_086\"]\n",
"print(machines[0]) # first item \u2014 Python counts from 0!\n",
"print(machines[0]) # first item Python counts from 0!\n",
"print(machines[-1]) # last item\n",
"print(len(machines)) # how many items\n",
"print(machines + [\"ETHOSCOPE_140\"]) # concatenate (returns a new list)\n"
@ -212,7 +212,7 @@
" return days / 7\n",
"\n",
"print(fly_age_in_weeks(14)) # 2.0\n",
"print(fly_age_in_weeks(5)) # 0.714\u2026\n"
"print(fly_age_in_weeks(5)) # 0.714\n"
]
},
{
@ -242,12 +242,12 @@
"source": [
"## 9. Meet pandas\n",
"\n",
"Real data is rarely a single number \u2014 it's a **table** with rows and\n",
"Real data is rarely a single number it's a **table** with rows and\n",
"columns (think Excel). `pandas` is the library that handles tables in\n",
"Python. The two main objects are:\n",
"\n",
"- **`Series`** \u2014 a single column with a name.\n",
"- **`DataFrame`** \u2014 a whole table.\n",
"- **`Series`** a single column with a name.\n",
"- **`DataFrame`** a whole table.\n",
"\n",
"By convention we import pandas as `pd`. Always.\n"
]
@ -257,17 +257,7 @@
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# Read the project's metadata TSV (Tab-Separated Values).\n",
"tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
"df = pd.read_csv(tsv_path, sep=\"\\t\")\n",
"\n",
"# How big is it?\n",
"print(f\"Rows: {len(df)}\")\n",
"print(f\"Columns: {df.shape[1]}\")\n"
]
"source": "import pandas as pd\nfrom pathlib import Path\n\n# All the project's bulky data lives under /mnt/data/projects/cupido/.\n# This pattern — define one DATA_DIR variable, then build sub-paths from\n# it — is much easier to read (and to update) than hard-coding long\n# strings everywhere.\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\ntsv_path = DATA_DIR / \"all_video_info_merged.tsv\"\n\n# Read the project's metadata TSV (Tab-Separated Values).\ndf = pd.read_csv(tsv_path, sep=\"\\t\")\n\n# How big is it?\nprint(f\"Rows: {len(df)}\")\nprint(f\"Columns: {df.shape[1]}\")\n"
},
{
"cell_type": "markdown",
@ -368,7 +358,7 @@
"mel_only = df[df[\"species\"] == \"Melanogaster/CS\"]\n",
"print(f\"Melanogaster/CS rows: {len(mel_only)}\")\n",
"\n",
"# Combine conditions with & (and) | (or) \u2014 and wrap each part in parentheses.\n",
"# Combine conditions with & (and) | (or) and wrap each part in parentheses.\n",
"trained_mel = df[(df[\"male\"] == \"trained\") & (df[\"species\"] == \"Melanogaster/CS\")]\n",
"print(f\"trained Mel rows: {len(trained_mel)}\")\n"
]

View file

@ -16,7 +16,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# 02 \u00b7 A first look at one tracking database\n",
"# 02 · A first look at one tracking database\n",
"\n",
"In this notebook we open **one** of the SQLite databases that the tracker\n",
"produced and look at what's actually inside. By the end you'll be able to:\n",
@ -40,7 +40,7 @@
"## Setup\n",
"\n",
"We import the libraries we need. `sqlite3` is part of Python's standard\n",
"library \u2014 no install needed.\n"
"library no install needed.\n"
]
},
{
@ -71,15 +71,7 @@
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"tracked_dir = Path(\"/mnt/data/projects/cupido/tracked\")\n",
"db_files = sorted(tracked_dir.glob(\"*_tracking.db\"))\n",
"\n",
"print(f\"Found {len(db_files)} tracking DBs.\")\n",
"print(\"\\nFirst 5 by name:\")\n",
"for db in db_files[:5]:\n",
" print(f\" {db.name}\")\n"
]
"source": "# Single root for all the project's data. Build sub-paths from it.\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\ntracked_dir = DATA_DIR / \"tracked\"\n\ndb_files = sorted(tracked_dir.glob(\"*_tracking.db\"))\n\nprint(f\"Found {len(db_files)} tracking DBs.\")\nprint(\"\\nFirst 5 by name:\")\nfor db in db_files[:5]:\n print(f\" {db.name}\")\n"
},
{
"cell_type": "markdown",
@ -90,7 +82,7 @@
"\n",
"```\n",
"2024-09-17_10-32-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged_tracking.db\n",
"\u2514\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2518\u2514\u2500\u2500\u252c\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n",
"└────┬─────┘└──┬──┘ └────────────────┬───────────────┘└──────┬───────┘\n",
" date time machine UUID video format\n",
"```\n",
"\n",
@ -152,7 +144,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You should see tables like `ROI_1`, `ROI_2`, \u2026, `ROI_6` (one per\n",
"You should see tables like `ROI_1`, `ROI_2`, , `ROI_6` (one per\n",
"sub-arena), plus housekeeping tables like `METADATA`, `ROI_MAP`,\n",
"`VAR_MAP`, `START_EVENTS`. We mostly care about the `ROI_*` ones.\n",
"\n",
@ -187,7 +179,7 @@
"- `x`, `y`: fly position in **pixels**. The image origin (0, 0) is the\n",
" **top-left** corner. y grows downward.\n",
"- `w`, `h`: bounding-box width/height. Their product (`area = w*h`) is a\n",
" rough proxy for \"how big does this blob look\" \u2014 useful for spotting\n",
" rough proxy for \"how big does this blob look\" useful for spotting\n",
" frames where the tracker merged two flies into one big detection.\n"
]
},
@ -245,7 +237,7 @@
"source": [
"The output tells you, e.g., \"100,000 frames had 2 flies visible, 30,000\n",
"had 1 fly visible\". Frames with 1 fly usually mean the two flies are\n",
"overlapping or one is occluded \u2014 that's something we'll handle properly\n",
"overlapping or one is occluded that's something we'll handle properly\n",
"in the next notebook.\n"
]
},
@ -257,7 +249,7 @@
"\n",
"We'll plot the position over the first 5 minutes (300 000 ms). For\n",
"clarity we'll only look at frames where there were 2 flies and pick the\n",
"**first** of the two (sorted by `id`) as \"fly 1\" \u2014 this is a rough\n",
"**first** of the two (sorted by `id`) as \"fly 1\" this is a rough\n",
"heuristic; identity tracking is harder than it sounds.\n"
]
},
@ -280,7 +272,7 @@
"plt.gca().invert_yaxis() # because pixel y grows downward\n",
"plt.xlabel(\"x (pixels)\")\n",
"plt.ylabel(\"y (pixels)\")\n",
"plt.title(f\"Fly 1 trajectory \u2014 first 5 min \u2014 {db_path.name[:30]}\u2026\")\n",
"plt.title(f\"Fly 1 trajectory — first 5 min — {db_path.name[:30]}…\")\n",
"plt.legend()\n",
"plt.axis(\"equal\")\n",
"plt.show()\n"
@ -293,7 +285,7 @@
"You should see a tangle of lines confined to a roughly rectangular ROI.\n",
"That tangle is the fly walking around its sub-arena.\n",
"\n",
"Notice we did `plt.gca().invert_yaxis()` \u2014 that's because in image\n",
"Notice we did `plt.gca().invert_yaxis()` that's because in image\n",
"coordinates y grows downward, but humans expect plots where y grows\n",
"upward. Without it the plot would be vertically flipped.\n"
]
@ -318,7 +310,7 @@
"\n",
"axes[0].plot(fly1[\"t\"] / 1000, fly1[\"x\"], linewidth=0.5)\n",
"axes[0].set_ylabel(\"x (px)\")\n",
"axes[0].set_title(f\"Fly 1, ROI 1, {db_path.name[:30]}\u2026\")\n",
"axes[0].set_title(f\"Fly 1, ROI 1, {db_path.name[:30]}\")\n",
"\n",
"axes[1].plot(fly1[\"t\"] / 1000, fly1[\"y\"], linewidth=0.5, color=\"darkorange\")\n",
"axes[1].set_ylabel(\"y (px)\")\n",
@ -344,7 +336,7 @@
"## Distance between the two flies\n",
"\n",
"Whenever the ROI has 2 detections at the same `t`, we can compute the\n",
"Euclidean distance between them: `sqrt((x1-x2)\u00b2 + (y1-y2)\u00b2)`.\n"
"Euclidean distance between them: `sqrt((x1-x2)² + (y1-y2)²)`.\n"
]
},
{
@ -399,7 +391,7 @@
"## Don't forget to close the connection\n",
"\n",
"If you opened a connection, close it when you're done. (Not strictly\n",
"necessary in a notebook \u2014 Python tidies up \u2014 but a good habit.)\n"
"necessary in a notebook — Python tidies up — but a good habit.)\n"
]
},
{
@ -423,7 +415,7 @@
"2. Plot the distance trace for **ROI 4** instead of ROI 1.\n",
"3. Compute the **percentage of frames** in ROI 1 that had only 1 fly visible.\n",
"4. The `area = w * h` column is a useful diagnostic. Plot `area` vs `t`\n",
" for fly 1 \u2014 when does the bounding box get unusually large?\n"
" for fly 1 when does the bounding box get unusually large?\n"
]
},
{

View file

@ -16,13 +16,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# 03 \u00b7 Your first real analysis: trained vs naive\n",
"# 03 · Your first real analysis: trained vs naive\n",
"\n",
"In notebook 02 we explored a single database. Now we'll work with **all\n",
"of them at once**, compute a simple per-fly metric, and ask the central\n",
"question of the project:\n",
"\n",
"> **Do trained males behave differently from na\u00efve males in the testing\n",
"> **Do trained males behave differently from naïve males in the testing\n",
"> session?**\n",
"\n",
"By the end you'll have:\n",
@ -31,7 +31,7 @@
" project's helper function;\n",
"- reduced each trace to one number per fly (the *median inter-fly\n",
" distance*);\n",
"- compared the trained group against the na\u00efve group with a histogram\n",
"- compared the trained group against the naïve group with a histogram\n",
" and a non-parametric statistical test;\n",
"- learnt enough to start asking your own questions.\n"
]
@ -48,27 +48,13 @@
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import sys\n",
"from pathlib import Path\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from scipy import stats\n",
"\n",
"# Tell Python where to find the project's helper modules.\n",
"PROJECT_ROOT = Path(\"..\").resolve().parent # this notebook is in notebooks/getting_started/\n",
"sys.path.insert(0, str(PROJECT_ROOT / \"scripts\"))\n",
"\n",
"from load_roi_data import load_roi_data\n"
]
"source": "import sys\nfrom pathlib import Path\n\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nfrom scipy import stats\n\n# Two locations to know about:\n# - DATA_DIR : where the project's data files live (read-only data volume)\n# - REPO_ROOT : where the code repository lives (this notebook is inside it)\n# We build both as Path objects, then derive everything else from them.\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path(\"/home/gg/ownCloud/Work/Projects/coding/cupido/tracking\")\n\n# Tell Python where to find the project's helper modules (in scripts/).\nsys.path.insert(0, str(REPO_ROOT / \"scripts\"))\n\nfrom load_roi_data import load_roi_data\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading everything at once \u2014 but carefully\n",
"## Loading everything at once — but carefully\n",
"\n",
"`load_roi_data()` opens every tracking DB referenced by the metadata TSV\n",
"and returns one big DataFrame. **It can be slow and memory-hungry**\n",
@ -80,12 +66,7 @@
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Load the metadata TSV first \u2014 it's small and fast.\n",
"tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
"meta = pd.read_csv(tsv_path, sep=\"\\t\")\n",
"print(f\"metadata rows: {len(meta)}\")\n"
]
"source": "# Load the metadata TSV first — it's small and fast.\ntsv_path = DATA_DIR / \"all_video_info_merged.tsv\"\nmeta = pd.read_csv(tsv_path, sep=\"\\t\")\nprint(f\"metadata rows: {len(meta)}\")\n"
},
{
"cell_type": "markdown",
@ -180,7 +161,7 @@
"Right now each fly contributes **tens of thousands** of (t, x, y) rows.\n",
"We can't compare distributions of millions of points across two groups\n",
"in any meaningful way. So we **collapse each (date, machine_name, ROI)\n",
"trace into a single summary number** \u2014 here, the median distance between\n",
"trace into a single summary number** here, the median distance between\n",
"the two flies during testing.\n",
"\n",
"Why median rather than mean? Because tracker glitches (one fly\n",
@ -195,7 +176,7 @@
"execution_count": null,
"outputs": [],
"source": [
"# Step 1 \u2014 per-frame distance.\n",
"# Step 1 per-frame distance.\n",
"# Take only frames with exactly 2 flies (so we have a real distance).\n",
"two_fly = testing.groupby([\"date\", \"machine_name\", \"ROI\", \"t\"]).filter(lambda g: len(g) == 2)\n",
"\n",
@ -220,7 +201,7 @@
"execution_count": null,
"outputs": [],
"source": [
"# Step 2 \u2014 one number per (date, machine_name, ROI).\n",
"# Step 2 one number per (date, machine_name, ROI).\n",
"per_fly = (\n",
" per_frame\n",
" .groupby([\"date\", \"machine_name\", \"ROI\", \"male\"])[\"distance_px\"]\n",
@ -278,7 +259,7 @@
"\n",
"ax.set_xlabel(\"median inter-fly distance during testing (px)\")\n",
"ax.set_ylabel(\"number of flies\")\n",
"ax.set_title(\"Trained vs na\u00efve \u2014 Melanogaster/CS \u2014 testing session\")\n",
"ax.set_title(\"Trained vs naïve — Melanogaster/CS — testing session\")\n",
"ax.legend()\n",
"plt.show()\n"
]
@ -293,10 +274,10 @@
" trained males are spending less time near the female (i.e. they\n",
" learned to give up).\n",
"- If the two distributions look identical, no learning effect was\n",
" measurable with this metric \u2014 but that doesn't mean there's no effect,\n",
" measurable with this metric but that doesn't mean there's no effect,\n",
" just that this particular summary didn't capture it.\n",
"- A **bimodal** trained distribution (two humps) would mean some males\n",
" learned and others didn't \u2014 the \"individual differences\" story in\n",
" learned and others didn't the \"individual differences\" story in\n",
" `docs/bimodal_hypothesis.md`.\n"
]
},
@ -353,7 +334,7 @@
"- **Pick a different metric**: instead of median distance, try fraction\n",
" of time the flies were within 50 px (a \"close-proximity\" metric), or\n",
" the maximum velocity per fly. (Velocity needs identity tracking, which\n",
" is harder \u2014 see `flies_analysis_simple.ipynb` cell 16 for an example.)\n",
" is harder see `flies_analysis_simple.ipynb` cell 16 for an example.)\n",
"- **Look at it per species**: re-run with `species == \"Sechellia\"` and\n",
" compare. Does the effect generalize? Where is it strongest?\n",
"- **Look at the bimodality**: a kernel density plot\n",
@ -389,8 +370,8 @@
"`parquet` is a fast columnar format. `pip install pyarrow` if your\n",
"environment doesn't have it.\n",
"\n",
"There are also vectorized ways to compute these distances ~100\u00d7 faster\n",
"that avoid `groupby().apply()`. Don't worry about that yet \u2014 get a\n",
"There are also vectorized ways to compute these distances ~100× faster\n",
"that avoid `groupby().apply()`. Don't worry about that yet get a\n",
"correct answer first, optimize only if you find yourself waiting.\n"
]
}

View file

@ -2,21 +2,26 @@
from pathlib import Path
# Where this code repository lives (the directory containing scripts/, notebooks/, ...).
PROJECT_ROOT = Path(__file__).resolve().parent.parent
DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_METADATA = PROJECT_ROOT / "data" / "metadata"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
FIGURES = PROJECT_ROOT / "figures"
# Offline-tracking pipeline paths
VIDEOS_ROOT = Path("/mnt/ethoscope_data/videos")
VIDEO_INFO_XLSX = PROJECT_ROOT.parent / "all_video_info_merged.xlsx"
INVENTORY_CSV = DATA_METADATA / "video_inventory.csv"
# Reason: kept on the local data volume alongside the tracking DBs (out of
# ownCloud sync). See TRACKING_OUTPUT_DIR comment below.
TARGETS_DIR = Path("/mnt/data/projects/cupido/targets")
# Reason: tracking DBs are large binary files that don't belong in
# ownCloud-synced storage (sync conflicts + bandwidth). They live on the
# local data volume instead. Regenerable from videos + target JSONs.
TRACKING_OUTPUT_DIR = Path("/mnt/data/projects/cupido/tracked")
LOGS_DIR = PROJECT_ROOT / "data" / "logs"
# Where the source videos live (read-only NFS mount).
VIDEOS_ROOT = Path("/mnt/ethoscope_data/videos")
# Where the project's bulky data lives — outside the ownCloud-synced repo so
# it doesn't churn the cloud sync. This single root holds everything that's
# big or regenerable: tracking DBs, target-point JSONs, and the metadata
# spreadsheet (xlsx + TSV).
DATA_VOLUME = Path("/mnt/data/projects/cupido")
TARGETS_DIR = DATA_VOLUME / "targets"
TRACKING_OUTPUT_DIR = DATA_VOLUME / "tracked"
VIDEO_INFO_XLSX = DATA_VOLUME / "all_video_info_merged.xlsx"
VIDEO_INFO_TSV = DATA_VOLUME / "all_video_info_merged.tsv"
# A small CSV listing every video file we know about (built locally).
INVENTORY_CSV = DATA_METADATA / "video_inventory.csv"

View file

@ -26,7 +26,7 @@ from pathlib import Path
import pandas as pd
from config import INVENTORY_CSV, TRACKING_OUTPUT_DIR, VIDEO_INFO_XLSX
from config import INVENTORY_CSV, TRACKING_OUTPUT_DIR, VIDEO_INFO_TSV, VIDEO_INFO_XLSX
_TIME_RE = re.compile(r"^(\d{8})_(\d{1,2})(\d{2})?(AM|PM)$", re.IGNORECASE)
@ -138,7 +138,7 @@ def main() -> None:
parser.add_argument(
"--out",
type=Path,
default=VIDEO_INFO_XLSX.with_suffix(".tsv"),
default=VIDEO_INFO_TSV,
help="output TSV path (default: alongside the xlsx)",
)
args = parser.parse_args()

View file

@ -13,7 +13,7 @@ from pathlib import Path
import pandas as pd
from config import VIDEO_INFO_XLSX
from config import VIDEO_INFO_TSV
# Metadata columns to copy onto every tracking sample. These are the xlsx
@ -68,7 +68,7 @@ def load_roi_data(meta: pd.DataFrame | None = None) -> pd.DataFrame:
sample. Empty if nothing could be loaded.
"""
if meta is None:
meta = pd.read_csv(VIDEO_INFO_XLSX.with_suffix(".tsv"), sep="\t")
meta = pd.read_csv(VIDEO_INFO_TSV, sep="\t")
db_cache: dict = {}
chunks: list[pd.DataFrame] = []