cupido/notebooks/getting_started/02_explore_one_database.ipynb
Giorgio Gilestro f176224150 Move metadata xlsx/TSV to /mnt/data/projects/cupido/
Consolidates everything bulky (tracking DBs, targets, metadata
spreadsheet) under a single DATA_VOLUME root outside the ownCloud-synced
repo. Notebooks now use a visible DATA_DIR = Path(...) idiom rather than
walking up the filesystem with PROJECT_ROOT.parent — easier for students
with no Python background to follow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-01 08:47:15 +01:00

431 lines
No EOL
13 KiB
Text

{
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 02 · A first look at one tracking database\n",
"\n",
"In this notebook we open **one** of the SQLite databases that the tracker\n",
"produced and look at what's actually inside. By the end you'll be able to:\n",
"\n",
"- list the tables in a `.db` file\n",
"- read one ROI's tracking trace into a DataFrame\n",
"- plot a fly's path through the arena\n",
"- count how many flies are visible at each moment\n",
"- compute a simple distance between the two flies in a ROI\n",
"\n",
"If you're curious how SQLite works, the\n",
"[SQLite Quickstart](https://www.sqlite.org/quickstart.html) is short and\n",
"worth reading. For our purposes, **SQLite is just a file that contains\n",
"several tables you can query like a DataFrame**.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"We import the libraries we need. `sqlite3` is part of Python's standard\n",
"library — no install needed.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import sqlite3\n",
"from pathlib import Path\n",
"\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Find the databases\n",
"\n",
"The DBs live at `/mnt/data/projects/cupido/tracked/`. Let's list a few.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": "# Single root for all the project's data. Build sub-paths from it.\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\ntracked_dir = DATA_DIR / \"tracked\"\n\ndb_files = sorted(tracked_dir.glob(\"*_tracking.db\"))\n\nprint(f\"Found {len(db_files)} tracking DBs.\")\nprint(\"\\nFirst 5 by name:\")\nfor db in db_files[:5]:\n print(f\" {db.name}\")\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The filename encodes the date, time, machine UUID, video resolution, and\n",
"the suffix `_tracking.db`. For example:\n",
"\n",
"```\n",
"2024-09-17_10-32-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged_tracking.db\n",
"└────┬─────┘└──┬──┘ └────────────────┬───────────────┘└──────┬───────┘\n",
" date time machine UUID video format\n",
"```\n",
"\n",
"Pick one to explore. Feel free to change the index.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"db_path = db_files[0]\n",
"print(\"Working with:\", db_path.name)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Open the database\n",
"\n",
"We open it **read-only** as a safety measure. The `?mode=ro` flag is\n",
"SQLite's read-only switch.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"conn = sqlite3.connect(f\"file:{db_path}?mode=ro\", uri=True)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What tables are inside?\n",
"\n",
"Every SQLite database has a system table called `sqlite_master` that\n",
"lists everything. We can query it like any other table.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"tables = pd.read_sql_query(\n",
" \"SELECT name FROM sqlite_master WHERE type='table' ORDER BY name\", conn\n",
")\n",
"tables\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should see tables like `ROI_1`, `ROI_2`, …, `ROI_6` (one per\n",
"sub-arena), plus housekeeping tables like `METADATA`, `ROI_MAP`,\n",
"`VAR_MAP`, `START_EVENTS`. We mostly care about the `ROI_*` ones.\n",
"\n",
"## Read one ROI\n",
"\n",
"`pd.read_sql_query()` runs an SQL query against the connection and\n",
"returns a DataFrame. The query `SELECT * FROM ROI_1` means *\"give me all\n",
"columns and all rows from the table called ROI_1\"*.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"roi1 = pd.read_sql_query(\"SELECT * FROM ROI_1\", conn)\n",
"print(f\"shape: {roi1.shape}\") # (rows, columns)\n",
"roi1.head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Understanding the columns\n",
"\n",
"Refer back to notebook `00_welcome` for the full column reference. Quick\n",
"recap of the important ones:\n",
"\n",
"- `t`: time in **milliseconds** since the video started.\n",
"- `x`, `y`: fly position in **pixels**. The image origin (0, 0) is the\n",
" **top-left** corner. y grows downward.\n",
"- `w`, `h`: bounding-box width/height. Their product (`area = w*h`) is a\n",
" rough proxy for \"how big does this blob look\" — useful for spotting\n",
" frames where the tracker merged two flies into one big detection.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Quick descriptive stats\n",
"roi1[[\"t\", \"x\", \"y\", \"w\", \"h\"]].describe()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The minimum `t` should be 0 (start of the video). The maximum tells you\n",
"how long the recording was. Convert ms to minutes by dividing by 60000:\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"duration_min = roi1[\"t\"].max() / 60_000\n",
"print(f\"Session length: {duration_min:.1f} minutes\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How many flies per frame?\n",
"\n",
"If two flies are visible in this ROI, we get **two rows per `t`**. Let's\n",
"check.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"flies_per_frame = roi1.groupby(\"t\").size()\n",
"print(flies_per_frame.value_counts().sort_index())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output tells you, e.g., \"100,000 frames had 2 flies visible, 30,000\n",
"had 1 fly visible\". Frames with 1 fly usually mean the two flies are\n",
"overlapping or one is occluded — that's something we'll handle properly\n",
"in the next notebook.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plot one fly's trajectory\n",
"\n",
"We'll plot the position over the first 5 minutes (300 000 ms). For\n",
"clarity we'll only look at frames where there were 2 flies and pick the\n",
"**first** of the two (sorted by `id`) as \"fly 1\" — this is a rough\n",
"heuristic; identity tracking is harder than it sounds.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Filter to the first 5 minutes\n",
"sub = roi1[roi1[\"t\"] <= 5 * 60_000]\n",
"\n",
"# Pick \"fly 1\" by taking the first row at each time point\n",
"fly1 = sub.sort_values([\"t\", \"id\"]).drop_duplicates(\"t\", keep=\"first\")\n",
"\n",
"plt.figure(figsize=(6, 5))\n",
"plt.plot(fly1[\"x\"], fly1[\"y\"], color=\"steelblue\", linewidth=0.5, alpha=0.7)\n",
"plt.scatter(fly1[\"x\"].iloc[0], fly1[\"y\"].iloc[0], color=\"green\", label=\"start\", zorder=5)\n",
"plt.scatter(fly1[\"x\"].iloc[-1], fly1[\"y\"].iloc[-1], color=\"red\", label=\"end\", zorder=5)\n",
"plt.gca().invert_yaxis() # because pixel y grows downward\n",
"plt.xlabel(\"x (pixels)\")\n",
"plt.ylabel(\"y (pixels)\")\n",
"plt.title(f\"Fly 1 trajectory — first 5 min — {db_path.name[:30]}…\")\n",
"plt.legend()\n",
"plt.axis(\"equal\")\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should see a tangle of lines confined to a roughly rectangular ROI.\n",
"That tangle is the fly walking around its sub-arena.\n",
"\n",
"Notice we did `plt.gca().invert_yaxis()` — that's because in image\n",
"coordinates y grows downward, but humans expect plots where y grows\n",
"upward. Without it the plot would be vertically flipped.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plot position over time\n",
"\n",
"A trajectory plot collapses time into \"shape on a page\". To see *when*\n",
"things happen we need time on the x-axis.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"fig, axes = plt.subplots(2, 1, figsize=(12, 5), sharex=True)\n",
"\n",
"axes[0].plot(fly1[\"t\"] / 1000, fly1[\"x\"], linewidth=0.5)\n",
"axes[0].set_ylabel(\"x (px)\")\n",
"axes[0].set_title(f\"Fly 1, ROI 1, {db_path.name[:30]}…\")\n",
"\n",
"axes[1].plot(fly1[\"t\"] / 1000, fly1[\"y\"], linewidth=0.5, color=\"darkorange\")\n",
"axes[1].set_ylabel(\"y (px)\")\n",
"axes[1].set_xlabel(\"time (s)\")\n",
"axes[1].invert_yaxis()\n",
"\n",
"plt.tight_layout()\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Bursts of variation = active fly. Long flat stretches = the fly is sitting\n",
"still. You'll come to recognize courtship vs idling by eye after a while.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Distance between the two flies\n",
"\n",
"Whenever the ROI has 2 detections at the same `t`, we can compute the\n",
"Euclidean distance between them: `sqrt((x1-x2)² + (y1-y2)²)`.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"two_fly_frames = roi1.groupby(\"t\").filter(lambda g: len(g) == 2)\n",
"two_fly_frames = two_fly_frames.sort_values([\"t\", \"id\"])\n",
"\n",
"# Pivot so each row is one timepoint with x1, y1, x2, y2\n",
"def pair_up(g):\n",
" g = g.reset_index(drop=True)\n",
" return pd.Series({\n",
" \"x1\": g.loc[0, \"x\"], \"y1\": g.loc[0, \"y\"],\n",
" \"x2\": g.loc[1, \"x\"], \"y2\": g.loc[1, \"y\"],\n",
" })\n",
"\n",
"paired = two_fly_frames.groupby(\"t\").apply(pair_up).reset_index()\n",
"paired[\"distance_px\"] = np.hypot(paired[\"x1\"] - paired[\"x2\"], paired[\"y1\"] - paired[\"y2\"])\n",
"paired.head()\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"plt.figure(figsize=(12, 4))\n",
"plt.plot(paired[\"t\"] / 1000, paired[\"distance_px\"], linewidth=0.4)\n",
"plt.xlabel(\"time (s)\")\n",
"plt.ylabel(\"inter-fly distance (px)\")\n",
"plt.title(\"Distance between the two flies in ROI 1\")\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the kind of trace that drives the rest of the analysis: a male\n",
"courting a female stays close (small distance); a male giving up wanders\n",
"off (large distance). The shape of this curve is the behavioural readout.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Don't forget to close the connection\n",
"\n",
"If you opened a connection, close it when you're done. (Not strictly\n",
"necessary in a notebook — Python tidies up — but a good habit.)\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"conn.close()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercises\n",
"\n",
"1. Pick a different DB (change `db_files[0]` to `db_files[10]` for example)\n",
" and re-run the trajectory plot. Is the arena bigger / smaller? Why\n",
" might that be? (Hint: look at the resolution part of the filename.)\n",
"2. Plot the distance trace for **ROI 4** instead of ROI 1.\n",
"3. Compute the **percentage of frames** in ROI 1 that had only 1 fly visible.\n",
"4. The `area = w * h` column is a useful diagnostic. Plot `area` vs `t`\n",
" for fly 1 — when does the bounding box get unusually large?\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Exercise space\n"
]
}
]
}