Add beginner tutorial notebooks for incoming students

Four guided notebooks under notebooks/getting_started/ aimed at someone
new to Python and data science. The series progresses: project orientation
→ Python/pandas crash course → exploring one tracking DB → first
trained-vs-naive comparison using load_roi_data + Mann-Whitney U.

Each notebook leans heavily on markdown explanations, includes exercises
with empty cells, and links out to canonical references (JupyterLab,
official Python tutorial, pandas 10-min guide, Wikipedia for stats
concepts).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Giorgio Gilestro 2026-04-30 18:14:17 +01:00
parent 7d09523840
commit ec56e51bf9
5 changed files with 1607 additions and 0 deletions

View file

@ -0,0 +1,255 @@
{
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 00 \u00b7 Welcome to the Cupido fly-tracking project\n",
"\n",
"Hi! You're about to start working on a project that studies how *Drosophila*\n",
"(fruit flies) form **memories of mating experiences** \u2014 and whether trained\n",
"flies behave differently from na\u00efve ones in their later courtship.\n",
"\n",
"**You don't need any prior experience with Python or data science to follow\n",
"along.** This series of notebooks will walk you through everything, one\n",
"small step at a time.\n",
"\n",
"> **How to read these notebooks**: each notebook is split into \"cells\".\n",
"> Some cells are explanations (like this one), others are code that you\n",
"> can **run** by clicking on the cell and pressing `Shift + Enter`. Try it\n",
"> on the next cell.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# This is a code cell. Click on it and press Shift+Enter to run it.\n",
"print(\"Hello, fly world!\")\n",
"1 + 1\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should have seen `Hello, fly world!` printed and the number `2`\n",
"appear underneath. If something else happened, ask Giorgio \u2014 that's a\n",
"sign the environment isn't set up right.\n",
"\n",
"If this is the very first time you're using JupyterLab, take 10 minutes\n",
"to read the [official \"Getting started with JupyterLab\"\n",
"guide](https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html).\n",
"The most important things to know are:\n",
"\n",
"- A notebook (`.ipynb` file) is a sequence of **cells**.\n",
"- Each cell is either **Markdown** (formatted text, like this) or **Code**\n",
" (Python that the computer runs).\n",
"- The **kernel** is the running Python process behind the notebook. It\n",
" remembers everything you've defined. If something gets weird, restart\n",
" the kernel: top menu \u2192 *Kernel* \u2192 *Restart Kernel\u2026*.\n",
"- `Shift + Enter` runs a cell and moves to the next one.\n",
"- `Ctrl + Enter` runs a cell and stays put.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is the project about?\n",
"\n",
"Drosophila males court females with a stereotyped sequence (chasing,\n",
"wing-extension, tapping). When a male is rejected by a female (e.g.\n",
"because she's already mated), he **learns** to suppress his courtship \u2014\n",
"even toward new, receptive females, for a while. This is a textbook\n",
"example of *non-associative learning* in invertebrates ([review on\n",
"PubMed](https://pubmed.ncbi.nlm.nih.gov/?term=courtship+conditioning+drosophila)).\n",
"\n",
"The lab is interested in:\n",
"\n",
"- Does this learning **transfer across species**? (We have ~7 *Drosophila*\n",
" species recorded.)\n",
"- How long does the memory last? (training_length_hr,\n",
" consolidation_length_hr columns in the metadata.)\n",
"- Are there **individual differences** \u2014 do some males learn while others\n",
" don't? (The \"bimodal hypothesis\" in `docs/bimodal_hypothesis.md`.)\n",
"\n",
"Your job, broadly, will be to **turn videos of flies into numbers and\n",
"plots that answer these questions.**\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How an experiment works (the bird's-eye view)\n",
"\n",
"1. **Training**: a male fly is placed with a non-receptive (mated) female.\n",
" He courts, gets rejected, eventually gives up.\n",
"2. *Wait* for some hours (the \"consolidation\" period \u2014 gives memory time\n",
" to form).\n",
"3. **Testing**: same male is placed with a fresh receptive female.\n",
" Does he court her vigorously, or has he learned to give up easily?\n",
"\n",
"Each experiment runs in an **HD mating arena** \u2014 a small chamber with\n",
"6 sub-arenas (we call them **ROIs**, for \"regions of interest\"). Each ROI\n",
"contains one couple (a male and a female). A camera films the whole arena\n",
"from above. So one **video** gives us 6 simultaneous experiments.\n",
"\n",
"The setup uses [Ethoscopes](https://www.ethoscope.com/) \u2014 open-source\n",
"behavioural recording boxes built in this lab. Each ethoscope is a\n",
"machine; we have 16 in total, named `ETHOSCOPE_067`, `ETHOSCOPE_076`, etc.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What does the data look like?\n",
"\n",
"For each video, the **tracker** (a piece of software that runs after the\n",
"recording) finds the flies frame-by-frame and writes their positions to a\n",
"**SQLite database** (a single file, ending in `.db`). One DB per video.\n",
"Inside each DB there are 6 tables called `ROI_1`, `ROI_2`, \u2026, `ROI_6` \u2014\n",
"one per sub-arena. Each row of an ROI table is **one fly detection at one\n",
"moment in time** with these columns:\n",
"\n",
"| column | meaning |\n",
"|---|---|\n",
"| `id` | row number (auto-incremented) |\n",
"| `t` | time in **milliseconds** since the video started |\n",
"| `x`, `y` | fly position in **pixels** (top-left corner of the image is 0,0) |\n",
"| `w`, `h` | width and height of the bounding box around the fly, in pixels |\n",
"| `phi` | orientation angle of the fly |\n",
"| `is_inferred` | 1 if the position was guessed (not directly seen), 0 otherwise |\n",
"| `has_interacted` | (legacy column, mostly unused) |\n",
"\n",
"If a single ROI has two flies that the tracker can see, you'll get **two\n",
"rows with the same `t`** \u2014 one for each fly. If only one fly is detected\n",
"(maybe they're on top of each other), you'll get one row.\n",
"\n",
"That's the heart of the data. Everything else (distances, velocities,\n",
"group comparisons) is computed from these (t, x, y) traces.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Where everything lives\n",
"\n",
"Take a moment to memorize these locations \u2014 you'll come back to them often.\n",
"\n",
"| what | where |\n",
"|---|---|\n",
"| Tracking DBs (SQLite, one per video) | `/mnt/data/projects/cupido/tracked/` |\n",
"| Target JSONs (the user-clicked reference points) | `/mnt/data/projects/cupido/targets/` |\n",
"| Source video files | `/mnt/ethoscope_data/videos/` |\n",
"| Project code (this repo) | `/home/gg/ownCloud/Work/Projects/coding/cupido/tracking/` |\n",
"| The metadata table (xlsx + TSV) | `/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv` |\n",
"| Your notebooks | `notebooks/getting_started/` (this folder) |\n",
"\n",
"Let's verify a couple of these from inside Python:\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"from pathlib import Path\n",
"\n",
"tracked = Path(\"/mnt/data/projects/cupido/tracked\")\n",
"targets = Path(\"/mnt/data/projects/cupido/targets\")\n",
"\n",
"n_dbs = len(list(tracked.glob(\"*_tracking.db\")))\n",
"n_jsons = len(list(targets.glob(\"*.json\")))\n",
"\n",
"print(f\"Tracking DBs available: {n_dbs}\")\n",
"print(f\"Target JSONs available: {n_jsons}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should see roughly 113 tracking DBs and 130 target JSONs. If those\n",
"numbers are zero, the storage volume isn't mounted \u2014 ask Giorgio.\n",
"\n",
"> **Note**: the tracking DBs are read-only inside the JupyterLab\n",
"> container. You can read them but not modify or delete them. That's a\n",
"> deliberate safety measure \u2014 we don't want analysis code accidentally\n",
"> corrupting the source data.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Glossary (refer back as needed)\n",
"\n",
"- **ROI** \u2014 *region of interest*. One sub-arena inside the HD mating\n",
" arena. There are 6 ROIs per video, numbered 1\u20136.\n",
"- **fly** \u2014 one detection in a single (t, ROI) cell. Two flies in the\n",
" same ROI at the same time = two rows with the same `t`.\n",
"- **trained** \u2014 the male had a training session before testing.\n",
"- **naive** \u2014 the male is a control (no training).\n",
"- **training session** \u2014 the recording where the male meets the\n",
" non-receptive female (he gets rejected).\n",
"- **testing session** \u2014 the recording where the male meets a fresh\n",
" receptive female (we measure his courtship).\n",
"- **t (milliseconds)** \u2014 time within one session, starting at 0.\n",
"- **(x, y) pixels** \u2014 fly position in the image. Top-left is (0, 0); x\n",
" grows to the right, y grows **downward** (this is the image-coordinate\n",
" convention, opposite of math class).\n",
"- **machine_name** \u2014 which ethoscope recorded the video, e.g.\n",
" `ETHOSCOPE_076`.\n",
"- **species** \u2014 `Melanogaster/CS`, `Sechellia`, `Simulans`, `Yakuba`,\n",
" `Erecta`, `Willistoni`, or `CS`.\n",
"\n",
"If you bump into other terms in the code, ask. Don't guess \u2014 biology\n",
"codebases pick up jargon over the years.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What's next\n",
"\n",
"When you're ready, open these notebooks **in order**:\n",
"\n",
"1. `01_python_pandas_basics.ipynb` \u2014 just enough Python and pandas to\n",
" read and manipulate tabular data.\n",
"2. `02_explore_one_database.ipynb` \u2014 open one tracking DB, plot a fly's\n",
" trajectory, see what the numbers actually look like.\n",
"3. `03_compare_trained_vs_naive.ipynb` \u2014 your first real analysis,\n",
" comparing groups of flies.\n",
"\n",
"After those, the notebooks one level up (`flies_analysis.ipynb`,\n",
"`flies_analysis_simple.ipynb`) contain the analysis pipeline that the\n",
"previous student built \u2014 those will make sense once you've worked\n",
"through the tutorials.\n",
"\n",
"Don't try to power through all of them in one sitting. Run a few cells,\n",
"read the explanation, **change a number** to see what happens, **break\n",
"something on purpose** to see the error message. That's how you learn.\n"
]
}
]
}

View file

@ -0,0 +1,500 @@
{
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 01 \u00b7 Python and pandas \u2014 just enough to be dangerous\n",
"\n",
"This notebook teaches the **minimum** Python and `pandas` you need to read\n",
"the rest of the project's code and write your own analyses.\n",
"\n",
"If you've never programmed before, don't try to memorize the syntax.\n",
"Just run each cell, read what it does, and come back when you're stuck on\n",
"something specific. The cheat sheet at the end is the only thing worth\n",
"keeping handy.\n",
"\n",
"External resources, in order of how much time they take:\n",
"\n",
"- \ud83e\udd98 [Python in 10 minutes (very condensed)](https://www.stavros.io/tutorials/python/)\n",
"- \ud83d\udc0d [Official Python tutorial \u2014 chapters 3\u20135](https://docs.python.org/3/tutorial/introduction.html)\n",
"- \ud83d\udc3c [pandas in 10 minutes (official)](https://pandas.pydata.org/docs/user_guide/10min.html)\n",
"- \ud83d\udcda [Python for Data Analysis (the book)](https://wesmckinney.com/book/) \u2014 free online\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Variables\n",
"\n",
"A variable is a named box you put a value into. The `=` is **assignment**,\n",
"not equality. Read it as \"make `name` refer to `value`\".\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"x = 5\n",
"y = 3\n",
"total = x + y\n",
"print(total)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Re-running the cell after changing `x = 5` to `x = 50` gives a different\n",
"answer. Try it.\n",
"\n",
"Variable names: lowercase letters, digits, and underscores. They can't\n",
"start with a digit. Convention is `snake_case`: `mean_distance`, not\n",
"`meanDistance` or `MeanDistance`.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Strings and numbers\n",
"\n",
"A **string** is text in quotes. You can join strings with `+`. You can\n",
"turn a number into a string with `str()`, and vice-versa with `int()` /\n",
"`float()`.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"species = \"Drosophila melanogaster\"\n",
"n_flies = 12\n",
"message = \"We tracked \" + str(n_flies) + \" \" + species + \" males.\"\n",
"print(message)\n",
"\n",
"# A nicer way to build strings \u2014 f-strings (note the leading 'f'):\n",
"print(f\"We tracked {n_flies} {species} males.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Lists\n",
"\n",
"A list is an ordered collection of things. Square brackets, items\n",
"separated by commas. You can mix types (but usually shouldn't).\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"machines = [\"ETHOSCOPE_076\", \"ETHOSCOPE_082\", \"ETHOSCOPE_086\"]\n",
"print(machines[0]) # first item \u2014 Python counts from 0!\n",
"print(machines[-1]) # last item\n",
"print(len(machines)) # how many items\n",
"print(machines + [\"ETHOSCOPE_140\"]) # concatenate (returns a new list)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Dictionaries\n",
"\n",
"A dictionary maps **keys** to **values**. Curly braces, `key: value`\n",
"pairs.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"fly = {\"species\": \"Sechellia\", \"trained\": True, \"age_days\": 5}\n",
"print(fly[\"species\"])\n",
"print(fly[\"age_days\"])\n",
"fly[\"alive\"] = False # add a new key\n",
"print(fly)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Conditions: if / elif / else\n",
"\n",
"Compare with `==` (equal), `!=` (not equal), `<`, `>`, `<=`, `>=`.\n",
"Combine with `and`, `or`, `not`.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"distance_px = 42\n",
"\n",
"if distance_px < 50:\n",
" label = \"close\"\n",
"elif distance_px < 200:\n",
" label = \"medium\"\n",
"else:\n",
" label = \"far\"\n",
"\n",
"print(label)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Loops\n",
"\n",
"`for x in collection:` runs the indented block once per item.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"for m in machines:\n",
" print(f\"Looking at machine {m}\")\n",
"\n",
"# Looping with an index, when you need it:\n",
"for i, m in enumerate(machines):\n",
" print(f\"{i}: {m}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Functions\n",
"\n",
"A function is a named, reusable chunk of code. `def` declares it. `return`\n",
"sends a value back to whoever called it.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"def fly_age_in_weeks(days):\n",
" \"\"\"Return age in weeks given age in days.\"\"\"\n",
" return days / 7\n",
"\n",
"print(fly_age_in_weeks(14)) # 2.0\n",
"print(fly_age_in_weeks(5)) # 0.714\u2026\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Importing libraries\n",
"\n",
"A library is somebody else's code. We use `import` to pull it into our\n",
"notebook.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import math\n",
"print(math.sqrt(16)) # 4.0\n",
"print(math.pi)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Meet pandas\n",
"\n",
"Real data is rarely a single number \u2014 it's a **table** with rows and\n",
"columns (think Excel). `pandas` is the library that handles tables in\n",
"Python. The two main objects are:\n",
"\n",
"- **`Series`** \u2014 a single column with a name.\n",
"- **`DataFrame`** \u2014 a whole table.\n",
"\n",
"By convention we import pandas as `pd`. Always.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# Read the project's metadata TSV (Tab-Separated Values).\n",
"tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
"df = pd.read_csv(tsv_path, sep=\"\\t\")\n",
"\n",
"# How big is it?\n",
"print(f\"Rows: {len(df)}\")\n",
"print(f\"Columns: {df.shape[1]}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Looking at the table\n",
"\n",
"`.head()` shows the first 5 rows. `.tail()` the last 5. `.columns` lists\n",
"column names. `.dtypes` shows the type of each column.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"df.head(3)\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"print(\"Column names:\")\n",
"for c in df.columns:\n",
" print(f\" {c}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Selecting columns\n",
"\n",
"Two main ways to get one column: bracket-indexing (`df[\"name\"]`) or\n",
"attribute access (`df.name`). The first works for any column name; the\n",
"second only works if the name has no spaces or weird characters.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"df[\"species\"].head()\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"df.species.value_counts() # how many rows per species\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Selecting multiple columns\n",
"\n",
"Pass a **list** of names inside the brackets:\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"df[[\"machine_name\", \"roi\", \"species\", \"male\"]].head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. Filtering rows\n",
"\n",
"The pattern is `df[condition]`. The condition is a Series of `True`/`False`.\n",
"Pandas keeps the rows where it's `True`.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"trained = df[df[\"male\"] == \"trained\"]\n",
"print(f\"trained rows: {len(trained)}\")\n",
"\n",
"mel_only = df[df[\"species\"] == \"Melanogaster/CS\"]\n",
"print(f\"Melanogaster/CS rows: {len(mel_only)}\")\n",
"\n",
"# Combine conditions with & (and) | (or) \u2014 and wrap each part in parentheses.\n",
"trained_mel = df[(df[\"male\"] == \"trained\") & (df[\"species\"] == \"Melanogaster/CS\")]\n",
"print(f\"trained Mel rows: {len(trained_mel)}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Grouping and counting\n",
"\n",
"`.groupby(\"col\")` followed by an aggregator like `.size()` or `.mean()`\n",
"splits the table by the values in that column and computes something per\n",
"group.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# How many ROIs per (species, training condition)?\n",
"df.groupby([\"species\", \"male\"]).size()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 15. Quick plots\n",
"\n",
"DataFrames know how to draw themselves. Under the hood it's `matplotlib`.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"# How many rows per machine?\n",
"df[\"machine_name\"].value_counts().plot(kind=\"bar\", figsize=(10, 4))\n",
"plt.title(\"Number of fly-rows per ethoscope machine\")\n",
"plt.ylabel(\"rows\")\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 16. Exercises\n",
"\n",
"Don't skip these. They're how you find out what you actually understood.\n",
"\n",
"1. How many rows does `df` have where `age` equals `'5-7'`?\n",
"2. Print the **unique values** of the `memory` column. (Hint: `df[\"memory\"].unique()`)\n",
"3. How many distinct `(date, machine_name)` pairs are in the dataset?\n",
" (Hint: `df.groupby([\"date\", \"machine_name\"]).size().shape`.)\n",
"4. Make a bar plot of `species` counts. Which species has the most rows?\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Try exercise 1 here\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Try exercise 2 here\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Try exercise 3 here\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Try exercise 4 here\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cheat sheet\n",
"\n",
"```python\n",
"import pandas as pd\n",
"df = pd.read_csv(\"file.tsv\", sep=\"\\t\") # read\n",
"df.head(); df.tail(); df.shape; df.columns # peek\n",
"df[\"col\"]; df[[\"a\", \"b\"]] # select\n",
"df[df[\"col\"] == \"value\"] # filter\n",
"df.groupby(\"col\").size() # count per group\n",
"df.groupby(\"col\")[\"x\"].mean() # mean of x per group\n",
"df[\"col\"].value_counts() # quick counts\n",
"df[\"col\"].unique() # unique values\n",
"df[\"new_col\"] = df[\"w\"] * df[\"h\"] # derived column\n",
"df.sort_values(\"col\", ascending=False) # sort\n",
"df.plot(...) # quick plot\n",
"```\n",
"\n",
"Keep this list open when reading other people's code. Most of pandas is\n",
"just combinations of these primitives. When you need more, the official\n",
"[pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)\n",
"is excellent.\n"
]
}
]
}

View file

@ -0,0 +1,439 @@
{
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 02 \u00b7 A first look at one tracking database\n",
"\n",
"In this notebook we open **one** of the SQLite databases that the tracker\n",
"produced and look at what's actually inside. By the end you'll be able to:\n",
"\n",
"- list the tables in a `.db` file\n",
"- read one ROI's tracking trace into a DataFrame\n",
"- plot a fly's path through the arena\n",
"- count how many flies are visible at each moment\n",
"- compute a simple distance between the two flies in a ROI\n",
"\n",
"If you're curious how SQLite works, the\n",
"[SQLite Quickstart](https://www.sqlite.org/quickstart.html) is short and\n",
"worth reading. For our purposes, **SQLite is just a file that contains\n",
"several tables you can query like a DataFrame**.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"We import the libraries we need. `sqlite3` is part of Python's standard\n",
"library \u2014 no install needed.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import sqlite3\n",
"from pathlib import Path\n",
"\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Find the databases\n",
"\n",
"The DBs live at `/mnt/data/projects/cupido/tracked/`. Let's list a few.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"tracked_dir = Path(\"/mnt/data/projects/cupido/tracked\")\n",
"db_files = sorted(tracked_dir.glob(\"*_tracking.db\"))\n",
"\n",
"print(f\"Found {len(db_files)} tracking DBs.\")\n",
"print(\"\\nFirst 5 by name:\")\n",
"for db in db_files[:5]:\n",
" print(f\" {db.name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The filename encodes the date, time, machine UUID, video resolution, and\n",
"the suffix `_tracking.db`. For example:\n",
"\n",
"```\n",
"2024-09-17_10-32-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged_tracking.db\n",
"\u2514\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2518\u2514\u2500\u2500\u252c\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n",
" date time machine UUID video format\n",
"```\n",
"\n",
"Pick one to explore. Feel free to change the index.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"db_path = db_files[0]\n",
"print(\"Working with:\", db_path.name)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Open the database\n",
"\n",
"We open it **read-only** as a safety measure. The `?mode=ro` flag is\n",
"SQLite's read-only switch.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"conn = sqlite3.connect(f\"file:{db_path}?mode=ro\", uri=True)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What tables are inside?\n",
"\n",
"Every SQLite database has a system table called `sqlite_master` that\n",
"lists everything. We can query it like any other table.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"tables = pd.read_sql_query(\n",
" \"SELECT name FROM sqlite_master WHERE type='table' ORDER BY name\", conn\n",
")\n",
"tables\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should see tables like `ROI_1`, `ROI_2`, \u2026, `ROI_6` (one per\n",
"sub-arena), plus housekeeping tables like `METADATA`, `ROI_MAP`,\n",
"`VAR_MAP`, `START_EVENTS`. We mostly care about the `ROI_*` ones.\n",
"\n",
"## Read one ROI\n",
"\n",
"`pd.read_sql_query()` runs an SQL query against the connection and\n",
"returns a DataFrame. The query `SELECT * FROM ROI_1` means *\"give me all\n",
"columns and all rows from the table called ROI_1\"*.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"roi1 = pd.read_sql_query(\"SELECT * FROM ROI_1\", conn)\n",
"print(f\"shape: {roi1.shape}\") # (rows, columns)\n",
"roi1.head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Understanding the columns\n",
"\n",
"Refer back to notebook `00_welcome` for the full column reference. Quick\n",
"recap of the important ones:\n",
"\n",
"- `t`: time in **milliseconds** since the video started.\n",
"- `x`, `y`: fly position in **pixels**. The image origin (0, 0) is the\n",
" **top-left** corner. y grows downward.\n",
"- `w`, `h`: bounding-box width/height. Their product (`area = w*h`) is a\n",
" rough proxy for \"how big does this blob look\" \u2014 useful for spotting\n",
" frames where the tracker merged two flies into one big detection.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Quick descriptive stats\n",
"roi1[[\"t\", \"x\", \"y\", \"w\", \"h\"]].describe()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The minimum `t` should be 0 (start of the video). The maximum tells you\n",
"how long the recording was. Convert ms to minutes by dividing by 60000:\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"duration_min = roi1[\"t\"].max() / 60_000\n",
"print(f\"Session length: {duration_min:.1f} minutes\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How many flies per frame?\n",
"\n",
"If two flies are visible in this ROI, we get **two rows per `t`**. Let's\n",
"check.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"flies_per_frame = roi1.groupby(\"t\").size()\n",
"print(flies_per_frame.value_counts().sort_index())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output tells you, e.g., \"100,000 frames had 2 flies visible, 30,000\n",
"had 1 fly visible\". Frames with 1 fly usually mean the two flies are\n",
"overlapping or one is occluded \u2014 that's something we'll handle properly\n",
"in the next notebook.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plot one fly's trajectory\n",
"\n",
"We'll plot the position over the first 5 minutes (300 000 ms). For\n",
"clarity we'll only look at frames where there were 2 flies and pick the\n",
"**first** of the two (sorted by `id`) as \"fly 1\" \u2014 this is a rough\n",
"heuristic; identity tracking is harder than it sounds.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Filter to the first 5 minutes\n",
"sub = roi1[roi1[\"t\"] <= 5 * 60_000]\n",
"\n",
"# Pick \"fly 1\" by taking the first row at each time point\n",
"fly1 = sub.sort_values([\"t\", \"id\"]).drop_duplicates(\"t\", keep=\"first\")\n",
"\n",
"plt.figure(figsize=(6, 5))\n",
"plt.plot(fly1[\"x\"], fly1[\"y\"], color=\"steelblue\", linewidth=0.5, alpha=0.7)\n",
"plt.scatter(fly1[\"x\"].iloc[0], fly1[\"y\"].iloc[0], color=\"green\", label=\"start\", zorder=5)\n",
"plt.scatter(fly1[\"x\"].iloc[-1], fly1[\"y\"].iloc[-1], color=\"red\", label=\"end\", zorder=5)\n",
"plt.gca().invert_yaxis() # because pixel y grows downward\n",
"plt.xlabel(\"x (pixels)\")\n",
"plt.ylabel(\"y (pixels)\")\n",
"plt.title(f\"Fly 1 trajectory \u2014 first 5 min \u2014 {db_path.name[:30]}\u2026\")\n",
"plt.legend()\n",
"plt.axis(\"equal\")\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should see a tangle of lines confined to a roughly rectangular ROI.\n",
"That tangle is the fly walking around its sub-arena.\n",
"\n",
"Notice we did `plt.gca().invert_yaxis()` \u2014 that's because in image\n",
"coordinates y grows downward, but humans expect plots where y grows\n",
"upward. Without it the plot would be vertically flipped.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plot position over time\n",
"\n",
"A trajectory plot collapses time into \"shape on a page\". To see *when*\n",
"things happen we need time on the x-axis.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"fig, axes = plt.subplots(2, 1, figsize=(12, 5), sharex=True)\n",
"\n",
"axes[0].plot(fly1[\"t\"] / 1000, fly1[\"x\"], linewidth=0.5)\n",
"axes[0].set_ylabel(\"x (px)\")\n",
"axes[0].set_title(f\"Fly 1, ROI 1, {db_path.name[:30]}\u2026\")\n",
"\n",
"axes[1].plot(fly1[\"t\"] / 1000, fly1[\"y\"], linewidth=0.5, color=\"darkorange\")\n",
"axes[1].set_ylabel(\"y (px)\")\n",
"axes[1].set_xlabel(\"time (s)\")\n",
"axes[1].invert_yaxis()\n",
"\n",
"plt.tight_layout()\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Bursts of variation = active fly. Long flat stretches = the fly is sitting\n",
"still. You'll come to recognize courtship vs idling by eye after a while.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Distance between the two flies\n",
"\n",
"Whenever the ROI has 2 detections at the same `t`, we can compute the\n",
"Euclidean distance between them: `sqrt((x1-x2)\u00b2 + (y1-y2)\u00b2)`.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"two_fly_frames = roi1.groupby(\"t\").filter(lambda g: len(g) == 2)\n",
"two_fly_frames = two_fly_frames.sort_values([\"t\", \"id\"])\n",
"\n",
"# Pivot so each row is one timepoint with x1, y1, x2, y2\n",
"def pair_up(g):\n",
" g = g.reset_index(drop=True)\n",
" return pd.Series({\n",
" \"x1\": g.loc[0, \"x\"], \"y1\": g.loc[0, \"y\"],\n",
" \"x2\": g.loc[1, \"x\"], \"y2\": g.loc[1, \"y\"],\n",
" })\n",
"\n",
"paired = two_fly_frames.groupby(\"t\").apply(pair_up).reset_index()\n",
"paired[\"distance_px\"] = np.hypot(paired[\"x1\"] - paired[\"x2\"], paired[\"y1\"] - paired[\"y2\"])\n",
"paired.head()\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"plt.figure(figsize=(12, 4))\n",
"plt.plot(paired[\"t\"] / 1000, paired[\"distance_px\"], linewidth=0.4)\n",
"plt.xlabel(\"time (s)\")\n",
"plt.ylabel(\"inter-fly distance (px)\")\n",
"plt.title(\"Distance between the two flies in ROI 1\")\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the kind of trace that drives the rest of the analysis: a male\n",
"courting a female stays close (small distance); a male giving up wanders\n",
"off (large distance). The shape of this curve is the behavioural readout.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Don't forget to close the connection\n",
"\n",
"If you opened a connection, close it when you're done. (Not strictly\n",
"necessary in a notebook \u2014 Python tidies up \u2014 but a good habit.)\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"conn.close()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercises\n",
"\n",
"1. Pick a different DB (change `db_files[0]` to `db_files[10]` for example)\n",
" and re-run the trajectory plot. Is the arena bigger / smaller? Why\n",
" might that be? (Hint: look at the resolution part of the filename.)\n",
"2. Plot the distance trace for **ROI 4** instead of ROI 1.\n",
"3. Compute the **percentage of frames** in ROI 1 that had only 1 fly visible.\n",
"4. The `area = w * h` column is a useful diagnostic. Plot `area` vs `t`\n",
" for fly 1 \u2014 when does the bounding box get unusually large?\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Exercise space\n"
]
}
]
}

View file

@ -0,0 +1,398 @@
{
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 03 \u00b7 Your first real analysis: trained vs naive\n",
"\n",
"In notebook 02 we explored a single database. Now we'll work with **all\n",
"of them at once**, compute a simple per-fly metric, and ask the central\n",
"question of the project:\n",
"\n",
"> **Do trained males behave differently from na\u00efve males in the testing\n",
"> session?**\n",
"\n",
"By the end you'll have:\n",
"\n",
"- loaded every (fly, session) trace into one big DataFrame using the\n",
" project's helper function;\n",
"- reduced each trace to one number per fly (the *median inter-fly\n",
" distance*);\n",
"- compared the trained group against the na\u00efve group with a histogram\n",
" and a non-parametric statistical test;\n",
"- learnt enough to start asking your own questions.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import sys\n",
"from pathlib import Path\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from scipy import stats\n",
"\n",
"# Tell Python where to find the project's helper modules.\n",
"PROJECT_ROOT = Path(\"..\").resolve().parent # this notebook is in notebooks/getting_started/\n",
"sys.path.insert(0, str(PROJECT_ROOT / \"scripts\"))\n",
"\n",
"from load_roi_data import load_roi_data\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading everything at once \u2014 but carefully\n",
"\n",
"`load_roi_data()` opens every tracking DB referenced by the metadata TSV\n",
"and returns one big DataFrame. **It can be slow and memory-hungry**\n",
"(the full batch is ~200 million rows). Always start small.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Load the metadata TSV first \u2014 it's small and fast.\n",
"tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
"meta = pd.read_csv(tsv_path, sep=\"\\t\")\n",
"print(f\"metadata rows: {len(meta)}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pre-filter the metadata before passing it to `load_roi_data`. We'll start\n",
"with **just one species and just the testing sessions**, because:\n",
"\n",
"1. mixing species is a confound (different species behave differently);\n",
"2. the question is about behaviour after training, so the testing session\n",
" is the relevant one;\n",
"3. starting small means we can iterate quickly.\n",
"\n",
"You can come back later and broaden this filter.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Pick one species. 'Melanogaster/CS' has the most rows (127), so a good default.\n",
"sub = meta[meta[\"species\"] == \"Melanogaster/CS\"].copy()\n",
"\n",
"# We're loading every session for these flies, but the loader stamps each\n",
"# row with a 'session' column so we can filter to testing afterwards.\n",
"print(f\"selected metadata rows: {len(sub)}\")\n",
"print(sub[\"male\"].value_counts())\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# This will take a minute or two and use a chunk of RAM. Be patient.\n",
"data = load_roi_data(sub)\n",
"print(f\"loaded shape: {data.shape}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What did we get?\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"data.head(3)\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# How big is each session, in tracking samples?\n",
"data.groupby([\"session\", \"male\"]).size().unstack(fill_value=0)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Restrict to the testing session\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"testing = data[data[\"session\"] == \"testing\"].copy()\n",
"print(f\"testing samples: {len(testing):,}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reduce each trace to one number\n",
"\n",
"Right now each fly contributes **tens of thousands** of (t, x, y) rows.\n",
"We can't compare distributions of millions of points across two groups\n",
"in any meaningful way. So we **collapse each (date, machine_name, ROI)\n",
"trace into a single summary number** \u2014 here, the median distance between\n",
"the two flies during testing.\n",
"\n",
"Why median rather than mean? Because tracker glitches (one fly\n",
"temporarily lost) can produce huge spikes that the median ignores.\n",
"[Why medians beat means in noisy data\n",
"(2-min read)](https://en.wikipedia.org/wiki/Median#Robustness).\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Step 1 \u2014 per-frame distance.\n",
"# Take only frames with exactly 2 flies (so we have a real distance).\n",
"two_fly = testing.groupby([\"date\", \"machine_name\", \"ROI\", \"t\"]).filter(lambda g: len(g) == 2)\n",
"\n",
"# For each (track, t), compute the distance between the two rows.\n",
"def distance_for_frame(g):\n",
" g = g.sort_values(\"id\").reset_index(drop=True)\n",
" return np.hypot(g.loc[0, \"x\"] - g.loc[1, \"x\"], g.loc[0, \"y\"] - g.loc[1, \"y\"])\n",
"\n",
"# This is the slow step. With ~3 M frames it takes a while.\n",
"per_frame = (\n",
" two_fly\n",
" .groupby([\"date\", \"machine_name\", \"ROI\", \"t\", \"male\"])\n",
" .apply(distance_for_frame)\n",
" .reset_index(name=\"distance_px\")\n",
")\n",
"print(f\"per-frame distance rows: {len(per_frame):,}\")\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Step 2 \u2014 one number per (date, machine_name, ROI).\n",
"per_fly = (\n",
" per_frame\n",
" .groupby([\"date\", \"machine_name\", \"ROI\", \"male\"])[\"distance_px\"]\n",
" .median()\n",
" .reset_index(name=\"median_distance_px\")\n",
")\n",
"\n",
"# Each row now is \"one fly during testing\", with its median distance.\n",
"print(per_fly.shape)\n",
"per_fly.head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sanity check: how many flies per group?\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"per_fly[\"male\"].value_counts()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the numbers are very different, your statistical comparison will be\n",
"underpowered for one side. Note them down.\n",
"\n",
"## Plot the distributions\n",
"\n",
"The first thing to do with two groups is to **look at them**. Don't trust\n",
"a p-value before you've seen the histogram.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"fig, ax = plt.subplots(figsize=(10, 5))\n",
"\n",
"bins = np.linspace(0, per_fly[\"median_distance_px\"].max(), 40)\n",
"\n",
"for label, color in [(\"trained\", \"steelblue\"), (\"naive\", \"darkorange\")]:\n",
" sub = per_fly[per_fly[\"male\"] == label][\"median_distance_px\"]\n",
" ax.hist(sub, bins=bins, alpha=0.6, label=f\"{label} (n={len(sub)})\", color=color)\n",
"\n",
"ax.set_xlabel(\"median inter-fly distance during testing (px)\")\n",
"ax.set_ylabel(\"number of flies\")\n",
"ax.set_title(\"Trained vs na\u00efve \u2014 Melanogaster/CS \u2014 testing session\")\n",
"ax.legend()\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**What you might see:**\n",
"\n",
"- If the trained group's distribution is shifted to **higher** distances,\n",
" trained males are spending less time near the female (i.e. they\n",
" learned to give up).\n",
"- If the two distributions look identical, no learning effect was\n",
" measurable with this metric \u2014 but that doesn't mean there's no effect,\n",
" just that this particular summary didn't capture it.\n",
"- A **bimodal** trained distribution (two humps) would mean some males\n",
" learned and others didn't \u2014 the \"individual differences\" story in\n",
" `docs/bimodal_hypothesis.md`.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Add a stat test\n",
"\n",
"A formal comparison. Because group sizes are small and we don't know if\n",
"the data are normally distributed, the\n",
"[Mann-Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test)\n",
"is a safer default than the classic t-test.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"trained_vals = per_fly[per_fly[\"male\"] == \"trained\"][\"median_distance_px\"]\n",
"naive_vals = per_fly[per_fly[\"male\"] == \"naive\"][\"median_distance_px\"]\n",
"\n",
"stat, pvalue = stats.mannwhitneyu(trained_vals, naive_vals, alternative=\"two-sided\")\n",
"\n",
"print(f\"trained median: {trained_vals.median():.1f} px (n={len(trained_vals)})\")\n",
"print(f\"naive median: {naive_vals.median():.1f} px (n={len(naive_vals)})\")\n",
"print(f\"Mann-Whitney U: {stat:.0f} p-value: {pvalue:.4f}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**How to read this**: the p-value is the probability of seeing a\n",
"difference at least this big *if there were really no difference*. By\n",
"convention p < 0.05 is \"interesting\", p < 0.01 is \"fairly convincing\".\n",
"But never trust a p-value without:\n",
"\n",
"1. eyeballing the histogram first (you did);\n",
"2. reporting the **effect size**, not just the p-value (e.g. the\n",
" difference of medians);\n",
"3. understanding that p-values\n",
" [say nothing about practical importance](https://www.nature.com/articles/d41586-019-00857-9).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What's next?\n",
"\n",
"- **Pick a different metric**: instead of median distance, try fraction\n",
" of time the flies were within 50 px (a \"close-proximity\" metric), or\n",
" the maximum velocity per fly. (Velocity needs identity tracking, which\n",
" is harder \u2014 see `flies_analysis_simple.ipynb` cell 16 for an example.)\n",
"- **Look at it per species**: re-run with `species == \"Sechellia\"` and\n",
" compare. Does the effect generalize? Where is it strongest?\n",
"- **Look at the bimodality**: a kernel density plot\n",
" ([seaborn.kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html))\n",
" will show humps better than a histogram.\n",
"- **Time inside the session**: maybe the difference only shows up in the\n",
" first few minutes (right after the female is introduced). Slice\n",
" `per_frame` by `t` before aggregating.\n",
"- **Consult `docs/bimodal_hypothesis.md`**: it lays out a formal plan for\n",
" testing the \"some flies learn, others don't\" hypothesis.\n",
"\n",
"When you write your own analysis, **save it as a new notebook** (don't\n",
"edit this one). Copy the setup cells, change the question, change the\n",
"plot. That's how analysis projects grow.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A note on iteration speed\n",
"\n",
"The pipeline above is correct but **slow** because we apply a Python\n",
"function to every (track, t) group. If you find yourself re-running the\n",
"same expensive computation a lot, save the intermediate result to disk:\n",
"\n",
"```python\n",
"per_frame.to_parquet(\"per_frame_distance.parquet\")\n",
"# next time:\n",
"per_frame = pd.read_parquet(\"per_frame_distance.parquet\")\n",
"```\n",
"\n",
"`parquet` is a fast columnar format. `pip install pyarrow` if your\n",
"environment doesn't have it.\n",
"\n",
"There are also vectorized ways to compute these distances ~100\u00d7 faster\n",
"that avoid `groupby().apply()`. Don't worry about that yet \u2014 get a\n",
"correct answer first, optimize only if you find yourself waiting.\n"
]
}
]
}

View file

@ -0,0 +1,15 @@
# Tutorial notebooks
Read these in order:
1. **`00_welcome.ipynb`** — what's the project, where the data lives,
how to use a Jupyter notebook.
2. **`01_python_pandas_basics.ipynb`** — minimum Python and pandas you
need to read project code.
3. **`02_explore_one_database.ipynb`** — open one tracking DB, plot a
trajectory, compute a single distance.
4. **`03_compare_trained_vs_naive.ipynb`** — first real analysis,
comparing groups.
After these, the notebooks one level up (`flies_analysis*.ipynb`) walk
through the full analysis pipeline that the previous student built.