Add beginner tutorial notebooks for incoming students
Four guided notebooks under notebooks/getting_started/ aimed at someone new to Python and data science. The series progresses: project orientation → Python/pandas crash course → exploring one tracking DB → first trained-vs-naive comparison using load_roi_data + Mann-Whitney U. Each notebook leans heavily on markdown explanations, includes exercises with empty cells, and links out to canonical references (JupyterLab, official Python tutorial, pandas 10-min guide, Wikipedia for stats concepts). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
7d09523840
commit
ec56e51bf9
5 changed files with 1607 additions and 0 deletions
255
notebooks/getting_started/00_welcome.ipynb
Normal file
255
notebooks/getting_started/00_welcome.ipynb
Normal file
|
|
@ -0,0 +1,255 @@
|
|||
{
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5,
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 00 \u00b7 Welcome to the Cupido fly-tracking project\n",
|
||||
"\n",
|
||||
"Hi! You're about to start working on a project that studies how *Drosophila*\n",
|
||||
"(fruit flies) form **memories of mating experiences** \u2014 and whether trained\n",
|
||||
"flies behave differently from na\u00efve ones in their later courtship.\n",
|
||||
"\n",
|
||||
"**You don't need any prior experience with Python or data science to follow\n",
|
||||
"along.** This series of notebooks will walk you through everything, one\n",
|
||||
"small step at a time.\n",
|
||||
"\n",
|
||||
"> **How to read these notebooks**: each notebook is split into \"cells\".\n",
|
||||
"> Some cells are explanations (like this one), others are code that you\n",
|
||||
"> can **run** by clicking on the cell and pressing `Shift + Enter`. Try it\n",
|
||||
"> on the next cell.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# This is a code cell. Click on it and press Shift+Enter to run it.\n",
|
||||
"print(\"Hello, fly world!\")\n",
|
||||
"1 + 1\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You should have seen `Hello, fly world!` printed and the number `2`\n",
|
||||
"appear underneath. If something else happened, ask Giorgio \u2014 that's a\n",
|
||||
"sign the environment isn't set up right.\n",
|
||||
"\n",
|
||||
"If this is the very first time you're using JupyterLab, take 10 minutes\n",
|
||||
"to read the [official \"Getting started with JupyterLab\"\n",
|
||||
"guide](https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html).\n",
|
||||
"The most important things to know are:\n",
|
||||
"\n",
|
||||
"- A notebook (`.ipynb` file) is a sequence of **cells**.\n",
|
||||
"- Each cell is either **Markdown** (formatted text, like this) or **Code**\n",
|
||||
" (Python that the computer runs).\n",
|
||||
"- The **kernel** is the running Python process behind the notebook. It\n",
|
||||
" remembers everything you've defined. If something gets weird, restart\n",
|
||||
" the kernel: top menu \u2192 *Kernel* \u2192 *Restart Kernel\u2026*.\n",
|
||||
"- `Shift + Enter` runs a cell and moves to the next one.\n",
|
||||
"- `Ctrl + Enter` runs a cell and stays put.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## What is the project about?\n",
|
||||
"\n",
|
||||
"Drosophila males court females with a stereotyped sequence (chasing,\n",
|
||||
"wing-extension, tapping). When a male is rejected by a female (e.g.\n",
|
||||
"because she's already mated), he **learns** to suppress his courtship \u2014\n",
|
||||
"even toward new, receptive females, for a while. This is a textbook\n",
|
||||
"example of *non-associative learning* in invertebrates ([review on\n",
|
||||
"PubMed](https://pubmed.ncbi.nlm.nih.gov/?term=courtship+conditioning+drosophila)).\n",
|
||||
"\n",
|
||||
"The lab is interested in:\n",
|
||||
"\n",
|
||||
"- Does this learning **transfer across species**? (We have ~7 *Drosophila*\n",
|
||||
" species recorded.)\n",
|
||||
"- How long does the memory last? (training_length_hr,\n",
|
||||
" consolidation_length_hr columns in the metadata.)\n",
|
||||
"- Are there **individual differences** \u2014 do some males learn while others\n",
|
||||
" don't? (The \"bimodal hypothesis\" in `docs/bimodal_hypothesis.md`.)\n",
|
||||
"\n",
|
||||
"Your job, broadly, will be to **turn videos of flies into numbers and\n",
|
||||
"plots that answer these questions.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## How an experiment works (the bird's-eye view)\n",
|
||||
"\n",
|
||||
"1. **Training**: a male fly is placed with a non-receptive (mated) female.\n",
|
||||
" He courts, gets rejected, eventually gives up.\n",
|
||||
"2. *Wait* for some hours (the \"consolidation\" period \u2014 gives memory time\n",
|
||||
" to form).\n",
|
||||
"3. **Testing**: same male is placed with a fresh receptive female.\n",
|
||||
" Does he court her vigorously, or has he learned to give up easily?\n",
|
||||
"\n",
|
||||
"Each experiment runs in an **HD mating arena** \u2014 a small chamber with\n",
|
||||
"6 sub-arenas (we call them **ROIs**, for \"regions of interest\"). Each ROI\n",
|
||||
"contains one couple (a male and a female). A camera films the whole arena\n",
|
||||
"from above. So one **video** gives us 6 simultaneous experiments.\n",
|
||||
"\n",
|
||||
"The setup uses [Ethoscopes](https://www.ethoscope.com/) \u2014 open-source\n",
|
||||
"behavioural recording boxes built in this lab. Each ethoscope is a\n",
|
||||
"machine; we have 16 in total, named `ETHOSCOPE_067`, `ETHOSCOPE_076`, etc.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## What does the data look like?\n",
|
||||
"\n",
|
||||
"For each video, the **tracker** (a piece of software that runs after the\n",
|
||||
"recording) finds the flies frame-by-frame and writes their positions to a\n",
|
||||
"**SQLite database** (a single file, ending in `.db`). One DB per video.\n",
|
||||
"Inside each DB there are 6 tables called `ROI_1`, `ROI_2`, \u2026, `ROI_6` \u2014\n",
|
||||
"one per sub-arena. Each row of an ROI table is **one fly detection at one\n",
|
||||
"moment in time** with these columns:\n",
|
||||
"\n",
|
||||
"| column | meaning |\n",
|
||||
"|---|---|\n",
|
||||
"| `id` | row number (auto-incremented) |\n",
|
||||
"| `t` | time in **milliseconds** since the video started |\n",
|
||||
"| `x`, `y` | fly position in **pixels** (top-left corner of the image is 0,0) |\n",
|
||||
"| `w`, `h` | width and height of the bounding box around the fly, in pixels |\n",
|
||||
"| `phi` | orientation angle of the fly |\n",
|
||||
"| `is_inferred` | 1 if the position was guessed (not directly seen), 0 otherwise |\n",
|
||||
"| `has_interacted` | (legacy column, mostly unused) |\n",
|
||||
"\n",
|
||||
"If a single ROI has two flies that the tracker can see, you'll get **two\n",
|
||||
"rows with the same `t`** \u2014 one for each fly. If only one fly is detected\n",
|
||||
"(maybe they're on top of each other), you'll get one row.\n",
|
||||
"\n",
|
||||
"That's the heart of the data. Everything else (distances, velocities,\n",
|
||||
"group comparisons) is computed from these (t, x, y) traces.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Where everything lives\n",
|
||||
"\n",
|
||||
"Take a moment to memorize these locations \u2014 you'll come back to them often.\n",
|
||||
"\n",
|
||||
"| what | where |\n",
|
||||
"|---|---|\n",
|
||||
"| Tracking DBs (SQLite, one per video) | `/mnt/data/projects/cupido/tracked/` |\n",
|
||||
"| Target JSONs (the user-clicked reference points) | `/mnt/data/projects/cupido/targets/` |\n",
|
||||
"| Source video files | `/mnt/ethoscope_data/videos/` |\n",
|
||||
"| Project code (this repo) | `/home/gg/ownCloud/Work/Projects/coding/cupido/tracking/` |\n",
|
||||
"| The metadata table (xlsx + TSV) | `/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv` |\n",
|
||||
"| Your notebooks | `notebooks/getting_started/` (this folder) |\n",
|
||||
"\n",
|
||||
"Let's verify a couple of these from inside Python:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"tracked = Path(\"/mnt/data/projects/cupido/tracked\")\n",
|
||||
"targets = Path(\"/mnt/data/projects/cupido/targets\")\n",
|
||||
"\n",
|
||||
"n_dbs = len(list(tracked.glob(\"*_tracking.db\")))\n",
|
||||
"n_jsons = len(list(targets.glob(\"*.json\")))\n",
|
||||
"\n",
|
||||
"print(f\"Tracking DBs available: {n_dbs}\")\n",
|
||||
"print(f\"Target JSONs available: {n_jsons}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You should see roughly 113 tracking DBs and 130 target JSONs. If those\n",
|
||||
"numbers are zero, the storage volume isn't mounted \u2014 ask Giorgio.\n",
|
||||
"\n",
|
||||
"> **Note**: the tracking DBs are read-only inside the JupyterLab\n",
|
||||
"> container. You can read them but not modify or delete them. That's a\n",
|
||||
"> deliberate safety measure \u2014 we don't want analysis code accidentally\n",
|
||||
"> corrupting the source data.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Glossary (refer back as needed)\n",
|
||||
"\n",
|
||||
"- **ROI** \u2014 *region of interest*. One sub-arena inside the HD mating\n",
|
||||
" arena. There are 6 ROIs per video, numbered 1\u20136.\n",
|
||||
"- **fly** \u2014 one detection in a single (t, ROI) cell. Two flies in the\n",
|
||||
" same ROI at the same time = two rows with the same `t`.\n",
|
||||
"- **trained** \u2014 the male had a training session before testing.\n",
|
||||
"- **naive** \u2014 the male is a control (no training).\n",
|
||||
"- **training session** \u2014 the recording where the male meets the\n",
|
||||
" non-receptive female (he gets rejected).\n",
|
||||
"- **testing session** \u2014 the recording where the male meets a fresh\n",
|
||||
" receptive female (we measure his courtship).\n",
|
||||
"- **t (milliseconds)** \u2014 time within one session, starting at 0.\n",
|
||||
"- **(x, y) pixels** \u2014 fly position in the image. Top-left is (0, 0); x\n",
|
||||
" grows to the right, y grows **downward** (this is the image-coordinate\n",
|
||||
" convention, opposite of math class).\n",
|
||||
"- **machine_name** \u2014 which ethoscope recorded the video, e.g.\n",
|
||||
" `ETHOSCOPE_076`.\n",
|
||||
"- **species** \u2014 `Melanogaster/CS`, `Sechellia`, `Simulans`, `Yakuba`,\n",
|
||||
" `Erecta`, `Willistoni`, or `CS`.\n",
|
||||
"\n",
|
||||
"If you bump into other terms in the code, ask. Don't guess \u2014 biology\n",
|
||||
"codebases pick up jargon over the years.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## What's next\n",
|
||||
"\n",
|
||||
"When you're ready, open these notebooks **in order**:\n",
|
||||
"\n",
|
||||
"1. `01_python_pandas_basics.ipynb` \u2014 just enough Python and pandas to\n",
|
||||
" read and manipulate tabular data.\n",
|
||||
"2. `02_explore_one_database.ipynb` \u2014 open one tracking DB, plot a fly's\n",
|
||||
" trajectory, see what the numbers actually look like.\n",
|
||||
"3. `03_compare_trained_vs_naive.ipynb` \u2014 your first real analysis,\n",
|
||||
" comparing groups of flies.\n",
|
||||
"\n",
|
||||
"After those, the notebooks one level up (`flies_analysis.ipynb`,\n",
|
||||
"`flies_analysis_simple.ipynb`) contain the analysis pipeline that the\n",
|
||||
"previous student built \u2014 those will make sense once you've worked\n",
|
||||
"through the tutorials.\n",
|
||||
"\n",
|
||||
"Don't try to power through all of them in one sitting. Run a few cells,\n",
|
||||
"read the explanation, **change a number** to see what happens, **break\n",
|
||||
"something on purpose** to see the error message. That's how you learn.\n"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
500
notebooks/getting_started/01_python_pandas_basics.ipynb
Normal file
500
notebooks/getting_started/01_python_pandas_basics.ipynb
Normal file
|
|
@ -0,0 +1,500 @@
|
|||
{
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5,
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 01 \u00b7 Python and pandas \u2014 just enough to be dangerous\n",
|
||||
"\n",
|
||||
"This notebook teaches the **minimum** Python and `pandas` you need to read\n",
|
||||
"the rest of the project's code and write your own analyses.\n",
|
||||
"\n",
|
||||
"If you've never programmed before, don't try to memorize the syntax.\n",
|
||||
"Just run each cell, read what it does, and come back when you're stuck on\n",
|
||||
"something specific. The cheat sheet at the end is the only thing worth\n",
|
||||
"keeping handy.\n",
|
||||
"\n",
|
||||
"External resources, in order of how much time they take:\n",
|
||||
"\n",
|
||||
"- \ud83e\udd98 [Python in 10 minutes (very condensed)](https://www.stavros.io/tutorials/python/)\n",
|
||||
"- \ud83d\udc0d [Official Python tutorial \u2014 chapters 3\u20135](https://docs.python.org/3/tutorial/introduction.html)\n",
|
||||
"- \ud83d\udc3c [pandas in 10 minutes (official)](https://pandas.pydata.org/docs/user_guide/10min.html)\n",
|
||||
"- \ud83d\udcda [Python for Data Analysis (the book)](https://wesmckinney.com/book/) \u2014 free online\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1. Variables\n",
|
||||
"\n",
|
||||
"A variable is a named box you put a value into. The `=` is **assignment**,\n",
|
||||
"not equality. Read it as \"make `name` refer to `value`\".\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"x = 5\n",
|
||||
"y = 3\n",
|
||||
"total = x + y\n",
|
||||
"print(total)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Re-running the cell after changing `x = 5` to `x = 50` gives a different\n",
|
||||
"answer. Try it.\n",
|
||||
"\n",
|
||||
"Variable names: lowercase letters, digits, and underscores. They can't\n",
|
||||
"start with a digit. Convention is `snake_case`: `mean_distance`, not\n",
|
||||
"`meanDistance` or `MeanDistance`.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2. Strings and numbers\n",
|
||||
"\n",
|
||||
"A **string** is text in quotes. You can join strings with `+`. You can\n",
|
||||
"turn a number into a string with `str()`, and vice-versa with `int()` /\n",
|
||||
"`float()`.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"species = \"Drosophila melanogaster\"\n",
|
||||
"n_flies = 12\n",
|
||||
"message = \"We tracked \" + str(n_flies) + \" \" + species + \" males.\"\n",
|
||||
"print(message)\n",
|
||||
"\n",
|
||||
"# A nicer way to build strings \u2014 f-strings (note the leading 'f'):\n",
|
||||
"print(f\"We tracked {n_flies} {species} males.\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3. Lists\n",
|
||||
"\n",
|
||||
"A list is an ordered collection of things. Square brackets, items\n",
|
||||
"separated by commas. You can mix types (but usually shouldn't).\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"machines = [\"ETHOSCOPE_076\", \"ETHOSCOPE_082\", \"ETHOSCOPE_086\"]\n",
|
||||
"print(machines[0]) # first item \u2014 Python counts from 0!\n",
|
||||
"print(machines[-1]) # last item\n",
|
||||
"print(len(machines)) # how many items\n",
|
||||
"print(machines + [\"ETHOSCOPE_140\"]) # concatenate (returns a new list)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4. Dictionaries\n",
|
||||
"\n",
|
||||
"A dictionary maps **keys** to **values**. Curly braces, `key: value`\n",
|
||||
"pairs.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fly = {\"species\": \"Sechellia\", \"trained\": True, \"age_days\": 5}\n",
|
||||
"print(fly[\"species\"])\n",
|
||||
"print(fly[\"age_days\"])\n",
|
||||
"fly[\"alive\"] = False # add a new key\n",
|
||||
"print(fly)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 5. Conditions: if / elif / else\n",
|
||||
"\n",
|
||||
"Compare with `==` (equal), `!=` (not equal), `<`, `>`, `<=`, `>=`.\n",
|
||||
"Combine with `and`, `or`, `not`.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"distance_px = 42\n",
|
||||
"\n",
|
||||
"if distance_px < 50:\n",
|
||||
" label = \"close\"\n",
|
||||
"elif distance_px < 200:\n",
|
||||
" label = \"medium\"\n",
|
||||
"else:\n",
|
||||
" label = \"far\"\n",
|
||||
"\n",
|
||||
"print(label)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 6. Loops\n",
|
||||
"\n",
|
||||
"`for x in collection:` runs the indented block once per item.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for m in machines:\n",
|
||||
" print(f\"Looking at machine {m}\")\n",
|
||||
"\n",
|
||||
"# Looping with an index, when you need it:\n",
|
||||
"for i, m in enumerate(machines):\n",
|
||||
" print(f\"{i}: {m}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 7. Functions\n",
|
||||
"\n",
|
||||
"A function is a named, reusable chunk of code. `def` declares it. `return`\n",
|
||||
"sends a value back to whoever called it.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def fly_age_in_weeks(days):\n",
|
||||
" \"\"\"Return age in weeks given age in days.\"\"\"\n",
|
||||
" return days / 7\n",
|
||||
"\n",
|
||||
"print(fly_age_in_weeks(14)) # 2.0\n",
|
||||
"print(fly_age_in_weeks(5)) # 0.714\u2026\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 8. Importing libraries\n",
|
||||
"\n",
|
||||
"A library is somebody else's code. We use `import` to pull it into our\n",
|
||||
"notebook.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import math\n",
|
||||
"print(math.sqrt(16)) # 4.0\n",
|
||||
"print(math.pi)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 9. Meet pandas\n",
|
||||
"\n",
|
||||
"Real data is rarely a single number \u2014 it's a **table** with rows and\n",
|
||||
"columns (think Excel). `pandas` is the library that handles tables in\n",
|
||||
"Python. The two main objects are:\n",
|
||||
"\n",
|
||||
"- **`Series`** \u2014 a single column with a name.\n",
|
||||
"- **`DataFrame`** \u2014 a whole table.\n",
|
||||
"\n",
|
||||
"By convention we import pandas as `pd`. Always.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"\n",
|
||||
"# Read the project's metadata TSV (Tab-Separated Values).\n",
|
||||
"tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
|
||||
"df = pd.read_csv(tsv_path, sep=\"\\t\")\n",
|
||||
"\n",
|
||||
"# How big is it?\n",
|
||||
"print(f\"Rows: {len(df)}\")\n",
|
||||
"print(f\"Columns: {df.shape[1]}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 10. Looking at the table\n",
|
||||
"\n",
|
||||
"`.head()` shows the first 5 rows. `.tail()` the last 5. `.columns` lists\n",
|
||||
"column names. `.dtypes` shows the type of each column.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df.head(3)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"Column names:\")\n",
|
||||
"for c in df.columns:\n",
|
||||
" print(f\" {c}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 11. Selecting columns\n",
|
||||
"\n",
|
||||
"Two main ways to get one column: bracket-indexing (`df[\"name\"]`) or\n",
|
||||
"attribute access (`df.name`). The first works for any column name; the\n",
|
||||
"second only works if the name has no spaces or weird characters.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df[\"species\"].head()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df.species.value_counts() # how many rows per species\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 12. Selecting multiple columns\n",
|
||||
"\n",
|
||||
"Pass a **list** of names inside the brackets:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df[[\"machine_name\", \"roi\", \"species\", \"male\"]].head()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 13. Filtering rows\n",
|
||||
"\n",
|
||||
"The pattern is `df[condition]`. The condition is a Series of `True`/`False`.\n",
|
||||
"Pandas keeps the rows where it's `True`.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"trained = df[df[\"male\"] == \"trained\"]\n",
|
||||
"print(f\"trained rows: {len(trained)}\")\n",
|
||||
"\n",
|
||||
"mel_only = df[df[\"species\"] == \"Melanogaster/CS\"]\n",
|
||||
"print(f\"Melanogaster/CS rows: {len(mel_only)}\")\n",
|
||||
"\n",
|
||||
"# Combine conditions with & (and) | (or) \u2014 and wrap each part in parentheses.\n",
|
||||
"trained_mel = df[(df[\"male\"] == \"trained\") & (df[\"species\"] == \"Melanogaster/CS\")]\n",
|
||||
"print(f\"trained Mel rows: {len(trained_mel)}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 14. Grouping and counting\n",
|
||||
"\n",
|
||||
"`.groupby(\"col\")` followed by an aggregator like `.size()` or `.mean()`\n",
|
||||
"splits the table by the values in that column and computes something per\n",
|
||||
"group.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# How many ROIs per (species, training condition)?\n",
|
||||
"df.groupby([\"species\", \"male\"]).size()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 15. Quick plots\n",
|
||||
"\n",
|
||||
"DataFrames know how to draw themselves. Under the hood it's `matplotlib`.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"\n",
|
||||
"# How many rows per machine?\n",
|
||||
"df[\"machine_name\"].value_counts().plot(kind=\"bar\", figsize=(10, 4))\n",
|
||||
"plt.title(\"Number of fly-rows per ethoscope machine\")\n",
|
||||
"plt.ylabel(\"rows\")\n",
|
||||
"plt.show()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 16. Exercises\n",
|
||||
"\n",
|
||||
"Don't skip these. They're how you find out what you actually understood.\n",
|
||||
"\n",
|
||||
"1. How many rows does `df` have where `age` equals `'5-7'`?\n",
|
||||
"2. Print the **unique values** of the `memory` column. (Hint: `df[\"memory\"].unique()`)\n",
|
||||
"3. How many distinct `(date, machine_name)` pairs are in the dataset?\n",
|
||||
" (Hint: `df.groupby([\"date\", \"machine_name\"]).size().shape`.)\n",
|
||||
"4. Make a bar plot of `species` counts. Which species has the most rows?\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Try exercise 1 here\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Try exercise 2 here\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Try exercise 3 here\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Try exercise 4 here\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Cheat sheet\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"import pandas as pd\n",
|
||||
"df = pd.read_csv(\"file.tsv\", sep=\"\\t\") # read\n",
|
||||
"df.head(); df.tail(); df.shape; df.columns # peek\n",
|
||||
"df[\"col\"]; df[[\"a\", \"b\"]] # select\n",
|
||||
"df[df[\"col\"] == \"value\"] # filter\n",
|
||||
"df.groupby(\"col\").size() # count per group\n",
|
||||
"df.groupby(\"col\")[\"x\"].mean() # mean of x per group\n",
|
||||
"df[\"col\"].value_counts() # quick counts\n",
|
||||
"df[\"col\"].unique() # unique values\n",
|
||||
"df[\"new_col\"] = df[\"w\"] * df[\"h\"] # derived column\n",
|
||||
"df.sort_values(\"col\", ascending=False) # sort\n",
|
||||
"df.plot(...) # quick plot\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"Keep this list open when reading other people's code. Most of pandas is\n",
|
||||
"just combinations of these primitives. When you need more, the official\n",
|
||||
"[pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)\n",
|
||||
"is excellent.\n"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
439
notebooks/getting_started/02_explore_one_database.ipynb
Normal file
439
notebooks/getting_started/02_explore_one_database.ipynb
Normal file
|
|
@ -0,0 +1,439 @@
|
|||
{
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5,
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 02 \u00b7 A first look at one tracking database\n",
|
||||
"\n",
|
||||
"In this notebook we open **one** of the SQLite databases that the tracker\n",
|
||||
"produced and look at what's actually inside. By the end you'll be able to:\n",
|
||||
"\n",
|
||||
"- list the tables in a `.db` file\n",
|
||||
"- read one ROI's tracking trace into a DataFrame\n",
|
||||
"- plot a fly's path through the arena\n",
|
||||
"- count how many flies are visible at each moment\n",
|
||||
"- compute a simple distance between the two flies in a ROI\n",
|
||||
"\n",
|
||||
"If you're curious how SQLite works, the\n",
|
||||
"[SQLite Quickstart](https://www.sqlite.org/quickstart.html) is short and\n",
|
||||
"worth reading. For our purposes, **SQLite is just a file that contains\n",
|
||||
"several tables you can query like a DataFrame**.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup\n",
|
||||
"\n",
|
||||
"We import the libraries we need. `sqlite3` is part of Python's standard\n",
|
||||
"library \u2014 no install needed.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import sqlite3\n",
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"import pandas as pd\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"import numpy as np\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Find the databases\n",
|
||||
"\n",
|
||||
"The DBs live at `/mnt/data/projects/cupido/tracked/`. Let's list a few.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tracked_dir = Path(\"/mnt/data/projects/cupido/tracked\")\n",
|
||||
"db_files = sorted(tracked_dir.glob(\"*_tracking.db\"))\n",
|
||||
"\n",
|
||||
"print(f\"Found {len(db_files)} tracking DBs.\")\n",
|
||||
"print(\"\\nFirst 5 by name:\")\n",
|
||||
"for db in db_files[:5]:\n",
|
||||
" print(f\" {db.name}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The filename encodes the date, time, machine UUID, video resolution, and\n",
|
||||
"the suffix `_tracking.db`. For example:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"2024-09-17_10-32-10_076e2825a7274661bd0697c42d6fa4c0__1920x1088@25fps-28q_merged_tracking.db\n",
|
||||
"\u2514\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2518\u2514\u2500\u2500\u252c\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n",
|
||||
" date time machine UUID video format\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"Pick one to explore. Feel free to change the index.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"db_path = db_files[0]\n",
|
||||
"print(\"Working with:\", db_path.name)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Open the database\n",
|
||||
"\n",
|
||||
"We open it **read-only** as a safety measure. The `?mode=ro` flag is\n",
|
||||
"SQLite's read-only switch.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"conn = sqlite3.connect(f\"file:{db_path}?mode=ro\", uri=True)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## What tables are inside?\n",
|
||||
"\n",
|
||||
"Every SQLite database has a system table called `sqlite_master` that\n",
|
||||
"lists everything. We can query it like any other table.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tables = pd.read_sql_query(\n",
|
||||
" \"SELECT name FROM sqlite_master WHERE type='table' ORDER BY name\", conn\n",
|
||||
")\n",
|
||||
"tables\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You should see tables like `ROI_1`, `ROI_2`, \u2026, `ROI_6` (one per\n",
|
||||
"sub-arena), plus housekeeping tables like `METADATA`, `ROI_MAP`,\n",
|
||||
"`VAR_MAP`, `START_EVENTS`. We mostly care about the `ROI_*` ones.\n",
|
||||
"\n",
|
||||
"## Read one ROI\n",
|
||||
"\n",
|
||||
"`pd.read_sql_query()` runs an SQL query against the connection and\n",
|
||||
"returns a DataFrame. The query `SELECT * FROM ROI_1` means *\"give me all\n",
|
||||
"columns and all rows from the table called ROI_1\"*.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"roi1 = pd.read_sql_query(\"SELECT * FROM ROI_1\", conn)\n",
|
||||
"print(f\"shape: {roi1.shape}\") # (rows, columns)\n",
|
||||
"roi1.head()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Understanding the columns\n",
|
||||
"\n",
|
||||
"Refer back to notebook `00_welcome` for the full column reference. Quick\n",
|
||||
"recap of the important ones:\n",
|
||||
"\n",
|
||||
"- `t`: time in **milliseconds** since the video started.\n",
|
||||
"- `x`, `y`: fly position in **pixels**. The image origin (0, 0) is the\n",
|
||||
" **top-left** corner. y grows downward.\n",
|
||||
"- `w`, `h`: bounding-box width/height. Their product (`area = w*h`) is a\n",
|
||||
" rough proxy for \"how big does this blob look\" \u2014 useful for spotting\n",
|
||||
" frames where the tracker merged two flies into one big detection.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Quick descriptive stats\n",
|
||||
"roi1[[\"t\", \"x\", \"y\", \"w\", \"h\"]].describe()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The minimum `t` should be 0 (start of the video). The maximum tells you\n",
|
||||
"how long the recording was. Convert ms to minutes by dividing by 60000:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"duration_min = roi1[\"t\"].max() / 60_000\n",
|
||||
"print(f\"Session length: {duration_min:.1f} minutes\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## How many flies per frame?\n",
|
||||
"\n",
|
||||
"If two flies are visible in this ROI, we get **two rows per `t`**. Let's\n",
|
||||
"check.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"flies_per_frame = roi1.groupby(\"t\").size()\n",
|
||||
"print(flies_per_frame.value_counts().sort_index())\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The output tells you, e.g., \"100,000 frames had 2 flies visible, 30,000\n",
|
||||
"had 1 fly visible\". Frames with 1 fly usually mean the two flies are\n",
|
||||
"overlapping or one is occluded \u2014 that's something we'll handle properly\n",
|
||||
"in the next notebook.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Plot one fly's trajectory\n",
|
||||
"\n",
|
||||
"We'll plot the position over the first 5 minutes (300 000 ms). For\n",
|
||||
"clarity we'll only look at frames where there were 2 flies and pick the\n",
|
||||
"**first** of the two (sorted by `id`) as \"fly 1\" \u2014 this is a rough\n",
|
||||
"heuristic; identity tracking is harder than it sounds.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Filter to the first 5 minutes\n",
|
||||
"sub = roi1[roi1[\"t\"] <= 5 * 60_000]\n",
|
||||
"\n",
|
||||
"# Pick \"fly 1\" by taking the first row at each time point\n",
|
||||
"fly1 = sub.sort_values([\"t\", \"id\"]).drop_duplicates(\"t\", keep=\"first\")\n",
|
||||
"\n",
|
||||
"plt.figure(figsize=(6, 5))\n",
|
||||
"plt.plot(fly1[\"x\"], fly1[\"y\"], color=\"steelblue\", linewidth=0.5, alpha=0.7)\n",
|
||||
"plt.scatter(fly1[\"x\"].iloc[0], fly1[\"y\"].iloc[0], color=\"green\", label=\"start\", zorder=5)\n",
|
||||
"plt.scatter(fly1[\"x\"].iloc[-1], fly1[\"y\"].iloc[-1], color=\"red\", label=\"end\", zorder=5)\n",
|
||||
"plt.gca().invert_yaxis() # because pixel y grows downward\n",
|
||||
"plt.xlabel(\"x (pixels)\")\n",
|
||||
"plt.ylabel(\"y (pixels)\")\n",
|
||||
"plt.title(f\"Fly 1 trajectory \u2014 first 5 min \u2014 {db_path.name[:30]}\u2026\")\n",
|
||||
"plt.legend()\n",
|
||||
"plt.axis(\"equal\")\n",
|
||||
"plt.show()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You should see a tangle of lines confined to a roughly rectangular ROI.\n",
|
||||
"That tangle is the fly walking around its sub-arena.\n",
|
||||
"\n",
|
||||
"Notice we did `plt.gca().invert_yaxis()` \u2014 that's because in image\n",
|
||||
"coordinates y grows downward, but humans expect plots where y grows\n",
|
||||
"upward. Without it the plot would be vertically flipped.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Plot position over time\n",
|
||||
"\n",
|
||||
"A trajectory plot collapses time into \"shape on a page\". To see *when*\n",
|
||||
"things happen we need time on the x-axis.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fig, axes = plt.subplots(2, 1, figsize=(12, 5), sharex=True)\n",
|
||||
"\n",
|
||||
"axes[0].plot(fly1[\"t\"] / 1000, fly1[\"x\"], linewidth=0.5)\n",
|
||||
"axes[0].set_ylabel(\"x (px)\")\n",
|
||||
"axes[0].set_title(f\"Fly 1, ROI 1, {db_path.name[:30]}\u2026\")\n",
|
||||
"\n",
|
||||
"axes[1].plot(fly1[\"t\"] / 1000, fly1[\"y\"], linewidth=0.5, color=\"darkorange\")\n",
|
||||
"axes[1].set_ylabel(\"y (px)\")\n",
|
||||
"axes[1].set_xlabel(\"time (s)\")\n",
|
||||
"axes[1].invert_yaxis()\n",
|
||||
"\n",
|
||||
"plt.tight_layout()\n",
|
||||
"plt.show()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Bursts of variation = active fly. Long flat stretches = the fly is sitting\n",
|
||||
"still. You'll come to recognize courtship vs idling by eye after a while.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Distance between the two flies\n",
|
||||
"\n",
|
||||
"Whenever the ROI has 2 detections at the same `t`, we can compute the\n",
|
||||
"Euclidean distance between them: `sqrt((x1-x2)\u00b2 + (y1-y2)\u00b2)`.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"two_fly_frames = roi1.groupby(\"t\").filter(lambda g: len(g) == 2)\n",
|
||||
"two_fly_frames = two_fly_frames.sort_values([\"t\", \"id\"])\n",
|
||||
"\n",
|
||||
"# Pivot so each row is one timepoint with x1, y1, x2, y2\n",
|
||||
"def pair_up(g):\n",
|
||||
" g = g.reset_index(drop=True)\n",
|
||||
" return pd.Series({\n",
|
||||
" \"x1\": g.loc[0, \"x\"], \"y1\": g.loc[0, \"y\"],\n",
|
||||
" \"x2\": g.loc[1, \"x\"], \"y2\": g.loc[1, \"y\"],\n",
|
||||
" })\n",
|
||||
"\n",
|
||||
"paired = two_fly_frames.groupby(\"t\").apply(pair_up).reset_index()\n",
|
||||
"paired[\"distance_px\"] = np.hypot(paired[\"x1\"] - paired[\"x2\"], paired[\"y1\"] - paired[\"y2\"])\n",
|
||||
"paired.head()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"plt.figure(figsize=(12, 4))\n",
|
||||
"plt.plot(paired[\"t\"] / 1000, paired[\"distance_px\"], linewidth=0.4)\n",
|
||||
"plt.xlabel(\"time (s)\")\n",
|
||||
"plt.ylabel(\"inter-fly distance (px)\")\n",
|
||||
"plt.title(\"Distance between the two flies in ROI 1\")\n",
|
||||
"plt.show()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This is the kind of trace that drives the rest of the analysis: a male\n",
|
||||
"courting a female stays close (small distance); a male giving up wanders\n",
|
||||
"off (large distance). The shape of this curve is the behavioural readout.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Don't forget to close the connection\n",
|
||||
"\n",
|
||||
"If you opened a connection, close it when you're done. (Not strictly\n",
|
||||
"necessary in a notebook \u2014 Python tidies up \u2014 but a good habit.)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"conn.close()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Exercises\n",
|
||||
"\n",
|
||||
"1. Pick a different DB (change `db_files[0]` to `db_files[10]` for example)\n",
|
||||
" and re-run the trajectory plot. Is the arena bigger / smaller? Why\n",
|
||||
" might that be? (Hint: look at the resolution part of the filename.)\n",
|
||||
"2. Plot the distance trace for **ROI 4** instead of ROI 1.\n",
|
||||
"3. Compute the **percentage of frames** in ROI 1 that had only 1 fly visible.\n",
|
||||
"4. The `area = w * h` column is a useful diagnostic. Plot `area` vs `t`\n",
|
||||
" for fly 1 \u2014 when does the bounding box get unusually large?\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Exercise space\n"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
398
notebooks/getting_started/03_compare_trained_vs_naive.ipynb
Normal file
398
notebooks/getting_started/03_compare_trained_vs_naive.ipynb
Normal file
|
|
@ -0,0 +1,398 @@
|
|||
{
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5,
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 03 \u00b7 Your first real analysis: trained vs naive\n",
|
||||
"\n",
|
||||
"In notebook 02 we explored a single database. Now we'll work with **all\n",
|
||||
"of them at once**, compute a simple per-fly metric, and ask the central\n",
|
||||
"question of the project:\n",
|
||||
"\n",
|
||||
"> **Do trained males behave differently from na\u00efve males in the testing\n",
|
||||
"> session?**\n",
|
||||
"\n",
|
||||
"By the end you'll have:\n",
|
||||
"\n",
|
||||
"- loaded every (fly, session) trace into one big DataFrame using the\n",
|
||||
" project's helper function;\n",
|
||||
"- reduced each trace to one number per fly (the *median inter-fly\n",
|
||||
" distance*);\n",
|
||||
"- compared the trained group against the na\u00efve group with a histogram\n",
|
||||
" and a non-parametric statistical test;\n",
|
||||
"- learnt enough to start asking your own questions.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import sys\n",
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"import numpy as np\n",
|
||||
"import pandas as pd\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"from scipy import stats\n",
|
||||
"\n",
|
||||
"# Tell Python where to find the project's helper modules.\n",
|
||||
"PROJECT_ROOT = Path(\"..\").resolve().parent # this notebook is in notebooks/getting_started/\n",
|
||||
"sys.path.insert(0, str(PROJECT_ROOT / \"scripts\"))\n",
|
||||
"\n",
|
||||
"from load_roi_data import load_roi_data\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Loading everything at once \u2014 but carefully\n",
|
||||
"\n",
|
||||
"`load_roi_data()` opens every tracking DB referenced by the metadata TSV\n",
|
||||
"and returns one big DataFrame. **It can be slow and memory-hungry**\n",
|
||||
"(the full batch is ~200 million rows). Always start small.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Load the metadata TSV first \u2014 it's small and fast.\n",
|
||||
"tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
|
||||
"meta = pd.read_csv(tsv_path, sep=\"\\t\")\n",
|
||||
"print(f\"metadata rows: {len(meta)}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Pre-filter the metadata before passing it to `load_roi_data`. We'll start\n",
|
||||
"with **just one species and just the testing sessions**, because:\n",
|
||||
"\n",
|
||||
"1. mixing species is a confound (different species behave differently);\n",
|
||||
"2. the question is about behaviour after training, so the testing session\n",
|
||||
" is the relevant one;\n",
|
||||
"3. starting small means we can iterate quickly.\n",
|
||||
"\n",
|
||||
"You can come back later and broaden this filter.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Pick one species. 'Melanogaster/CS' has the most rows (127), so a good default.\n",
|
||||
"sub = meta[meta[\"species\"] == \"Melanogaster/CS\"].copy()\n",
|
||||
"\n",
|
||||
"# We're loading every session for these flies, but the loader stamps each\n",
|
||||
"# row with a 'session' column so we can filter to testing afterwards.\n",
|
||||
"print(f\"selected metadata rows: {len(sub)}\")\n",
|
||||
"print(sub[\"male\"].value_counts())\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# This will take a minute or two and use a chunk of RAM. Be patient.\n",
|
||||
"data = load_roi_data(sub)\n",
|
||||
"print(f\"loaded shape: {data.shape}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## What did we get?\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"data.head(3)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# How big is each session, in tracking samples?\n",
|
||||
"data.groupby([\"session\", \"male\"]).size().unstack(fill_value=0)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Restrict to the testing session\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"testing = data[data[\"session\"] == \"testing\"].copy()\n",
|
||||
"print(f\"testing samples: {len(testing):,}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Reduce each trace to one number\n",
|
||||
"\n",
|
||||
"Right now each fly contributes **tens of thousands** of (t, x, y) rows.\n",
|
||||
"We can't compare distributions of millions of points across two groups\n",
|
||||
"in any meaningful way. So we **collapse each (date, machine_name, ROI)\n",
|
||||
"trace into a single summary number** \u2014 here, the median distance between\n",
|
||||
"the two flies during testing.\n",
|
||||
"\n",
|
||||
"Why median rather than mean? Because tracker glitches (one fly\n",
|
||||
"temporarily lost) can produce huge spikes that the median ignores.\n",
|
||||
"[Why medians beat means in noisy data\n",
|
||||
"(2-min read)](https://en.wikipedia.org/wiki/Median#Robustness).\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Step 1 \u2014 per-frame distance.\n",
|
||||
"# Take only frames with exactly 2 flies (so we have a real distance).\n",
|
||||
"two_fly = testing.groupby([\"date\", \"machine_name\", \"ROI\", \"t\"]).filter(lambda g: len(g) == 2)\n",
|
||||
"\n",
|
||||
"# For each (track, t), compute the distance between the two rows.\n",
|
||||
"def distance_for_frame(g):\n",
|
||||
" g = g.sort_values(\"id\").reset_index(drop=True)\n",
|
||||
" return np.hypot(g.loc[0, \"x\"] - g.loc[1, \"x\"], g.loc[0, \"y\"] - g.loc[1, \"y\"])\n",
|
||||
"\n",
|
||||
"# This is the slow step. With ~3 M frames it takes a while.\n",
|
||||
"per_frame = (\n",
|
||||
" two_fly\n",
|
||||
" .groupby([\"date\", \"machine_name\", \"ROI\", \"t\", \"male\"])\n",
|
||||
" .apply(distance_for_frame)\n",
|
||||
" .reset_index(name=\"distance_px\")\n",
|
||||
")\n",
|
||||
"print(f\"per-frame distance rows: {len(per_frame):,}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Step 2 \u2014 one number per (date, machine_name, ROI).\n",
|
||||
"per_fly = (\n",
|
||||
" per_frame\n",
|
||||
" .groupby([\"date\", \"machine_name\", \"ROI\", \"male\"])[\"distance_px\"]\n",
|
||||
" .median()\n",
|
||||
" .reset_index(name=\"median_distance_px\")\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Each row now is \"one fly during testing\", with its median distance.\n",
|
||||
"print(per_fly.shape)\n",
|
||||
"per_fly.head()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Sanity check: how many flies per group?\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"per_fly[\"male\"].value_counts()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"If the numbers are very different, your statistical comparison will be\n",
|
||||
"underpowered for one side. Note them down.\n",
|
||||
"\n",
|
||||
"## Plot the distributions\n",
|
||||
"\n",
|
||||
"The first thing to do with two groups is to **look at them**. Don't trust\n",
|
||||
"a p-value before you've seen the histogram.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fig, ax = plt.subplots(figsize=(10, 5))\n",
|
||||
"\n",
|
||||
"bins = np.linspace(0, per_fly[\"median_distance_px\"].max(), 40)\n",
|
||||
"\n",
|
||||
"for label, color in [(\"trained\", \"steelblue\"), (\"naive\", \"darkorange\")]:\n",
|
||||
" sub = per_fly[per_fly[\"male\"] == label][\"median_distance_px\"]\n",
|
||||
" ax.hist(sub, bins=bins, alpha=0.6, label=f\"{label} (n={len(sub)})\", color=color)\n",
|
||||
"\n",
|
||||
"ax.set_xlabel(\"median inter-fly distance during testing (px)\")\n",
|
||||
"ax.set_ylabel(\"number of flies\")\n",
|
||||
"ax.set_title(\"Trained vs na\u00efve \u2014 Melanogaster/CS \u2014 testing session\")\n",
|
||||
"ax.legend()\n",
|
||||
"plt.show()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**What you might see:**\n",
|
||||
"\n",
|
||||
"- If the trained group's distribution is shifted to **higher** distances,\n",
|
||||
" trained males are spending less time near the female (i.e. they\n",
|
||||
" learned to give up).\n",
|
||||
"- If the two distributions look identical, no learning effect was\n",
|
||||
" measurable with this metric \u2014 but that doesn't mean there's no effect,\n",
|
||||
" just that this particular summary didn't capture it.\n",
|
||||
"- A **bimodal** trained distribution (two humps) would mean some males\n",
|
||||
" learned and others didn't \u2014 the \"individual differences\" story in\n",
|
||||
" `docs/bimodal_hypothesis.md`.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Add a stat test\n",
|
||||
"\n",
|
||||
"A formal comparison. Because group sizes are small and we don't know if\n",
|
||||
"the data are normally distributed, the\n",
|
||||
"[Mann-Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test)\n",
|
||||
"is a safer default than the classic t-test.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {},
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"trained_vals = per_fly[per_fly[\"male\"] == \"trained\"][\"median_distance_px\"]\n",
|
||||
"naive_vals = per_fly[per_fly[\"male\"] == \"naive\"][\"median_distance_px\"]\n",
|
||||
"\n",
|
||||
"stat, pvalue = stats.mannwhitneyu(trained_vals, naive_vals, alternative=\"two-sided\")\n",
|
||||
"\n",
|
||||
"print(f\"trained median: {trained_vals.median():.1f} px (n={len(trained_vals)})\")\n",
|
||||
"print(f\"naive median: {naive_vals.median():.1f} px (n={len(naive_vals)})\")\n",
|
||||
"print(f\"Mann-Whitney U: {stat:.0f} p-value: {pvalue:.4f}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**How to read this**: the p-value is the probability of seeing a\n",
|
||||
"difference at least this big *if there were really no difference*. By\n",
|
||||
"convention p < 0.05 is \"interesting\", p < 0.01 is \"fairly convincing\".\n",
|
||||
"But never trust a p-value without:\n",
|
||||
"\n",
|
||||
"1. eyeballing the histogram first (you did);\n",
|
||||
"2. reporting the **effect size**, not just the p-value (e.g. the\n",
|
||||
" difference of medians);\n",
|
||||
"3. understanding that p-values\n",
|
||||
" [say nothing about practical importance](https://www.nature.com/articles/d41586-019-00857-9).\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## What's next?\n",
|
||||
"\n",
|
||||
"- **Pick a different metric**: instead of median distance, try fraction\n",
|
||||
" of time the flies were within 50 px (a \"close-proximity\" metric), or\n",
|
||||
" the maximum velocity per fly. (Velocity needs identity tracking, which\n",
|
||||
" is harder \u2014 see `flies_analysis_simple.ipynb` cell 16 for an example.)\n",
|
||||
"- **Look at it per species**: re-run with `species == \"Sechellia\"` and\n",
|
||||
" compare. Does the effect generalize? Where is it strongest?\n",
|
||||
"- **Look at the bimodality**: a kernel density plot\n",
|
||||
" ([seaborn.kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html))\n",
|
||||
" will show humps better than a histogram.\n",
|
||||
"- **Time inside the session**: maybe the difference only shows up in the\n",
|
||||
" first few minutes (right after the female is introduced). Slice\n",
|
||||
" `per_frame` by `t` before aggregating.\n",
|
||||
"- **Consult `docs/bimodal_hypothesis.md`**: it lays out a formal plan for\n",
|
||||
" testing the \"some flies learn, others don't\" hypothesis.\n",
|
||||
"\n",
|
||||
"When you write your own analysis, **save it as a new notebook** (don't\n",
|
||||
"edit this one). Copy the setup cells, change the question, change the\n",
|
||||
"plot. That's how analysis projects grow.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## A note on iteration speed\n",
|
||||
"\n",
|
||||
"The pipeline above is correct but **slow** because we apply a Python\n",
|
||||
"function to every (track, t) group. If you find yourself re-running the\n",
|
||||
"same expensive computation a lot, save the intermediate result to disk:\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"per_frame.to_parquet(\"per_frame_distance.parquet\")\n",
|
||||
"# next time:\n",
|
||||
"per_frame = pd.read_parquet(\"per_frame_distance.parquet\")\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"`parquet` is a fast columnar format. `pip install pyarrow` if your\n",
|
||||
"environment doesn't have it.\n",
|
||||
"\n",
|
||||
"There are also vectorized ways to compute these distances ~100\u00d7 faster\n",
|
||||
"that avoid `groupby().apply()`. Don't worry about that yet \u2014 get a\n",
|
||||
"correct answer first, optimize only if you find yourself waiting.\n"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
15
notebooks/getting_started/README.md
Normal file
15
notebooks/getting_started/README.md
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
# Tutorial notebooks
|
||||
|
||||
Read these in order:
|
||||
|
||||
1. **`00_welcome.ipynb`** — what's the project, where the data lives,
|
||||
how to use a Jupyter notebook.
|
||||
2. **`01_python_pandas_basics.ipynb`** — minimum Python and pandas you
|
||||
need to read project code.
|
||||
3. **`02_explore_one_database.ipynb`** — open one tracking DB, plot a
|
||||
trajectory, compute a single distance.
|
||||
4. **`03_compare_trained_vs_naive.ipynb`** — first real analysis,
|
||||
comparing groups.
|
||||
|
||||
After these, the notebooks one level up (`flies_analysis*.ipynb`) walk
|
||||
through the full analysis pipeline that the previous student built.
|
||||
Loading…
Add table
Add a link
Reference in a new issue