Add beginner tutorial notebooks for incoming students

Four guided notebooks under notebooks/getting_started/ aimed at someone new to Python and data science. The series progresses: project orientation → Python/pandas crash course → exploring one tracking DB → first trained-vs-naive comparison using load_roi_data + Mann-Whitney U. Each notebook leans heavily on markdown explanations, includes exercises with empty cells, and links out to canonical references (JupyterLab, official Python tutorial, pandas 10-min guide, Wikipedia for stats concepts). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 18:14:17 +01:00 · 2026-04-30 18:14:17 +01:00 · ec56e51bf9
commit ec56e51bf9
parent 7d09523840
5 changed files with 1607 additions and 0 deletions
--- a/notebooks/getting_started/01_python_pandas_basics.ipynb
+++ b/notebooks/getting_started/01_python_pandas_basics.ipynb
@ -0,0 +1,500 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 5,
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 01 \u00b7 Python and pandas \u2014 just enough to be dangerous\n",
+    "\n",
+    "This notebook teaches the **minimum** Python and `pandas` you need to read\n",
+    "the rest of the project's code and write your own analyses.\n",
+    "\n",
+    "If you've never programmed before, don't try to memorize the syntax.\n",
+    "Just run each cell, read what it does, and come back when you're stuck on\n",
+    "something specific. The cheat sheet at the end is the only thing worth\n",
+    "keeping handy.\n",
+    "\n",
+    "External resources, in order of how much time they take:\n",
+    "\n",
+    "- \ud83e\udd98 [Python in 10 minutes (very condensed)](https://www.stavros.io/tutorials/python/)\n",
+    "- \ud83d\udc0d [Official Python tutorial \u2014 chapters 3\u20135](https://docs.python.org/3/tutorial/introduction.html)\n",
+    "- \ud83d\udc3c [pandas in 10 minutes (official)](https://pandas.pydata.org/docs/user_guide/10min.html)\n",
+    "- \ud83d\udcda [Python for Data Analysis (the book)](https://wesmckinney.com/book/) \u2014 free online\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1.  Variables\n",
+    "\n",
+    "A variable is a named box you put a value into. The `=` is **assignment**,\n",
+    "not equality. Read it as \"make `name` refer to `value`\".\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "x = 5\n",
+    "y = 3\n",
+    "total = x + y\n",
+    "print(total)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Re-running the cell after changing `x = 5` to `x = 50` gives a different\n",
+    "answer. Try it.\n",
+    "\n",
+    "Variable names: lowercase letters, digits, and underscores. They can't\n",
+    "start with a digit. Convention is `snake_case`: `mean_distance`, not\n",
+    "`meanDistance` or `MeanDistance`.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2.  Strings and numbers\n",
+    "\n",
+    "A **string** is text in quotes. You can join strings with `+`. You can\n",
+    "turn a number into a string with `str()`, and vice-versa with `int()` /\n",
+    "`float()`.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "species = \"Drosophila melanogaster\"\n",
+    "n_flies = 12\n",
+    "message = \"We tracked \" + str(n_flies) + \" \" + species + \" males.\"\n",
+    "print(message)\n",
+    "\n",
+    "# A nicer way to build strings \u2014 f-strings (note the leading 'f'):\n",
+    "print(f\"We tracked {n_flies} {species} males.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3.  Lists\n",
+    "\n",
+    "A list is an ordered collection of things. Square brackets, items\n",
+    "separated by commas. You can mix types (but usually shouldn't).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "machines = [\"ETHOSCOPE_076\", \"ETHOSCOPE_082\", \"ETHOSCOPE_086\"]\n",
+    "print(machines[0])         # first item \u2014 Python counts from 0!\n",
+    "print(machines[-1])        # last item\n",
+    "print(len(machines))       # how many items\n",
+    "print(machines + [\"ETHOSCOPE_140\"])  # concatenate (returns a new list)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4.  Dictionaries\n",
+    "\n",
+    "A dictionary maps **keys** to **values**. Curly braces, `key: value`\n",
+    "pairs.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "fly = {\"species\": \"Sechellia\", \"trained\": True, \"age_days\": 5}\n",
+    "print(fly[\"species\"])\n",
+    "print(fly[\"age_days\"])\n",
+    "fly[\"alive\"] = False         # add a new key\n",
+    "print(fly)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5.  Conditions: if / elif / else\n",
+    "\n",
+    "Compare with `==` (equal), `!=` (not equal), `<`, `>`, `<=`, `>=`.\n",
+    "Combine with `and`, `or`, `not`.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "distance_px = 42\n",
+    "\n",
+    "if distance_px < 50:\n",
+    "    label = \"close\"\n",
+    "elif distance_px < 200:\n",
+    "    label = \"medium\"\n",
+    "else:\n",
+    "    label = \"far\"\n",
+    "\n",
+    "print(label)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6.  Loops\n",
+    "\n",
+    "`for x in collection:` runs the indented block once per item.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "for m in machines:\n",
+    "    print(f\"Looking at machine {m}\")\n",
+    "\n",
+    "# Looping with an index, when you need it:\n",
+    "for i, m in enumerate(machines):\n",
+    "    print(f\"{i}: {m}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7.  Functions\n",
+    "\n",
+    "A function is a named, reusable chunk of code. `def` declares it. `return`\n",
+    "sends a value back to whoever called it.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "def fly_age_in_weeks(days):\n",
+    "    \"\"\"Return age in weeks given age in days.\"\"\"\n",
+    "    return days / 7\n",
+    "\n",
+    "print(fly_age_in_weeks(14))    # 2.0\n",
+    "print(fly_age_in_weeks(5))     # 0.714\u2026\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8.  Importing libraries\n",
+    "\n",
+    "A library is somebody else's code. We use `import` to pull it into our\n",
+    "notebook.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "import math\n",
+    "print(math.sqrt(16))   # 4.0\n",
+    "print(math.pi)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 9.  Meet pandas\n",
+    "\n",
+    "Real data is rarely a single number \u2014 it's a **table** with rows and\n",
+    "columns (think Excel). `pandas` is the library that handles tables in\n",
+    "Python. The two main objects are:\n",
+    "\n",
+    "- **`Series`** \u2014 a single column with a name.\n",
+    "- **`DataFrame`** \u2014 a whole table.\n",
+    "\n",
+    "By convention we import pandas as `pd`. Always.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "# Read the project's metadata TSV (Tab-Separated Values).\n",
+    "tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
+    "df = pd.read_csv(tsv_path, sep=\"\\t\")\n",
+    "\n",
+    "# How big is it?\n",
+    "print(f\"Rows: {len(df)}\")\n",
+    "print(f\"Columns: {df.shape[1]}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 10.  Looking at the table\n",
+    "\n",
+    "`.head()` shows the first 5 rows. `.tail()` the last 5. `.columns` lists\n",
+    "column names. `.dtypes` shows the type of each column.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "df.head(3)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "print(\"Column names:\")\n",
+    "for c in df.columns:\n",
+    "    print(f\"  {c}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 11.  Selecting columns\n",
+    "\n",
+    "Two main ways to get one column: bracket-indexing (`df[\"name\"]`) or\n",
+    "attribute access (`df.name`). The first works for any column name; the\n",
+    "second only works if the name has no spaces or weird characters.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "df[\"species\"].head()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "df.species.value_counts()    # how many rows per species\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 12.  Selecting multiple columns\n",
+    "\n",
+    "Pass a **list** of names inside the brackets:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "df[[\"machine_name\", \"roi\", \"species\", \"male\"]].head()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 13.  Filtering rows\n",
+    "\n",
+    "The pattern is `df[condition]`. The condition is a Series of `True`/`False`.\n",
+    "Pandas keeps the rows where it's `True`.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "trained = df[df[\"male\"] == \"trained\"]\n",
+    "print(f\"trained rows: {len(trained)}\")\n",
+    "\n",
+    "mel_only = df[df[\"species\"] == \"Melanogaster/CS\"]\n",
+    "print(f\"Melanogaster/CS rows: {len(mel_only)}\")\n",
+    "\n",
+    "# Combine conditions with & (and) | (or) \u2014 and wrap each part in parentheses.\n",
+    "trained_mel = df[(df[\"male\"] == \"trained\") & (df[\"species\"] == \"Melanogaster/CS\")]\n",
+    "print(f\"trained Mel rows: {len(trained_mel)}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 14.  Grouping and counting\n",
+    "\n",
+    "`.groupby(\"col\")` followed by an aggregator like `.size()` or `.mean()`\n",
+    "splits the table by the values in that column and computes something per\n",
+    "group.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# How many ROIs per (species, training condition)?\n",
+    "df.groupby([\"species\", \"male\"]).size()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 15.  Quick plots\n",
+    "\n",
+    "DataFrames know how to draw themselves. Under the hood it's `matplotlib`.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "# How many rows per machine?\n",
+    "df[\"machine_name\"].value_counts().plot(kind=\"bar\", figsize=(10, 4))\n",
+    "plt.title(\"Number of fly-rows per ethoscope machine\")\n",
+    "plt.ylabel(\"rows\")\n",
+    "plt.show()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 16.  Exercises\n",
+    "\n",
+    "Don't skip these. They're how you find out what you actually understood.\n",
+    "\n",
+    "1. How many rows does `df` have where `age` equals `'5-7'`?\n",
+    "2. Print the **unique values** of the `memory` column. (Hint: `df[\"memory\"].unique()`)\n",
+    "3. How many distinct `(date, machine_name)` pairs are in the dataset?\n",
+    "   (Hint: `df.groupby([\"date\", \"machine_name\"]).size().shape`.)\n",
+    "4. Make a bar plot of `species` counts. Which species has the most rows?\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# Try exercise 1 here\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# Try exercise 2 here\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# Try exercise 3 here\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# Try exercise 4 here\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Cheat sheet\n",
+    "\n",
+    "```python\n",
+    "import pandas as pd\n",
+    "df = pd.read_csv(\"file.tsv\", sep=\"\\t\")     # read\n",
+    "df.head(); df.tail(); df.shape; df.columns  # peek\n",
+    "df[\"col\"]; df[[\"a\", \"b\"]]                    # select\n",
+    "df[df[\"col\"] == \"value\"]                     # filter\n",
+    "df.groupby(\"col\").size()                     # count per group\n",
+    "df.groupby(\"col\")[\"x\"].mean()                # mean of x per group\n",
+    "df[\"col\"].value_counts()                     # quick counts\n",
+    "df[\"col\"].unique()                           # unique values\n",
+    "df[\"new_col\"] = df[\"w\"] * df[\"h\"]            # derived column\n",
+    "df.sort_values(\"col\", ascending=False)       # sort\n",
+    "df.plot(...)                                 # quick plot\n",
+    "```\n",
+    "\n",
+    "Keep this list open when reading other people's code. Most of pandas is\n",
+    "just combinations of these primitives. When you need more, the official\n",
+    "[pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)\n",
+    "is excellent.\n"
+   ]
+  }
+ ]
+}