cupido/notebooks/getting_started/01_python_pandas_basics.ipynb
Giorgio Gilestro f08e4b843d Per-user metadata TSV — auto-prefer ~/cupido_metadata.tsv if present
The shared TSV at /mnt/data/projects/cupido/ is read-only inside the
container, so users who want to customize the `include` column (or any
metadata) need a personal copy. Notebooks now check for
~/cupido_metadata.tsv first and fall back to the shared master if it
doesn't exist. Each user keeps their own edits without stepping on
anyone else's analysis.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-01 09:25:24 +01:00

490 lines
No EOL
14 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 01 · Python and pandas — just enough to be dangerous\n",
"\n",
"This notebook teaches the **minimum** Python and `pandas` you need to read\n",
"the rest of the project's code and write your own analyses.\n",
"\n",
"If you've never programmed before, don't try to memorize the syntax.\n",
"Just run each cell, read what it does, and come back when you're stuck on\n",
"something specific. The cheat sheet at the end is the only thing worth\n",
"keeping handy.\n",
"\n",
"External resources, in order of how much time they take:\n",
"\n",
"- 🦘 [Python in 10 minutes (very condensed)](https://www.stavros.io/tutorials/python/)\n",
"- 🐍 [Official Python tutorial — chapters 35](https://docs.python.org/3/tutorial/introduction.html)\n",
"- 🐼 [pandas in 10 minutes (official)](https://pandas.pydata.org/docs/user_guide/10min.html)\n",
"- 📚 [Python for Data Analysis (the book)](https://wesmckinney.com/book/) — free online\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Variables\n",
"\n",
"A variable is a named box you put a value into. The `=` is **assignment**,\n",
"not equality. Read it as \"make `name` refer to `value`\".\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"x = 5\n",
"y = 3\n",
"total = x + y\n",
"print(total)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Re-running the cell after changing `x = 5` to `x = 50` gives a different\n",
"answer. Try it.\n",
"\n",
"Variable names: lowercase letters, digits, and underscores. They can't\n",
"start with a digit. Convention is `snake_case`: `mean_distance`, not\n",
"`meanDistance` or `MeanDistance`.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Strings and numbers\n",
"\n",
"A **string** is text in quotes. You can join strings with `+`. You can\n",
"turn a number into a string with `str()`, and vice-versa with `int()` /\n",
"`float()`.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"species = \"Drosophila melanogaster\"\n",
"n_flies = 12\n",
"message = \"We tracked \" + str(n_flies) + \" \" + species + \" males.\"\n",
"print(message)\n",
"\n",
"# A nicer way to build strings — f-strings (note the leading 'f'):\n",
"print(f\"We tracked {n_flies} {species} males.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Lists\n",
"\n",
"A list is an ordered collection of things. Square brackets, items\n",
"separated by commas. You can mix types (but usually shouldn't).\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"machines = [\"ETHOSCOPE_076\", \"ETHOSCOPE_082\", \"ETHOSCOPE_086\"]\n",
"print(machines[0]) # first item — Python counts from 0!\n",
"print(machines[-1]) # last item\n",
"print(len(machines)) # how many items\n",
"print(machines + [\"ETHOSCOPE_140\"]) # concatenate (returns a new list)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Dictionaries\n",
"\n",
"A dictionary maps **keys** to **values**. Curly braces, `key: value`\n",
"pairs.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"fly = {\"species\": \"Sechellia\", \"trained\": True, \"age_days\": 5}\n",
"print(fly[\"species\"])\n",
"print(fly[\"age_days\"])\n",
"fly[\"alive\"] = False # add a new key\n",
"print(fly)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Conditions: if / elif / else\n",
"\n",
"Compare with `==` (equal), `!=` (not equal), `<`, `>`, `<=`, `>=`.\n",
"Combine with `and`, `or`, `not`.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"distance_px = 42\n",
"\n",
"if distance_px < 50:\n",
" label = \"close\"\n",
"elif distance_px < 200:\n",
" label = \"medium\"\n",
"else:\n",
" label = \"far\"\n",
"\n",
"print(label)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Loops\n",
"\n",
"`for x in collection:` runs the indented block once per item.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"for m in machines:\n",
" print(f\"Looking at machine {m}\")\n",
"\n",
"# Looping with an index, when you need it:\n",
"for i, m in enumerate(machines):\n",
" print(f\"{i}: {m}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Functions\n",
"\n",
"A function is a named, reusable chunk of code. `def` declares it. `return`\n",
"sends a value back to whoever called it.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"def fly_age_in_weeks(days):\n",
" \"\"\"Return age in weeks given age in days.\"\"\"\n",
" return days / 7\n",
"\n",
"print(fly_age_in_weeks(14)) # 2.0\n",
"print(fly_age_in_weeks(5)) # 0.714…\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Importing libraries\n",
"\n",
"A library is somebody else's code. We use `import` to pull it into our\n",
"notebook.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import math\n",
"print(math.sqrt(16)) # 4.0\n",
"print(math.pi)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Meet pandas\n",
"\n",
"Real data is rarely a single number — it's a **table** with rows and\n",
"columns (think Excel). `pandas` is the library that handles tables in\n",
"Python. The two main objects are:\n",
"\n",
"- **`Series`** — a single column with a name.\n",
"- **`DataFrame`** — a whole table.\n",
"\n",
"By convention we import pandas as `pd`. Always.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": "import pandas as pd\nfrom pathlib import Path\n\n# All the project's bulky data lives under /mnt/data/projects/cupido/.\n# Defining one DATA_DIR variable and building sub-paths from it is much\n# easier to read (and to update) than hard-coding long strings everywhere.\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\n\n# Pick the metadata TSV: prefer your personal copy if you have one,\n# otherwise fall back to the shared (read-only) master. To make a\n# personal copy you can edit, run ONCE in a terminal:\n# cp /mnt/data/projects/cupido/all_video_info_merged.tsv ~/cupido_metadata.tsv\nSHARED_TSV = DATA_DIR / \"all_video_info_merged.tsv\"\nPERSONAL_TSV = Path.home() / \"cupido_metadata.tsv\"\ntsv_path = PERSONAL_TSV if PERSONAL_TSV.exists() else SHARED_TSV\n\n# Read the project's metadata TSV (Tab-Separated Values).\ndf = pd.read_csv(tsv_path, sep=\"\\t\")\n\n# How big is it?\nprint(f\"Reading from: {tsv_path}\")\nprint(f\"Rows: {len(df)}\")\nprint(f\"Columns: {df.shape[1]}\")\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Looking at the table\n",
"\n",
"`.head()` shows the first 5 rows. `.tail()` the last 5. `.columns` lists\n",
"column names. `.dtypes` shows the type of each column.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"df.head(3)\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"print(\"Column names:\")\n",
"for c in df.columns:\n",
" print(f\" {c}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Selecting columns\n",
"\n",
"Two main ways to get one column: bracket-indexing (`df[\"name\"]`) or\n",
"attribute access (`df.name`). The first works for any column name; the\n",
"second only works if the name has no spaces or weird characters.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"df[\"species\"].head()\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"df.species.value_counts() # how many rows per species\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Selecting multiple columns\n",
"\n",
"Pass a **list** of names inside the brackets:\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"df[[\"machine_name\", \"roi\", \"species\", \"male\"]].head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. Filtering rows\n",
"\n",
"The pattern is `df[condition]`. The condition is a Series of `True`/`False`.\n",
"Pandas keeps the rows where it's `True`.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"trained = df[df[\"male\"] == \"trained\"]\n",
"print(f\"trained rows: {len(trained)}\")\n",
"\n",
"mel_only = df[df[\"species\"] == \"Melanogaster/CS\"]\n",
"print(f\"Melanogaster/CS rows: {len(mel_only)}\")\n",
"\n",
"# Combine conditions with & (and) | (or) — and wrap each part in parentheses.\n",
"trained_mel = df[(df[\"male\"] == \"trained\") & (df[\"species\"] == \"Melanogaster/CS\")]\n",
"print(f\"trained Mel rows: {len(trained_mel)}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Grouping and counting\n",
"\n",
"`.groupby(\"col\")` followed by an aggregator like `.size()` or `.mean()`\n",
"splits the table by the values in that column and computes something per\n",
"group.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# How many ROIs per (species, training condition)?\n",
"df.groupby([\"species\", \"male\"]).size()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 15. Quick plots\n",
"\n",
"DataFrames know how to draw themselves. Under the hood it's `matplotlib`.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"# How many rows per machine?\n",
"df[\"machine_name\"].value_counts().plot(kind=\"bar\", figsize=(10, 4))\n",
"plt.title(\"Number of fly-rows per ethoscope machine\")\n",
"plt.ylabel(\"rows\")\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 16. Exercises\n",
"\n",
"Don't skip these. They're how you find out what you actually understood.\n",
"\n",
"1. How many rows does `df` have where `age` equals `'5-7'`?\n",
"2. Print the **unique values** of the `memory` column. (Hint: `df[\"memory\"].unique()`)\n",
"3. How many distinct `(date, machine_name)` pairs are in the dataset?\n",
" (Hint: `df.groupby([\"date\", \"machine_name\"]).size().shape`.)\n",
"4. Make a bar plot of `species` counts. Which species has the most rows?\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Try exercise 1 here\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Try exercise 2 here\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Try exercise 3 here\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"# Try exercise 4 here\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cheat sheet\n",
"\n",
"```python\n",
"import pandas as pd\n",
"df = pd.read_csv(\"file.tsv\", sep=\"\\t\") # read\n",
"df.head(); df.tail(); df.shape; df.columns # peek\n",
"df[\"col\"]; df[[\"a\", \"b\"]] # select\n",
"df[df[\"col\"] == \"value\"] # filter\n",
"df.groupby(\"col\").size() # count per group\n",
"df.groupby(\"col\")[\"x\"].mean() # mean of x per group\n",
"df[\"col\"].value_counts() # quick counts\n",
"df[\"col\"].unique() # unique values\n",
"df[\"new_col\"] = df[\"w\"] * df[\"h\"] # derived column\n",
"df.sort_values(\"col\", ascending=False) # sort\n",
"df.plot(...) # quick plot\n",
"```\n",
"\n",
"Keep this list open when reading other people's code. Most of pandas is\n",
"just combinations of these primitives. When you need more, the official\n",
"[pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)\n",
"is excellent.\n"
]
}
]
}