Personal copy of all_video_info_merged.tsv now lives at ~/cupido/data/metadata/all_video_info_merged.tsv (gitignored) instead of ~/cupido_metadata.tsv. That sits next to the other small metadata CSVs (barrier_opening, etc.) — the natural home for it. Updated all five notebooks and processed/README accordingly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
490 lines
No EOL
14 KiB
Text
490 lines
No EOL
14 KiB
Text
{
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5,
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"name": "python"
|
||
}
|
||
},
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 01 · Python and pandas — just enough to be dangerous\n",
|
||
"\n",
|
||
"This notebook teaches the **minimum** Python and `pandas` you need to read\n",
|
||
"the rest of the project's code and write your own analyses.\n",
|
||
"\n",
|
||
"If you've never programmed before, don't try to memorize the syntax.\n",
|
||
"Just run each cell, read what it does, and come back when you're stuck on\n",
|
||
"something specific. The cheat sheet at the end is the only thing worth\n",
|
||
"keeping handy.\n",
|
||
"\n",
|
||
"External resources, in order of how much time they take:\n",
|
||
"\n",
|
||
"- 🦘 [Python in 10 minutes (very condensed)](https://www.stavros.io/tutorials/python/)\n",
|
||
"- 🐍 [Official Python tutorial — chapters 3–5](https://docs.python.org/3/tutorial/introduction.html)\n",
|
||
"- 🐼 [pandas in 10 minutes (official)](https://pandas.pydata.org/docs/user_guide/10min.html)\n",
|
||
"- 📚 [Python for Data Analysis (the book)](https://wesmckinney.com/book/) — free online\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 1. Variables\n",
|
||
"\n",
|
||
"A variable is a named box you put a value into. The `=` is **assignment**,\n",
|
||
"not equality. Read it as \"make `name` refer to `value`\".\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"x = 5\n",
|
||
"y = 3\n",
|
||
"total = x + y\n",
|
||
"print(total)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Re-running the cell after changing `x = 5` to `x = 50` gives a different\n",
|
||
"answer. Try it.\n",
|
||
"\n",
|
||
"Variable names: lowercase letters, digits, and underscores. They can't\n",
|
||
"start with a digit. Convention is `snake_case`: `mean_distance`, not\n",
|
||
"`meanDistance` or `MeanDistance`.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 2. Strings and numbers\n",
|
||
"\n",
|
||
"A **string** is text in quotes. You can join strings with `+`. You can\n",
|
||
"turn a number into a string with `str()`, and vice-versa with `int()` /\n",
|
||
"`float()`.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"species = \"Drosophila melanogaster\"\n",
|
||
"n_flies = 12\n",
|
||
"message = \"We tracked \" + str(n_flies) + \" \" + species + \" males.\"\n",
|
||
"print(message)\n",
|
||
"\n",
|
||
"# A nicer way to build strings — f-strings (note the leading 'f'):\n",
|
||
"print(f\"We tracked {n_flies} {species} males.\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 3. Lists\n",
|
||
"\n",
|
||
"A list is an ordered collection of things. Square brackets, items\n",
|
||
"separated by commas. You can mix types (but usually shouldn't).\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"machines = [\"ETHOSCOPE_076\", \"ETHOSCOPE_082\", \"ETHOSCOPE_086\"]\n",
|
||
"print(machines[0]) # first item — Python counts from 0!\n",
|
||
"print(machines[-1]) # last item\n",
|
||
"print(len(machines)) # how many items\n",
|
||
"print(machines + [\"ETHOSCOPE_140\"]) # concatenate (returns a new list)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 4. Dictionaries\n",
|
||
"\n",
|
||
"A dictionary maps **keys** to **values**. Curly braces, `key: value`\n",
|
||
"pairs.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"fly = {\"species\": \"Sechellia\", \"trained\": True, \"age_days\": 5}\n",
|
||
"print(fly[\"species\"])\n",
|
||
"print(fly[\"age_days\"])\n",
|
||
"fly[\"alive\"] = False # add a new key\n",
|
||
"print(fly)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 5. Conditions: if / elif / else\n",
|
||
"\n",
|
||
"Compare with `==` (equal), `!=` (not equal), `<`, `>`, `<=`, `>=`.\n",
|
||
"Combine with `and`, `or`, `not`.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"distance_px = 42\n",
|
||
"\n",
|
||
"if distance_px < 50:\n",
|
||
" label = \"close\"\n",
|
||
"elif distance_px < 200:\n",
|
||
" label = \"medium\"\n",
|
||
"else:\n",
|
||
" label = \"far\"\n",
|
||
"\n",
|
||
"print(label)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 6. Loops\n",
|
||
"\n",
|
||
"`for x in collection:` runs the indented block once per item.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"for m in machines:\n",
|
||
" print(f\"Looking at machine {m}\")\n",
|
||
"\n",
|
||
"# Looping with an index, when you need it:\n",
|
||
"for i, m in enumerate(machines):\n",
|
||
" print(f\"{i}: {m}\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 7. Functions\n",
|
||
"\n",
|
||
"A function is a named, reusable chunk of code. `def` declares it. `return`\n",
|
||
"sends a value back to whoever called it.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"def fly_age_in_weeks(days):\n",
|
||
" \"\"\"Return age in weeks given age in days.\"\"\"\n",
|
||
" return days / 7\n",
|
||
"\n",
|
||
"print(fly_age_in_weeks(14)) # 2.0\n",
|
||
"print(fly_age_in_weeks(5)) # 0.714…\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 8. Importing libraries\n",
|
||
"\n",
|
||
"A library is somebody else's code. We use `import` to pull it into our\n",
|
||
"notebook.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"import math\n",
|
||
"print(math.sqrt(16)) # 4.0\n",
|
||
"print(math.pi)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 9. Meet pandas\n",
|
||
"\n",
|
||
"Real data is rarely a single number — it's a **table** with rows and\n",
|
||
"columns (think Excel). `pandas` is the library that handles tables in\n",
|
||
"Python. The two main objects are:\n",
|
||
"\n",
|
||
"- **`Series`** — a single column with a name.\n",
|
||
"- **`DataFrame`** — a whole table.\n",
|
||
"\n",
|
||
"By convention we import pandas as `pd`. Always.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": "import pandas as pd\nfrom pathlib import Path\n\n# Two locations to know about:\n# - DATA_DIR : where the project's bulky data lives (mounted read-only)\n# - REPO_ROOT : where the code repo is checked out (your home directory)\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path.home() / \"cupido\"\n\n# Pick the metadata TSV: prefer your personal copy (in the repo's\n# data/metadata/ folder, gitignored) if you have one, otherwise fall\n# back to the shared (read-only) master on the data volume. To make a\n# personal copy you can edit, run ONCE in a terminal:\n# cp /mnt/data/projects/cupido/all_video_info_merged.tsv ~/cupido/data/metadata/\nSHARED_TSV = DATA_DIR / \"all_video_info_merged.tsv\"\nPERSONAL_TSV = REPO_ROOT / \"data\" / \"metadata\" / \"all_video_info_merged.tsv\"\ntsv_path = PERSONAL_TSV if PERSONAL_TSV.exists() else SHARED_TSV\n\n# Read the project's metadata TSV (Tab-Separated Values).\ndf = pd.read_csv(tsv_path, sep=\"\\t\")\n\n# How big is it?\nprint(f\"Reading from: {tsv_path}\")\nprint(f\"Rows: {len(df)}\")\nprint(f\"Columns: {df.shape[1]}\")\n"
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 10. Looking at the table\n",
|
||
"\n",
|
||
"`.head()` shows the first 5 rows. `.tail()` the last 5. `.columns` lists\n",
|
||
"column names. `.dtypes` shows the type of each column.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"df.head(3)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"print(\"Column names:\")\n",
|
||
"for c in df.columns:\n",
|
||
" print(f\" {c}\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 11. Selecting columns\n",
|
||
"\n",
|
||
"Two main ways to get one column: bracket-indexing (`df[\"name\"]`) or\n",
|
||
"attribute access (`df.name`). The first works for any column name; the\n",
|
||
"second only works if the name has no spaces or weird characters.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"df[\"species\"].head()\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"df.species.value_counts() # how many rows per species\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 12. Selecting multiple columns\n",
|
||
"\n",
|
||
"Pass a **list** of names inside the brackets:\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"df[[\"machine_name\", \"roi\", \"species\", \"male\"]].head()\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 13. Filtering rows\n",
|
||
"\n",
|
||
"The pattern is `df[condition]`. The condition is a Series of `True`/`False`.\n",
|
||
"Pandas keeps the rows where it's `True`.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"trained = df[df[\"male\"] == \"trained\"]\n",
|
||
"print(f\"trained rows: {len(trained)}\")\n",
|
||
"\n",
|
||
"mel_only = df[df[\"species\"] == \"Melanogaster/CS\"]\n",
|
||
"print(f\"Melanogaster/CS rows: {len(mel_only)}\")\n",
|
||
"\n",
|
||
"# Combine conditions with & (and) | (or) — and wrap each part in parentheses.\n",
|
||
"trained_mel = df[(df[\"male\"] == \"trained\") & (df[\"species\"] == \"Melanogaster/CS\")]\n",
|
||
"print(f\"trained Mel rows: {len(trained_mel)}\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 14. Grouping and counting\n",
|
||
"\n",
|
||
"`.groupby(\"col\")` followed by an aggregator like `.size()` or `.mean()`\n",
|
||
"splits the table by the values in that column and computes something per\n",
|
||
"group.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"# How many ROIs per (species, training condition)?\n",
|
||
"df.groupby([\"species\", \"male\"]).size()\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 15. Quick plots\n",
|
||
"\n",
|
||
"DataFrames know how to draw themselves. Under the hood it's `matplotlib`.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"import matplotlib.pyplot as plt\n",
|
||
"\n",
|
||
"# How many rows per machine?\n",
|
||
"df[\"machine_name\"].value_counts().plot(kind=\"bar\", figsize=(10, 4))\n",
|
||
"plt.title(\"Number of fly-rows per ethoscope machine\")\n",
|
||
"plt.ylabel(\"rows\")\n",
|
||
"plt.show()\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 16. Exercises\n",
|
||
"\n",
|
||
"Don't skip these. They're how you find out what you actually understood.\n",
|
||
"\n",
|
||
"1. How many rows does `df` have where `age` equals `'5-7'`?\n",
|
||
"2. Print the **unique values** of the `memory` column. (Hint: `df[\"memory\"].unique()`)\n",
|
||
"3. How many distinct `(date, machine_name)` pairs are in the dataset?\n",
|
||
" (Hint: `df.groupby([\"date\", \"machine_name\"]).size().shape`.)\n",
|
||
"4. Make a bar plot of `species` counts. Which species has the most rows?\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"# Try exercise 1 here\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"# Try exercise 2 here\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"# Try exercise 3 here\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"execution_count": null,
|
||
"outputs": [],
|
||
"source": [
|
||
"# Try exercise 4 here\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Cheat sheet\n",
|
||
"\n",
|
||
"```python\n",
|
||
"import pandas as pd\n",
|
||
"df = pd.read_csv(\"file.tsv\", sep=\"\\t\") # read\n",
|
||
"df.head(); df.tail(); df.shape; df.columns # peek\n",
|
||
"df[\"col\"]; df[[\"a\", \"b\"]] # select\n",
|
||
"df[df[\"col\"] == \"value\"] # filter\n",
|
||
"df.groupby(\"col\").size() # count per group\n",
|
||
"df.groupby(\"col\")[\"x\"].mean() # mean of x per group\n",
|
||
"df[\"col\"].value_counts() # quick counts\n",
|
||
"df[\"col\"].unique() # unique values\n",
|
||
"df[\"new_col\"] = df[\"w\"] * df[\"h\"] # derived column\n",
|
||
"df.sort_values(\"col\", ascending=False) # sort\n",
|
||
"df.plot(...) # quick plot\n",
|
||
"```\n",
|
||
"\n",
|
||
"Keep this list open when reading other people's code. Most of pandas is\n",
|
||
"just combinations of these primitives. When you need more, the official\n",
|
||
"[pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)\n",
|
||
"is excellent.\n"
|
||
]
|
||
}
|
||
]
|
||
} |