Four guided notebooks under notebooks/getting_started/ aimed at someone new to Python and data science. The series progresses: project orientation → Python/pandas crash course → exploring one tracking DB → first trained-vs-naive comparison using load_roi_data + Mann-Whitney U. Each notebook leans heavily on markdown explanations, includes exercises with empty cells, and links out to canonical references (JupyterLab, official Python tutorial, pandas 10-min guide, Wikipedia for stats concepts). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
500 lines
13 KiB
Text
500 lines
13 KiB
Text
{
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5,
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"name": "python"
|
|
}
|
|
},
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 01 \u00b7 Python and pandas \u2014 just enough to be dangerous\n",
|
|
"\n",
|
|
"This notebook teaches the **minimum** Python and `pandas` you need to read\n",
|
|
"the rest of the project's code and write your own analyses.\n",
|
|
"\n",
|
|
"If you've never programmed before, don't try to memorize the syntax.\n",
|
|
"Just run each cell, read what it does, and come back when you're stuck on\n",
|
|
"something specific. The cheat sheet at the end is the only thing worth\n",
|
|
"keeping handy.\n",
|
|
"\n",
|
|
"External resources, in order of how much time they take:\n",
|
|
"\n",
|
|
"- \ud83e\udd98 [Python in 10 minutes (very condensed)](https://www.stavros.io/tutorials/python/)\n",
|
|
"- \ud83d\udc0d [Official Python tutorial \u2014 chapters 3\u20135](https://docs.python.org/3/tutorial/introduction.html)\n",
|
|
"- \ud83d\udc3c [pandas in 10 minutes (official)](https://pandas.pydata.org/docs/user_guide/10min.html)\n",
|
|
"- \ud83d\udcda [Python for Data Analysis (the book)](https://wesmckinney.com/book/) \u2014 free online\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 1. Variables\n",
|
|
"\n",
|
|
"A variable is a named box you put a value into. The `=` is **assignment**,\n",
|
|
"not equality. Read it as \"make `name` refer to `value`\".\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"x = 5\n",
|
|
"y = 3\n",
|
|
"total = x + y\n",
|
|
"print(total)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Re-running the cell after changing `x = 5` to `x = 50` gives a different\n",
|
|
"answer. Try it.\n",
|
|
"\n",
|
|
"Variable names: lowercase letters, digits, and underscores. They can't\n",
|
|
"start with a digit. Convention is `snake_case`: `mean_distance`, not\n",
|
|
"`meanDistance` or `MeanDistance`.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2. Strings and numbers\n",
|
|
"\n",
|
|
"A **string** is text in quotes. You can join strings with `+`. You can\n",
|
|
"turn a number into a string with `str()`, and vice-versa with `int()` /\n",
|
|
"`float()`.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"species = \"Drosophila melanogaster\"\n",
|
|
"n_flies = 12\n",
|
|
"message = \"We tracked \" + str(n_flies) + \" \" + species + \" males.\"\n",
|
|
"print(message)\n",
|
|
"\n",
|
|
"# A nicer way to build strings \u2014 f-strings (note the leading 'f'):\n",
|
|
"print(f\"We tracked {n_flies} {species} males.\")\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 3. Lists\n",
|
|
"\n",
|
|
"A list is an ordered collection of things. Square brackets, items\n",
|
|
"separated by commas. You can mix types (but usually shouldn't).\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"machines = [\"ETHOSCOPE_076\", \"ETHOSCOPE_082\", \"ETHOSCOPE_086\"]\n",
|
|
"print(machines[0]) # first item \u2014 Python counts from 0!\n",
|
|
"print(machines[-1]) # last item\n",
|
|
"print(len(machines)) # how many items\n",
|
|
"print(machines + [\"ETHOSCOPE_140\"]) # concatenate (returns a new list)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 4. Dictionaries\n",
|
|
"\n",
|
|
"A dictionary maps **keys** to **values**. Curly braces, `key: value`\n",
|
|
"pairs.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"fly = {\"species\": \"Sechellia\", \"trained\": True, \"age_days\": 5}\n",
|
|
"print(fly[\"species\"])\n",
|
|
"print(fly[\"age_days\"])\n",
|
|
"fly[\"alive\"] = False # add a new key\n",
|
|
"print(fly)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 5. Conditions: if / elif / else\n",
|
|
"\n",
|
|
"Compare with `==` (equal), `!=` (not equal), `<`, `>`, `<=`, `>=`.\n",
|
|
"Combine with `and`, `or`, `not`.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"distance_px = 42\n",
|
|
"\n",
|
|
"if distance_px < 50:\n",
|
|
" label = \"close\"\n",
|
|
"elif distance_px < 200:\n",
|
|
" label = \"medium\"\n",
|
|
"else:\n",
|
|
" label = \"far\"\n",
|
|
"\n",
|
|
"print(label)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 6. Loops\n",
|
|
"\n",
|
|
"`for x in collection:` runs the indented block once per item.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"for m in machines:\n",
|
|
" print(f\"Looking at machine {m}\")\n",
|
|
"\n",
|
|
"# Looping with an index, when you need it:\n",
|
|
"for i, m in enumerate(machines):\n",
|
|
" print(f\"{i}: {m}\")\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 7. Functions\n",
|
|
"\n",
|
|
"A function is a named, reusable chunk of code. `def` declares it. `return`\n",
|
|
"sends a value back to whoever called it.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"def fly_age_in_weeks(days):\n",
|
|
" \"\"\"Return age in weeks given age in days.\"\"\"\n",
|
|
" return days / 7\n",
|
|
"\n",
|
|
"print(fly_age_in_weeks(14)) # 2.0\n",
|
|
"print(fly_age_in_weeks(5)) # 0.714\u2026\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 8. Importing libraries\n",
|
|
"\n",
|
|
"A library is somebody else's code. We use `import` to pull it into our\n",
|
|
"notebook.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"import math\n",
|
|
"print(math.sqrt(16)) # 4.0\n",
|
|
"print(math.pi)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 9. Meet pandas\n",
|
|
"\n",
|
|
"Real data is rarely a single number \u2014 it's a **table** with rows and\n",
|
|
"columns (think Excel). `pandas` is the library that handles tables in\n",
|
|
"Python. The two main objects are:\n",
|
|
"\n",
|
|
"- **`Series`** \u2014 a single column with a name.\n",
|
|
"- **`DataFrame`** \u2014 a whole table.\n",
|
|
"\n",
|
|
"By convention we import pandas as `pd`. Always.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"# Read the project's metadata TSV (Tab-Separated Values).\n",
|
|
"tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
|
|
"df = pd.read_csv(tsv_path, sep=\"\\t\")\n",
|
|
"\n",
|
|
"# How big is it?\n",
|
|
"print(f\"Rows: {len(df)}\")\n",
|
|
"print(f\"Columns: {df.shape[1]}\")\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 10. Looking at the table\n",
|
|
"\n",
|
|
"`.head()` shows the first 5 rows. `.tail()` the last 5. `.columns` lists\n",
|
|
"column names. `.dtypes` shows the type of each column.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"df.head(3)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"print(\"Column names:\")\n",
|
|
"for c in df.columns:\n",
|
|
" print(f\" {c}\")\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 11. Selecting columns\n",
|
|
"\n",
|
|
"Two main ways to get one column: bracket-indexing (`df[\"name\"]`) or\n",
|
|
"attribute access (`df.name`). The first works for any column name; the\n",
|
|
"second only works if the name has no spaces or weird characters.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"df[\"species\"].head()\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"df.species.value_counts() # how many rows per species\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 12. Selecting multiple columns\n",
|
|
"\n",
|
|
"Pass a **list** of names inside the brackets:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"df[[\"machine_name\", \"roi\", \"species\", \"male\"]].head()\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 13. Filtering rows\n",
|
|
"\n",
|
|
"The pattern is `df[condition]`. The condition is a Series of `True`/`False`.\n",
|
|
"Pandas keeps the rows where it's `True`.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"trained = df[df[\"male\"] == \"trained\"]\n",
|
|
"print(f\"trained rows: {len(trained)}\")\n",
|
|
"\n",
|
|
"mel_only = df[df[\"species\"] == \"Melanogaster/CS\"]\n",
|
|
"print(f\"Melanogaster/CS rows: {len(mel_only)}\")\n",
|
|
"\n",
|
|
"# Combine conditions with & (and) | (or) \u2014 and wrap each part in parentheses.\n",
|
|
"trained_mel = df[(df[\"male\"] == \"trained\") & (df[\"species\"] == \"Melanogaster/CS\")]\n",
|
|
"print(f\"trained Mel rows: {len(trained_mel)}\")\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 14. Grouping and counting\n",
|
|
"\n",
|
|
"`.groupby(\"col\")` followed by an aggregator like `.size()` or `.mean()`\n",
|
|
"splits the table by the values in that column and computes something per\n",
|
|
"group.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# How many ROIs per (species, training condition)?\n",
|
|
"df.groupby([\"species\", \"male\"]).size()\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 15. Quick plots\n",
|
|
"\n",
|
|
"DataFrames know how to draw themselves. Under the hood it's `matplotlib`.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"import matplotlib.pyplot as plt\n",
|
|
"\n",
|
|
"# How many rows per machine?\n",
|
|
"df[\"machine_name\"].value_counts().plot(kind=\"bar\", figsize=(10, 4))\n",
|
|
"plt.title(\"Number of fly-rows per ethoscope machine\")\n",
|
|
"plt.ylabel(\"rows\")\n",
|
|
"plt.show()\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 16. Exercises\n",
|
|
"\n",
|
|
"Don't skip these. They're how you find out what you actually understood.\n",
|
|
"\n",
|
|
"1. How many rows does `df` have where `age` equals `'5-7'`?\n",
|
|
"2. Print the **unique values** of the `memory` column. (Hint: `df[\"memory\"].unique()`)\n",
|
|
"3. How many distinct `(date, machine_name)` pairs are in the dataset?\n",
|
|
" (Hint: `df.groupby([\"date\", \"machine_name\"]).size().shape`.)\n",
|
|
"4. Make a bar plot of `species` counts. Which species has the most rows?\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# Try exercise 1 here\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# Try exercise 2 here\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# Try exercise 3 here\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {},
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# Try exercise 4 here\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Cheat sheet\n",
|
|
"\n",
|
|
"```python\n",
|
|
"import pandas as pd\n",
|
|
"df = pd.read_csv(\"file.tsv\", sep=\"\\t\") # read\n",
|
|
"df.head(); df.tail(); df.shape; df.columns # peek\n",
|
|
"df[\"col\"]; df[[\"a\", \"b\"]] # select\n",
|
|
"df[df[\"col\"] == \"value\"] # filter\n",
|
|
"df.groupby(\"col\").size() # count per group\n",
|
|
"df.groupby(\"col\")[\"x\"].mean() # mean of x per group\n",
|
|
"df[\"col\"].value_counts() # quick counts\n",
|
|
"df[\"col\"].unique() # unique values\n",
|
|
"df[\"new_col\"] = df[\"w\"] * df[\"h\"] # derived column\n",
|
|
"df.sort_values(\"col\", ascending=False) # sort\n",
|
|
"df.plot(...) # quick plot\n",
|
|
"```\n",
|
|
"\n",
|
|
"Keep this list open when reading other people's code. Most of pandas is\n",
|
|
"just combinations of these primitives. When you need more, the official\n",
|
|
"[pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)\n",
|
|
"is excellent.\n"
|
|
]
|
|
}
|
|
]
|
|
}
|