cupido/notebooks/getting_started/01_python_pandas_basics.ipynb

{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 01 \u00b7 Python and pandas \u2014 just enough to be dangerous\n",
    "\n",
    "This notebook teaches the **minimum** Python and `pandas` you need to read\n",
    "the rest of the project's code and write your own analyses.\n",
    "\n",
    "If you've never programmed before, don't try to memorize the syntax.\n",
    "Just run each cell, read what it does, and come back when you're stuck on\n",
    "something specific. The cheat sheet at the end is the only thing worth\n",
    "keeping handy.\n",
    "\n",
    "External resources, in order of how much time they take:\n",
    "\n",
    "- \ud83e\udd98 [Python in 10 minutes (very condensed)](https://www.stavros.io/tutorials/python/)\n",
    "- \ud83d\udc0d [Official Python tutorial \u2014 chapters 3\u20135](https://docs.python.org/3/tutorial/introduction.html)\n",
    "- \ud83d\udc3c [pandas in 10 minutes (official)](https://pandas.pydata.org/docs/user_guide/10min.html)\n",
    "- \ud83d\udcda [Python for Data Analysis (the book)](https://wesmckinney.com/book/) \u2014 free online\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.  Variables\n",
    "\n",
    "A variable is a named box you put a value into. The `=` is **assignment**,\n",
    "not equality. Read it as \"make `name` refer to `value`\".\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "x = 5\n",
    "y = 3\n",
    "total = x + y\n",
    "print(total)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Re-running the cell after changing `x = 5` to `x = 50` gives a different\n",
    "answer. Try it.\n",
    "\n",
    "Variable names: lowercase letters, digits, and underscores. They can't\n",
    "start with a digit. Convention is `snake_case`: `mean_distance`, not\n",
    "`meanDistance` or `MeanDistance`.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.  Strings and numbers\n",
    "\n",
    "A **string** is text in quotes. You can join strings with `+`. You can\n",
    "turn a number into a string with `str()`, and vice-versa with `int()` /\n",
    "`float()`.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "species = \"Drosophila melanogaster\"\n",
    "n_flies = 12\n",
    "message = \"We tracked \" + str(n_flies) + \" \" + species + \" males.\"\n",
    "print(message)\n",
    "\n",
    "# A nicer way to build strings \u2014 f-strings (note the leading 'f'):\n",
    "print(f\"We tracked {n_flies} {species} males.\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.  Lists\n",
    "\n",
    "A list is an ordered collection of things. Square brackets, items\n",
    "separated by commas. You can mix types (but usually shouldn't).\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "machines = [\"ETHOSCOPE_076\", \"ETHOSCOPE_082\", \"ETHOSCOPE_086\"]\n",
    "print(machines[0])         # first item \u2014 Python counts from 0!\n",
    "print(machines[-1])        # last item\n",
    "print(len(machines))       # how many items\n",
    "print(machines + [\"ETHOSCOPE_140\"])  # concatenate (returns a new list)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.  Dictionaries\n",
    "\n",
    "A dictionary maps **keys** to **values**. Curly braces, `key: value`\n",
    "pairs.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "fly = {\"species\": \"Sechellia\", \"trained\": True, \"age_days\": 5}\n",
    "print(fly[\"species\"])\n",
    "print(fly[\"age_days\"])\n",
    "fly[\"alive\"] = False         # add a new key\n",
    "print(fly)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.  Conditions: if / elif / else\n",
    "\n",
    "Compare with `==` (equal), `!=` (not equal), `<`, `>`, `<=`, `>=`.\n",
    "Combine with `and`, `or`, `not`.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "distance_px = 42\n",
    "\n",
    "if distance_px < 50:\n",
    "    label = \"close\"\n",
    "elif distance_px < 200:\n",
    "    label = \"medium\"\n",
    "else:\n",
    "    label = \"far\"\n",
    "\n",
    "print(label)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6.  Loops\n",
    "\n",
    "`for x in collection:` runs the indented block once per item.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "for m in machines:\n",
    "    print(f\"Looking at machine {m}\")\n",
    "\n",
    "# Looping with an index, when you need it:\n",
    "for i, m in enumerate(machines):\n",
    "    print(f\"{i}: {m}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7.  Functions\n",
    "\n",
    "A function is a named, reusable chunk of code. `def` declares it. `return`\n",
    "sends a value back to whoever called it.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def fly_age_in_weeks(days):\n",
    "    \"\"\"Return age in weeks given age in days.\"\"\"\n",
    "    return days / 7\n",
    "\n",
    "print(fly_age_in_weeks(14))    # 2.0\n",
    "print(fly_age_in_weeks(5))     # 0.714\u2026\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8.  Importing libraries\n",
    "\n",
    "A library is somebody else's code. We use `import` to pull it into our\n",
    "notebook.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import math\n",
    "print(math.sqrt(16))   # 4.0\n",
    "print(math.pi)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9.  Meet pandas\n",
    "\n",
    "Real data is rarely a single number \u2014 it's a **table** with rows and\n",
    "columns (think Excel). `pandas` is the library that handles tables in\n",
    "Python. The two main objects are:\n",
    "\n",
    "- **`Series`** \u2014 a single column with a name.\n",
    "- **`DataFrame`** \u2014 a whole table.\n",
    "\n",
    "By convention we import pandas as `pd`. Always.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Read the project's metadata TSV (Tab-Separated Values).\n",
    "tsv_path = \"/home/gg/ownCloud/Work/Projects/coding/cupido/all_video_info_merged.tsv\"\n",
    "df = pd.read_csv(tsv_path, sep=\"\\t\")\n",
    "\n",
    "# How big is it?\n",
    "print(f\"Rows: {len(df)}\")\n",
    "print(f\"Columns: {df.shape[1]}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10.  Looking at the table\n",
    "\n",
    "`.head()` shows the first 5 rows. `.tail()` the last 5. `.columns` lists\n",
    "column names. `.dtypes` shows the type of each column.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "df.head(3)\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "print(\"Column names:\")\n",
    "for c in df.columns:\n",
    "    print(f\"  {c}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 11.  Selecting columns\n",
    "\n",
    "Two main ways to get one column: bracket-indexing (`df[\"name\"]`) or\n",
    "attribute access (`df.name`). The first works for any column name; the\n",
    "second only works if the name has no spaces or weird characters.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "df[\"species\"].head()\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "df.species.value_counts()    # how many rows per species\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 12.  Selecting multiple columns\n",
    "\n",
    "Pass a **list** of names inside the brackets:\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "df[[\"machine_name\", \"roi\", \"species\", \"male\"]].head()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 13.  Filtering rows\n",
    "\n",
    "The pattern is `df[condition]`. The condition is a Series of `True`/`False`.\n",
    "Pandas keeps the rows where it's `True`.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "trained = df[df[\"male\"] == \"trained\"]\n",
    "print(f\"trained rows: {len(trained)}\")\n",
    "\n",
    "mel_only = df[df[\"species\"] == \"Melanogaster/CS\"]\n",
    "print(f\"Melanogaster/CS rows: {len(mel_only)}\")\n",
    "\n",
    "# Combine conditions with & (and) | (or) \u2014 and wrap each part in parentheses.\n",
    "trained_mel = df[(df[\"male\"] == \"trained\") & (df[\"species\"] == \"Melanogaster/CS\")]\n",
    "print(f\"trained Mel rows: {len(trained_mel)}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 14.  Grouping and counting\n",
    "\n",
    "`.groupby(\"col\")` followed by an aggregator like `.size()` or `.mean()`\n",
    "splits the table by the values in that column and computes something per\n",
    "group.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# How many ROIs per (species, training condition)?\n",
    "df.groupby([\"species\", \"male\"]).size()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 15.  Quick plots\n",
    "\n",
    "DataFrames know how to draw themselves. Under the hood it's `matplotlib`.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# How many rows per machine?\n",
    "df[\"machine_name\"].value_counts().plot(kind=\"bar\", figsize=(10, 4))\n",
    "plt.title(\"Number of fly-rows per ethoscope machine\")\n",
    "plt.ylabel(\"rows\")\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 16.  Exercises\n",
    "\n",
    "Don't skip these. They're how you find out what you actually understood.\n",
    "\n",
    "1. How many rows does `df` have where `age` equals `'5-7'`?\n",
    "2. Print the **unique values** of the `memory` column. (Hint: `df[\"memory\"].unique()`)\n",
    "3. How many distinct `(date, machine_name)` pairs are in the dataset?\n",
    "   (Hint: `df.groupby([\"date\", \"machine_name\"]).size().shape`.)\n",
    "4. Make a bar plot of `species` counts. Which species has the most rows?\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Try exercise 1 here\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Try exercise 2 here\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Try exercise 3 here\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Try exercise 4 here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cheat sheet\n",
    "\n",
    "```python\n",
    "import pandas as pd\n",
    "df = pd.read_csv(\"file.tsv\", sep=\"\\t\")     # read\n",
    "df.head(); df.tail(); df.shape; df.columns  # peek\n",
    "df[\"col\"]; df[[\"a\", \"b\"]]                    # select\n",
    "df[df[\"col\"] == \"value\"]                     # filter\n",
    "df.groupby(\"col\").size()                     # count per group\n",
    "df.groupby(\"col\")[\"x\"].mean()                # mean of x per group\n",
    "df[\"col\"].value_counts()                     # quick counts\n",
    "df[\"col\"].unique()                           # unique values\n",
    "df[\"new_col\"] = df[\"w\"] * df[\"h\"]            # derived column\n",
    "df.sort_values(\"col\", ascending=False)       # sort\n",
    "df.plot(...)                                 # quick plot\n",
    "```\n",
    "\n",
    "Keep this list open when reading other people's code. Most of pandas is\n",
    "just combinations of these primitives. When you need more, the official\n",
    "[pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)\n",
    "is excellent.\n"
   ]
  }
 ]
}