cupido/notebooks/flies_analysis_simple.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "# Trained vs naïve flies — distance & velocity pipeline\n\nThis notebook is the canonical pipeline the previous student built to ask\n\"do trained males behave differently from naïve males?\" using two\ncomplementary metrics:\n\n1. **Inter-fly distance over time** — does the male stay close to the\n   female (courting) or drift away (giving up)?\n2. **Per-fly maximum velocity** — when each fly is identified across\n   frames, how fast does it move?\n\nIt runs in roughly this order:\n\n| step | what it does |\n|---|---|\n| Load   | pull every (fly, session) trace via `load_roi_data` |\n| Align  | shift each track so `t = 0` is the moment the barrier opened |\n| Area   | compute a baseline body-area to spot frames where the tracker merged two flies into one blob |\n| Distance | per-frame Euclidean distance, with the merged-blob heuristic |\n| Track   | re-identify \"fly 1\" vs \"fly 2\" frame-to-frame (Hungarian assignment) |\n| Velocity | per-fly velocity from those tracked identities |\n| Plot   | trained-vs-naïve mean curves with smoothing |\n| Stats  | t-test + Cohen's d, pre- and post-barrier-opening |\n\n**Important caveat.** Step 2 (Align) needs the **barrier-opening time\nper machine**, which we currently only have for the 2025-07-15 batch\n(`data/metadata/2025_07_15_barrier_opening.csv`). Running this notebook\non the full dataset will silently drop every machine that doesn't\nappear in that file. Annotating the 2024 batch is on the todo list —\nsee `tasks/todo.md`.\n\nThe notebook caches expensive intermediate results to\n`data/processed/*.csv` so re-running is cheap. Set\n`recalculate_distances = True` (or `recalculate_tracking = True`) to\nforce a fresh computation.\n"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "import sys\nfrom pathlib import Path\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.spatial.distance import euclidean\nfrom scipy import stats\n\n# ─── Where the data lives ────────────────────────────────────────────────\n# DATA_DIR holds everything bulky/regenerable: the metadata TSV and the\n# tracking SQLite DBs. It's mounted into the container at this fixed path.\n# REPO_ROOT is your checkout of the cupido repo, in your home directory.\n# Path.home() expands to /home/<your-username>, so this works for any\n# user (no hard-coded usernames).\nDATA_DIR  = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path.home() / \"cupido\"\n\nTRACKED_DBS = DATA_DIR / \"tracked\"\n\n# ─── The metadata TSV — shared master vs. your personal copy ─────────────\n# DATA_DIR is mounted read-only inside the container, so the shared TSV\n# at SHARED_TSV cannot be edited. Fine for read-only analysis. But if\n# you want to flip `include` flags (or otherwise customize the metadata\n# for your own analysis), copy it to your repo's data/metadata/ ONCE:\n#\n#     $ cp /mnt/data/projects/cupido/all_video_info_merged.tsv ~/cupido/data/metadata/\n#\n# That location is gitignored, so your edits won't pollute the repo and\n# other users are unaffected. The auto-select line below picks up your\n# personal copy automatically once it's there.\nSHARED_TSV   = DATA_DIR / \"all_video_info_merged.tsv\"\nPERSONAL_TSV = REPO_ROOT / \"data\" / \"metadata\" / \"all_video_info_merged.tsv\"\nMETADATA_TSV = PERSONAL_TSV if PERSONAL_TSV.exists() else SHARED_TSV\n\n# Sanity-check the data location up front so any failure here points at\n# the obvious thing — rather than crashing inside load_roi_data later.\nassert METADATA_TSV.exists(), f\"Metadata TSV not found at {METADATA_TSV}\"\nassert TRACKED_DBS.is_dir(),  f\"Tracked-DB directory not found at {TRACKED_DBS}\"\n\n# Pull the in-repo path constants (DATA_METADATA, DATA_PROCESSED, FIGURES)\n# from scripts/config.py — single source of truth.\nsys.path.insert(0, str(REPO_ROOT / \"scripts\"))\nfrom config import DATA_METADATA, DATA_PROCESSED, FIGURES\n\n# Plotting style\nplt.style.use('seaborn-v0_8')\nsns.set_palette(\"husl\")\n\nprint(f\"Data directory: {DATA_DIR}\")\nprint(f\"Repo root:      {REPO_ROOT}\")\nprint(f\"Metadata TSV:   {METADATA_TSV}  ({'personal' if METADATA_TSV == PERSONAL_TSV else 'shared (read-only)'})\")\nprint(f\"Pandas version: {pd.__version__}\")\nprint(f\"NumPy version:  {np.__version__}\")\n"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## 1. Load the tracking data\n\n`load_roi_data` opens every tracking DB referenced by the merged TSV\nand returns one big DataFrame stamped with experimental metadata\n(species, male/naïve, age, …). The TSV has a boolean `include` column\n(default `True`); the loader skips rows where it's `False`. To customize\nwhich rows you want included, see the personal-copy comment block in\nthe setup cell above — you edit your **own** copy of the TSV, not the\nshared one.\n\nIf you only want a subset, pre-filter `meta` before passing it in\n(e.g. `load_roi_data(meta[meta.species == 'Melanogaster/CS'])`).\n"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Load the metadata explicitly from METADATA_TSV (no hidden defaults), then\n# hand it to load_roi_data which opens each referenced tracking DB.\nfrom load_roi_data import load_roi_data\n\nmeta = pd.read_csv(METADATA_TSV, sep=\"\\t\")\nprint(f\"metadata rows: {len(meta)}\")\n\ndata = load_roi_data(meta)\ntrained_data   = data[data['male'] == 'trained'].copy()\nuntrained_data = data[data['male'] == 'naive'].copy()\n\n# Reason: a few helper functions further down (e.g. calculate_distances_with_area,\n# the identity-tracking step) expect a 'group' column to label rows. Derive\n# it here so we don't have to thread male/naive through manually everywhere.\ntrained_data['group']   = 'trained'\nuntrained_data['group'] = 'naive'\n\nprint(f\"all data shape:    {data.shape}\")\nprint(f\"Trained data:      {trained_data.shape}\")\nprint(f\"Naive data:        {untrained_data.shape}\")\nprint(f\"Columns:           {list(trained_data.columns)}\")\n"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## 2. Align tracks to the barrier-opening time\n\nEvery video starts at its own arbitrary moment. What matters is the\n**barrier opening** — the experimenter physically lifts a divider, the\nsexes meet, and the courtship clock starts. We define `aligned_time = 0`\nas that moment, so curves from different machines can be averaged.\n\nPer-machine opening times are hand-annotated in\n`2025_07_15_barrier_opening.csv` (one row per ethoscope machine,\n`opening_time` in seconds from the start of the video). Any machine\nnot in that file is silently dropped from the aligned data — its rows\nget `aligned_time = NaN`.\n"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Load barrier opening data\nbarrier_data = pd.read_csv(DATA_METADATA / '2025_07_15_barrier_opening.csv')\nbarrier_data['opening_time_ms'] = barrier_data['opening_time'] * 1000\n\n# Create a dictionary mapping machine_name to opening time\nopening_times = dict(zip(barrier_data['machine'], barrier_data['opening_time_ms']))\nprint(\"Barrier opening times:\")\nprint(barrier_data)"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "def align_to_opening_time(df, opening_times):\n    \"\"\"Shift each machine's track so t=0 is its barrier-opening moment.\n\n    Returns a DataFrame with an extra 'aligned_time' column. Rows whose\n    machine is not in `opening_times` are dropped (their aligned_time\n    would be NaN — meaningless).\n    \"\"\"\n    df_aligned = df.copy()\n    df_aligned['aligned_time'] = np.nan\n\n    for machine in df['machine_name'].unique():\n        if machine in opening_times:\n            opening_time = opening_times[machine]\n            mask = df['machine_name'] == machine\n            df_aligned.loc[mask, 'aligned_time'] = df.loc[mask, 't'] - opening_time\n\n    df_aligned = df_aligned.dropna(subset=['aligned_time'])\n    return df_aligned\n\n\ntrained_aligned   = align_to_opening_time(trained_data,   opening_times)\nuntrained_aligned = align_to_opening_time(untrained_data, opening_times)\n\nprint(f\"Trained aligned data shape:   {trained_aligned.shape}\")\nprint(f\"Untrained aligned data shape: {untrained_aligned.shape}\")\n\n# Reason: if NO machines in the loaded data have barrier-opening\n# annotations, every downstream step quietly produces empty DataFrames\n# and we crash with a confusing KeyError 5 cells later. Fail loudly here\n# with the actionable message instead.\nloaded_machines  = set(data['machine_name'].unique())\nknown_machines   = set(opening_times)\nmissing_machines = sorted(loaded_machines - known_machines)\nif trained_aligned.empty and untrained_aligned.empty:\n    raise ValueError(\n        \"Alignment produced no rows: none of the loaded machines have an entry \"\n        f\"in 2025_07_15_barrier_opening.csv.\\n\"\n        f\"  loaded machines: {sorted(loaded_machines)}\\n\"\n        f\"  machines with barrier_opening: {sorted(known_machines)}\\n\"\n        \"Either filter `meta` to a subset that overlaps with the barrier-opening \"\n        \"CSV, or annotate barrier-opening times for the machines you want to analyze.\"\n    )\nif missing_machines:\n    print(\n        f\"⚠ {len(missing_machines)} loaded machine(s) have no barrier_opening \"\n        f\"entry and were dropped: {missing_machines}\"\n    )\n"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## 3. Body-area baseline (used as a \"two flies merged\" detector)\n\nAt any given time the tracker may detect one fly or two flies in a\nROI. The interesting case is \"we expected two but only see one\" —\nwhich usually means the two flies are touching/overlapping and the\ntracker fused them into a single, **larger** bounding box.\n\nTo distinguish \"real single-fly frames\" (one fly hidden, true distance\nunknown) from \"merged-blob frames\" (flies effectively touching, distance\n≈ 0), we need a baseline: how big is one fly's bounding box on average?\nWe estimate that from frames where we *do* see two flies — there the\nboxes are guaranteed to be one-fly-each. The median of those areas is\nour reference value.\n"
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Median area size for time points with two flies: 1749.00\n"
     ]
    }
   ],
   "source": [
    "def calculate_areas_with_two_flies(df):\n",
    "    \"\"\"Calculate median area size for time points with two flies\"\"\"\n",
    "    # Calculate area for each row\n",
    "    df['area'] = df['w'] * df['h']\n",
    "    \n",
    "    # Group by machine_name, ROI, and time to count flies per time point\n",
    "    fly_counts = df.groupby(['machine_name', 'ROI', 't']).size().reset_index(name='fly_count')\n",
    "    \n",
    "    # Filter for time points with exactly 2 flies\n",
    "    two_fly_times = fly_counts[fly_counts['fly_count'] == 2]\n",
    "    \n",
    "    # Merge back with original data to get areas for these time points\n",
    "    two_fly_data = pd.merge(df, two_fly_times[['machine_name', 'ROI', 't']], \n",
    "                           on=['machine_name', 'ROI', 't'])\n",
    "    \n",
    "    # Calculate median area\n",
    "    median_area = two_fly_data['area'].median()\n",
    "    \n",
    "    return median_area, two_fly_data\n",
    "\n",
    "# Combine trained and untrained data for area calculation\n",
    "combined_data = pd.concat([trained_aligned, untrained_aligned], ignore_index=True)\n",
    "\n",
    "# Calculate median area for time points with two flies\n",
    "median_area, two_fly_data = calculate_areas_with_two_flies(combined_data)\n",
    "print(f\"Median area size for time points with two flies: {median_area:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## 4. Per-frame inter-fly distance\n\nFor each `(machine, ROI, aligned_time)`:\n\n- **2 detections** → Euclidean distance between them (the obvious case).\n- **1 detection, large box** (area > 1.5× the median) → treat as\n  distance 0. The flies are touching, the tracker fused them.\n- **1 detection, normal box** → `NaN`. One fly is genuinely lost\n  (occluded, off-frame); we don't know the distance and shouldn't pretend.\n\nThis is computed once per group (trained/naïve) and cached to CSV. Set\n`recalculate_distances = True` to force a re-computation after changing\nany of the inputs.\n"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "recalculate_distances = False  # Set to True to force a fresh computation\n\ntrained_dist_file   = DATA_PROCESSED / 'trained_distances_aligned.csv'\nuntrained_dist_file = DATA_PROCESSED / 'untrained_distances_aligned.csv'\n\n# Reason: a previous run that produced empty results saved 1-byte CSVs\n# here (just a newline). Treat any cache that doesn't have the expected\n# schema as invalid and recompute, instead of crashing later.\nEXPECTED_COLS = {'machine_name', 'ROI', 'aligned_time', 'distance', 'n_flies'}\n\ndef _read_cache(path):\n    try:\n        df = pd.read_csv(path)\n    except (pd.errors.EmptyDataError, FileNotFoundError):\n        return None\n    if not EXPECTED_COLS.issubset(df.columns) or df.empty:\n        return None\n    return df\n\ncached_trained   = _read_cache(trained_dist_file)   if not recalculate_distances else None\ncached_untrained = _read_cache(untrained_dist_file) if not recalculate_distances else None\n\nif cached_trained is not None and cached_untrained is not None:\n    print(\"Loading pre-calculated distance data from CSV files...\")\n    trained_distances   = cached_trained\n    untrained_distances = cached_untrained\n    print(f\"Trained distances shape:   {trained_distances.shape}\")\n    print(f\"Untrained distances shape: {untrained_distances.shape}\")\nelse:\n    print(\"Calculating distances from scratch...\")\n    def calculate_distances_with_area(df, median_area_threshold):\n        \"\"\"Distance between the two flies, with merged-blob heuristic.\"\"\"\n        df['area'] = df['w'] * df['h']\n        results = []\n\n        for (machine_name, roi, t), group in df.groupby(['machine_name', 'ROI', 'aligned_time']):\n            group = group.sort_values('id').reset_index(drop=True)\n\n            if len(group) >= 2:\n                fly1 = group.iloc[0]\n                fly2 = group.iloc[1]\n                distance = euclidean([fly1['x'], fly1['y']], [fly2['x'], fly2['y']])\n                results.append({\n                    'machine_name': machine_name, 'ROI': roi, 'aligned_time': t,\n                    'distance': distance, 'n_flies': len(group),\n                    'area_fly1': fly1['area'], 'area_fly2': fly2['area'],\n                    'group': fly1['group'],\n                })\n            elif len(group) == 1:\n                fly = group.iloc[0]\n                area = fly['area']\n                distance = 0.0 if area > 1.5 * median_area_threshold else np.nan\n                results.append({\n                    'machine_name': machine_name, 'ROI': roi, 'aligned_time': t,\n                    'distance': distance, 'n_flies': 1,\n                    'area_fly1': area, 'area_fly2': np.nan,\n                    'group': fly['group'],\n                })\n\n        return pd.DataFrame(results)\n\n    trained_distances   = calculate_distances_with_area(trained_aligned,   median_area)\n    untrained_distances = calculate_distances_with_area(untrained_aligned, median_area)\n\n    print(f\"Trained distances shape:   {trained_distances.shape}\")\n    print(f\"Untrained distances shape: {untrained_distances.shape}\")\n\n    # Reason: only persist cache if we actually have data — saving an\n    # empty DataFrame writes a 1-byte file that bricks the next run.\n    if not trained_distances.empty and not untrained_distances.empty:\n        DATA_PROCESSED.mkdir(parents=True, exist_ok=True)\n        trained_distances.to_csv(trained_dist_file, index=False)\n        untrained_distances.to_csv(untrained_dist_file, index=False)\n        print(\"Distance data saved to CSV files\")\n    else:\n        print(\"⚠ skipping cache save — one of the result DataFrames is empty\")\n"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## 5. Trained vs naïve — average distance over the whole session\n\nFor each `aligned_time`, average the inter-fly distance across all\n(machine, ROI) tracks in the group, then smooth with a rolling mean\n(50-point window) to dampen frame-to-frame noise.\n\nA clearly-shifted curve = the groups behave differently. The vertical\ndashed line marks `t = 0` (barrier opening).\n"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Remove NaN distances for plotting\ntrained_clean = trained_distances.dropna(subset=['distance'])\nuntrained_clean = untrained_distances.dropna(subset=['distance'])\n\n# Calculate average distance over time for each group\ntrained_avg = trained_clean.groupby('aligned_time')['distance'].mean()\nuntrained_avg = untrained_clean.groupby('aligned_time')['distance'].mean()\n\n# Apply smoothing using a rolling average\nwindow_size = 50\ntrained_smooth = trained_avg.rolling(window=window_size, center=True).mean()\nuntrained_smooth = untrained_avg.rolling(window=window_size, center=True).mean()\n\n# Create the plot\nplt.figure(figsize=(15, 8))\n\nplt.plot(trained_smooth.index/1000, trained_smooth.values, \n         label='Trained (smoothed)', color='blue', linewidth=2)\nplt.plot(untrained_smooth.index/1000, untrained_smooth.values, \n         label='Untrained (smoothed)', color='red', linewidth=2)\n\nplt.axvline(x=0, color='black', linestyle='--', alpha=0.7, label='Barrier Opening')\n\nplt.xlabel('Time (seconds relative to barrier opening)')\nplt.ylabel('Average Distance')\nplt.title('Average Distance Between Flies Over Entire Experiment')\nplt.legend()\nplt.grid(True, alpha=0.3)\n\nplt.tight_layout()\nplt.savefig(FIGURES / 'avg_distance_entire_experiment.png', dpi=300, bbox_inches='tight')\nplt.show()"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## 6. Same plot, zoomed to the first 5 minutes after opening\n\nThe interesting behavioural signal is concentrated in the moments\n**right after** the barrier lifts (the male's first reaction to the\nfemale). Re-plot with `xlim` cropped to ±150 s around opening and\nending at +300 s.\n"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Filter data to +300 seconds\ntrained_filtered = trained_clean[trained_clean['aligned_time'] <= 300000]\nuntrained_filtered = untrained_clean[untrained_clean['aligned_time'] <= 300000]\n\n# Calculate average distance over time for each group\ntrained_avg_300 = trained_filtered.groupby('aligned_time')['distance'].mean()\nuntrained_avg_300 = untrained_filtered.groupby('aligned_time')['distance'].mean()\n\n# Apply smoothing using a rolling average\ntrained_smooth_300 = trained_avg_300.rolling(window=window_size, center=True).mean()\nuntrained_smooth_300 = untrained_avg_300.rolling(window=window_size, center=True).mean()\n\n# Create the plot\nplt.figure(figsize=(15, 8))\n\nplt.plot(trained_smooth_300.index/1000, trained_smooth_300.values, \n         label='Trained (smoothed)', color='blue', linewidth=2)\nplt.plot(untrained_smooth_300.index/1000, untrained_smooth_300.values, \n         label='Untrained (smoothed)', color='red', linewidth=2)\n\nplt.axvline(x=0, color='black', linestyle='--', alpha=0.7, label='Barrier Opening')\n\nplt.xlabel('Time (seconds relative to barrier opening)')\nplt.ylabel('Average Distance')\nplt.title('Average Distance Between Flies (First 300 Seconds Post-Opening)')\nplt.legend()\nplt.grid(True, alpha=0.3)\nplt.xlim(-150, 300)\n\nplt.tight_layout()\nplt.savefig(FIGURES / 'avg_distance_300_seconds.png', dpi=300, bbox_inches='tight')\nplt.show()"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## 7. Identity tracking → per-fly velocity\n\nInter-fly distance is *symmetric* — it doesn't matter which detection\nis \"fly 1\" vs \"fly 2\". **Velocity is not.** For velocity to mean\nanything we need to follow the same fly across consecutive frames.\n\nThe tracker only labels detections as \"id 1, id 2\" within a single\nframe; those ids can swap between consecutive frames. To stitch them\ntogether we use the **Hungarian algorithm** (`scipy.optimize.linear_sum_assignment`):\nat each `t → t+1` step, pair detections so the total displacement is\nminimised. That's the standard light-touch approach for short, simple\nmulti-object tracking — see the\n[Wikipedia entry](https://en.wikipedia.org/wiki/Hungarian_algorithm)\nfor the maths.\n\nOnce identities are stable across time, velocity is just `Δposition / Δt`\nper fly. We then compute the **maximum velocity within a 10-second\nsliding window** — that's a coarse \"is this fly active right now?\" signal.\n"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "def track_fly_identities_vectorized(df):\n    \"\"\"Efficiently track fly identities across frames using vectorized operations\"\"\"\n    from scipy.optimize import linear_sum_assignment\n    \n    df = df.sort_values(['machine_name', 'ROI', 'aligned_time']).reset_index(drop=True)\n    tracked_flies = []\n    \n    for (machine_name, roi), group in df.groupby(['machine_name', 'ROI']):\n        group = group.sort_values('aligned_time').reset_index(drop=True)\n        time_points = sorted(group['aligned_time'].unique())\n        \n        if len(time_points) < 2:\n            continue\n            \n        fly_id_counter = 0\n        fly_identities = {}\n        \n        first_time_flies = group[group['aligned_time'] == time_points[0]]\n        for i in range(len(first_time_flies)):\n            fly_identities[(time_points[0], i)] = fly_id_counter\n            fly_id_counter += 1\n        \n        for t_idx in range(len(time_points) - 1):\n            t1, t2 = time_points[t_idx], time_points[t_idx + 1]\n            flies_t1 = group[group['aligned_time'] == t1].reset_index(drop=True)\n            flies_t2 = group[group['aligned_time'] == t2].reset_index(drop=True)\n            \n            if len(flies_t1) == 0 or len(flies_t2) == 0:\n                continue\n                \n            if len(flies_t1) == len(flies_t2):\n                pos_t1 = flies_t1[['x', 'y']].values\n                pos_t2 = flies_t2[['x', 'y']].values\n                distances = np.sqrt(np.sum((pos_t1[:, np.newaxis] - pos_t2[np.newaxis, :])**2, axis=2))\n                \n                row_ind, col_ind = linear_sum_assignment(distances)\n                \n                for i, j in zip(row_ind, col_ind):\n                    prev_id = fly_identities[(t1, i)]\n                    fly_identities[(t2, j)] = prev_id\n            else:\n                min_flies = min(len(flies_t1), len(flies_t2))\n                max_flies = max(len(flies_t1), len(flies_t2))\n                \n                for i in range(min_flies):\n                    prev_id = fly_identities[(t1, i)]\n                    fly_identities[(t2, i)] = prev_id\n                \n                for i in range(min_flies, max_flies):\n                    fly_identities[(t2, i)] = fly_id_counter\n                    fly_id_counter += 1\n        \n        for t in time_points:\n            flies_at_t = group[group['aligned_time'] == t].reset_index(drop=True)\n            for i, (idx, fly) in enumerate(flies_at_t.iterrows()):\n                if (t, i) in fly_identities:\n                    fly_copy = fly.copy()\n                    fly_copy['fly_id'] = fly_identities[(t, i)]\n                    tracked_flies.append(fly_copy)\n    \n    return pd.DataFrame(tracked_flies)\n\n# Check if pre-calculated tracked identity files exist\nrecalculate_tracking = False  # Set to True if you want to recalculate\n\ntrained_tracked_file = DATA_PROCESSED / 'trained_tracked.csv'\nuntrained_tracked_file = DATA_PROCESSED / 'untrained_tracked.csv'\n\nif not recalculate_tracking and trained_tracked_file.exists() and untrained_tracked_file.exists():\n    print(\"Loading pre-calculated tracked identity data from CSV files...\")\n    trained_tracked = pd.read_csv(trained_tracked_file)\n    untrained_tracked = pd.read_csv(untrained_tracked_file)\n    print(f\"Trained tracked data shape: {trained_tracked.shape}\")\n    print(f\"Untrained tracked data shape: {untrained_tracked.shape}\")\nelse:\n    print(\"Tracking fly identities (this may take a while...)\")\n    trained_tracked = track_fly_identities_vectorized(trained_aligned)\n    untrained_tracked = track_fly_identities_vectorized(untrained_aligned)\n    \n    trained_tracked.to_csv(trained_tracked_file, index=False)\n    untrained_tracked.to_csv(untrained_tracked_file, index=False)\n    print(\"Tracked identity data saved to CSV files\")\n    \n    print(f\"Trained tracked data shape: {trained_tracked.shape}\")\n    print(f\"Untrained tracked data shape: {untrained_tracked.shape}\")"
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculating velocities based on tracked identities...\n",
      "Trained velocity data shape: (1160573, 15)\n",
      "Untrained velocity data shape: (1125288, 15)\n"
     ]
    }
   ],
   "source": [
    "def calculate_velocity_with_identities_vectorized(df):\n",
    "    \"\"\"Efficiently calculate velocity for each fly based on tracked identities\"\"\"\n",
    "    if df.empty:\n",
    "        return pd.DataFrame()\n",
    "    \n",
    "    # Sort by fly_id and time\n",
    "    df = df.sort_values(['machine_name', 'ROI', 'fly_id', 'aligned_time']).reset_index(drop=True)\n",
    "    \n",
    "    # Calculate velocities using groupby and vectorized operations\n",
    "    def calculate_group_velocity(group):\n",
    "        if len(group) < 2:\n",
    "            return pd.DataFrame()\n",
    "            \n",
    "        # Sort by time\n",
    "        group = group.sort_values('aligned_time').reset_index(drop=True)\n",
    "        \n",
    "        # Calculate differences using vectorized operations\n",
    "        time_diff = group['aligned_time'].diff() / 1000.0  # Convert ms to seconds\n",
    "        x_diff = group['x'].diff()\n",
    "        y_diff = group['y'].diff()\n",
    "        \n",
    "        # Calculate Euclidean distance (in pixels)\n",
    "        distance = np.sqrt(x_diff**2 + y_diff**2)\n",
    "        \n",
    "        # Calculate velocity (pixels per second)\n",
    "        velocity = distance / time_diff\n",
    "        \n",
    "        # Add to results (skip first row which has NaN velocity)\n",
    "        result = group.iloc[1:].copy()\n",
    "        result['velocity'] = velocity.iloc[1:].values\n",
    "        \n",
    "        return result\n",
    "    \n",
    "    # Apply the function to each group\n",
    "    velocity_groups = []\n",
    "    for (machine_name, roi, fly_id), group in df.groupby(['machine_name', 'ROI', 'fly_id']):\n",
    "        velocity_group = calculate_group_velocity(group)\n",
    "        if not velocity_group.empty:\n",
    "            velocity_groups.append(velocity_group)\n",
    "    \n",
    "    if velocity_groups:\n",
    "        return pd.concat(velocity_groups, ignore_index=True)\n",
    "    else:\n",
    "        return pd.DataFrame()\n",
    "\n",
    "# Calculate velocities for both groups\n",
    "print(\"Calculating velocities based on tracked identities...\")\n",
    "trained_velocity = calculate_velocity_with_identities_vectorized(trained_tracked)\n",
    "untrained_velocity = calculate_velocity_with_identities_vectorized(untrained_tracked)\n",
    "\n",
    "print(f\"Trained velocity data shape: {trained_velocity.shape}\")\n",
    "print(f\"Untrained velocity data shape: {untrained_velocity.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Analyze maximal velocity over a moving window of 10 seconds\ndef calculate_max_velocity_window_vectorized(df, window_size_seconds=10):\n    \"\"\"Efficiently calculate maximal velocity over a moving window\"\"\"\n    df_clean = df.dropna(subset=['velocity'])\n    \n    if df_clean.empty:\n        return pd.DataFrame()\n        \n    window_size_ms = window_size_seconds * 1000\n    results = []\n    \n    for (machine_name, roi), group in df_clean.groupby(['machine_name', 'ROI']):\n        group = group.sort_values('aligned_time').reset_index(drop=True)\n        \n        if len(group) == 0:\n            continue\n            \n        times = group['aligned_time'].values\n        velocities = group['velocity'].values\n        \n        for i, current_time in enumerate(times):\n            window_end = current_time + window_size_ms\n            window_mask = (times >= current_time) & (times < window_end)\n            window_velocities = velocities[window_mask]\n            \n            if len(window_velocities) > 0:\n                max_velocity = np.max(window_velocities)\n                results.append({\n                    'machine_name': machine_name,\n                    'ROI': roi,\n                    'aligned_time': current_time,\n                    'max_velocity': max_velocity,\n                    'group': group.iloc[i]['group']\n                })\n    \n    return pd.DataFrame(results)\n\n# Calculate max velocity over 10-second windows\ntrained_max_velocity = calculate_max_velocity_window_vectorized(trained_velocity, 10)\nuntrained_max_velocity = calculate_max_velocity_window_vectorized(untrained_velocity, 10)\n\nprint(f\"Trained max velocity data shape: {trained_max_velocity.shape}\")\nprint(f\"Untrained max velocity data shape: {untrained_max_velocity.shape}\")\n\n# Save velocity data to CSV\ntrained_max_velocity.to_csv(DATA_PROCESSED / 'trained_max_velocity.csv', index=False)\nuntrained_max_velocity.to_csv(DATA_PROCESSED / 'untrained_max_velocity.csv', index=False)\nprint(\"Max velocity data saved to CSV files\")"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Plot averaged max velocity over time for each group\n# Focus on the time period between 50 and 200 seconds where differences are largest\ntrained_velocity_filtered = trained_max_velocity[\n    (trained_max_velocity['aligned_time'] >= 50000) & \n    (trained_max_velocity['aligned_time'] <= 200000)\n]\nuntrained_velocity_filtered = untrained_max_velocity[\n    (untrained_max_velocity['aligned_time'] >= 50000) & \n    (untrained_max_velocity['aligned_time'] <= 200000)\n]\n\nif not trained_velocity_filtered.empty and not untrained_velocity_filtered.empty:\n    trained_velocity_avg = trained_velocity_filtered.groupby('aligned_time')['max_velocity'].mean()\n    untrained_velocity_avg = untrained_velocity_filtered.groupby('aligned_time')['max_velocity'].mean()\n\n    velocity_window_size = 30\n    trained_velocity_smooth = trained_velocity_avg.rolling(window=velocity_window_size, center=True).mean()\n    untrained_velocity_smooth = untrained_velocity_avg.rolling(window=velocity_window_size, center=True).mean()\n\n    plt.figure(figsize=(15, 8))\n\n    plt.plot(trained_velocity_smooth.index/1000, trained_velocity_smooth.values, \n             label='Trained (smoothed)', color='blue', linewidth=2)\n    plt.plot(untrained_velocity_smooth.index/1000, untrained_velocity_smooth.values, \n             label='Untrained (smoothed)', color='red', linewidth=2)\n\n    plt.axvline(x=0, color='black', linestyle='--', alpha=0.7, label='Barrier Opening')\n\n    plt.xlabel('Time (seconds relative to barrier opening)')\n    plt.ylabel('Average Max Velocity (pixels/second)')\n    plt.title('Average Max Velocity Over 10-Second Windows (50-200 Seconds)')\n    plt.legend()\n    plt.grid(True, alpha=0.3)\n    plt.xlim(50, 200)\n\n    plt.tight_layout()\n    plt.savefig(FIGURES / 'avg_max_velocity_50_200_seconds.png', dpi=300, bbox_inches='tight')\n    plt.show()\nelse:\n    print(\"Not enough data to plot velocity differences\")"
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=== MAX VELOCITY STATISTICS (50-200 seconds) ===\n",
      "Trained flies:\n",
      "  Mean max velocity: 6120.19 pixels/second\n",
      "  Std max velocity: 2913.85 pixels/second\n",
      "  Median max velocity: 7426.51 pixels/second\n",
      "\n",
      "Untrained flies:\n",
      "  Mean max velocity: 5710.33 pixels/second\n",
      "  Std max velocity: 2784.93 pixels/second\n",
      "  Median max velocity: 6876.77 pixels/second\n",
      "\n",
      "Statistical comparison (trained vs untrained):\n",
      "  T-statistic: 43.8194\n",
      "  P-value: 0.00e+00\n",
      "  Cohen's d: 0.1438\n"
     ]
    }
   ],
   "source": [
    "# Summary statistics for max velocity in the 50-200 second window\n",
    "if not trained_velocity_filtered.empty and not untrained_velocity_filtered.empty:\n",
    "    print(\"=== MAX VELOCITY STATISTICS (50-200 seconds) ===\")\n",
    "    trained_velocity_50_200 = trained_velocity_filtered['max_velocity']\n",
    "    untrained_velocity_50_200 = untrained_velocity_filtered['max_velocity']\n",
    "\n",
    "    print(f\"Trained flies:\")\n",
    "    print(f\"  Mean max velocity: {trained_velocity_50_200.mean():.2f} pixels/second\")\n",
    "    print(f\"  Std max velocity: {trained_velocity_50_200.std():.2f} pixels/second\")\n",
    "    print(f\"  Median max velocity: {trained_velocity_50_200.median():.2f} pixels/second\")\n",
    "\n",
    "    print(f\"\\nUntrained flies:\")\n",
    "    print(f\"  Mean max velocity: {untrained_velocity_50_200.mean():.2f} pixels/second\")\n",
    "    print(f\"  Std max velocity: {untrained_velocity_50_200.std():.2f} pixels/second\")\n",
    "    print(f\"  Median max velocity: {untrained_velocity_50_200.median():.2f} pixels/second\")\n",
    "\n",
    "    # Statistical test\n",
    "    if len(trained_velocity_50_200) > 1 and len(untrained_velocity_50_200) > 1:\n",
    "        t_stat_vel, p_val_vel = stats.ttest_ind(trained_velocity_50_200, untrained_velocity_50_200)\n",
    "        cohens_d_vel = (trained_velocity_50_200.mean() - untrained_velocity_50_200.mean()) / \\\n",
    "            np.sqrt(((len(trained_velocity_50_200)-1)*trained_velocity_50_200.var() + \\\n",
    "                     (len(untrained_velocity_50_200)-1)*untrained_velocity_50_200.var()) / \\\n",
    "                    (len(trained_velocity_50_200) + len(untrained_velocity_50_200) - 2))\n",
    "\n",
    "        print(f\"\\nStatistical comparison (trained vs untrained):\")\n",
    "        print(f\"  T-statistic: {t_stat_vel:.4f}\")\n",
    "        print(f\"  P-value: {p_val_vel:.2e}\")\n",
    "        print(f\"  Cohen's d: {cohens_d_vel:.4f}\")\n",
    "    else:\n",
    "        print(\"\\nNot enough data for statistical test\")\n",
    "else:\n",
    "    print(\"Not enough data for velocity statistics\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## Summary statistics\n\nIndependent t-test on **post-opening** distances, plus\n[Cohen's d](https://en.wikipedia.org/wiki/Effect_size#Cohen's_d) for\neffect size. P-values from large samples can be tiny even when the\neffect is small — always read p-value and Cohen's d together.\n"
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=== SUMMARY STATISTICS ===\n",
      "Median area size for two-fly detections: 1749.00\n",
      "\n",
      "Pre-opening period (t < 0):\n",
      "  Trained mean distance: 156.05\n",
      "  Untrained mean distance: 147.69\n",
      "\n",
      "Post-opening period (t > 0):\n",
      "  Trained mean distance: 72.60\n",
      "  Untrained mean distance: 64.00\n",
      "\n",
      "Post-opening comparison (trained vs untrained):\n",
      "  T-statistic: 30.2455\n",
      "  P-value: 9.57e-201\n",
      "  Cohen's d: 0.0908\n"
     ]
    }
   ],
   "source": [
    "print(\"=== SUMMARY STATISTICS ===\")\n",
    "print(f\"Median area size for two-fly detections: {median_area:.2f}\")\n",
    "\n",
    "print(\"\\nPre-opening period (t < 0):\")\n",
    "trained_pre = trained_clean[trained_clean['aligned_time'] < 0]['distance']\n",
    "untrained_pre = untrained_clean[untrained_clean['aligned_time'] < 0]['distance']\n",
    "print(f\"  Trained mean distance: {trained_pre.mean():.2f}\")\n",
    "print(f\"  Untrained mean distance: {untrained_pre.mean():.2f}\")\n",
    "\n",
    "print(\"\\nPost-opening period (t > 0):\")\n",
    "trained_post = trained_clean[trained_clean['aligned_time'] > 0]['distance']\n",
    "untrained_post = untrained_clean[untrained_clean['aligned_time'] > 0]['distance']\n",
    "print(f\"  Trained mean distance: {trained_post.mean():.2f}\")\n",
    "print(f\"  Untrained mean distance: {untrained_post.mean():.2f}\")\n",
    "\n",
    "# Statistical test\n",
    "t_stat, p_val = stats.ttest_ind(trained_post, untrained_post)\n",
    "cohens_d = (trained_post.mean() - untrained_post.mean()) / np.sqrt(((len(trained_post)-1)*trained_post.var() + (len(untrained_post)-1)*untrained_post.var()) / (len(trained_post) + len(untrained_post) - 2))\n",
    "\n",
    "print(f\"\\nPost-opening comparison (trained vs untrained):\")\n",
    "print(f\"  T-statistic: {t_stat:.4f}\")\n",
    "print(f\"  P-value: {p_val:.2e}\")\n",
    "print(f\"  Cohen's d: {cohens_d:.4f}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}