cupido/docs/bimodal_hypothesis.md
Giorgio e7e4db264d Initial commit: organized project structure for student handoff
Reorganized flat 41-file directory into structured layout with:
- scripts/ for Python analysis code with shared config.py
- notebooks/ for Jupyter analysis notebooks
- data/ split into raw/, metadata/, processed/
- docs/ with analysis summary, experimental design, and bimodal hypothesis tutorial
- tasks/ with todo checklist and lessons learned
- Comprehensive README, PLANNING.md, and .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 16:08:36 +00:00

43 KiB

The Bimodal Hypothesis: A Guide to the Next Phase of Analysis

Project: Cupido -- Drosophila Social Interaction Tracking Date: 2026-03-05 Purpose: Tutorial-level guide for a student taking over the analysis pipeline


Table of Contents

  1. Background: The Assay
  2. Why the Current Aggregate Analysis Fails
  3. The Bimodal Hypothesis
  4. The Methodology Shift: Per-ROI Summary Statistics
  5. Step-by-Step Analysis Code
  6. Expected Results and Interpretation
  7. Statistical Considerations
  8. References

1. Background: The Assay

This project uses ethoscope-based tracking to study social behavior in Drosophila melanogaster. The experimental setup is a barrier-opening assay:

  • Each recording chamber (ROI) contains two flies separated by a physical barrier.
  • At a defined time point, the barrier is removed, allowing the flies to interact freely.
  • We measure the distance between the two flies over time, along with velocity and other behavioral features.

The core experimental question: Do trained flies (those that have undergone a conditioning protocol) behave differently from untrained flies after the barrier is removed?

Data Overview

  • Machines: 3 ethoscopes (076, 145, 268), each running multiple recording sessions
  • Recording sessions: 5 total (076 ran twice, 145 ran twice, 268 ran once)
  • ROIs per session: 6 (containing pairs of flies)
  • Total ROIs: 30 across all sessions, but Machine 139 has metadata entries for 6 ROIs with no corresponding tracking data (no tracking DB, no barrier_opening entry), leaving 24 usable ROIs -- approximately 12 trained and 12 untrained, split across machines and sessions. The exact split per session varies (check the metadata).
  • Note on true N: The metadata file (data/metadata/2025_07_15_metadata_fixed.csv) lists all ROIs and their group assignments. The barrier opening file (data/metadata/2025_07_15_barrier_opening.csv) lists 5 sessions across 3 machines. When you load the aligned data, count unique (machine_name, ROI) combinations per group to determine your actual N. Based on the existing analysis, the project documentation refers to N=18 per group, but you should verify this directly from your data.

The primary data files are:

File Description
data/processed/trained_distances_aligned.csv Distance data for trained ROIs, aligned to barrier opening
data/processed/untrained_distances_aligned.csv Distance data for untrained ROIs, aligned to barrier opening
data/processed/trained_max_velocity.csv Max velocity data for trained ROIs
data/processed/untrained_max_velocity.csv Max velocity data for untrained ROIs

Each CSV has columns: machine_name, ROI, aligned_time, distance (or max_velocity), n_flies, area_fly1, area_fly2, group.

The aligned_time column is in milliseconds, with the barrier opening at t = 0. Negative values are pre-opening; positive values are post-opening.


2. Why the Current Aggregate Analysis Fails

2.1 The Numbers

The current analysis (see docs/analysis_summary.md and scripts/statistical_tests.py) pools all time-point observations from all ROIs into one giant distribution per group and runs a t-test:

Comparison Cohen's d p-value What it means
Post-opening: trained vs untrained 0.09 ~1e-203 Tiny effect, astronomically "significant"
Pre-opening: trained vs untrained 0.07 ~1e-5 Tiny effect before barrier even opens
Velocity differences ~0.14 very small Still small

This is a textbook case of statistical significance without practical significance.

2.2 What Cohen's d = 0.09 Actually Means

Cohen's d measures the standardized difference between two group means. To put d = 0.09 in perspective:

  • d = 0.2 is conventionally considered a "small" effect.
  • d = 0.09 is less than half of "small".
  • At d = 0.09, the two distributions overlap by approximately 96%. If you randomly picked one fly from the trained group and one from the untrained group, you would correctly guess which was which only about 52.5% of the time (barely better than a coin flip).
  • The reason the p-value is so extreme (p < 1e-200) is purely a function of sample size: with ~230,000 data points per group, even a trivially small mean difference becomes "statistically significant."

Imagine two bell curves that are nearly perfectly superimposed, with one shifted a hair to the right. That is what d = 0.09 looks like. No biologist would claim these groups are meaningfully different based on that overlap.

2.3 The Pseudoreplication Problem

This is the fatal flaw in the current analysis. Here is the logic:

  1. We have N = 18 trained ROIs and N = 18 untrained ROIs (the true biological replicates).
  2. Each ROI contributes thousands of time-point measurements (one distance value every ~40ms).
  3. The current code concatenates all time points from all ROIs into a single array per group, yielding ~230,000 "observations" per group.
  4. A t-test is run on these 230,000 vs 230,000 values.

The problem: time points within a single ROI are not independent observations. The distance at time t is highly correlated with the distance at time t+40ms (the flies do not teleport between frames). Furthermore, all time points from one ROI reflect the behavior of the same pair of flies.

This is called pseudoreplication -- treating non-independent measurements as independent samples. It inflates your degrees of freedom from ~34 (the true df for 18 vs 18) to ~460,000, making your standard error artificially tiny and your p-value absurdly small.

The correct N for any between-group test is the number of ROIs (biological replicates), not the number of time points.

2.4 The Signal Dilution Problem

Even if we fix the pseudoreplication (by computing per-ROI summary statistics), the aggregate approach still has a deeper issue. It assumes that the training effect is uniform -- that every trained fly learns equally. But what if only a fraction of trained flies actually acquire the conditioned response?

If 6 out of 18 trained ROIs contain flies that genuinely learned (showing reduced distance or faster approach), while the other 12 trained ROIs contain flies that failed to learn (behaving identically to untrained flies), then:

  • The mean of the trained group is pulled toward the untrained mean by the 12 non-learners.
  • The effect size is diluted by a factor proportional to the non-learner fraction.
  • No amount of data accumulation will fix this -- you are averaging signal with noise.

This is why we need the bimodal hypothesis.


3. The Bimodal Hypothesis

3.1 The Core Prediction

The hypothesis is simple:

The trained group is not one population -- it is a mixture of two subpopulations: true learners and non-learners.

Concretely, when you compute a single summary statistic for each ROI (e.g., the mean distance in the first 300 seconds after barrier opening), you should see:

  • Untrained group: A single cluster (unimodal distribution). All untrained flies behave similarly because none of them received conditioning.
  • Trained group: Two clusters (bimodal distribution).
    • Cluster 1 (non-learners): These flies did not acquire the conditioned response. Their summary statistics overlap with the untrained distribution.
    • Cluster 2 (true learners): These flies did learn. Their summary statistics are shifted -- e.g., lower mean distance (they approach the other fly), higher fraction of time at close proximity, or different velocity profiles.

3.2 Why This Matters

If the bimodal hypothesis is correct, the appropriate analysis is not "trained vs untrained" as a whole, but rather:

  1. Identify which trained ROIs are learners vs non-learners.
  2. Compare learners vs untrained (this should show a large effect size).
  3. Report the learning rate (what fraction of trained flies actually learned).

This transforms a "disappointing d = 0.09" result into a potentially strong finding: "X% of trained flies show a robust behavioral shift (d = ?) while the remainder do not learn."

3.3 Visual Intuition

Imagine plotting a histogram of "mean post-opening distance" for each ROI:

Untrained (N=18):          Trained (N=18):

  Count                      Count
  |                          |
  |   ****                   |   ***         **
  |  ******                  |  *****       ****
  | ********                 | *******     ******
  |**********                |*********   ********
  +-----------> distance     +-----------> distance
     (unimodal)                 (bimodal)

The untrained histogram has one peak. The trained histogram has two peaks: one overlapping with the untrained distribution (non-learners) and one shifted to the left (learners who maintain closer proximity).


4. The Methodology Shift: Per-ROI Summary Statistics

4.1 The Key Idea

Instead of feeding raw time-series data into group comparisons, collapse each ROI into a single summary value (or a small set of features). This:

  1. Eliminates pseudoreplication (each ROI contributes exactly one data point).
  2. Gives you the true sample size (N per group = number of ROIs).
  3. Enables distribution-level analyses (bimodality testing, mixture modeling).

4.2 Which Summary Statistics to Compute

For each unique (machine_name, ROI) combination, compute the following over the post-opening window (aligned_time > 0, typically 0 to 300,000 ms, i.e., 0 to 300 seconds):

Feature Description Rationale
mean_distance Mean distance across all post-opening time points Overall proximity measure
median_distance Median distance Robust to outliers (e.g., tracking artifacts)
frac_close Fraction of time points where distance < 50 px Captures how often flies are in close contact
frac_very_close Fraction of time points where distance < 25 px Stricter proximity threshold
distance_slope Slope of a linear fit to distance vs time (post-opening) Captures whether flies approach over time
mean_velocity Mean of max_velocity in the 50-200s window Activity / approach speed (from velocity data)
distance_cv Coefficient of variation of distance Behavioral variability within a trial

You may also want a pre-opening baseline for each ROI (e.g., mean distance for aligned_time < 0) to use as a covariate or to compute a change score (post - pre) that controls for baseline chamber differences.

4.3 Why 0-300 Seconds?

The first 300 seconds (5 minutes) after barrier opening is the critical window where social approach behavior is most apparent. Beyond this, flies may habituate or begin other behaviors. The existing plots (avg_distance_300_seconds.png) focus on this window for the same reason.

You can and should also try other windows (e.g., 0-120s for early approach, 60-300s to skip the initial startle) to check robustness.


5. Step-by-Step Analysis Code

All code below uses Python with pandas, numpy, scipy, matplotlib, and scikit-learn. Run everything within the project virtual environment.

Phase 1: Per-ROI Feature Extraction

"""
Phase 1: Compute per-ROI summary statistics from the aligned distance data.

This is the foundation of the bimodal analysis. Each ROI becomes a single
data point instead of thousands of time-series observations.
"""
import pandas as pd
import numpy as np
from scipy import stats as sp_stats
from pathlib import Path

# ------------------------------------------------------------------
# Load data
# ------------------------------------------------------------------
DATA_PROCESSED = Path("data/processed")

trained = pd.read_csv(DATA_PROCESSED / "trained_distances_aligned.csv")
untrained = pd.read_csv(DATA_PROCESSED / "untrained_distances_aligned.csv")

# Combine into one DataFrame for uniform processing
df = pd.concat([trained, untrained], ignore_index=True)

# Verify the true sample sizes
roi_counts = (
    df.groupby("group")[["machine_name", "ROI"]]
    .apply(lambda g: g.drop_duplicates().shape[0])
)
print("ROIs per group:")
print(roi_counts)
# Expected: trained ~18, untrained ~18

# ------------------------------------------------------------------
# Define the post-opening analysis window
# aligned_time is in milliseconds; 300 seconds = 300,000 ms
# ------------------------------------------------------------------
POST_START_MS = 0
POST_END_MS = 300_000

post = df[(df["aligned_time"] > POST_START_MS) & (df["aligned_time"] <= POST_END_MS)].copy()

# Drop rows where distance is NaN (frames with only one small fly detected)
post = post.dropna(subset=["distance"])

# ------------------------------------------------------------------
# Compute per-ROI features using groupby
# ------------------------------------------------------------------
def compute_roi_features(group_df):
    """Compute summary features for a single ROI's post-opening data.

    Args:
        group_df (pd.DataFrame): All post-opening rows for one ROI.

    Returns:
        pd.Series: Summary features.
    """
    d = group_df["distance"]
    return pd.Series({
        "mean_distance": d.mean(),
        "median_distance": d.median(),
        "std_distance": d.std(),
        "frac_close_50": (d < 50).mean(),     # fraction of time at < 50 px
        "frac_close_25": (d < 25).mean(),     # fraction of time at < 25 px
        "distance_cv": d.std() / d.mean() if d.mean() > 0 else np.nan,
        "n_timepoints": len(d),
        "group": group_df["group"].iloc[0],
    })

roi_features = (
    post.groupby(["machine_name", "ROI"])
    .apply(compute_roi_features)
    .reset_index()
)

print(f"\nPer-ROI features computed: {len(roi_features)} ROIs")
print(roi_features.head(10))

# ------------------------------------------------------------------
# Split by group for downstream analysis
# ------------------------------------------------------------------
trained_features = roi_features[roi_features["group"] == "trained"].copy()
untrained_features = roi_features[roi_features["group"] == "untrained"].copy()

print(f"\nTrained ROIs: {len(trained_features)}")
print(f"Untrained ROIs: {len(untrained_features)}")

# Save for later use
roi_features.to_csv(DATA_PROCESSED / "roi_summary_features.csv", index=False)
print("Saved roi_summary_features.csv")

What this gives you: A DataFrame with one row per ROI and columns for each summary statistic. This is your true dataset for all subsequent analyses.

Phase 2: Visualization -- Histograms and KDE Plots

"""
Phase 2: Visualize the per-ROI feature distributions.

The key question: does the trained group look bimodal while the
untrained group looks unimodal?
"""
import matplotlib.pyplot as plt
import seaborn as sns

# ------------------------------------------------------------------
# Histogram + KDE for each feature, side by side
# ------------------------------------------------------------------
features_to_plot = ["mean_distance", "median_distance", "frac_close_50"]

fig, axes = plt.subplots(len(features_to_plot), 2, figsize=(12, 4 * len(features_to_plot)))

for i, feature in enumerate(features_to_plot):
    # Left panel: Untrained
    ax_u = axes[i, 0]
    vals_u = untrained_features[feature].dropna().values
    ax_u.hist(vals_u, bins=8, density=True, alpha=0.6, color="steelblue", edgecolor="black")
    if len(vals_u) > 2:
        sns.kdeplot(vals_u, ax=ax_u, color="navy", linewidth=2)
    ax_u.set_title(f"Untrained -- {feature}")
    ax_u.set_xlabel(feature)
    ax_u.set_ylabel("Density")

    # Right panel: Trained
    ax_t = axes[i, 1]
    vals_t = trained_features[feature].dropna().values
    ax_t.hist(vals_t, bins=8, density=True, alpha=0.6, color="salmon", edgecolor="black")
    if len(vals_t) > 2:
        sns.kdeplot(vals_t, ax=ax_t, color="darkred", linewidth=2)
    ax_t.set_title(f"Trained -- {feature}")
    ax_t.set_xlabel(feature)
    ax_t.set_ylabel("Density")

plt.tight_layout()
plt.savefig("figures/bimodal_roi_distributions.png", dpi=200, bbox_inches="tight")
plt.show()

# ------------------------------------------------------------------
# Strip/swarm plot overlay: both groups on one axis
# ------------------------------------------------------------------
fig, axes = plt.subplots(1, len(features_to_plot), figsize=(5 * len(features_to_plot), 5))

for i, feature in enumerate(features_to_plot):
    ax = axes[i]
    sns.stripplot(
        data=roi_features, x="group", y=feature, ax=ax,
        jitter=True, alpha=0.7, size=8, palette={"trained": "salmon", "untrained": "steelblue"}
    )
    ax.set_title(feature)

plt.tight_layout()
plt.savefig("figures/bimodal_stripplots.png", dpi=200, bbox_inches="tight")
plt.show()

What to look for:

  • The untrained histograms should show a single cluster of points.
  • The trained histograms should show two clusters -- some ROIs overlapping with the untrained values (non-learners) and some ROIs clearly separated (learners).
  • In the strip plots, look for a gap or split among the trained dots that does not appear in the untrained dots.

Phase 3: Hartigan's Dip Test for Bimodality

The dip test is a formal statistical test of unimodality. A significant p-value means the distribution is unlikely to be unimodal.

"""
Phase 3: Hartigan's dip test for bimodality.

Install the package first:
    pip install diptest
"""
import diptest

# ------------------------------------------------------------------
# Run dip test on each group for each feature
# ------------------------------------------------------------------
features_to_test = ["mean_distance", "median_distance", "frac_close_50"]

print("=== HARTIGAN'S DIP TEST ===\n")

for feature in features_to_test:
    print(f"Feature: {feature}")

    for group_name, group_df in [("trained", trained_features), ("untrained", untrained_features)]:
        values = group_df[feature].dropna().values

        if len(values) < 4:
            print(f"  {group_name}: Not enough data points (N={len(values)})")
            continue

        dip_stat, p_value = diptest.diptest(values)
        print(f"  {group_name}: dip statistic = {dip_stat:.4f}, p-value = {p_value:.4f}")

        if p_value < 0.05:
            print(f"    --> SIGNIFICANT: Evidence against unimodality (p < 0.05)")
        else:
            print(f"    --> Not significant: Consistent with unimodality")

    print()

Interpreting the dip test:

  • Null hypothesis: The distribution is unimodal.
  • Alternative: The distribution has more than one mode (bimodal, multimodal).
  • If the trained group yields p < 0.05 and the untrained group does not, this is direct evidence for the bimodal hypothesis.
  • Caution: With N=18, the dip test has limited power. A non-significant result does not prove unimodality -- it may simply mean the sample is too small to detect bimodality. This is why we also use Gaussian Mixture Models (next phase).

Phase 4: Gaussian Mixture Model (GMM) Fitting

A more informative approach than the dip test: fit 1-component and 2-component Gaussian mixture models and compare them using the Bayesian Information Criterion (BIC).

"""
Phase 4: Fit Gaussian Mixture Models to the per-ROI feature distributions.

Compare 1-component (unimodal) vs 2-component (bimodal) models using BIC.
Lower BIC = better model fit, penalizing for complexity.
"""
from sklearn.mixture import GaussianMixture
import numpy as np

# ------------------------------------------------------------------
# Fit GMMs and compare BIC
# ------------------------------------------------------------------
features_to_fit = ["mean_distance", "median_distance", "frac_close_50"]

print("=== GAUSSIAN MIXTURE MODEL COMPARISON ===\n")

gmm_results = {}

for feature in features_to_fit:
    print(f"Feature: {feature}")

    for group_name, group_df in [("trained", trained_features), ("untrained", untrained_features)]:
        values = group_df[feature].dropna().values.reshape(-1, 1)

        if len(values) < 4:
            print(f"  {group_name}: Not enough data points")
            continue

        # Fit 1-component GMM (unimodal)
        gmm1 = GaussianMixture(n_components=1, random_state=42)
        gmm1.fit(values)
        bic1 = gmm1.bic(values)

        # Fit 2-component GMM (bimodal)
        gmm2 = GaussianMixture(n_components=2, random_state=42)
        gmm2.fit(values)
        bic2 = gmm2.bic(values)

        delta_bic = bic1 - bic2  # positive = 2-component model is better

        print(f"  {group_name}:")
        print(f"    BIC (1 component): {bic1:.2f}")
        print(f"    BIC (2 components): {bic2:.2f}")
        print(f"    Delta BIC (1 - 2): {delta_bic:.2f}")

        if delta_bic > 10:
            print(f"    --> STRONG evidence for 2 components (Delta BIC > 10)")
        elif delta_bic > 2:
            print(f"    --> Moderate evidence for 2 components")
        elif delta_bic > 0:
            print(f"    --> Weak evidence for 2 components")
        else:
            print(f"    --> 1 component (unimodal) preferred")

        if group_name == "trained":
            gmm_results[feature] = {
                "gmm1": gmm1,
                "gmm2": gmm2,
                "bic1": bic1,
                "bic2": bic2,
                "delta_bic": delta_bic,
            }

    print()

Interpreting Delta BIC:

Delta BIC (BIC_1comp - BIC_2comp) Interpretation
> 10 Strong evidence that 2 components fits better
2 -- 10 Moderate evidence for 2 components
0 -- 2 Weak / negligible evidence
< 0 1-component model preferred (unimodal)

Expected result if hypothesis is correct:

  • Trained group: Delta BIC >> 0 (2 components preferred)
  • Untrained group: Delta BIC <= 0 (1 component preferred)

Phase 5: Subgroup Classification

If the 2-component GMM is preferred for the trained group, use it to classify each trained ROI as a "learner" or "non-learner."

"""
Phase 5: Classify trained ROIs into learners vs non-learners
using GMM posterior probabilities.
"""
import matplotlib.pyplot as plt
import numpy as np

# ------------------------------------------------------------------
# Choose the feature with the best bimodal evidence
# ------------------------------------------------------------------
# (Adjust this based on Phase 4 results -- use the feature with
#  the highest Delta BIC for the trained group)
BEST_FEATURE = "mean_distance"

values_trained = trained_features[BEST_FEATURE].dropna().values.reshape(-1, 1)

# Fit 2-component GMM
gmm2 = GaussianMixture(n_components=2, random_state=42)
gmm2.fit(values_trained)

# Get posterior probabilities: P(component | data point)
probs = gmm2.predict_proba(values_trained)  # shape: (N, 2)
labels = gmm2.predict(values_trained)         # hard assignment

# Identify which component is the "learner" component
# Learners should have LOWER mean distance (approach behavior)
component_means = gmm2.means_.flatten()
learner_component = np.argmin(component_means)  # lower distance = learner
non_learner_component = 1 - learner_component

print(f"Component means: {component_means}")
print(f"Learner component: {learner_component} (mean = {component_means[learner_component]:.2f})")
print(f"Non-learner component: {non_learner_component} (mean = {component_means[non_learner_component]:.2f})")

# ------------------------------------------------------------------
# Assign labels to trained ROIs
# ------------------------------------------------------------------
trained_features_classified = trained_features.dropna(subset=[BEST_FEATURE]).copy()
trained_features_classified["learner_prob"] = probs[:, learner_component]
trained_features_classified["is_learner"] = labels == learner_component

n_learners = trained_features_classified["is_learner"].sum()
n_total = len(trained_features_classified)
print(f"\nClassification results:")
print(f"  Learners: {n_learners} / {n_total} ({100 * n_learners / n_total:.1f}%)")
print(f"  Non-learners: {n_total - n_learners} / {n_total} ({100 * (n_total - n_learners) / n_total:.1f}%)")

# ------------------------------------------------------------------
# Visualization: GMM fit with classification
# ------------------------------------------------------------------
fig, ax = plt.subplots(figsize=(8, 5))

x_range = np.linspace(values_trained.min() - 10, values_trained.max() + 10, 300).reshape(-1, 1)

# Plot overall GMM density
log_dens = gmm2.score_samples(x_range)
ax.plot(x_range, np.exp(log_dens), color="black", linewidth=2, label="GMM fit (2 components)")

# Plot individual component densities
weights = gmm2.weights_
means = gmm2.means_.flatten()
covariances = gmm2.covariances_.flatten()

for k in range(2):
    component_dens = (
        weights[k]
        * (1 / np.sqrt(2 * np.pi * covariances[k]))
        * np.exp(-0.5 * (x_range.flatten() - means[k]) ** 2 / covariances[k])
    )
    lbl = "Learners" if k == learner_component else "Non-learners"
    clr = "green" if k == learner_component else "gray"
    ax.plot(x_range, component_dens, color=clr, linewidth=1.5, linestyle="--", label=lbl)

# Scatter the actual ROI values
for _, row in trained_features_classified.iterrows():
    color = "green" if row["is_learner"] else "gray"
    ax.axvline(row[BEST_FEATURE], color=color, alpha=0.4, linewidth=1)

ax.set_xlabel(BEST_FEATURE)
ax.set_ylabel("Density")
ax.set_title("GMM Classification of Trained ROIs")
ax.legend()
plt.tight_layout()
plt.savefig("figures/gmm_classification.png", dpi=200, bbox_inches="tight")
plt.show()

# ------------------------------------------------------------------
# Print per-ROI classification table
# ------------------------------------------------------------------
print("\nPer-ROI Classification:")
print(trained_features_classified[
    ["machine_name", "ROI", BEST_FEATURE, "learner_prob", "is_learner"]
].to_string(index=False))

Important note about the classification threshold: The GMM predict() method assigns each point to the component with the highest posterior probability. For borderline cases (learner_prob near 0.5), you may want to be conservative and only label ROIs as learners if learner_prob > 0.8. Report your threshold and sensitivity analysis.

Phase 6: Effect Size Re-estimation

Now compare the learner subgroup against untrained flies, using appropriate small-sample statistics.

"""
Phase 6: Re-estimate effect sizes after subgroup identification.

With small N, use Mann-Whitney U (non-parametric) and bootstrap confidence intervals.
"""
from scipy.stats import mannwhitneyu
import numpy as np

# ------------------------------------------------------------------
# Setup: get the feature values for each group
# ------------------------------------------------------------------
FEATURE = "mean_distance"

learner_vals = trained_features_classified.loc[
    trained_features_classified["is_learner"], FEATURE
].values

non_learner_vals = trained_features_classified.loc[
    ~trained_features_classified["is_learner"], FEATURE
].values

untrained_vals = untrained_features[FEATURE].dropna().values

print(f"Group sizes:")
print(f"  Learners: N = {len(learner_vals)}")
print(f"  Non-learners: N = {len(non_learner_vals)}")
print(f"  Untrained: N = {len(untrained_vals)}")

# ------------------------------------------------------------------
# Mann-Whitney U test: Learners vs Untrained
# ------------------------------------------------------------------
if len(learner_vals) >= 3 and len(untrained_vals) >= 3:
    u_stat, p_value = mannwhitneyu(learner_vals, untrained_vals, alternative="two-sided")
    print(f"\nMann-Whitney U test (Learners vs Untrained):")
    print(f"  U = {u_stat:.1f}, p = {p_value:.4f}")

    # Rank-biserial correlation as non-parametric effect size
    n1, n2 = len(learner_vals), len(untrained_vals)
    rank_biserial = 1 - (2 * u_stat) / (n1 * n2)
    print(f"  Rank-biserial correlation (effect size): r = {rank_biserial:.3f}")
else:
    print("\nNot enough data points for Mann-Whitney U test.")

# ------------------------------------------------------------------
# Mann-Whitney U test: Non-learners vs Untrained
# (These should NOT differ if classification is correct)
# ------------------------------------------------------------------
if len(non_learner_vals) >= 3 and len(untrained_vals) >= 3:
    u_stat_nl, p_value_nl = mannwhitneyu(non_learner_vals, untrained_vals, alternative="two-sided")
    print(f"\nMann-Whitney U test (Non-learners vs Untrained):")
    print(f"  U = {u_stat_nl:.1f}, p = {p_value_nl:.4f}")
    print(f"  (Expected: NOT significant -- non-learners should resemble untrained)")

# ------------------------------------------------------------------
# Bootstrap confidence interval for the mean difference
# ------------------------------------------------------------------
def bootstrap_mean_diff(group_a, group_b, n_bootstrap=10000, ci=95, seed=42):
    """Compute bootstrap CI for difference in means (A - B).

    Args:
        group_a (np.ndarray): Values from group A.
        group_b (np.ndarray): Values from group B.
        n_bootstrap (int): Number of bootstrap iterations.
        ci (float): Confidence interval percentage.
        seed (int): Random seed for reproducibility.

    Returns:
        tuple: (observed_diff, ci_lower, ci_upper).
    """
    rng = np.random.default_rng(seed)
    observed_diff = np.mean(group_a) - np.mean(group_b)

    boot_diffs = np.empty(n_bootstrap)
    for i in range(n_bootstrap):
        boot_a = rng.choice(group_a, size=len(group_a), replace=True)
        boot_b = rng.choice(group_b, size=len(group_b), replace=True)
        boot_diffs[i] = np.mean(boot_a) - np.mean(boot_b)

    alpha = (100 - ci) / 2
    ci_lower = np.percentile(boot_diffs, alpha)
    ci_upper = np.percentile(boot_diffs, 100 - alpha)

    return observed_diff, ci_lower, ci_upper

if len(learner_vals) >= 3 and len(untrained_vals) >= 3:
    diff, ci_lo, ci_hi = bootstrap_mean_diff(learner_vals, untrained_vals)
    print(f"\nBootstrap 95% CI for mean difference (Learners - Untrained):")
    print(f"  Observed difference: {diff:.2f}")
    print(f"  95% CI: [{ci_lo:.2f}, {ci_hi:.2f}]")

    if ci_lo > 0 or ci_hi < 0:
        print(f"  --> CI does not include zero: significant difference")
    else:
        print(f"  --> CI includes zero: difference not significant at 95% level")

# ------------------------------------------------------------------
# Cohen's d with Hedges' g correction (for small N)
# ------------------------------------------------------------------
def hedges_g(group_a, group_b):
    """Compute Hedges' g (bias-corrected Cohen's d for small samples).

    Args:
        group_a (np.ndarray): Values from group A.
        group_b (np.ndarray): Values from group B.

    Returns:
        float: Hedges' g effect size.
    """
    n1, n2 = len(group_a), len(group_b)
    mean_diff = np.mean(group_a) - np.mean(group_b)
    pooled_std = np.sqrt(
        ((n1 - 1) * np.var(group_a, ddof=1) + (n2 - 1) * np.var(group_b, ddof=1))
        / (n1 + n2 - 2)
    )
    d = mean_diff / pooled_std if pooled_std > 0 else np.nan

    # Hedges' correction factor for small-sample bias
    correction = 1 - (3 / (4 * (n1 + n2) - 9))
    return d * correction

if len(learner_vals) >= 3 and len(untrained_vals) >= 3:
    g = hedges_g(learner_vals, untrained_vals)
    print(f"\nHedges' g (Learners vs Untrained): {g:.3f}")
    print(f"  (Compare to the original aggregate Cohen's d = 0.09)")

What to expect: If the bimodal hypothesis is correct, the Hedges' g for learners vs untrained should be substantially larger than the original d = 0.09, potentially in the "medium" (0.5) or "large" (0.8) range. The non-learners vs untrained comparison should be non-significant with a near-zero effect size.


6. Expected Results and Interpretation

6.1 If the Bimodal Hypothesis Is Correct

You should see:

  1. Histograms/KDE: The trained group shows two visible clusters in at least one feature (e.g., mean_distance). The untrained group has a single peak.

  2. Dip test: The trained group yields a significant (or trending) dip test result; the untrained group does not. (But remember: with N=18, power is limited.)

  3. GMM comparison: The 2-component model has a substantially lower BIC than the 1-component model for the trained group. For the untrained group, the 1-component model is preferred (or the difference is negligible).

  4. Classification: The GMM identifies a clear split -- some trained ROIs cluster with untrained values, others are distinctly separated.

  5. Effect sizes: Learners vs untrained shows a large effect size (Hedges' g > 0.5). Non-learners vs untrained shows g near zero.

  6. Bootstrap CIs: The confidence interval for the learner-vs-untrained mean difference does not include zero.

This would be the ideal outcome. It means the training works, but only on a subset of animals -- a biologically plausible result, since learning rates in Drosophila conditioning experiments are typically well below 100%.

6.2 If the Bimodal Hypothesis Is NOT Supported

If the distributions look unimodal for both groups (dip test non-significant, 1-component GMM preferred), there are several possible conclusions:

  1. The training genuinely has no effect (or an effect too small to detect with this sample size). The aggregate d = 0.09 is the real signal -- negligible.

  2. The training has a uniform but tiny effect on all flies, rather than a large effect on a subset. Both distributions shift slightly. This is hard to distinguish from "no effect" with N=18.

  3. The bimodal split exists but on a different feature than the ones you tested. Try additional features: time-to-first-contact, latency to approach within 25px, peak velocity in the first 60 seconds, etc.

  4. The sample size is too small to detect bimodality. With N=18, the dip test and GMM have limited power, especially if the learner fraction is small (e.g., 3 out of 18 learners). Consider whether additional data can be collected.

  5. The tracking data quality masks the signal. Review individual ROI time series plots to check for artifacts (lost tracking, stuck detections, distance = 0 for extended periods). If some ROIs have unreliable data, they should be flagged or excluded.

Be honest about null results. A clear negative finding ("training does not produce bimodal behavior in this assay") is still a valuable scientific result. Do not p-hack or cherry-pick features until something is significant.

6.3 Sanity Check: Individual ROI Time Series

Before trusting the aggregate statistics, always look at the raw data. Plot the distance time series for each ROI individually:

"""
Sanity check: Plot individual ROI distance traces.
"""
import matplotlib.pyplot as plt

fig, axes = plt.subplots(6, 3, figsize=(18, 24), sharex=True, sharey=True)
axes = axes.flatten()

# Plot trained ROIs
for idx, ((machine, roi), roi_data) in enumerate(
    post.groupby(["machine_name", "ROI"])
):
    if idx >= len(axes):
        break
    ax = axes[idx]
    group = roi_data["group"].iloc[0]
    color = "salmon" if group == "trained" else "steelblue"

    ax.plot(
        roi_data["aligned_time"] / 1000,  # convert to seconds
        roi_data["distance"],
        color=color, alpha=0.5, linewidth=0.5,
    )
    ax.set_title(f"M{machine} ROI{roi} ({group})", fontsize=9)
    ax.set_ylim(0, 300)

for ax in axes:
    ax.set_xlabel("Time post-opening (s)")
    ax.set_ylabel("Distance (px)")

plt.tight_layout()
plt.savefig("figures/individual_roi_traces.png", dpi=150, bbox_inches="tight")
plt.show()

This is invaluable for spotting tracking artifacts and getting intuition for whether there really is a subgroup of trained ROIs that behave differently.


7. Statistical Considerations

7.1 Small Sample Size (N=18 per Group)

With only 18 ROIs per group, you are working with a very small sample. This has several implications:

  • Parametric tests (t-tests) are unreliable unless the data are approximately normally distributed. Always check with a Shapiro-Wilk test (scipy.stats.shapiro(values)) and use non-parametric alternatives (Mann-Whitney U) when normality is violated.

  • The dip test has low power at N=18. A non-significant result is ambiguous -- it could mean the data are truly unimodal, or it could mean N is too small to detect bimodality. The GMM + BIC approach is somewhat more sensitive but still limited.

  • Confidence intervals are wide. This is expected and honest. Report CIs alongside point estimates so readers understand the uncertainty.

  • Do not use the t-test on per-ROI summary statistics and call it a "t-test on 230K observations." The N is 18, not 230K. This is the whole point of the methodology shift.

7.2 Why Bootstrap Confidence Intervals

With N=18, the Central Limit Theorem provides only a rough approximation. Bootstrap CIs:

  • Make no distributional assumptions.
  • Work well even for non-normal, skewed, or heavy-tailed distributions.
  • Give a direct estimate of the sampling variability of your effect size.

Use at least 10,000 bootstrap iterations. For publication, consider 100,000.

7.3 Recording Session as a Random Effect

ROIs within the same recording session share the same machine, lighting, temperature, and barrier-opening event. This creates a nested structure:

Group (trained/untrained)
  -> Session (machine x date: 076-session1, 076-session2, 145-session1, ...)
    -> ROI (1-6 within each session)

Ideally, you would account for this using a mixed-effects model with session as a random intercept:

"""
Mixed-effects model (optional, for advanced analysis).

Install: pip install statsmodels
"""
import statsmodels.formula.api as smf

# Add a session identifier
roi_features["session"] = (
    roi_features["machine_name"].astype(str) + "_" +
    roi_features.groupby("machine_name").cumcount().astype(str)
)
# Note: you may need a better session identifier from the metadata.
# Each machine ran multiple sessions -- use the date/HHMMSS from metadata
# to construct a unique session ID.

# Fit mixed model: feature ~ group, random intercept for session
model = smf.mixedlm(
    "mean_distance ~ group",
    data=roi_features,
    groups=roi_features["session"],
)
result = model.fit()
print(result.summary())

However, with only 5 sessions total (and some sessions contributing to only one group), a mixed model may be overparameterized. In practice, for this dataset:

  • At minimum, verify that results hold when you average across sessions first (i.e., compute session-level means, then compare groups with N=5 rather than N=18).
  • If session-level analysis gives the same qualitative result as ROI-level analysis, the session clustering is not driving your findings.

7.4 Multiple Comparisons

If you test bimodality across multiple features (mean_distance, median_distance, frac_close_50, velocity, etc.), you are performing multiple comparisons. Apply a correction:

  • Bonferroni: Divide your alpha (0.05) by the number of tests. Conservative but simple.
  • Benjamini-Hochberg (FDR): Controls the false discovery rate. Less conservative, appropriate for exploratory analysis.
from scipy.stats import false_discovery_control

# Collect all p-values from dip tests
p_values = [0.03, 0.12, 0.45, 0.01]  # example values
adjusted = false_discovery_control(p_values, method="bh")
print("BH-adjusted p-values:", adjusted)

Be transparent about how many features you tested and which correction you applied.

7.5 Circular Analysis Warning

There is a subtle danger if you use the same data to both (a) identify learners and (b) estimate the learner-vs-untrained effect size. This is called circular analysis or double dipping -- the effect size will be inflated because the GMM "optimizes" the split to maximize the difference.

To mitigate this:

  • Use one feature for classification and a different feature for effect size estimation. For example, classify based on mean_distance but estimate the effect size on frac_close_50 or velocity.
  • Or use cross-validation: Split the trained ROIs into two halves, fit the GMM on one half, classify the other half, and estimate the effect size on the classified half only.
  • Report the classification feature and the evaluation features separately so reviewers can assess potential circularity.

8. References

Software and Packages

Statistical Concepts

  • Pseudoreplication: Hurlbert, S.H. (1984). "Pseudoreplication and the design of ecological field experiments." Ecological Monographs, 54(2), 187-211. The original paper defining the concept. Every experimental biologist should read this.

  • Hartigan's Dip Test: Hartigan, J.A. & Hartigan, P.M. (1985). "The Dip Test of Unimodality." Annals of Statistics, 13(1), 70-84.

  • BIC for Model Selection: Schwarz, G. (1978). "Estimating the Dimension of a Model." Annals of Statistics, 6(2), 461-464. Lower BIC = better model, with a penalty for additional parameters. A Delta BIC > 10 is conventionally considered "very strong" evidence.

  • Cohen's d interpretation: Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). The standard reference for effect size conventions (small = 0.2, medium = 0.5, large = 0.8).

  • Hedges' g: Hedges, L.V. (1981). "Distribution Theory for Glass's Estimator of Effect Size and Related Estimators." Journal of Educational Statistics, 6(2), 107-128. Bias-corrected version of Cohen's d for small samples.

  • Bootstrap methods: Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap. The definitive reference for bootstrap confidence intervals.


Summary: The Analysis Roadmap

Current state                          Goal
-------------------------------------------------------------------
230K observations per group    --->    18 per-ROI summary values per group
Cohen's d = 0.09 (96% overlap) --->   Separate learners (strong effect)
                                        from non-learners (no effect)
One giant t-test               --->    Bimodality test + GMM + subgroup comparison
Pseudoreplication              --->    Correct N, honest uncertainty

Steps in order:

  1. Compute per-ROI features (Phase 1) -- this is the prerequisite for everything else.
  2. Visualize (Phase 2) -- look before you test.
  3. Test bimodality formally (Phases 3 and 4).
  4. If bimodal: classify and re-estimate (Phases 5 and 6).
  5. If not bimodal: report honestly, consider alternative features or additional data.
  6. Regardless of outcome: check individual ROI traces, account for session effects, and apply multiple comparison corrections.

Good luck with the analysis. The bimodal hypothesis is worth testing rigorously -- either it reveals a meaningful subgroup structure in the trained flies, or it confirms that the training effect in this assay is genuinely negligible. Both are informative results.