Reorganized flat 41-file directory into structured layout with: - scripts/ for Python analysis code with shared config.py - notebooks/ for Jupyter analysis notebooks - data/ split into raw/, metadata/, processed/ - docs/ with analysis summary, experimental design, and bimodal hypothesis tutorial - tasks/ with todo checklist and lessons learned - Comprehensive README, PLANNING.md, and .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
38 lines
1.9 KiB
Markdown
38 lines
1.9 KiB
Markdown
# Lessons Learned
|
|
|
|
## Pseudoreplication Pitfall
|
|
|
|
**The most important lesson in this project.**
|
|
|
|
The raw data has ~230K data points per group, but the true independent samples are ROIs (N=18 per group). Each ROI contributes thousands of correlated time points. Running t-tests on all data points inflates significance massively (p < 1e-200) while the actual effect size is negligible (Cohen's d = 0.09).
|
|
|
|
**Rule**: Always compute per-ROI summary statistics first, then compare groups at the ROI level.
|
|
|
|
## Significance vs Effect Size
|
|
|
|
A tiny p-value does NOT mean a meaningful difference. With N=230K, even a Cohen's d of 0.09 (96% overlap between distributions) gives p < 1e-200. Always report and interpret effect sizes alongside p-values.
|
|
|
|
## Data Type Mismatches
|
|
|
|
Machine names are stored as integers in metadata (76, 145, 268) but as strings in some contexts. The barrier_opening.csv uses "076" format. Always convert to string with `.astype(str)` before matching.
|
|
|
|
## Time Unit Mismatches
|
|
|
|
- SQLite databases: time `t` is in **milliseconds**
|
|
- `2025_07_15_barrier_opening.csv`: `opening_time` is in **seconds**
|
|
- Must multiply barrier opening times by 1000 before aligning
|
|
|
|
## Missing Data
|
|
|
|
Machine 139 has 6 ROIs in the metadata (3 trained, 3 untrained) but:
|
|
- No tracking database file exists
|
|
- No entry in barrier_opening.csv
|
|
- This reduces the effective N from 18 to 15 per group
|
|
|
|
## Single-Fly Detection Handling
|
|
|
|
When only one fly is detected (instead of two), the tracker reports a single bounding box. If the area of that box is large (>1.5x median two-fly area), it likely means the flies are overlapping (distance ~0). If the area is small, one fly is probably out of frame (distance = NaN, excluded from analysis).
|
|
|
|
## Path Management
|
|
|
|
All scripts use `from config import DATA_PROCESSED, FIGURES, ...` for consistent paths. Notebooks use `Path("..")` relative to the `notebooks/` directory. Never use hardcoded absolute paths.
|