108 of 508 rows in all_video_info_merged.xlsx were duplicates left over
from merging multiple source spreadsheets — same (date, machine, ROI)
appearing under two source_date values, identical data otherwise. The
`male` column was also using a mix of variants ('naïve', 'niave',
'naive', 'trained') with the canonical 'naive' a minority of 12/200.
scripts/cleanup_xlsx.py
Idempotent one-off: backs up the xlsx, dedupes preferring the row
whose source_date matches the experiment date, normalises `male`
spellings, strips whitespace from string columns. Re-running on a
clean file is a no-op.
scripts/export_video_db_index.py
New _validate_xlsx() runs first thing in main() and aborts the
export with an actionable error if duplicates or non-canonical
male values are present. Prevents silent regressions when the
xlsx is edited or re-merged in the future.
Result: TSV is now 400 rows (was 508), exactly 200 trained / 200
naive, no duplicates.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>