cupido

lab/cupido

Fork 0

Commit graph

Author	SHA1	Message	Date
Giorgio Gilestro	53b45e373b	Dedupe + canonicalise the merged xlsx, then guard the export 108 of 508 rows in all_video_info_merged.xlsx were duplicates left over from merging multiple source spreadsheets — same (date, machine, ROI) appearing under two source_date values, identical data otherwise. The `male` column was also using a mix of variants ('naïve', 'niave', 'naive', 'trained') with the canonical 'naive' a minority of 12/200. scripts/cleanup_xlsx.py Idempotent one-off: backs up the xlsx, dedupes preferring the row whose source_date matches the experiment date, normalises `male` spellings, strips whitespace from string columns. Re-running on a clean file is a no-op. scripts/export_video_db_index.py New _validate_xlsx() runs first thing in main() and aborts the export with an actionable error if duplicates or non-canonical male values are present. Prevents silent regressions when the xlsx is edited or re-merged in the future. Result: TSV is now 400 rows (was 508), exactly 200 trained / 200 naive, no duplicates. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-01 13:39:57 +01:00

Author

SHA1

Message

Date

Giorgio Gilestro

53b45e373b

Dedupe + canonicalise the merged xlsx, then guard the export

108 of 508 rows in all_video_info_merged.xlsx were duplicates left over
from merging multiple source spreadsheets — same (date, machine, ROI)
appearing under two source_date values, identical data otherwise. The
`male` column was also using a mix of variants ('naïve', 'niave',
'naive', 'trained') with the canonical 'naive' a minority of 12/200.

scripts/cleanup_xlsx.py
    Idempotent one-off: backs up the xlsx, dedupes preferring the row
    whose source_date matches the experiment date, normalises `male`
    spellings, strips whitespace from string columns. Re-running on a
    clean file is a no-op.

scripts/export_video_db_index.py
    New _validate_xlsx() runs first thing in main() and aborts the
    export with an actionable error if duplicates or non-canonical
    male values are present. Prevents silent regressions when the
    xlsx is edited or re-merged in the future.

Result: TSV is now 400 rows (was 508), exactly 200 trained / 200
naive, no duplicates.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-01 13:39:57 +01:00

1 commit