cupido/scripts
Giorgio Gilestro 53b45e373b Dedupe + canonicalise the merged xlsx, then guard the export
108 of 508 rows in all_video_info_merged.xlsx were duplicates left over
from merging multiple source spreadsheets — same (date, machine, ROI)
appearing under two source_date values, identical data otherwise. The
`male` column was also using a mix of variants ('naïve', 'niave',
'naive', 'trained') with the canonical 'naive' a minority of 12/200.

scripts/cleanup_xlsx.py
    Idempotent one-off: backs up the xlsx, dedupes preferring the row
    whose source_date matches the experiment date, normalises `male`
    spellings, strips whitespace from string columns. Re-running on a
    clean file is a no-op.

scripts/export_video_db_index.py
    New _validate_xlsx() runs first thing in main() and aborts the
    export with an actionable error if duplicates or non-canonical
    male values are present. Prevents silent regressions when the
    xlsx is edited or re-merged in the future.

Result: TSV is now 400 rows (was 508), exactly 200 trained / 200
naive, no duplicates.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-01 13:39:57 +01:00
..
barrier_picker_app Show experimental metadata above the video in the picker 2026-05-01 12:54:40 +01:00
__init__.py Initial commit: organized project structure for student handoff 2026-03-05 16:08:36 +00:00
analyze_distances.py Initial commit: organized project structure for student handoff 2026-03-05 16:08:36 +00:00
auto_detect_targets.py Remove hardcoded /home/gg paths so the project is portable 2026-05-01 08:55:44 +01:00
build_video_inventory.py Add video duration_s to inventory and propagate to merged TSV 2026-05-01 11:13:05 +01:00
calculate_distances.py Unify analysis pipeline around the TSV; move tracked DBs out of cloud sync 2026-04-30 15:20:14 +01:00
cleanup_xlsx.py Dedupe + canonicalise the merged xlsx, then guard the export 2026-05-01 13:39:57 +01:00
config.py Remove data/raw/ entirely — all bulky data now under /mnt/data/projects/cupido/ 2026-05-01 09:20:25 +01:00
detect_barrier_opening.py Merge 2025-07-15 batch into the xlsx; tools to detect & re-track 2026-05-01 10:28:25 +01:00
explore_barrier_signal.py Merge 2025-07-15 batch into the xlsx; tools to detect & re-track 2026-05-01 10:28:25 +01:00
export_video_db_index.py Dedupe + canonicalise the merged xlsx, then guard the export 2026-05-01 13:39:57 +01:00
load_roi_data.py Make load_roi_data progress bar refresh reliably in JupyterLab 2026-05-01 09:43:12 +01:00
merge_2025_07_15_into_xlsx.py Merge 2025-07-15 batch into the xlsx; tools to detect & re-track 2026-05-01 10:28:25 +01:00
ml_classification.py Initial commit: organized project structure for student handoff 2026-03-05 16:08:36 +00:00
monitor_tracking.py Unify analysis pipeline around the TSV; move tracked DBs out of cloud sync 2026-04-30 15:20:14 +01:00
pick_barrier.py Force interactive matplotlib backend in pick_barrier 2026-05-01 12:23:15 +01:00
pick_targets.py Merge 2025-07-15 batch into the xlsx; tools to detect & re-track 2026-05-01 10:28:25 +01:00
plot_avg_distance_aligned.py Initial commit: organized project structure for student handoff 2026-03-05 16:08:36 +00:00
plot_avg_distance_first_200s.py Initial commit: organized project structure for student handoff 2026-03-05 16:08:36 +00:00
plot_avg_distance_over_time.py Initial commit: organized project structure for student handoff 2026-03-05 16:08:36 +00:00
plot_distance_over_time.py Initial commit: organized project structure for student handoff 2026-03-05 16:08:36 +00:00
statistical_tests.py Initial commit: organized project structure for student handoff 2026-03-05 16:08:36 +00:00
track_videos.py Remove hardcoded /home/gg paths so the project is portable 2026-05-01 08:55:44 +01:00
tracking_geometry.py Add offline tracking pipeline for video backlog 2026-04-27 17:25:26 +01:00