Remove data/raw/ entirely — all bulky data now under /mnt/data/projects/cupido/

Deleted the 5 stale pre-pipeline tracking DBs and the data/raw/ directory.
Dropped DATA_RAW from config.py; build_video_inventory now scans
TRACKING_OUTPUT_DIR for already-tracked sessions. Notebooks no longer
import DATA_RAW. README, PLANNING and todo updated to reflect that the
repo holds only code + small curated metadata, never bulky DBs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Giorgio Gilestro 2026-05-01 09:20:25 +01:00
parent 9f3ee24a23
commit 23050360ea
9 changed files with 37 additions and 70 deletions

6
.gitignore vendored
View file

@ -1,9 +1,7 @@
# Large data files (reproducible from raw DBs) # Generated CSVs (regenerable from the tracking DBs + the merged TSV)
data/raw/*.db
data/processed/*.csv data/processed/*.csv
# Offline-tracking outputs (regenerable from videos + target JSONs) # Tracking DBs and target JSONs live outside the repo at /mnt/data/projects/cupido/
# DBs and target JSONs live outside the repo at /mnt/data/projects/cupido/
data/metadata/video_inventory.csv data/metadata/video_inventory.csv
data/logs/*.log data/logs/*.log

View file

@ -30,14 +30,19 @@ Drosophila behavioral tracking analysis for the Cupido project. Compares social
``` ```
tracking/ tracking/
├── data/raw/ # SQLite DBs (gitignored) ├── data/metadata/ # Small hand-curated CSVs (tracked in git)
├── data/metadata/ # Small CSVs (tracked)
├── data/processed/ # Large generated CSVs (gitignored) ├── data/processed/ # Large generated CSVs (gitignored)
├── data/logs/ # Tracker logs (gitignored)
├── scripts/ # Python scripts with config.py imports ├── scripts/ # Python scripts with config.py imports
├── notebooks/ # Jupyter analysis notebooks ├── notebooks/ # Jupyter analysis notebooks
├── figures/ # Generated plots (gitignored) ├── figures/ # Generated plots (gitignored)
├── docs/ # Scientific documentation ├── docs/ # Scientific documentation
└── tasks/ # Task tracking └── tasks/ # Task tracking
# All bulky data lives outside the repo at /mnt/data/projects/cupido/:
# tracked/ # SQLite tracking DBs
# targets/ # Target-point JSON sidecars
# all_video_info_merged.{xlsx,tsv} # Metadata spreadsheet
``` ```
## Next Direction ## Next Direction

View file

@ -14,9 +14,11 @@ python -m venv .venv
source .venv/bin/activate source .venv/bin/activate
pip install -r requirements.txt pip install -r requirements.txt
# Get the data files (not in git - ask lab for copies) # Project data lives outside the repo at /mnt/data/projects/cupido/:
# Place .db files in data/raw/ # tracked/ → SQLite tracking DBs
# Place large .csv files in data/processed/ # targets/ → target-point JSONs
# all_video_info_merged.{xlsx,tsv} → metadata spreadsheet
# Generated CSVs land in data/processed/ (gitignored).
# Run the main analysis notebook # Run the main analysis notebook
jupyter notebook notebooks/flies_analysis_simple.ipynb jupyter notebook notebooks/flies_analysis_simple.ipynb
@ -66,7 +68,7 @@ python scripts/pick_targets.py --redo # re-pick already-picked videos
# 3) batch tracking (idempotent, can run in background) # 3) batch tracking (idempotent, can run in background)
python scripts/track_videos.py --jobs 4 # parallel python scripts/track_videos.py --jobs 4 # parallel
# output → /mnt/data/projects/cupido/tracked/*_tracking.db (SQLite, same schema as data/raw/) # output → /mnt/data/projects/cupido/tracked/*_tracking.db (SQLite)
``` ```
See `tasks/todo.md` "Offline Tracking" section for the full plan, and See `tasks/todo.md` "Offline Tracking" section for the full plan, and
@ -80,9 +82,9 @@ tracking/
├── PLANNING.md # Architecture & conventions ├── PLANNING.md # Architecture & conventions
├── requirements.txt # Python dependencies ├── requirements.txt # Python dependencies
├── data/ ├── data/
│ ├── raw/ # SQLite tracking databases (gitignored) │ ├── metadata/ # Experiment metadata CSVs (small, hand-curated)
│ ├── metadata/ # Experiment metadata CSVs │ ├── processed/ # Generated analysis CSVs (gitignored)
│ └── processed/ # Generated analysis CSVs (gitignored) │ └── logs/ # Tracker logs (gitignored)
├── scripts/ # Python analysis scripts ├── scripts/ # Python analysis scripts
│ ├── config.py # Shared path constants │ ├── config.py # Shared path constants
│ ├── load_roi_data.py # Extract data from DBs │ ├── load_roi_data.py # Extract data from DBs
@ -107,13 +109,13 @@ tracking/
## Data Pipeline ## Data Pipeline
``` ```
SQLite DBs (data/raw/) SQLite DBs (/mnt/data/projects/cupido/tracked/) + merged TSV
▼ load_roi_data.py / notebook step 1 scripts/load_roi_data.py
ROI CSVs (data/processed/*_roi_data.csv) single DataFrame stamped with experimental metadata
▼ notebook steps 2-4 ▼ notebooks/flies_analysis_simple.ipynb (steps 24)
Aligned Distance CSVs (data/processed/*_distances_aligned.csv) Aligned distance CSVs (data/processed/*_distances_aligned.csv)
├──▶ Plots (figures/) ├──▶ Plots (figures/)
├──▶ Statistical tests ├──▶ Statistical tests

View file

@ -1,37 +0,0 @@
# Raw Data
SQLite databases containing fly tracking data from ethoscope recordings.
## Files
| File | Machine | Session | Size |
|------|---------|---------|------|
| `2025-07-15_16-03-10_076e...tracking.db` | ETHOSCOPE_076 | 16:03:10 | ~6.5MB |
| `2025-07-15_16-03-27_145b...tracking.db` | ETHOSCOPE_145 | 16:03:27 | ~6.1MB |
| `2025-07-15_16-31-34_076e...tracking.db` | ETHOSCOPE_076 | 16:31:34 | ~6.6MB |
| `2025-07-15_16-31-41_145b...tracking.db` | ETHOSCOPE_145 | 16:31:41 | ~6.6MB |
| `2025-07-15_16-32-05_268...tracking.db` | ETHOSCOPE_268 | 16:32:05 | ~7.0MB |
**Note**: Machine 139 has metadata but no tracking database. See `docs/experimental_design.md`.
## Schema
Each database contains tables `ROI_1` through `ROI_6`:
| Column | Type | Description |
|--------|------|-------------|
| `id` | int | Detection ID within frame |
| `t` | int | Time in **milliseconds** from recording start |
| `x` | float | X position in pixels |
| `y` | float | Y position in pixels |
| `w` | float | Bounding box width in pixels |
| `h` | float | Bounding box height in pixels |
| `phi` | float | Orientation angle |
| `is_inferred` | int | Whether position was inferred (0/1) |
| `has_interacted` | int | Whether interaction detected (0/1) |
## Provenance
Data recorded on 2025-07-15 using ethoscope platform.
Resolution: 1920x1088 @ 25fps, H.264 28q quality.
These files are gitignored (binary, ~33MB total).

View file

@ -14,7 +14,7 @@
"execution_count": null, "execution_count": null,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": "import sys\nfrom pathlib import Path\n\nimport pandas as pd\nimport numpy as np\nimport sqlite3\nimport glob\nimport re\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.spatial.distance import euclidean\nfrom scipy import stats\n\n# ─── Where the data lives ────────────────────────────────────────────────\n# DATA_DIR holds everything bulky/regenerable: the metadata TSV and the\n# tracking SQLite DBs. It's mounted into the container at this fixed path.\n# REPO_ROOT is your checkout of the cupido repo, in your home directory.\n# Path.home() expands to /home/<your-username>, so this works for any\n# user (no hard-coded usernames).\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path.home() / \"cupido\"\n\nMETADATA_TSV = DATA_DIR / \"all_video_info_merged.tsv\"\nTRACKED_DBS = DATA_DIR / \"tracked\"\n\n# Sanity-check the data location up front so any failure here points at\n# the obvious thing — rather than crashing inside load_roi_data later.\nassert METADATA_TSV.exists(), f\"Metadata TSV not found at {METADATA_TSV}\"\nassert TRACKED_DBS.is_dir(), f\"Tracked-DB directory not found at {TRACKED_DBS}\"\n\n# Pull the in-repo path constants (DATA_RAW, DATA_METADATA, DATA_PROCESSED,\n# FIGURES) from scripts/config.py — single source of truth.\nsys.path.insert(0, str(REPO_ROOT / \"scripts\"))\nfrom config import DATA_RAW, DATA_METADATA, DATA_PROCESSED, FIGURES\n\n# Plotting style\nplt.style.use('seaborn-v0_8')\nsns.set_palette(\"husl\")\n" "source": "import sys\nfrom pathlib import Path\n\nimport pandas as pd\nimport numpy as np\nimport sqlite3\nimport glob\nimport re\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.spatial.distance import euclidean\nfrom scipy import stats\n\n# ─── Where the data lives ────────────────────────────────────────────────\n# DATA_DIR holds everything bulky/regenerable: the metadata TSV and the\n# tracking SQLite DBs. It's mounted into the container at this fixed path.\n# REPO_ROOT is your checkout of the cupido repo, in your home directory.\n# Path.home() expands to /home/<your-username>, so this works for any\n# user (no hard-coded usernames).\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path.home() / \"cupido\"\n\nMETADATA_TSV = DATA_DIR / \"all_video_info_merged.tsv\"\nTRACKED_DBS = DATA_DIR / \"tracked\"\n\n# Sanity-check the data location up front so any failure here points at\n# the obvious thing — rather than crashing inside load_roi_data later.\nassert METADATA_TSV.exists(), f\"Metadata TSV not found at {METADATA_TSV}\"\nassert TRACKED_DBS.is_dir(), f\"Tracked-DB directory not found at {TRACKED_DBS}\"\n\n# Pull the in-repo path constants (DATA_METADATA, DATA_PROCESSED, FIGURES)\n# from scripts/config.py — single source of truth.\nsys.path.insert(0, str(REPO_ROOT / \"scripts\"))\nfrom config import DATA_METADATA, DATA_PROCESSED, FIGURES\n\n# Plotting style\nplt.style.use('seaborn-v0_8')\nsns.set_palette(\"husl\")\n"
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",

View file

@ -10,7 +10,7 @@
"execution_count": null, "execution_count": null,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": "import sys\nfrom pathlib import Path\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.spatial.distance import euclidean\nfrom scipy import stats\n\n# ─── Where the data lives ────────────────────────────────────────────────\n# DATA_DIR holds everything bulky/regenerable: the metadata TSV and the\n# tracking SQLite DBs. It's mounted into the container at this fixed path.\n# REPO_ROOT is your checkout of the cupido repo, in your home directory.\n# Path.home() expands to /home/<your-username>, so this works for any\n# user (no hard-coded usernames).\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path.home() / \"cupido\"\n\nMETADATA_TSV = DATA_DIR / \"all_video_info_merged.tsv\"\nTRACKED_DBS = DATA_DIR / \"tracked\"\n\n# Sanity-check the data location up front so any failure here points at\n# the obvious thing — rather than crashing inside load_roi_data later.\nassert METADATA_TSV.exists(), f\"Metadata TSV not found at {METADATA_TSV}\"\nassert TRACKED_DBS.is_dir(), f\"Tracked-DB directory not found at {TRACKED_DBS}\"\n\n# Pull the in-repo path constants (DATA_RAW, DATA_METADATA, DATA_PROCESSED,\n# FIGURES) from scripts/config.py — single source of truth.\nsys.path.insert(0, str(REPO_ROOT / \"scripts\"))\nfrom config import DATA_RAW, DATA_METADATA, DATA_PROCESSED, FIGURES\n\n# Plotting style\nplt.style.use('seaborn-v0_8')\nsns.set_palette(\"husl\")\n\nprint(f\"Data directory: {DATA_DIR}\")\nprint(f\"Repo root: {REPO_ROOT}\")\nprint(f\"Metadata TSV: {METADATA_TSV}\")\nprint(f\"Pandas version: {pd.__version__}\")\nprint(f\"NumPy version: {np.__version__}\")\n" "source": "import sys\nfrom pathlib import Path\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.spatial.distance import euclidean\nfrom scipy import stats\n\n# ─── Where the data lives ────────────────────────────────────────────────\n# DATA_DIR holds everything bulky/regenerable: the metadata TSV and the\n# tracking SQLite DBs. It's mounted into the container at this fixed path.\n# REPO_ROOT is your checkout of the cupido repo, in your home directory.\n# Path.home() expands to /home/<your-username>, so this works for any\n# user (no hard-coded usernames).\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path.home() / \"cupido\"\n\nMETADATA_TSV = DATA_DIR / \"all_video_info_merged.tsv\"\nTRACKED_DBS = DATA_DIR / \"tracked\"\n\n# Sanity-check the data location up front so any failure here points at\n# the obvious thing — rather than crashing inside load_roi_data later.\nassert METADATA_TSV.exists(), f\"Metadata TSV not found at {METADATA_TSV}\"\nassert TRACKED_DBS.is_dir(), f\"Tracked-DB directory not found at {TRACKED_DBS}\"\n\n# Pull the in-repo path constants (DATA_METADATA, DATA_PROCESSED, FIGURES)\n# from scripts/config.py — single source of truth.\nsys.path.insert(0, str(REPO_ROOT / \"scripts\"))\nfrom config import DATA_METADATA, DATA_PROCESSED, FIGURES\n\n# Plotting style\nplt.style.use('seaborn-v0_8')\nsns.set_palette(\"husl\")\n\nprint(f\"Data directory: {DATA_DIR}\")\nprint(f\"Repo root: {REPO_ROOT}\")\nprint(f\"Metadata TSV: {METADATA_TSV}\")\nprint(f\"Pandas version: {pd.__version__}\")\nprint(f\"NumPy version: {np.__version__}\")\n"
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",

View file

@ -16,7 +16,7 @@ from pathlib import Path
import pandas as pd import pandas as pd
from config import DATA_RAW, INVENTORY_CSV, VIDEO_INFO_XLSX, VIDEOS_ROOT from config import INVENTORY_CSV, TRACKING_OUTPUT_DIR, VIDEO_INFO_XLSX, VIDEOS_ROOT
SESSION_RE = re.compile(r"^(\d{4}-\d{2}-\d{2})_(\d{2}-\d{2}-\d{2})$") SESSION_RE = re.compile(r"^(\d{4}-\d{2}-\d{2})_(\d{2}-\d{2}-\d{2})$")
@ -64,14 +64,14 @@ def scan_videos(videos_root: Path) -> pd.DataFrame:
return pd.DataFrame(rows) return pd.DataFrame(rows)
def already_tracked_set(data_raw: Path) -> set[tuple[str, str]]: def already_tracked_set(tracked_dir: Path) -> set[tuple[str, str]]:
"""Return the set of (date, time) sessions for which a tracking DB exists. """Return the set of (date, time) sessions for which a tracking DB exists.
DBs are named like: DBs are named like:
2025-07-15_16-03-10_<uuid>__1920x1088@25fps-28q_merged_tracking.db 2025-07-15_16-03-10_<uuid>__1920x1088@25fps-28q_merged_tracking.db
""" """
out = set() out = set()
for db in data_raw.glob("*_tracking.db"): for db in tracked_dir.glob("*_tracking.db"):
m = re.match(r"^(\d{4}-\d{2}-\d{2})_(\d{2}-\d{2}-\d{2})_", db.name) m = re.match(r"^(\d{4}-\d{2}-\d{2})_(\d{2}-\d{2}-\d{2})_", db.name)
if m: if m:
out.add((m.group(1), m.group(2))) out.add((m.group(1), m.group(2)))
@ -99,8 +99,8 @@ def main() -> None:
lambda r: (r["session_date"], r["machine_name"]) in xlsx_keys, axis=1 lambda r: (r["session_date"], r["machine_name"]) in xlsx_keys, axis=1
) )
# Mark which already have tracking DBs in data/raw/ # Mark which already have tracking DBs in TRACKING_OUTPUT_DIR
tracked = already_tracked_set(DATA_RAW) tracked = already_tracked_set(TRACKING_OUTPUT_DIR)
videos_df["already_tracked"] = videos_df.apply( videos_df["already_tracked"] = videos_df.apply(
lambda r: (r["session_date"], r["session_time"]) in tracked, axis=1 lambda r: (r["session_date"], r["session_time"]) in tracked, axis=1
) )

View file

@ -5,7 +5,6 @@ from pathlib import Path
# Where this code repository lives (the directory containing scripts/, notebooks/, ...). # Where this code repository lives (the directory containing scripts/, notebooks/, ...).
PROJECT_ROOT = Path(__file__).resolve().parent.parent PROJECT_ROOT = Path(__file__).resolve().parent.parent
DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_METADATA = PROJECT_ROOT / "data" / "metadata" DATA_METADATA = PROJECT_ROOT / "data" / "metadata"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed" DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
FIGURES = PROJECT_ROOT / "figures" FIGURES = PROJECT_ROOT / "figures"

View file

@ -55,14 +55,14 @@ See `docs/bimodal_hypothesis.md` for detailed methodology.
### Recap ### Recap
Tracked so far (5 sessions, all from 2025-07-15, machines 076/145/268). The DBs in Tracked so far (5 sessions, all from 2025-07-15, machines 076/145/268). Those
`data/raw/` use tracker `ConstrainedMultiFlyTracker` and template were re-tracked through the unified pipeline and now live at
`HD_Mating_Arena_6_ROIS.json` (2 flies × 6 ROIs per video). `/mnt/data/projects/cupido/tracked/` (no separate `data/raw/` anymore — the
old pre-pipeline copies were deleted on 2026-05-01).
The metadata file `../all_video_info_merged.xlsx` indexes a different set of The metadata file `/mnt/data/projects/cupido/all_video_info_merged.xlsx`
experiments: 7 dates from 2024-09-17 → 2024-10-21, 16 ethoscope machines, indexes a different set of experiments: 7 dates from 2024-09-17 → 2024-10-21,
63 unique (date, machine) sessions = 484 ROI-rows. **None of the already-tracked 16 ethoscope machines, 63 unique (date, machine) sessions = 484 ROI-rows.
sessions are in this xlsx — these are fresh recordings to track.**
Inventory: see `data/metadata/video_inventory.csv` (built by Inventory: see `data/metadata/video_inventory.csv` (built by
`scripts/build_video_inventory.py`). `scripts/build_video_inventory.py`).