Remove data/raw/ entirely — all bulky data now under /mnt/data/projects/cupido/
Deleted the 5 stale pre-pipeline tracking DBs and the data/raw/ directory. Dropped DATA_RAW from config.py; build_video_inventory now scans TRACKING_OUTPUT_DIR for already-tracked sessions. Notebooks no longer import DATA_RAW. README, PLANNING and todo updated to reflect that the repo holds only code + small curated metadata, never bulky DBs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
9f3ee24a23
commit
23050360ea
9 changed files with 37 additions and 70 deletions
6
.gitignore
vendored
6
.gitignore
vendored
|
|
@ -1,9 +1,7 @@
|
||||||
# Large data files (reproducible from raw DBs)
|
# Generated CSVs (regenerable from the tracking DBs + the merged TSV)
|
||||||
data/raw/*.db
|
|
||||||
data/processed/*.csv
|
data/processed/*.csv
|
||||||
|
|
||||||
# Offline-tracking outputs (regenerable from videos + target JSONs)
|
# Tracking DBs and target JSONs live outside the repo at /mnt/data/projects/cupido/
|
||||||
# DBs and target JSONs live outside the repo at /mnt/data/projects/cupido/
|
|
||||||
data/metadata/video_inventory.csv
|
data/metadata/video_inventory.csv
|
||||||
data/logs/*.log
|
data/logs/*.log
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -30,14 +30,19 @@ Drosophila behavioral tracking analysis for the Cupido project. Compares social
|
||||||
|
|
||||||
```
|
```
|
||||||
tracking/
|
tracking/
|
||||||
├── data/raw/ # SQLite DBs (gitignored)
|
├── data/metadata/ # Small hand-curated CSVs (tracked in git)
|
||||||
├── data/metadata/ # Small CSVs (tracked)
|
|
||||||
├── data/processed/ # Large generated CSVs (gitignored)
|
├── data/processed/ # Large generated CSVs (gitignored)
|
||||||
|
├── data/logs/ # Tracker logs (gitignored)
|
||||||
├── scripts/ # Python scripts with config.py imports
|
├── scripts/ # Python scripts with config.py imports
|
||||||
├── notebooks/ # Jupyter analysis notebooks
|
├── notebooks/ # Jupyter analysis notebooks
|
||||||
├── figures/ # Generated plots (gitignored)
|
├── figures/ # Generated plots (gitignored)
|
||||||
├── docs/ # Scientific documentation
|
├── docs/ # Scientific documentation
|
||||||
└── tasks/ # Task tracking
|
└── tasks/ # Task tracking
|
||||||
|
|
||||||
|
# All bulky data lives outside the repo at /mnt/data/projects/cupido/:
|
||||||
|
# tracked/ # SQLite tracking DBs
|
||||||
|
# targets/ # Target-point JSON sidecars
|
||||||
|
# all_video_info_merged.{xlsx,tsv} # Metadata spreadsheet
|
||||||
```
|
```
|
||||||
|
|
||||||
## Next Direction
|
## Next Direction
|
||||||
|
|
|
||||||
26
README.md
26
README.md
|
|
@ -14,9 +14,11 @@ python -m venv .venv
|
||||||
source .venv/bin/activate
|
source .venv/bin/activate
|
||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
|
|
||||||
# Get the data files (not in git - ask lab for copies)
|
# Project data lives outside the repo at /mnt/data/projects/cupido/:
|
||||||
# Place .db files in data/raw/
|
# tracked/ → SQLite tracking DBs
|
||||||
# Place large .csv files in data/processed/
|
# targets/ → target-point JSONs
|
||||||
|
# all_video_info_merged.{xlsx,tsv} → metadata spreadsheet
|
||||||
|
# Generated CSVs land in data/processed/ (gitignored).
|
||||||
|
|
||||||
# Run the main analysis notebook
|
# Run the main analysis notebook
|
||||||
jupyter notebook notebooks/flies_analysis_simple.ipynb
|
jupyter notebook notebooks/flies_analysis_simple.ipynb
|
||||||
|
|
@ -66,7 +68,7 @@ python scripts/pick_targets.py --redo # re-pick already-picked videos
|
||||||
|
|
||||||
# 3) batch tracking (idempotent, can run in background)
|
# 3) batch tracking (idempotent, can run in background)
|
||||||
python scripts/track_videos.py --jobs 4 # parallel
|
python scripts/track_videos.py --jobs 4 # parallel
|
||||||
# output → /mnt/data/projects/cupido/tracked/*_tracking.db (SQLite, same schema as data/raw/)
|
# output → /mnt/data/projects/cupido/tracked/*_tracking.db (SQLite)
|
||||||
```
|
```
|
||||||
|
|
||||||
See `tasks/todo.md` "Offline Tracking" section for the full plan, and
|
See `tasks/todo.md` "Offline Tracking" section for the full plan, and
|
||||||
|
|
@ -80,9 +82,9 @@ tracking/
|
||||||
├── PLANNING.md # Architecture & conventions
|
├── PLANNING.md # Architecture & conventions
|
||||||
├── requirements.txt # Python dependencies
|
├── requirements.txt # Python dependencies
|
||||||
├── data/
|
├── data/
|
||||||
│ ├── raw/ # SQLite tracking databases (gitignored)
|
│ ├── metadata/ # Experiment metadata CSVs (small, hand-curated)
|
||||||
│ ├── metadata/ # Experiment metadata CSVs
|
│ ├── processed/ # Generated analysis CSVs (gitignored)
|
||||||
│ └── processed/ # Generated analysis CSVs (gitignored)
|
│ └── logs/ # Tracker logs (gitignored)
|
||||||
├── scripts/ # Python analysis scripts
|
├── scripts/ # Python analysis scripts
|
||||||
│ ├── config.py # Shared path constants
|
│ ├── config.py # Shared path constants
|
||||||
│ ├── load_roi_data.py # Extract data from DBs
|
│ ├── load_roi_data.py # Extract data from DBs
|
||||||
|
|
@ -107,13 +109,13 @@ tracking/
|
||||||
## Data Pipeline
|
## Data Pipeline
|
||||||
|
|
||||||
```
|
```
|
||||||
SQLite DBs (data/raw/)
|
SQLite DBs (/mnt/data/projects/cupido/tracked/) + merged TSV
|
||||||
│
|
│
|
||||||
▼ load_roi_data.py / notebook step 1
|
▼ scripts/load_roi_data.py
|
||||||
ROI CSVs (data/processed/*_roi_data.csv)
|
single DataFrame stamped with experimental metadata
|
||||||
│
|
│
|
||||||
▼ notebook steps 2-4
|
▼ notebooks/flies_analysis_simple.ipynb (steps 2–4)
|
||||||
Aligned Distance CSVs (data/processed/*_distances_aligned.csv)
|
Aligned distance CSVs (data/processed/*_distances_aligned.csv)
|
||||||
│
|
│
|
||||||
├──▶ Plots (figures/)
|
├──▶ Plots (figures/)
|
||||||
├──▶ Statistical tests
|
├──▶ Statistical tests
|
||||||
|
|
|
||||||
|
|
@ -1,37 +0,0 @@
|
||||||
# Raw Data
|
|
||||||
|
|
||||||
SQLite databases containing fly tracking data from ethoscope recordings.
|
|
||||||
|
|
||||||
## Files
|
|
||||||
|
|
||||||
| File | Machine | Session | Size |
|
|
||||||
|------|---------|---------|------|
|
|
||||||
| `2025-07-15_16-03-10_076e...tracking.db` | ETHOSCOPE_076 | 16:03:10 | ~6.5MB |
|
|
||||||
| `2025-07-15_16-03-27_145b...tracking.db` | ETHOSCOPE_145 | 16:03:27 | ~6.1MB |
|
|
||||||
| `2025-07-15_16-31-34_076e...tracking.db` | ETHOSCOPE_076 | 16:31:34 | ~6.6MB |
|
|
||||||
| `2025-07-15_16-31-41_145b...tracking.db` | ETHOSCOPE_145 | 16:31:41 | ~6.6MB |
|
|
||||||
| `2025-07-15_16-32-05_268...tracking.db` | ETHOSCOPE_268 | 16:32:05 | ~7.0MB |
|
|
||||||
|
|
||||||
**Note**: Machine 139 has metadata but no tracking database. See `docs/experimental_design.md`.
|
|
||||||
|
|
||||||
## Schema
|
|
||||||
|
|
||||||
Each database contains tables `ROI_1` through `ROI_6`:
|
|
||||||
|
|
||||||
| Column | Type | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| `id` | int | Detection ID within frame |
|
|
||||||
| `t` | int | Time in **milliseconds** from recording start |
|
|
||||||
| `x` | float | X position in pixels |
|
|
||||||
| `y` | float | Y position in pixels |
|
|
||||||
| `w` | float | Bounding box width in pixels |
|
|
||||||
| `h` | float | Bounding box height in pixels |
|
|
||||||
| `phi` | float | Orientation angle |
|
|
||||||
| `is_inferred` | int | Whether position was inferred (0/1) |
|
|
||||||
| `has_interacted` | int | Whether interaction detected (0/1) |
|
|
||||||
|
|
||||||
## Provenance
|
|
||||||
|
|
||||||
Data recorded on 2025-07-15 using ethoscope platform.
|
|
||||||
Resolution: 1920x1088 @ 25fps, H.264 28q quality.
|
|
||||||
These files are gitignored (binary, ~33MB total).
|
|
||||||
|
|
@ -14,7 +14,7 @@
|
||||||
"execution_count": null,
|
"execution_count": null,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": "import sys\nfrom pathlib import Path\n\nimport pandas as pd\nimport numpy as np\nimport sqlite3\nimport glob\nimport re\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.spatial.distance import euclidean\nfrom scipy import stats\n\n# ─── Where the data lives ────────────────────────────────────────────────\n# DATA_DIR holds everything bulky/regenerable: the metadata TSV and the\n# tracking SQLite DBs. It's mounted into the container at this fixed path.\n# REPO_ROOT is your checkout of the cupido repo, in your home directory.\n# Path.home() expands to /home/<your-username>, so this works for any\n# user (no hard-coded usernames).\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path.home() / \"cupido\"\n\nMETADATA_TSV = DATA_DIR / \"all_video_info_merged.tsv\"\nTRACKED_DBS = DATA_DIR / \"tracked\"\n\n# Sanity-check the data location up front so any failure here points at\n# the obvious thing — rather than crashing inside load_roi_data later.\nassert METADATA_TSV.exists(), f\"Metadata TSV not found at {METADATA_TSV}\"\nassert TRACKED_DBS.is_dir(), f\"Tracked-DB directory not found at {TRACKED_DBS}\"\n\n# Pull the in-repo path constants (DATA_RAW, DATA_METADATA, DATA_PROCESSED,\n# FIGURES) from scripts/config.py — single source of truth.\nsys.path.insert(0, str(REPO_ROOT / \"scripts\"))\nfrom config import DATA_RAW, DATA_METADATA, DATA_PROCESSED, FIGURES\n\n# Plotting style\nplt.style.use('seaborn-v0_8')\nsns.set_palette(\"husl\")\n"
|
"source": "import sys\nfrom pathlib import Path\n\nimport pandas as pd\nimport numpy as np\nimport sqlite3\nimport glob\nimport re\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.spatial.distance import euclidean\nfrom scipy import stats\n\n# ─── Where the data lives ────────────────────────────────────────────────\n# DATA_DIR holds everything bulky/regenerable: the metadata TSV and the\n# tracking SQLite DBs. It's mounted into the container at this fixed path.\n# REPO_ROOT is your checkout of the cupido repo, in your home directory.\n# Path.home() expands to /home/<your-username>, so this works for any\n# user (no hard-coded usernames).\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path.home() / \"cupido\"\n\nMETADATA_TSV = DATA_DIR / \"all_video_info_merged.tsv\"\nTRACKED_DBS = DATA_DIR / \"tracked\"\n\n# Sanity-check the data location up front so any failure here points at\n# the obvious thing — rather than crashing inside load_roi_data later.\nassert METADATA_TSV.exists(), f\"Metadata TSV not found at {METADATA_TSV}\"\nassert TRACKED_DBS.is_dir(), f\"Tracked-DB directory not found at {TRACKED_DBS}\"\n\n# Pull the in-repo path constants (DATA_METADATA, DATA_PROCESSED, FIGURES)\n# from scripts/config.py — single source of truth.\nsys.path.insert(0, str(REPO_ROOT / \"scripts\"))\nfrom config import DATA_METADATA, DATA_PROCESSED, FIGURES\n\n# Plotting style\nplt.style.use('seaborn-v0_8')\nsns.set_palette(\"husl\")\n"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
|
|
|
||||||
|
|
@ -10,7 +10,7 @@
|
||||||
"execution_count": null,
|
"execution_count": null,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": "import sys\nfrom pathlib import Path\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.spatial.distance import euclidean\nfrom scipy import stats\n\n# ─── Where the data lives ────────────────────────────────────────────────\n# DATA_DIR holds everything bulky/regenerable: the metadata TSV and the\n# tracking SQLite DBs. It's mounted into the container at this fixed path.\n# REPO_ROOT is your checkout of the cupido repo, in your home directory.\n# Path.home() expands to /home/<your-username>, so this works for any\n# user (no hard-coded usernames).\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path.home() / \"cupido\"\n\nMETADATA_TSV = DATA_DIR / \"all_video_info_merged.tsv\"\nTRACKED_DBS = DATA_DIR / \"tracked\"\n\n# Sanity-check the data location up front so any failure here points at\n# the obvious thing — rather than crashing inside load_roi_data later.\nassert METADATA_TSV.exists(), f\"Metadata TSV not found at {METADATA_TSV}\"\nassert TRACKED_DBS.is_dir(), f\"Tracked-DB directory not found at {TRACKED_DBS}\"\n\n# Pull the in-repo path constants (DATA_RAW, DATA_METADATA, DATA_PROCESSED,\n# FIGURES) from scripts/config.py — single source of truth.\nsys.path.insert(0, str(REPO_ROOT / \"scripts\"))\nfrom config import DATA_RAW, DATA_METADATA, DATA_PROCESSED, FIGURES\n\n# Plotting style\nplt.style.use('seaborn-v0_8')\nsns.set_palette(\"husl\")\n\nprint(f\"Data directory: {DATA_DIR}\")\nprint(f\"Repo root: {REPO_ROOT}\")\nprint(f\"Metadata TSV: {METADATA_TSV}\")\nprint(f\"Pandas version: {pd.__version__}\")\nprint(f\"NumPy version: {np.__version__}\")\n"
|
"source": "import sys\nfrom pathlib import Path\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.spatial.distance import euclidean\nfrom scipy import stats\n\n# ─── Where the data lives ────────────────────────────────────────────────\n# DATA_DIR holds everything bulky/regenerable: the metadata TSV and the\n# tracking SQLite DBs. It's mounted into the container at this fixed path.\n# REPO_ROOT is your checkout of the cupido repo, in your home directory.\n# Path.home() expands to /home/<your-username>, so this works for any\n# user (no hard-coded usernames).\nDATA_DIR = Path(\"/mnt/data/projects/cupido\")\nREPO_ROOT = Path.home() / \"cupido\"\n\nMETADATA_TSV = DATA_DIR / \"all_video_info_merged.tsv\"\nTRACKED_DBS = DATA_DIR / \"tracked\"\n\n# Sanity-check the data location up front so any failure here points at\n# the obvious thing — rather than crashing inside load_roi_data later.\nassert METADATA_TSV.exists(), f\"Metadata TSV not found at {METADATA_TSV}\"\nassert TRACKED_DBS.is_dir(), f\"Tracked-DB directory not found at {TRACKED_DBS}\"\n\n# Pull the in-repo path constants (DATA_METADATA, DATA_PROCESSED, FIGURES)\n# from scripts/config.py — single source of truth.\nsys.path.insert(0, str(REPO_ROOT / \"scripts\"))\nfrom config import DATA_METADATA, DATA_PROCESSED, FIGURES\n\n# Plotting style\nplt.style.use('seaborn-v0_8')\nsns.set_palette(\"husl\")\n\nprint(f\"Data directory: {DATA_DIR}\")\nprint(f\"Repo root: {REPO_ROOT}\")\nprint(f\"Metadata TSV: {METADATA_TSV}\")\nprint(f\"Pandas version: {pd.__version__}\")\nprint(f\"NumPy version: {np.__version__}\")\n"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
|
|
|
||||||
|
|
@ -16,7 +16,7 @@ from pathlib import Path
|
||||||
|
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
|
|
||||||
from config import DATA_RAW, INVENTORY_CSV, VIDEO_INFO_XLSX, VIDEOS_ROOT
|
from config import INVENTORY_CSV, TRACKING_OUTPUT_DIR, VIDEO_INFO_XLSX, VIDEOS_ROOT
|
||||||
|
|
||||||
SESSION_RE = re.compile(r"^(\d{4}-\d{2}-\d{2})_(\d{2}-\d{2}-\d{2})$")
|
SESSION_RE = re.compile(r"^(\d{4}-\d{2}-\d{2})_(\d{2}-\d{2}-\d{2})$")
|
||||||
|
|
||||||
|
|
@ -64,14 +64,14 @@ def scan_videos(videos_root: Path) -> pd.DataFrame:
|
||||||
return pd.DataFrame(rows)
|
return pd.DataFrame(rows)
|
||||||
|
|
||||||
|
|
||||||
def already_tracked_set(data_raw: Path) -> set[tuple[str, str]]:
|
def already_tracked_set(tracked_dir: Path) -> set[tuple[str, str]]:
|
||||||
"""Return the set of (date, time) sessions for which a tracking DB exists.
|
"""Return the set of (date, time) sessions for which a tracking DB exists.
|
||||||
|
|
||||||
DBs are named like:
|
DBs are named like:
|
||||||
2025-07-15_16-03-10_<uuid>__1920x1088@25fps-28q_merged_tracking.db
|
2025-07-15_16-03-10_<uuid>__1920x1088@25fps-28q_merged_tracking.db
|
||||||
"""
|
"""
|
||||||
out = set()
|
out = set()
|
||||||
for db in data_raw.glob("*_tracking.db"):
|
for db in tracked_dir.glob("*_tracking.db"):
|
||||||
m = re.match(r"^(\d{4}-\d{2}-\d{2})_(\d{2}-\d{2}-\d{2})_", db.name)
|
m = re.match(r"^(\d{4}-\d{2}-\d{2})_(\d{2}-\d{2}-\d{2})_", db.name)
|
||||||
if m:
|
if m:
|
||||||
out.add((m.group(1), m.group(2)))
|
out.add((m.group(1), m.group(2)))
|
||||||
|
|
@ -99,8 +99,8 @@ def main() -> None:
|
||||||
lambda r: (r["session_date"], r["machine_name"]) in xlsx_keys, axis=1
|
lambda r: (r["session_date"], r["machine_name"]) in xlsx_keys, axis=1
|
||||||
)
|
)
|
||||||
|
|
||||||
# Mark which already have tracking DBs in data/raw/
|
# Mark which already have tracking DBs in TRACKING_OUTPUT_DIR
|
||||||
tracked = already_tracked_set(DATA_RAW)
|
tracked = already_tracked_set(TRACKING_OUTPUT_DIR)
|
||||||
videos_df["already_tracked"] = videos_df.apply(
|
videos_df["already_tracked"] = videos_df.apply(
|
||||||
lambda r: (r["session_date"], r["session_time"]) in tracked, axis=1
|
lambda r: (r["session_date"], r["session_time"]) in tracked, axis=1
|
||||||
)
|
)
|
||||||
|
|
|
||||||
|
|
@ -5,7 +5,6 @@ from pathlib import Path
|
||||||
|
|
||||||
# Where this code repository lives (the directory containing scripts/, notebooks/, ...).
|
# Where this code repository lives (the directory containing scripts/, notebooks/, ...).
|
||||||
PROJECT_ROOT = Path(__file__).resolve().parent.parent
|
PROJECT_ROOT = Path(__file__).resolve().parent.parent
|
||||||
DATA_RAW = PROJECT_ROOT / "data" / "raw"
|
|
||||||
DATA_METADATA = PROJECT_ROOT / "data" / "metadata"
|
DATA_METADATA = PROJECT_ROOT / "data" / "metadata"
|
||||||
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
|
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
|
||||||
FIGURES = PROJECT_ROOT / "figures"
|
FIGURES = PROJECT_ROOT / "figures"
|
||||||
|
|
|
||||||
|
|
@ -55,14 +55,14 @@ See `docs/bimodal_hypothesis.md` for detailed methodology.
|
||||||
|
|
||||||
### Recap
|
### Recap
|
||||||
|
|
||||||
Tracked so far (5 sessions, all from 2025-07-15, machines 076/145/268). The DBs in
|
Tracked so far (5 sessions, all from 2025-07-15, machines 076/145/268). Those
|
||||||
`data/raw/` use tracker `ConstrainedMultiFlyTracker` and template
|
were re-tracked through the unified pipeline and now live at
|
||||||
`HD_Mating_Arena_6_ROIS.json` (2 flies × 6 ROIs per video).
|
`/mnt/data/projects/cupido/tracked/` (no separate `data/raw/` anymore — the
|
||||||
|
old pre-pipeline copies were deleted on 2026-05-01).
|
||||||
|
|
||||||
The metadata file `../all_video_info_merged.xlsx` indexes a different set of
|
The metadata file `/mnt/data/projects/cupido/all_video_info_merged.xlsx`
|
||||||
experiments: 7 dates from 2024-09-17 → 2024-10-21, 16 ethoscope machines,
|
indexes a different set of experiments: 7 dates from 2024-09-17 → 2024-10-21,
|
||||||
63 unique (date, machine) sessions = 484 ROI-rows. **None of the already-tracked
|
16 ethoscope machines, 63 unique (date, machine) sessions = 484 ROI-rows.
|
||||||
sessions are in this xlsx — these are fresh recordings to track.**
|
|
||||||
|
|
||||||
Inventory: see `data/metadata/video_inventory.csv` (built by
|
Inventory: see `data/metadata/video_inventory.csv` (built by
|
||||||
`scripts/build_video_inventory.py`).
|
`scripts/build_video_inventory.py`).
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue