Skip to content

Commit 9f93d2a

Browse files
ryan-williamsclaude
andcommitted
Add tests for get_freshness_details; update README with new features
Tests: side-effect/fetch freshness details, weekly schedule, naive timestamps. README: side-effect stages, fetch schedules, directory deps, git-tracked imports, cron install, fix DVC link. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 9c171ac commit 9f93d2a

File tree

2 files changed

+203
-2
lines changed

2 files changed

+203
-2
lines changed

README.md

Lines changed: 89 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,10 @@ DVX is a lightweight wrapper around [DVC] that provides core data versioning wit
44

55
- **Parallel pipeline execution** with per-file provenance tracking
66
- **Decentralized workflow definitions** - each `.dvc` file contains its computation, deps, and outputs
7+
- **Side-effect stages** for deploys, posts, and syncs without local file outputs
8+
- **Fetch schedules** for periodic re-fetch of external data (daily/hourly/cron)
79
- **Enhanced diff** with preprocessing pipelines and directory support
10+
- **Git-tracked imports** with URL provenance for small files
811
- **Cache introspection** commands for examining cached data
912
- **Performance optimizations** for large repos (batched git lookups, mtime caching)
1013

@@ -176,6 +179,9 @@ pip install dvx
176179
# With S3 support
177180
pip install dvx[s3]
178181

182+
# With cron schedule support
183+
pip install dvx[cron]
184+
179185
# With all remote backends
180186
pip install dvx[all]
181187
```
@@ -289,7 +295,8 @@ with Repo() as repo:
289295
| `remove` | Stop tracking file(s) |
290296
| `move` | Move tracked file(s) |
291297
| `import` | Import from another DVC repo |
292-
| `import-url` | Import from a URL |
298+
| `import-url` | Import from a URL (`--git` for git-tracked, `-A` for User-Agent) |
299+
| `update` | Re-fetch imported data from source |
293300
| `get` | Download without tracking |
294301
| `get-url` | Download URL without tracking |
295302
| `shell-integration` | Output shell aliases |
@@ -298,6 +305,10 @@ with Repo() as repo:
298305

299306
### Added in DVX
300307
- `dvx run` - Parallel pipeline execution with per-file provenance
308+
- Side-effect stages - Deploys/syncs modeled as `.dvc` files with no `outs`
309+
- Fetch schedules - Periodic re-fetch with daily/hourly/weekly/cron staleness
310+
- Directory dependencies - Git tree SHA tracking for `git_deps`
311+
- `dvx import-url --git` - Git-tracked imports with URL provenance
301312
- `dvx diff` preprocessing - Pipe through commands before diffing (with `{}` placeholder)
302313
- `dvx cache path/md5` - Cache introspection
303314
- `dvx cat` - View cached files directly
@@ -335,6 +346,82 @@ When adding outputs with dependencies:
335346
- **Recursive add**: Use `dvx add -r` to auto-add stale deps first
336347
- **Accurate recording**: Recorded dep hashes always match what was actually used
337348

349+
## Side-Effect Stages
350+
351+
Not all pipeline stages produce local file outputs. Deploys, database imports, Slack posts — these are side effects. DVX models them as `.dvc` files with `meta.computation` but no `outs`:
352+
353+
```yaml
354+
# www-deploy.dvc
355+
meta:
356+
computation:
357+
cmd: wrangler pages deploy www/dist --project-name my-app
358+
deps:
359+
www/dist/index.html: a1b2c3d4...
360+
www/dist/assets/app.js: e5f6a7b8...
361+
```
362+
363+
- `dvx status` reports stale when dep hashes change
364+
- `dvx run` executes the command and updates dep hashes
365+
- No cache push/pull — the `.dvc` file itself is the receipt
366+
- Side-effect is inferred from no `outs` + having a `cmd` (optionally explicit via `computation.side_effect: true`)
367+
368+
## Fetch Schedules
369+
370+
External data sources change on their own schedule. DVX can track periodic fetches with a `fetch.schedule`:
371+
372+
```yaml
373+
# data/live-feed.xml.dvc
374+
outs:
375+
- md5: abc123...
376+
path: live-feed.xml
377+
meta:
378+
computation:
379+
cmd: curl -o live-feed.xml https://api.example.com/feed
380+
fetch:
381+
schedule: daily # or "hourly", "weekly", "0 15 * * *", "manual"
382+
last_run: 2026-04-07T15:10:00Z
383+
```
384+
385+
- `dvx status` reports stale when `last_run + interval` has elapsed
386+
- `dvx run` executes the fetch and updates `last_run`
387+
- If fetched data is identical (same hash), downstream stages stay fresh
388+
- `"manual"` schedule is never auto-stale — only runs on `dvx run --force`
389+
- Cron expressions require the optional `croniter` package: `pip install dvx[cron]`
390+
391+
## Directory Dependencies
392+
393+
Stages can depend on entire directory trees using `git_deps`. DVX uses git tree SHAs, which change when any file in the directory changes:
394+
395+
```yaml
396+
# bundle.js.dvc
397+
outs:
398+
- md5: def456...
399+
path: bundle.js
400+
meta:
401+
computation:
402+
cmd: cd www && pnpm build
403+
git_deps:
404+
www/src: abc123tree... # tree SHA — any file change invalidates
405+
www/package.json: def456blob... # blob SHA — individual file
406+
```
407+
408+
## Git-Tracked Imports
409+
410+
For small files from URLs (configs, metadata), use `--git` to track in Git instead of DVC cache:
411+
412+
```bash
413+
# Import and commit to Git (not DVC cache)
414+
dvx import-url --git https://example.com/config.json
415+
416+
# With custom User-Agent (persisted for updates)
417+
dvx import-url --git -A "MyBot/1.0" https://api.example.com/data.json
418+
419+
# Update: re-checks ETag/Last-Modified, re-downloads if changed
420+
dvx update config.json.dvc
421+
```
422+
423+
The `.dvc` file stores URL provenance (ETag, Last-Modified, size, User-Agent) so `dvx update` knows how to re-fetch.
424+
338425
## Performance
339426

340427
DVX is optimized for large repos:
@@ -355,4 +442,4 @@ DVX is optimized for large repos:
355442

356443
Apache 2.0
357444

358-
[DVC]: https://github.com/treeverse/dvc
445+
[DVC]: https://github.com/iterative/dvc

tests/test_run_dvc_files.py

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
from dvx.run.dvc_files import (
1010
DVCFileInfo,
1111
get_dvc_file_path,
12+
get_freshness_details,
1213
is_output_fresh,
1314
read_dvc_file,
1415
write_dvc_file,
@@ -783,3 +784,116 @@ def test_directory_git_dep_freshness(git_repo):
783784
fresh, reason = is_output_fresh(Path("bundle.js"), use_mtime_cache=False)
784785
assert fresh is False
785786
assert "git dep changed: src" == reason
787+
788+
789+
# =============================================================================
790+
# get_freshness_details tests for side-effect and fetch
791+
# =============================================================================
792+
793+
794+
def test_freshness_details_side_effect_fresh(tmp_path):
795+
"""get_freshness_details returns fresh for side-effect with matching deps."""
796+
os.chdir(tmp_path)
797+
798+
dep_dvc = tmp_path / "dist.dvc"
799+
with open(dep_dvc, "w") as f:
800+
yaml.dump({"outs": [{"md5": "abc123", "size": 100, "path": "dist"}]}, f)
801+
802+
se_dvc = tmp_path / "deploy.dvc"
803+
with open(se_dvc, "w") as f:
804+
yaml.dump({
805+
"meta": {"computation": {"cmd": "deploy.sh", "deps": {"dist": "abc123"}}}
806+
}, f)
807+
808+
details = get_freshness_details(Path("deploy"), use_mtime_cache=False)
809+
assert details.fresh is True
810+
assert details.reason == "up-to-date"
811+
812+
813+
def test_freshness_details_side_effect_stale(tmp_path):
814+
"""get_freshness_details returns stale for side-effect with changed deps."""
815+
os.chdir(tmp_path)
816+
817+
dep_dvc = tmp_path / "dist.dvc"
818+
with open(dep_dvc, "w") as f:
819+
yaml.dump({"outs": [{"md5": "new_hash", "size": 200, "path": "dist"}]}, f)
820+
821+
se_dvc = tmp_path / "deploy.dvc"
822+
with open(se_dvc, "w") as f:
823+
yaml.dump({
824+
"meta": {"computation": {"cmd": "deploy.sh", "deps": {"dist": "old_hash"}}}
825+
}, f)
826+
827+
details = get_freshness_details(Path("deploy"), use_mtime_cache=False)
828+
assert details.fresh is False
829+
assert "dep changed: dist" in details.reason
830+
assert details.changed_deps is not None
831+
assert "dist" in details.changed_deps
832+
833+
834+
def test_freshness_details_fetch_due(tmp_path):
835+
"""get_freshness_details returns stale when fetch schedule is due."""
836+
os.chdir(tmp_path)
837+
838+
output = tmp_path / "data.xml"
839+
output.write_text("<data/>")
840+
841+
write_dvc_file(
842+
output_path=output,
843+
md5="abc123",
844+
size=7,
845+
cmd="fetch-data",
846+
fetch_schedule="daily",
847+
fetch_last_run="2020-01-01T00:00:00Z", # Long ago → due
848+
)
849+
850+
details = get_freshness_details(Path("data.xml"), use_mtime_cache=False)
851+
assert details.fresh is False
852+
assert details.reason == "fetch schedule due"
853+
854+
855+
def test_freshness_details_fetch_not_due(tmp_path):
856+
"""get_freshness_details returns fresh when fetch not due and hash matches."""
857+
os.chdir(tmp_path)
858+
859+
output = tmp_path / "data.xml"
860+
output.write_text("<data/>")
861+
862+
from dvx.run.hash import compute_md5
863+
md5 = compute_md5(output)
864+
865+
write_dvc_file(
866+
output_path=output,
867+
md5=md5,
868+
size=output.stat().st_size,
869+
cmd="fetch-data",
870+
fetch_schedule="daily",
871+
fetch_last_run="2099-01-01T00:00:00Z",
872+
)
873+
874+
details = get_freshness_details(Path("data.xml"), use_mtime_cache=False)
875+
assert details.fresh is True
876+
877+
878+
def test_is_fetch_due_weekly():
879+
"""Weekly schedule: due after 7 days, not before."""
880+
from datetime import datetime, timezone
881+
882+
from dvx.run.dvc_files import is_fetch_due
883+
884+
last = "2026-04-01T12:00:00+00:00"
885+
# 6 days later → not due
886+
assert is_fetch_due("weekly", last, now=datetime(2026, 4, 7, 12, 0, 0, tzinfo=timezone.utc)) is False
887+
# 8 days later → due
888+
assert is_fetch_due("weekly", last, now=datetime(2026, 4, 9, 12, 0, 0, tzinfo=timezone.utc)) is True
889+
890+
891+
def test_is_fetch_due_naive_last_run():
892+
"""last_run without timezone is treated as UTC."""
893+
from datetime import datetime, timezone
894+
895+
from dvx.run.dvc_files import is_fetch_due
896+
897+
last = "2026-04-07T12:00:00" # No timezone
898+
now = datetime(2026, 4, 8, 13, 0, 0, tzinfo=timezone.utc)
899+
assert is_fetch_due("daily", last, now=now) is True

0 commit comments

Comments
 (0)