Skip to content

Commit 28c6b9a

Browse files
ryan-williamsclaude
andcommitted
Add -A/--user-agent to dvx import-url --git
Persists `user_agent` in `.dvc` dep so `dvx update` reuses it for HEAD/GET requests. Needed for sites with bot protection (e.g. Cloudflare). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 8716ca7 commit 28c6b9a

File tree

3 files changed

+133
-19
lines changed

3 files changed

+133
-19
lines changed
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Configurable User-Agent for HTTP imports
2+
3+
## Context
4+
5+
`dvx import-url --git` sends `User-Agent: dvx/0.1` which gets 403'd by sites with bot protection (e.g. `njsp.njoag.gov` uses Cloudflare). A browser-like User-Agent works fine via `curl -H "User-Agent: Mozilla/5.0 ..."`.
6+
7+
## Proposed behavior
8+
9+
### 1. CLI flag: `--user-agent` / `-A`
10+
11+
```bash
12+
dvx import-url --git -A "Mozilla/5.0" \
13+
https://njsp.njoag.gov/.../2024-UCR.xlsx \
14+
-o crime/2024-UCR.xlsx
15+
```
16+
17+
### 2. Stored in `.dvc` file
18+
19+
The User-Agent is needed for subsequent `dvx update` calls too, so persist it in the `.dvc` deps:
20+
21+
```yaml
22+
deps:
23+
- path: https://njsp.njoag.gov/.../2024-UCR.xlsx
24+
checksum: '"etag"'
25+
size: 204114
26+
mtime: '2026-02-24T00:00:00+00:00'
27+
user_agent: 'Mozilla/5.0 (compatible; dvx/0.1)'
28+
outs:
29+
- md5: e2154bc8...
30+
path: 2024-UCR.xlsx
31+
meta:
32+
git_tracked: true
33+
```
34+
35+
`dvx update` reads `user_agent` from the dep and uses it for HEAD/GET requests.
36+
37+
### 3. Global config fallback
38+
39+
```bash
40+
dvx config http.user_agent "Mozilla/5.0 (compatible; dvx/0.1)"
41+
```
42+
43+
Stored in `.dvc/config` (or `.dvc/config.local`). Per-dep `user_agent` in `.dvc` overrides the global config.
44+
45+
### 4. Default
46+
47+
Keep `dvx/0.1` as the default (honest about what we are). Only override when needed.
48+
49+
## Implementation
50+
51+
### `src/dvx/git_import.py`
52+
53+
```python
54+
def _get_headers(user_agent: str | None = None) -> dict:
55+
ua = user_agent or dvc_config_get("http.user_agent", "dvx/0.1")
56+
return {"User-Agent": ua}
57+
58+
def git_import_url(url, out, no_download=False, user_agent=None):
59+
headers = _get_headers(user_agent)
60+
req = Request(url, headers=headers)
61+
# ... download ...
62+
# Store user_agent in dep if non-default
63+
dep_info = {"path": url, ...}
64+
if user_agent:
65+
dep_info["user_agent"] = user_agent
66+
67+
def update_git_import(dvc_path, no_download=False):
68+
# Read user_agent from existing dep
69+
dep = load_dvc(dvc_path)["deps"][0]
70+
user_agent = dep.get("user_agent")
71+
headers = _get_headers(user_agent)
72+
# ... HEAD/GET with headers ...
73+
```
74+
75+
### `src/dvx/cli/external.py`
76+
77+
Add `-A`/`--user-agent` option to `import-url` and `update`.
78+
79+
### DVC core (`dvc_data`)
80+
81+
For non-git-tracked HTTP imports (regular `import-url`), DVC uses fsspec's `HTTPFileSystem`. User-Agent can be passed via `client_kwargs`:
82+
83+
```python
84+
fs = HTTPFileSystem(client_kwargs={"headers": {"User-Agent": ua}})
85+
```
86+
87+
This is already configurable via `--fs-config` but that's verbose. The `http.user_agent` config key would be a convenience.
88+
89+
## Scope
90+
91+
Minimal: just `git_import.py` + CLI flag + `.dvc` persistence. The global config and fsspec integration can follow.
92+
93+
## Implementation status
94+
95+
Items 1, 2, 4 done:
96+
- `-A`/`--user-agent` flag on `dvx import-url` (passed through to `git_import_url()`)
97+
- `user_agent` persisted in `.dvc` dep; `update_git_import()` reads it back for subsequent requests
98+
- Default remains `dvx/0.1`
99+
100+
Item 3 (global config fallback) deferred — not needed for the immediate use case.

src/dvx/cli/external.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,25 +27,27 @@ def import_cmd(url, path, out, rev):
2727

2828
@click.command("import-url")
2929
@click.argument("url")
30+
@click.option("-A", "--user-agent", help="Custom User-Agent header (persisted for updates).")
3031
@click.option("-F", "--fs-config", multiple=True, help="Filesystem config (key=value).")
3132
@click.option("-G", "--git", is_flag=True, help="Track in Git (not DVC cache). For small files.")
3233
@click.option("-N", "--no-download", is_flag=True, help="Track metadata only (no download).")
3334
@click.option("-o", "--out", help="Output path.")
3435
@click.option("-V", "--version-aware", is_flag=True, help="Track S3 version IDs.")
35-
def import_url(url, fs_config, git, no_download, out, version_aware):
36+
def import_url(url, user_agent, fs_config, git, no_download, out, version_aware):
3637
"""Import a file from a URL.
3738
3839
Use --git to commit the file to Git (instead of DVC cache) with URL
3940
provenance. Good for small files (< 1MB) you want in the repo.
4041
4142
Use --no-download to track metadata (ETag, size) without downloading.
4243
Use --fs-config allow_anonymous_login=true for public buckets.
44+
Use --user-agent to set a custom User-Agent (needed for some sites).
4345
"""
4446
if git:
4547
from dvx.git_import import git_import_url
4648

4749
try:
48-
dvc_path = git_import_url(url=url, out=out, no_download=no_download)
50+
dvc_path = git_import_url(url=url, out=out, no_download=no_download, user_agent=user_agent)
4951
action = "Tracked" if no_download else "Imported"
5052
click.echo(f"{action} {url} (git-tracked)")
5153
click.echo(f" {dvc_path}")

src/dvx/git_import.py

Lines changed: 29 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@
1313

1414
import yaml
1515

16+
DEFAULT_USER_AGENT = "dvx/0.1"
17+
1618

1719
def _default_out(url: str) -> str:
1820
"""Derive output filename from URL path."""
@@ -23,12 +25,17 @@ def _default_out(url: str) -> str:
2325
return name
2426

2527

26-
def _download(url: str, out: Path) -> tuple[str, int, dict[str, str]]:
28+
def _download(
29+
url: str,
30+
out: Path,
31+
user_agent: str | None = None,
32+
) -> tuple[str, int, dict[str, str]]:
2733
"""Download URL to `out`, returning (md5, size, http_headers).
2834
2935
Headers dict includes ETag, Last-Modified, Content-Length if present.
3036
"""
31-
req = Request(url, headers={"User-Agent": "dvx/0.1"})
37+
ua = user_agent or DEFAULT_USER_AGENT
38+
req = Request(url, headers={"User-Agent": ua})
3239
with urlopen(req) as resp: # noqa: S310
3340
data = resp.read()
3441
headers = {
@@ -44,9 +51,10 @@ def _download(url: str, out: Path) -> tuple[str, int, dict[str, str]]:
4451
return md5, len(data), headers
4552

4653

47-
def _head_metadata(url: str) -> dict[str, str]:
54+
def _head_metadata(url: str, user_agent: str | None = None) -> dict[str, str]:
4855
"""HEAD request to get ETag/Last-Modified without downloading."""
49-
req = Request(url, method="HEAD", headers={"User-Agent": "dvx/0.1"})
56+
ua = user_agent or DEFAULT_USER_AGENT
57+
req = Request(url, method="HEAD", headers={"User-Agent": ua})
5058
with urlopen(req) as resp: # noqa: S310
5159
return {
5260
k: resp.headers[k]
@@ -69,6 +77,7 @@ def _build_dvc_data(
6977
size: int,
7078
headers: dict[str, str],
7179
out_name: str,
80+
user_agent: str | None = None,
7281
) -> dict:
7382
"""Build the .dvc YAML structure for a git-tracked import."""
7483
dep: dict = {"path": url}
@@ -78,6 +87,8 @@ def _build_dvc_data(
7887
dep["size"] = size
7988
if "Last-Modified" in headers:
8089
dep["mtime"] = _parse_last_modified(headers["Last-Modified"])
90+
if user_agent:
91+
dep["user_agent"] = user_agent
8192

8293
out_entry = {
8394
"md5": md5,
@@ -101,6 +112,7 @@ def git_import_url(
101112
url: str,
102113
out: str | None = None,
103114
no_download: bool = False,
115+
user_agent: str | None = None,
104116
) -> Path:
105117
"""Import a URL as a git-tracked file with DVX provenance.
106118
@@ -111,22 +123,21 @@ def git_import_url(
111123
url: HTTP(S) URL to import.
112124
out: Output path (default: derived from URL filename).
113125
no_download: If True, only create .dvc with metadata (HEAD request).
126+
user_agent: Custom User-Agent header (persisted in .dvc for updates).
114127
115128
Returns:
116129
Path to the created .dvc file.
117130
"""
118131
out_path = Path(out or _default_out(url))
119132

120133
if no_download:
121-
headers = _head_metadata(url)
134+
headers = _head_metadata(url, user_agent=user_agent)
122135
size = int(headers.get("Content-Length", 0))
123-
# No file to hash; leave md5 empty
124-
dvc_data = _build_dvc_data(url, "", size, headers, out_path.name)
125-
# Remove empty md5 from outs
136+
dvc_data = _build_dvc_data(url, "", size, headers, out_path.name, user_agent=user_agent)
126137
del dvc_data["outs"][0]["md5"]
127138
else:
128-
md5, size, headers = _download(url, out_path)
129-
dvc_data = _build_dvc_data(url, md5, size, headers, out_path.name)
139+
md5, size, headers = _download(url, out_path, user_agent=user_agent)
140+
dvc_data = _build_dvc_data(url, md5, size, headers, out_path.name, user_agent=user_agent)
130141

131142
dvc_path = Path(str(out_path) + ".dvc")
132143
dvc_path.parent.mkdir(parents=True, exist_ok=True)
@@ -178,29 +189,30 @@ def update_git_import(dvc_path: Path, no_download: bool = False) -> bool:
178189
if not meta.get("git_tracked"):
179190
return False
180191

181-
url = data["deps"][0]["path"]
182-
old_checksum = data["deps"][0].get("checksum")
192+
dep = data["deps"][0]
193+
url = dep["path"]
194+
old_checksum = dep.get("checksum")
195+
user_agent = dep.get("user_agent")
183196
out_name = data["outs"][0]["path"]
184197
out_path = dvc_path.parent / out_name
185198

186199
if no_download:
187-
headers = _head_metadata(url)
200+
headers = _head_metadata(url, user_agent=user_agent)
188201
new_checksum = headers.get("ETag")
189202
if new_checksum and new_checksum == old_checksum:
190203
return False
191204
size = int(headers.get("Content-Length", 0))
192-
new_data = _build_dvc_data(url, "", size, headers, out_name)
205+
new_data = _build_dvc_data(url, "", size, headers, out_name, user_agent=user_agent)
193206
if "md5" in data["outs"][0]:
194207
new_data["outs"][0]["md5"] = data["outs"][0]["md5"]
195208
else:
196209
del new_data["outs"][0]["md5"]
197210
else:
198-
md5, size, headers = _download(url, out_path)
211+
md5, size, headers = _download(url, out_path, user_agent=user_agent)
199212
new_checksum = headers.get("ETag")
200213
if new_checksum and new_checksum == old_checksum:
201-
# ETag unchanged but we already downloaded; keep the new file
202214
pass
203-
new_data = _build_dvc_data(url, md5, size, headers, out_name)
215+
new_data = _build_dvc_data(url, md5, size, headers, out_name, user_agent=user_agent)
204216

205217
with open(dvc_path, "w") as f:
206218
yaml.dump(new_data, f, sort_keys=False, default_flow_style=False)

0 commit comments

Comments
 (0)