Fixes #24348: Strip URL scheme from hostPort to prevent ValueError#27191
Fixes #24348: Strip URL scheme from hostPort to prevent ValueError#27191RajdeepKushwaha5 wants to merge 9 commits intoopen-metadata:mainfrom
Conversation
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
1 similar comment
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
This PR prevents ingestion crashes when users mistakenly enter a full URL (e.g. http://localhost:3306) in hostPort by introducing a shared sanitization helper and applying it before hostPort parsing/URL construction, along with added test coverage.
Changes:
- Added
clean_host_port()utility to strip URL schemes (and trailing slashes) before parsinghostPort. - Updated multiple connectors and shared builders to call
clean_host_port()before splitting/usinghostPort. - Added/extended unit tests to cover scheme-prefixed
hostPortinputs and connection URL generation.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| ingestion/src/metadata/utils/db_utils.py | Adds clean_host_port() and uses it in get_host_from_host_port() to sanitize scheme-prefixed inputs. |
| ingestion/src/metadata/ingestion/connections/builders.py | Applies clean_host_port() when generating DB URLs and when splitting host/port for IAM token generation. |
| ingestion/src/metadata/ingestion/source/database/cassandra/connection.py | Uses clean_host_port() before splitting Cassandra hostPort. |
| ingestion/src/metadata/ingestion/source/database/db2/connection.py | Uses clean_host_port() for DB2 host extraction and port parsing. |
| ingestion/src/metadata/ingestion/source/database/databricks/client.py | Uses clean_host_port() to derive the Databricks workspace base host. |
| ingestion/src/metadata/ingestion/source/database/databricks/auth.py | Uses clean_host_port() when building Databricks SDK host for auth flows. |
| ingestion/src/metadata/ingestion/source/database/redshift/connection.py | Uses clean_host_port() when deriving host for IAM flow type detection. |
| ingestion/tests/unit/test_db_utils.py | Adds tests for clean_host_port() and scheme-prefixed get_host_from_host_port() inputs. |
| ingestion/tests/unit/test_build_connection_url.py | Adds integration-style tests ensuring MySQL/Postgres URL building works with scheme-prefixed hostPort. |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
8cec367 to
fcfd026
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
fcfd026 to
0b955fb
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
0b955fb to
6a779c3
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
ingestion/src/metadata/ingestion/source/database/db2/connection.py:90
- In
_get_ibmi_connection_args(),port_str = host_port.split(":")[1]is fragile: it will fail for bracketed IPv6 ([::1]:50000) and will also misbehave ifclean_host_port()ever returns an unbracketed IPv6 host (e.g. fromhttp://[::1]:50000). Prefer splitting from the right once (e.g.rsplit(':', 1)) and/or explicitly handling[...]:portto reliably extract the port.
args = get_connection_args_common(connection)
host_port = clean_host_port(connection.hostPort)
if ":" in host_port:
port_str = host_port.split(":")[1]
try:
args["port"] = int(port_str)
|
ping @harshach |
| url = f"{connection.scheme.value}://" | ||
| url += f"{quote_plus(username)}:{quote_plus(password)}@" | ||
| url += connection.hostPort | ||
| url += clean_host_port(connection.hostPort) |
There was a problem hiding this comment.
what happens to other connectors? why are we making changes only to few connectors that to in a connection.py?
6a779c3 to
8ed98aa
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
@harshach thanks for the feedback, you're absolutely right. Patching individual connector files was the wrong approach. I've reworked this completely. The fix is now centralised in What changed:
The diff is now 4 files, 145 additions, 6 deletions (down from 9 files). Verified locally: |
| return ( | ||
| <SelectContext.Provider value={{ size }}> | ||
| <SelectContext.Provider value={{ fontSize, size }}> | ||
| <AriaComboBox menuTrigger="focus" {...otherProps}> |
There was a problem hiding this comment.
The PR title/description focus on ingestion hostPort sanitization and claim 4 ingestion files changed, but this PR also includes UI changes in openmetadata-ui-core-components (e.g., adding/passing fontSize through SelectContext). Please either update the PR description to account for these UI changes or split them into a separate PR to keep the scope aligned with the stated issue fix.
…Autocomplete Unblocks CI on this PR. main is currently failing tsc --noEmit with TS2741 because open-metadata#27379 made fontSize a required field on SelectContext but only updated select.tsx. combobox.tsx and autocomplete.tsx also provide that context and were missed, cascading into all py-tests and Playwright failures on every open PR. Mirrors the fix in PR open-metadata#27435.
The broad 'except Exception' in model_post_init() was swallowing the intentional ValueError that clean_host_port raises for invalid ports (e.g. 'http://localhost:abc'), leaving a malformed hostPort on the connection object and turning a clear user error into an obscure downstream failure. The local import of clean_host_port is from a sibling module that always exists, so the try/except was unnecessary defence. Removed it so pydantic surfaces validation errors where they belong.
c654b9e to
31c851f
Compare
…r import The model_post_init hook added in the previous commit imported `clean_host_port` from `metadata.utils.db_utils` at runtime. That module has a heavy dependency graph (OpenMetadata client, LineageParser, generated schemas), so every *Connection class construction would drag it in and cause circular-import failures during integration test bootstrap. The file docstring explicitly requires BaseModel to be 'self-sufficient with only pydantic at import time', so the hostPort sanitisation is now handled by a stdlib-only helper (_strip_hostport_scheme) colocated with BaseModel. `clean_host_port` remains the public API in db_utils and delegates to the same helper, preserving behaviour for all existing callers and unit tests.
|
|
||
| # HTTP prefix is stripped | ||
| self.assertEqual(clean_host_port("http://localhost:3306"), "localhost:3306") | ||
| self.assertEqual(clean_host_port("http://example.com:8080"), "example.com:8080") |
There was a problem hiding this comment.
New assertions in this test exceed Black's default formatting width and will cause make py_format_check (black --check) to fail. Please run make py_format (or manually wrap these calls) so Black can reformat them consistently.
| self.assertEqual(clean_host_port("http://example.com:8080"), "example.com:8080") | |
| self.assertEqual( | |
| clean_host_port("http://example.com:8080"), "example.com:8080" | |
| ) |
| get_lineage_by_query, | ||
| get_lineage_via_table_entity, | ||
| ) | ||
| from metadata.ingestion.models.custom_pydantic import _strip_hostport_scheme |
There was a problem hiding this comment.
db_utils.py is importing _strip_hostport_scheme (a private underscored helper) from custom_pydantic.py. Since this is now used cross-module, consider making it a public helper (e.g., strip_hostport_scheme) or relocating the implementation to db_utils.py to avoid a fragile dependency on a private symbol.
| export const ComboBox = ({ | ||
| placeholder = 'Search', | ||
| shortcut = true, | ||
| size = 'sm', | ||
| fontSize = 'md', |
There was a problem hiding this comment.
The PR description says this change is limited to 4 ingestion files, but this PR also modifies UI core components (openmetadata-ui-core-components/...). Please either update the PR description to reflect the UI changes (and why they are needed for this fix) or split the UI changes into a separate PR to keep the scope focused.
| Raises ValueError if the scheme carries a non-numeric port so the user | ||
| gets a clear error instead of a silently broken hostPort. | ||
| """ |
There was a problem hiding this comment.
The _strip_hostport_scheme docstring says it “Raises ValueError if the scheme carries a non-numeric port”, but the JDBC-style not hostname fallback returns the stripped tail without port validation. Please either validate the port in that branch as well, or clarify the docstring to match the behavior.
| for sep in ("/", "?", "#"): | ||
| tail = tail.split(sep, 1)[0] | ||
| if "@" in tail: | ||
| tail = tail.rsplit("@", 1)[-1] |
There was a problem hiding this comment.
In the not hostname fallback branch, the function returns the stripped tail without validating the port component. For inputs like jdbc:postgresql://host:abc/db, this will return host:abc and the original int(port) failure can still happen later, which undermines the goal of producing a clearer error for non-numeric ports (and also makes the docstring claim about raising on non-numeric ports incomplete). Consider parsing tail for a host:port pattern and raising a ValueError when the port is present but non-numeric.
| tail = tail.rsplit("@", 1)[-1] | |
| tail = tail.rsplit("@", 1)[-1] | |
| fallback_port = None | |
| if tail.startswith("["): | |
| closing_bracket = tail.find("]") | |
| if closing_bracket != -1 and len(tail) > closing_bracket + 1: | |
| if tail[closing_bracket + 1] == ":": | |
| fallback_port = tail[closing_bracket + 2 :] | |
| elif ":" in tail: | |
| _, fallback_port = tail.rsplit(":", 1) | |
| if fallback_port and not fallback_port.isdigit(): | |
| raise ValueError( | |
| f"Invalid hostPort '{safe_label}'. Expected format is " | |
| "'hostname[:port]' (e.g. 'localhost:3306')." | |
| ) |
| Post-init hook for Connection classes: | ||
| - Sanitises ``hostPort`` by stripping accidental URL scheme prefixes. | ||
| - Converts raw ``dict`` values into ``FilterPattern`` objects. | ||
| """ |
There was a problem hiding this comment.
The PR description includes an auto-generated “Summary by Gitar” referencing UI changes (e.g., fontSize in AutocompleteBase/ComboBox), but this PR’s diff only modifies ingestion framework Python files. Please update/remove that summary to avoid a mismatch between the PR description and the actual changes being reviewed.
- Renames _strip_hostport_scheme to public strip_hostport_scheme so cross-module imports (db_utils.py) no longer depend on a private symbol; private alias kept for back-compat. - Validates the port in the JDBC-style fallback branch so inputs like 'jdbc:postgresql://host:abc/db' raise ValueError with a clear message, matching the docstring contract instead of silently passing a broken hostPort downstream. - Extends test_clean_host_port with the new fallback raise cases.
Code Review ✅ Approved 5 resolved / 5 findingsRefactors hostPort parsing logic to handle IPv6 brackets and missing ports correctly, resolving crashes in Cassandra, JDBC, and model initialization. The implementation now safely strips URL schemes and preserves credentials without relying on overly broad error handling. ✅ 5 resolved✅ Edge Case: Cassandra split(":") crashes when hostPort has no port
✅ Edge Case: JDBC fallback path preserves userinfo credentials in output
✅ Edge Case: Unguarded clean_host_port in model_post_init can crash model construction
✅ Edge Case: clean_host_port drops IPv6 brackets, breaking downstream splits
✅ Edge Case: Broad except swallows intentional ValueError for invalid ports
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|



Describe your changes:
Fixes #24348
When users enter a full URL like
http://localhost:3306orhttps://mydb.example.com:5432in thehostPortconnection field instead of justlocalhost:3306, the ingestion framework crashes with an unhelpfulValueError: invalid literal for int() with base 10.Root cause: Code throughout the ingestion framework calls
hostPort.split(":")expecting at most 2 parts (host:port), but a URL likehttp://localhost:3306splits into 3 parts (["http", "//localhost", "3306"]), causing either unpacking errors orint("//localhost")failures.Fix: Centralised
hostPortsanitisation inBaseModel.model_post_init()(custom_pydantic.py), the base class for every generated Pydantic model. When any*Connectionclass is constructed with ahostPortcontaining://, the scheme prefix is automatically stripped before any connector code ever accesses it.This covers all 38+ connectors in a single place — no per-connector patches needed, and future connectors are automatically protected.
Changes (4 files):
custom_pydantic.py— AddedhostPortsanitisation tomodel_post_init()(~15 lines)db_utils.py— Addedclean_host_port()utility and updatedget_host_from_host_port()to use it (defence-in-depth)test_db_utils.py— Unit tests forclean_host_port()(15+ assertions) and extendedget_host_from_host_port()teststest_build_connection_url.py— Integration tests for MySQL and Postgres with scheme-prefixedhostPortHow I tested:
CassandraConnection(hostPort='http://localhost:9042')→hostPortbecomes'localhost:9042'✓hostPortvalues pass through unchanged ✓http://localhost:abc,jdbc:postgresql://host:abc/db,jdbc:postgresql://[::1]:abc) raiseValueError✓py_format_check(Black) clean ✓Type of change:
Checklist:
Fixes <issue-number>: <short explanation>