Skip to content
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 66 additions & 8 deletions ingestion/src/metadata/ingestion/models/custom_pydantic.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import json
import logging
from typing import Any, Callable, Dict, Literal, Optional, Union
from urllib.parse import urlparse

from pydantic import BaseModel as PydanticBaseModel
from pydantic import WrapSerializer, model_validator
Expand All @@ -34,6 +35,55 @@
JSON_ENCODERS = "json_encoders"


def _strip_hostport_scheme(raw: str) -> str:
"""
Strip an accidental URL scheme from a hostPort string.

Self-contained helper that depends only on the standard library so
``model_post_init`` never drags heavy ``metadata.*`` imports into the
bootstrap path of every generated Connection class.

Raises ValueError if the scheme carries a non-numeric port so the user
gets a clear error instead of a silently broken hostPort.
"""
Comment on lines +46 to +50
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _strip_hostport_scheme docstring says it “Raises ValueError if the scheme carries a non-numeric port”, but the JDBC-style not hostname fallback returns the stripped tail without port validation. Please either validate the port in that branch as well, or clarify the docstring to match the behavior.

Copilot uses AI. Check for mistakes.
value = raw.strip()
if "://" not in value:
return value

parsed = urlparse(value)
hostname = parsed.hostname or ""
safe_label = (
f"{parsed.scheme}://{hostname}"
if parsed.scheme and hostname
else "URL with scheme"
)
logger.warning(
"The hostPort '%s' contains a URL scheme. Expected format is "
"'hostname[:port]' (e.g. 'localhost:3306'). Stripping the scheme prefix.",
safe_label,
)
try:
port = parsed.port
except ValueError as exc:
raise ValueError(
f"Invalid hostPort '{safe_label}'. Expected format is "
"'hostname[:port]' (e.g. 'localhost:3306')."
) from exc

if not hostname:
# urlparse couldn't extract a hostname (e.g. 'jdbc:postgresql://host:5432/db')
# Fall back to stripping scheme and any trailing path/query/fragment/userinfo.
tail = value.rsplit("://", 1)[-1]
for sep in ("/", "?", "#"):
tail = tail.split(sep, 1)[0]
if "@" in tail:
tail = tail.rsplit("@", 1)[-1]
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the not hostname fallback branch, the function returns the stripped tail without validating the port component. For inputs like jdbc:postgresql://host:abc/db, this will return host:abc and the original int(port) failure can still happen later, which undermines the goal of producing a clearer error for non-numeric ports (and also makes the docstring claim about raising on non-numeric ports incomplete). Consider parsing tail for a host:port pattern and raising a ValueError when the port is present but non-numeric.

Suggested change
tail = tail.rsplit("@", 1)[-1]
tail = tail.rsplit("@", 1)[-1]
fallback_port = None
if tail.startswith("["):
closing_bracket = tail.find("]")
if closing_bracket != -1 and len(tail) > closing_bracket + 1:
if tail[closing_bracket + 1] == ":":
fallback_port = tail[closing_bracket + 2 :]
elif ":" in tail:
_, fallback_port = tail.rsplit(":", 1)
if fallback_port and not fallback_port.isdigit():
raise ValueError(
f"Invalid hostPort '{safe_label}'. Expected format is "
"'hostname[:port]' (e.g. 'localhost:3306')."
)

Copilot uses AI. Check for mistakes.
return tail

host = f"[{hostname}]" if ":" in hostname else hostname
return f"{host}:{port}" if port is not None else host


class BaseModel(PydanticBaseModel):
"""
Base model for OpenMetadata generated models.
Expand All @@ -42,17 +92,25 @@ class BaseModel(PydanticBaseModel):

def model_post_init(self, context: Any, /):
"""
This function is used to parse the FilterPattern fields for the Connection classes.
This is needed because dict is defined in the JSON schema for the FilterPattern field,
but a FilterPattern object is required in the generated code.
Post-init hook for Connection classes:
- Sanitises ``hostPort`` by stripping accidental URL scheme prefixes.
- Converts raw ``dict`` values into ``FilterPattern`` objects.
"""
Comment on lines +117 to 120
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description includes an auto-generated “Summary by Gitar” referencing UI changes (e.g., fontSize in AutocompleteBase/ComboBox), but this PR’s diff only modifies ingestion framework Python files. Please update/remove that summary to avoid a mismatch between the PR description and the actual changes being reviewed.

Copilot uses AI. Check for mistakes.
# pylint: disable=import-outside-toplevel
if not self.__class__.__name__.endswith("Connection"):
return
if not hasattr(self, "__pydantic_fields__"):
return

if "hostPort" in self.__pydantic_fields__:
raw = getattr(self, "hostPort", None)
if isinstance(raw, str) and "://" in raw:
# Let ValueError propagate: if the hostPort cannot be parsed
# (e.g. non-numeric port), the user must fix their config
# rather than silently getting a broken hostPort.
object.__setattr__(self, "hostPort", _strip_hostport_scheme(raw))

Comment on lines 121 to +134
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broad try/except Exception around the whole model_post_init will also swallow failures from clean_host_port() (e.g. http://localhost:abc). In that case the model keeps the original invalid hostPort and downstream code can still crash with the same unhelpful parsing errors. Consider narrowing the exception handling to just the FilterPattern conversion, and let clean_host_port() errors propagate (or re-raise with a connection/hostPort-specific message). Also, the current warning text says it's parsing FilterPattern even when the failure is from hostPort sanitization.

Copilot uses AI. Check for mistakes.
try:
if not self.__class__.__name__.endswith("Connection"):
# Only parse FilterPattern for Connection classes
return
if not hasattr(self, "__pydantic_fields__"):
return
for field in self.__pydantic_fields__:
if field.endswith("FilterPattern"):
from metadata.generated.schema.type.filterPattern import (
Expand Down
24 changes: 23 additions & 1 deletion ingestion/src/metadata/utils/db_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
"""
Helpers module for db sources
"""

import time
import traceback
from typing import Iterable, List, Union
Expand All @@ -32,6 +33,7 @@
get_lineage_by_query,
get_lineage_via_table_entity,
)
from metadata.ingestion.models.custom_pydantic import _strip_hostport_scheme
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db_utils.py is importing _strip_hostport_scheme (a private underscored helper) from custom_pydantic.py. Since this is now used cross-module, consider making it a public helper (e.g., strip_hostport_scheme) or relocating the implementation to db_utils.py to avoid a fragile dependency on a private symbol.

Copilot uses AI. Check for mistakes.
from metadata.ingestion.ometa.ometa_api import OpenMetadata
from metadata.ingestion.source.models import TableView
from metadata.utils import fqn
Expand All @@ -43,12 +45,32 @@
PUBLIC_SCHEMA = "public"


def clean_host_port(host_port: str) -> str:
"""
Strip URL scheme prefixes from a hostPort string.

Users sometimes enter a full URL (e.g. 'http://localhost:3306')
instead of just 'localhost:3306'. This strips the scheme to avoid
ValueError when parsing host and port.

Delegates to the stdlib-only helper colocated with ``BaseModel`` so the
behaviour stays in lockstep with Pydantic's ``model_post_init`` hook.
"""
value = host_port.strip()
if "://" not in value:
return value.rstrip("/")
return _strip_hostport_scheme(value)


def get_host_from_host_port(uri: str) -> str:
"""
if uri is like "localhost:9000"
then return the host "localhost"
"""
return uri.split(":")[0]
cleaned = clean_host_port(uri)
if cleaned.startswith("["):
return cleaned.split("]")[0] + "]"
return cleaned.split(":")[0]


# pylint: disable=too-many-locals
Expand Down
28 changes: 28 additions & 0 deletions ingestion/tests/unit/test_build_connection_url.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,34 @@ def test_get_connection_url_mysql(self):
"mysql+pymysql://openmetadata_user:mocked_token@localhost:3306/openmetadata_db",
)

def test_get_connection_url_mysql_with_url_scheme(self):
"""hostPort with http:// prefix should be cleaned automatically"""
connection = MysqlConnectionConfig(
username="openmetadata_user",
authType=BasicAuth(password="openmetadata_password"),
hostPort="http://localhost:3306",
databaseSchema="openmetadata_db",
)
engine_connection = MySQLConnection(connection).client
self.assertEqual(
engine_connection.url.render_as_string(hide_password=False),
"mysql+pymysql://openmetadata_user:openmetadata_password@localhost:3306/openmetadata_db",
)

def test_get_connection_url_postgres_with_url_scheme(self):
"""hostPort with https:// prefix should be cleaned automatically"""
connection = PostgresConnectionConfig(
username="openmetadata_user",
authType=BasicAuth(password="openmetadata_password"),
hostPort="https://localhost:5432",
database="openmetadata_db",
)
engine_connection = PostgresConnection(connection).client
self.assertEqual(
engine_connection.url.render_as_string(hide_password=False),
"postgresql+psycopg2://openmetadata_user:openmetadata_password@localhost:5432/openmetadata_db",
)

def test_get_connection_url_postgres(self):
connection = PostgresConnectionConfig(
username="openmetadata_user",
Expand Down
85 changes: 84 additions & 1 deletion ingestion/tests/unit/test_db_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
"""
Unit tests for db_utils module
"""

import uuid
from copy import deepcopy
from unittest import TestCase
Expand All @@ -36,7 +37,11 @@
from metadata.ingestion.lineage.models import Dialect
from metadata.ingestion.lineage.sql_lineage import search_cache
from metadata.ingestion.source.models import TableView
from metadata.utils.db_utils import get_host_from_host_port, get_view_lineage
from metadata.utils.db_utils import (
clean_host_port,
get_host_from_host_port,
get_view_lineage,
)


# Mock LineageTable class to simulate collate_sqllineage.core.models.Table
Expand Down Expand Up @@ -118,6 +123,84 @@ def test_get_host_from_host_port(self):
self.assertEqual(get_host_from_host_port("localhost"), "localhost")
self.assertEqual(get_host_from_host_port("example.com"), "example.com")

# Test with URL scheme prefixes
self.assertEqual(get_host_from_host_port("http://localhost:3306"), "localhost")
self.assertEqual(
get_host_from_host_port("https://example.com:5432"), "example.com"
)
self.assertEqual(get_host_from_host_port("http://localhost"), "localhost")

# Test with IPv6 addresses
self.assertEqual(get_host_from_host_port("http://[::1]:3306"), "[::1]")
self.assertEqual(get_host_from_host_port("[::1]:3306"), "[::1]")

def test_clean_host_port(self):
"""Test clean_host_port strips URL scheme prefixes"""
# Already-clean values pass through unchanged
self.assertEqual(clean_host_port("localhost:3306"), "localhost:3306")
self.assertEqual(clean_host_port("127.0.0.1:5432"), "127.0.0.1:5432")
self.assertEqual(clean_host_port("example.com"), "example.com")

# HTTP prefix is stripped
self.assertEqual(clean_host_port("http://localhost:3306"), "localhost:3306")
self.assertEqual(clean_host_port("http://example.com:8080"), "example.com:8080")
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New assertions in this test exceed Black's default formatting width and will cause make py_format_check (black --check) to fail. Please run make py_format (or manually wrap these calls) so Black can reformat them consistently.

Suggested change
self.assertEqual(clean_host_port("http://example.com:8080"), "example.com:8080")
self.assertEqual(
clean_host_port("http://example.com:8080"), "example.com:8080"
)

Copilot uses AI. Check for mistakes.

# HTTPS prefix is stripped
self.assertEqual(clean_host_port("https://localhost:5432"), "localhost:5432")
self.assertEqual(
clean_host_port("https://mydb.example.com:3306"), "mydb.example.com:3306"
)

# Trailing slash is stripped
self.assertEqual(clean_host_port("http://localhost:3306/"), "localhost:3306")

# Host only with scheme
self.assertEqual(clean_host_port("http://localhost"), "localhost")
self.assertEqual(clean_host_port("https://example.com"), "example.com")

# URL with path is handled — path/query/fragment are discarded
self.assertEqual(clean_host_port("http://localhost:3306/db"), "localhost:3306")
self.assertEqual(
clean_host_port("https://example.com:5432/mydb?ssl=true"),
"example.com:5432",
)

# Whitespace is stripped
self.assertEqual(clean_host_port(" localhost:3306 "), "localhost:3306")
self.assertEqual(clean_host_port(" http://localhost:3306 "), "localhost:3306")

# JDBC-style URLs fall back to raw extraction
self.assertEqual(clean_host_port("jdbc:postgresql://host:5432"), "host:5432")
self.assertEqual(clean_host_port("jdbc:postgresql://host:5432/db"), "host:5432")
self.assertEqual(
clean_host_port("jdbc:postgresql://host:5432?ssl=true"), "host:5432"
)
self.assertEqual(
clean_host_port("jdbc:postgresql://host:5432/db?ssl=true#ref"),
"host:5432",
)

# IPv6 addresses — brackets are preserved
self.assertEqual(clean_host_port("http://[::1]:3306"), "[::1]:3306")
self.assertEqual(clean_host_port("https://[::1]:5432"), "[::1]:5432")
self.assertEqual(clean_host_port("http://[::1]"), "[::1]")
self.assertEqual(
clean_host_port("http://[2001:db8::1]:3306"), "[2001:db8::1]:3306"
)

# Plain IPv6 without scheme passes through unchanged
self.assertEqual(clean_host_port("[::1]:3306"), "[::1]:3306")

# JDBC with userinfo — credentials are stripped
self.assertEqual(
clean_host_port("jdbc:postgresql://user:pass@host:5432/db"),
"host:5432",
)

# Invalid port raises ValueError
with self.assertRaises(ValueError):
clean_host_port("http://localhost:abc")

@patch("metadata.utils.db_utils.ConnectionTypeDialectMapper")
@patch("metadata.utils.db_utils.fqn")
def test_get_view_lineage_success_with_lineage_parser(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -415,7 +415,10 @@ export const AutocompleteBase = ({

useResizeObserver({ ref: triggerRef, onResize, box: 'border-box' });

const selectContextValue = useMemo(() => ({ size: 'sm' as const }), []);
const selectContextValue = useMemo(
() => ({ fontSize: 'md' as const, size: 'sm' as const }),
[]
);

const autocompleteContextValue = useMemo(
() => ({
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ export const ComboBox = ({
placeholder = 'Search',
shortcut = true,
size = 'sm',
fontSize = 'md',
Comment on lines 138 to +142
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says this change is limited to 4 ingestion files, but this PR also modifies UI core components (openmetadata-ui-core-components/...). Please either update the PR description to reflect the UI changes (and why they are needed for this fix) or split the UI changes into a separate PR to keep the scope focused.

Copilot uses AI. Check for mistakes.
children,
items,
shortcutClassName,
Expand All @@ -165,7 +166,7 @@ export const ComboBox = ({
});

return (
<SelectContext.Provider value={{ size }}>
<SelectContext.Provider value={{ fontSize, size }}>
<AriaComboBox menuTrigger="focus" {...otherProps}>
Comment on lines 168 to 170
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title/description focus on ingestion hostPort sanitization and claim 4 ingestion files changed, but this PR also includes UI changes in openmetadata-ui-core-components (e.g., adding/passing fontSize through SelectContext). Please either update the PR description to account for these UI changes or split them into a separate PR to keep the scope aligned with the stated issue fix.

Copilot uses AI. Check for mistakes.
{(state) => (
<div className="tw:flex tw:flex-col tw:gap-1.5">
Expand Down
Loading