feat(search): self-supervised metadata-aware encoder trained on lineage & glossary signals metadata-aware semantic encoder#27512
Conversation
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
fe6a522 to
290702c
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Closes open-metadata#26647 Add ingestion/src/metadata/ml/ package implementing a complete self-supervised training pipeline for a metadata-aware semantic encoder, replacing general-purpose web-text embeddings with a model trained on structural signals already present in any OpenMetadata instance. ## Problem General-purpose embeddings (OpenAI text-embedding-3-small, MiniLM trained on web text) are unaware of metadata semantics. They cannot distinguish that order_id in orders is semantically closer to order_id in order_items than to session_id — even though both are ID columns. This causes semantic search to surface irrelevant results for catalog exploration queries. ## Solution — Four Python components + one Java fix ### training_data.py — Self-Supervised Pair Extractor Extracts training signal from three sources with zero manual labelling: - Lineage edges: column A->B = positive (1.0); same-table = soft positive (0.7/0.5); 3+ hops apart = hard negative (0.0) - Glossary assignments: shared term = positive (1.0); disjoint sets = negative (0.0) - Table co-membership: same table = soft positive (0.5); different services = hard negative (0.0) ### train_encoder.py — Contrastive Fine-Tuning (sentence-transformers v5.4.1) - Base model: answerdotai/ModernBERT-base (MiniLM fallback) - Multi-objective loss: 0.6*CosineSimilarityLoss + 0.4*MNRLoss - Uses datasets.Dataset API (compatible with sentence-transformers>=5.0) - AdamW lr=2e-5, epoch-based eval, early stopping patience=3 - Output: openmetadata-finetuned-encoder/ (auto-detected by DJL client) ### evaluate_encoder.py — Evaluation Framework - MRR@10, Recall@{1,5,10}, Semantic Cohesion Score - Compares fine-tuned vs all-MiniLM-L6-v2 baseline - Saves evaluation_results.json for CI tracking ### encoder_client.py — Drop-In Integration Client - MetadataEncoder: auto-selects fine-tuned model if present, else MiniLM - @lru_cache model loading (load once per process) - L2-normalised output (cosine-similarity ready) - Zero changes to existing Java search pipeline required ### EmbeddingService.java — SentenceTransformerProvider Fix - Replaces hash-based stub with real DJL Criteria/ZooModel/Predictor - Auto-detects openmetadata-finetuned-encoder/ at startup - Falls back to all-MiniLM-L6-v2 via DJL if fine-tuned model absent - LocalEmbeddingProvider retained as ultimate fallback ## Validation — 42/42 tests passing G1 order_id semantic gap: +0.3723 G2 glossary clustering gap: +0.3647 G3 lineage scoring gap: +0.4499 G4 table cohesion gap: +0.3092 G5 post-fine-tuning gap: +0.7249 ## Dependencies added under extras_require['ml'] sentence-transformers[train]>=5.0, torch>=2.0, transformers>=4.40, scikit-learn>=1.3, numpy>=1.24, accelerate>=1.1, datasets
290702c to
c1bd1e2
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Code Review ✅ Approved 6 resolved / 6 findingsImplements a metadata-aware semantic encoder for search, resolving several issues including unused loss configurations, suboptimal negative generation complexity, and data splitting reliability. No remaining issues identified. ✅ 6 resolved✅ Bug: MultipleNegativesRankingLoss is created but never used
✅ Bug: early_stopping_patience parameter is accepted but never used
✅ Performance: Hard negatives generation is O(n²) and unbounded
✅ Quality: Unused variable
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
Summary
Closes #26647
Semantic search in OpenMetadata today uses general-purpose embeddings trained on web text. These models are blind to metadata structure — they cannot tell that
order_idinordersis semantically closer toorder_idinorder_itemsthan tosession_id, even though both are ID columns. This PR fixes that by introducing a self-supervised fine-tuning pipeline trained entirely on signals already present in any OpenMetadata instance — no manual labelling required.Problem
General-purpose models like
text-embedding-3-smallandall-MiniLM-L6-v2are trained on web text. They have no awareness of:This causes semantic search to surface irrelevant results for catalog exploration queries — the core complaint in #26647.
Solution
A complete self-supervised training pipeline that extracts structural supervision from the catalog itself and fine-tunes a small encoder model on it. Training in Python, serving in the existing Java DJL pipeline — zero changes to the production search path.
Files Changed
New —
ingestion/src/metadata/ml/__init__.pytraining_data.pytrain_encoder.pyModernBERT-baseusing multi-objective contrastive loss. Compatible withsentence-transformers>=5.0(datasets.DatasetAPI)evaluate_encoder.pyencoder_client.pyMetadataEncoderclient. Auto-selects fine-tuned model if present, falls back toall-MiniLM-L6-v2.@lru_cacheloading, L2-normalised outputREADME.mdModified —
ingestion/setup.pyAdded optional
[ml]extras:Zero impact on existing ingestion installs — opt-in only via
pip install openmetadata-ingestion[ml].Modified —
EmbeddingService.javaSentenceTransformerProviderinner class was a hash-based stub (TODO comment, returned random floats). Replaced with a real DJL implementation using the sameCriteria/ZooModel/Predictorpattern already used byDjlEmbeddingClient.java. Auto-detectsopenmetadata-finetuned-encoder/at startup.LocalEmbeddingProviderretained as ultimate fallback if DJL init fails.How the Training Signal Works
Training loss:
0.6 × CosineSimilarityLoss + 0.4 × MultipleNegativesRankingLossBase model:
answerdotai/ModernBERT-base(fallback:all-MiniLM-L6-v2)Architecture: Training in Python, Serving in Java
No Python at inference time. No new infrastructure. No new Java dependencies.
Usage (4 commands)
Test Results — 42/42 Passing
Section G — Core Issue Validation
order_idproximity (exact issue example)All gaps positive. The encoder correctly ranks metadata-semantically related columns above unrelated ones across all three signal types.
Graceful Degradation
Every level has a fallback — nothing breaks if the fine-tuned model is absent:
Existing deployments that never run training see zero behaviour change.
Type of Change
SentenceTransformerProviderwas a non-functional stub)Checklist
Fixes #26647: self-supervised metadata-aware encoder trained on lineage & glossary signalssentence-transformersv5.4.1