feat(search): self-supervised metadata-aware encoder trained on lineage & glossary signals metadata-aware semantic encoder by Yashsainani123 · Pull Request #27512 · open-metadata/OpenMetadata

Yashsainani123 · 2026-04-18T12:09:55Z

Summary

Semantic search in OpenMetadata today uses general-purpose embeddings trained on web text. These models are blind to metadata structure — they cannot tell that order_id in orders is semantically closer to order_id in order_items than to session_id, even though both are ID columns. This PR fixes that by introducing a self-supervised fine-tuning pipeline trained entirely on signals already present in any OpenMetadata instance — no manual labelling required.

Problem

General-purpose models like text-embedding-3-small and all-MiniLM-L6-v2 are trained on web text. They have no awareness of:

Column lineage relationships (A flows into B → they share semantic meaning)
Glossary assignments (two columns tagged with the same term are a positive pair)
Table co-membership (columns in the same table are structurally related)

This causes semantic search to surface irrelevant results for catalog exploration queries — the core complaint in #26647.

Solution

A complete self-supervised training pipeline that extracts structural supervision from the catalog itself and fine-tunes a small encoder model on it. Training in Python, serving in the existing Java DJL pipeline — zero changes to the production search path.

Files Changed

New — `ingestion/src/metadata/ml/`

File	Purpose
`__init__.py`	Package marker
`training_data.py`	Extracts self-supervised training pairs from lineage, glossary, and table co-membership signals via the existing OMeta Python SDK
`train_encoder.py`	Fine-tunes `ModernBERT-base` using multi-objective contrastive loss. Compatible with `sentence-transformers>=5.0` (`datasets.Dataset` API)
`evaluate_encoder.py`	Evaluation framework — MRR@10, Recall@{1,5,10}, Semantic Cohesion Score. Compares fine-tuned vs baseline side by side
`encoder_client.py`	Drop-in `MetadataEncoder` client. Auto-selects fine-tuned model if present, falls back to `all-MiniLM-L6-v2`. `@lru_cache` loading, L2-normalised output
`README.md`	Full 4-step usage documentation

Modified — `ingestion/setup.py`

Added optional [ml] extras:

sentence-transformers[train]>=5.0
torch>=2.0.0
transformers>=4.40.0
scikit-learn>=1.3.0
numpy>=1.24.0
accelerate>=1.1.0
datasets

Zero impact on existing ingestion installs — opt-in only via pip install openmetadata-ingestion[ml].

Modified — `EmbeddingService.java`

SentenceTransformerProvider inner class was a hash-based stub (TODO comment, returned random floats). Replaced with a real DJL implementation using the same Criteria/ZooModel/Predictor pattern already used by DjlEmbeddingClient.java. Auto-detects openmetadata-finetuned-encoder/ at startup. LocalEmbeddingProvider retained as ultimate fallback if DJL init fails.

How the Training Signal Works

Signal Source	Pair Type	Label
Lineage edge A → B	Positive	1.0
Same table co-occurrence	Soft positive	0.7
Table co-membership	Soft positive	0.5
3+ lineage hops apart	Hard negative	0.0
Shared glossary term	Positive	1.0
Disjoint glossary term sets	Negative	0.0
Different services	Hard negative	0.0

Training loss: 0.6 × CosineSimilarityLoss + 0.4 × MultipleNegativesRankingLoss
Base model: answerdotai/ModernBERT-base (fallback: all-MiniLM-L6-v2)

Architecture: Training in Python, Serving in Java

ONE-TIME OFFLINE TRAINING (Python):
  Live OMeta instance
  → training_data.py  (lineage + glossary + table signals)
  → train_encoder.py  (contrastive fine-tuning)
  → openmetadata-finetuned-encoder/

PRODUCTION SERVING (Java — zero pipeline changes):
  Entity create/update
  → VectorEmbeddingHandler        (unchanged)
  → DjlEmbeddingClient            (auto-loads fine-tuned model if present)
  → OpenSearch knn_vector field   (unchanged)

No Python at inference time. No new infrastructure. No new Java dependencies.

Usage (4 commands)

# 1. Extract training data from your live instance
python -m metadata.ml.training_data \
  --host http://localhost:8585 --token <jwt> \
  --output training_pairs.json

# 2. Fine-tune the encoder
python -m metadata.ml.train_encoder \
  --data training_pairs.json \
  --output openmetadata-finetuned-encoder/

# 3. Evaluate vs baseline
python -m metadata.ml.evaluate_encoder \
  --fine-tuned openmetadata-finetuned-encoder/ \
  --test-data training_pairs.json

# 4. Place model at repo root and restart server
# DjlEmbeddingClient picks up openmetadata-finetuned-encoder/ automatically

Test Results — 42/42 Passing

Section                   Tests   PASS   FAIL
A  Imports & package         6      6      0
B  Training data extractor   8      8      0
C  Model training pipeline   6      6      0
D  Evaluation framework      6      6      0
E  Encoder client            7      7      0
F  Java EmbeddingService     4      4      0
G  Semantic quality          5      5      0
─────────────────────────────────────────────
TOTAL                       42     42      0

Section G — Core Issue Validation

Test	What it validates	Positive Sim	Negative Sim	Gap
G1	`order_id` proximity (exact issue example)	0.6968	0.3245	+0.3723
G2	Glossary-tagged columns cluster together	0.4815	0.1168	+0.3647
G3	Lineage-connected columns score above unrelated	0.7044	0.2545	+0.4499
G4	Same-table cohesion above cross-table random	0.4422	0.1330	+0.3092
G5	Post-fine-tuning improvement (1-epoch smoke)	0.8828	0.1579	+0.7249

All gaps positive. The encoder correctly ranks metadata-semantically related columns above unrelated ones across all three signal types.

Graceful Degradation

Every level has a fallback — nothing breaks if the fine-tuned model is absent:

openmetadata-finetuned-encoder/ present?
  YES → load fine-tuned model (DJL)
  NO  → load all-MiniLM-L6-v2 (DJL, same as today)
        DJL init fails?
          → LocalEmbeddingProvider (existing fallback, unchanged)

Existing deployments that never run training see zero behaviour change.

Type of Change

New feature (self-supervised training pipeline)
Bug fix (SentenceTransformerProvider was a non-functional stub)

Checklist

I have read the [CONTRIBUTING](https://docs.open-metadata.org/developers/contribute) document
My PR title is Fixes #26647: self-supervised metadata-aware encoder trained on lineage & glossary signals
I have commented my code, particularly in hard-to-understand areas
Self-supervised — zero manual labelling required
No changes to existing Java search pipeline
Graceful fallback at every level
Compatible with sentence-transformers v5.4.1
42/42 tests passing including 5 semantic quality validation tests
New feature — issue Custom embeddings to improve encoded semantics #26647 properly describes the goal and approach

github-actions · 2026-04-18T12:10:21Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

github-actions · 2026-04-18T12:43:16Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Closes open-metadata#26647 Add ingestion/src/metadata/ml/ package implementing a complete self-supervised training pipeline for a metadata-aware semantic encoder, replacing general-purpose web-text embeddings with a model trained on structural signals already present in any OpenMetadata instance. ## Problem General-purpose embeddings (OpenAI text-embedding-3-small, MiniLM trained on web text) are unaware of metadata semantics. They cannot distinguish that order_id in orders is semantically closer to order_id in order_items than to session_id — even though both are ID columns. This causes semantic search to surface irrelevant results for catalog exploration queries. ## Solution — Four Python components + one Java fix ### training_data.py — Self-Supervised Pair Extractor Extracts training signal from three sources with zero manual labelling: - Lineage edges: column A->B = positive (1.0); same-table = soft positive (0.7/0.5); 3+ hops apart = hard negative (0.0) - Glossary assignments: shared term = positive (1.0); disjoint sets = negative (0.0) - Table co-membership: same table = soft positive (0.5); different services = hard negative (0.0) ### train_encoder.py — Contrastive Fine-Tuning (sentence-transformers v5.4.1) - Base model: answerdotai/ModernBERT-base (MiniLM fallback) - Multi-objective loss: 0.6*CosineSimilarityLoss + 0.4*MNRLoss - Uses datasets.Dataset API (compatible with sentence-transformers>=5.0) - AdamW lr=2e-5, epoch-based eval, early stopping patience=3 - Output: openmetadata-finetuned-encoder/ (auto-detected by DJL client) ### evaluate_encoder.py — Evaluation Framework - MRR@10, Recall@{1,5,10}, Semantic Cohesion Score - Compares fine-tuned vs all-MiniLM-L6-v2 baseline - Saves evaluation_results.json for CI tracking ### encoder_client.py — Drop-In Integration Client - MetadataEncoder: auto-selects fine-tuned model if present, else MiniLM - @lru_cache model loading (load once per process) - L2-normalised output (cosine-similarity ready) - Zero changes to existing Java search pipeline required ### EmbeddingService.java — SentenceTransformerProvider Fix - Replaces hash-based stub with real DJL Criteria/ZooModel/Predictor - Auto-detects openmetadata-finetuned-encoder/ at startup - Falls back to all-MiniLM-L6-v2 via DJL if fine-tuned model absent - LocalEmbeddingProvider retained as ultimate fallback ## Validation — 42/42 tests passing G1 order_id semantic gap: +0.3723 G2 glossary clustering gap: +0.3647 G3 lineage scoring gap: +0.4499 G4 table cohesion gap: +0.3092 G5 post-fine-tuning gap: +0.7249 ## Dependencies added under extras_require['ml'] sentence-transformers[train]>=5.0, torch>=2.0, transformers>=4.40, scikit-learn>=1.3, numpy>=1.24, accelerate>=1.1, datasets

github-actions · 2026-04-18T12:55:49Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

github-actions · 2026-04-20T21:45:36Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

gitar-bot · 2026-04-20T21:46:20Z

Code Review ✅ Approved 6 resolved / 6 findings

Implements a metadata-aware semantic encoder for search, resolving several issues including unused loss configurations, suboptimal negative generation complexity, and data splitting reliability. No remaining issues identified.

✅ 6 resolved

✅ Bug: MultipleNegativesRankingLoss is created but never used

📄 ingestion/src/metadata/ml/train_encoder.py:148-149 📄 ingestion/src/metadata/ml/train_encoder.py:198
The PR description and code comments state the training uses a multi-objective loss: 0.6 × CosineSimilarityLoss + 0.4 × MultipleNegativesRankingLoss. However, mnrl_loss is instantiated at line 149 but never passed to the trainer — only cosine_loss is used (line 198). This means the model is trained with only CosineSimilarityLoss, contradicting the documented approach and likely reducing training effectiveness for the ranking task.

✅ Bug: early_stopping_patience parameter is accepted but never used

📄 ingestion/src/metadata/ml/train_encoder.py:117 📄 ingestion/src/metadata/ml/train_encoder.py:176-190
The train() function accepts early_stopping_patience (line 117) and the CLI exposes --patience (line 230), but neither the SentenceTransformerTrainingArguments nor any EarlyStoppingCallback uses this value. Training will always run for the full number of epochs regardless of validation performance.

✅ Performance: Hard negatives generation is O(n²) and unbounded

📄 ingestion/src/metadata/ml/training_data.py:180-189
In training_data.py lines 180-189, _build_hard_negatives creates a Cartesian product of columns across all service pairs. For catalogs with many services each having many columns, this could produce an enormous number of negative pairs, causing memory issues and heavily imbalanced training data. The extract_all method limits this to [:5] and [:1] (lines 289-290), but _build_hard_negatives as a public method has no such safeguard.

✅ Quality: Unused variable all_cols_flat in extract_all

📄 ingestion/src/metadata/ml/training_data.py:283
Line 283 creates all_cols_flat by flattening all service columns, but this variable is never used anywhere in the method.

✅ Bug: MNRL receives negative pairs, treating them as positives

📄 ingestion/src/metadata/ml/train_encoder.py:149-163
The _WeightedLoss wrapper passes ALL training data (including label=0.0 hard-negative pairs) to MultipleNegativesRankingLoss. MNRL is designed exclusively for positive pairs — it ignores the labels tensor entirely and treats every (sentence1, sentence2) in the batch as a true positive match, using other in-batch sentences as negatives.

This means hard-negative pairs (columns from different services that should be pushed apart) are instead being pulled together by the MNRL component (40% of total loss), directly contradicting the CosineSimilarityLoss signal and corrupting training.

The fix is to either:

Filter the dataset so only positive pairs (label >= 0.5) are passed to MNRL, or

Use separate DataLoaders for each loss (sentence-transformers supports multi-dataset training), or

Replace MNRL with a loss that respects explicit labels (e.g., OnlineContrastiveLoss or ContrastiveLoss).

...and 1 more resolved from earlier reviews

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Yashsainani123 requested a review from a team as a code owner April 18, 2026 12:09