Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions ingestion/examples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# 🤖 OpenMetadata AI SDK Recipes
### By Baibhav Prateek | OpenMetadata Hackathon 2026

## 🎯 Problem Statement
Most data teams struggle with poor metadata quality — tables without
descriptions, no owners assigned, and no easy way to explore their
data catalog using natural language. This project solves all three
problems using AI.

## 💡 Solution
Three ready-to-use Jupyter notebooks that demonstrate how to combine
OpenMetadata's REST API with AI to build powerful metadata workflows.

---

## 📓 Notebooks

### 1. 🏥 Metadata Health Report (`metadata_health_report.ipynb`)
**Problem it solves:** Data teams have no visibility into how well
their metadata is documented.

**What it does:**
- Connects to OpenMetadata and fetches all tables
- Checks which tables are missing descriptions and owners
- Calculates an overall health score (0-100)
- Generates visual charts showing coverage
- Exports results to CSV for further analysis

**Sample Output:**
==================================================
📊 MY OPENMETADATA HEALTH REPORT
Total Tables Analyzed : 50
✅ Have Description : 24 (48%)
❌ Missing Description : 26 (52%)
✅ Have Owner : 0 (0%)
❌ Missing Owner : 50 (100%)
Overall Health Score : 24/100
Status : 🔴 CRITICAL

---

### 2. 🔗 AI Template (`langchain_openmetadata_template.ipynb`)
**Problem it solves:** Developers need a reusable starting point
for building AI-powered data catalog applications.

**What it does:**
- Provides a clean, reusable template connecting Groq AI to OpenMetadata
- Fetches real metadata context from OpenMetadata
- Uses LLaMA 3.3 70b to answer questions about your data
- Anyone can customize this template for their own use case

**Sample Questions it answers:**
- "Which tables look incomplete or poorly documented?"
- "What kind of organization does this data belong to?"
- "Which tables should a new data analyst explore first?"

---

### 3. 🤖 AI Agent (`openmetadata_ai_agent.ipynb`)
**Problem it solves:** Users have to know exactly what to search
for in their data catalog. This agent makes it conversational.

**What it does:**
- Intelligent agent that automatically decides how to search
- Has 3 tools: get_tables, search_tables, get_databases
- AI decides which tool to use based on your question
- Returns human-friendly answers with full reasoning shown

**Sample Interaction:**
❓ User: Find tables related to orders
🧠 Agent thinking...
🔧 Agent decided to use: search_tables: orders
📦 Data fetched: ['raw_orders', 'fact_orders', 'orders'...]
🤖 Answer: Found several order-related tables...

---

## 🚀 Quick Start

### Prerequisites
```bash
pip install openmetadata-ingestion groq requests pandas matplotlib jupyter
```

Comment on lines +80 to +84
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README lists dependencies that are not used by these notebooks (e.g., google-genai, and openmetadata-ingestion even though the notebooks call the REST API via requests). Please align the installation instructions with the actual imports/usage, or refactor the notebooks to use the OpenMetadata Python SDK (openmetadata-ingestion / metadata.sdk) as advertised in the PR description.

Copilot uses AI. Check for mistakes.
### Setup
1. Get your OpenMetadata token from your profile page
2. Get a free Groq API key from console.groq.com
3. Open any notebook and replace the placeholders in Cell 1:
```python
GROQ_API_KEY = "your_groq_api_key_here"
TOKEN = "your_openmetadata_token_here"
```
4. Run all cells in order!

---

## 🛠️ Technologies Used
- **OpenMetadata REST API** — metadata fetching and search
- **Groq AI (LLaMA 3.3 70b)** — natural language processing
- **Python** — core language
- **Pandas** — data analysis
- **Matplotlib** — visualization
- **Jupyter Notebooks** — interactive environment

## 🎯 Impact
These notebooks help data teams:
- **Identify** poorly documented tables instantly
- **Explore** their data catalog using natural language
- **Build** AI-powered metadata applications faster

## 📁 File Structure
ingestion/examples/
├── metadata_health_report.ipynb # Health scoring notebook
├── langchain_openmetadata_template.ipynb # AI template notebook
├── openmetadata_ai_agent.ipynb # AI agent notebook
├── requirements.txt # Dependencies
└── README.md # This file

## 🔗 Related Issue
This submission is for issue #26646 — Metadata AI SDK Starter
Templates / Recipes
210 changes: 210 additions & 0 deletions ingestion/examples/langchain_openmetadata_template.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "2d542bc1-1752-4bbe-9cac-c88548ce6393",
"metadata": {},
"source": [
"# LangChain + OpenMetadata Template\n",
"### Built by Baibhav Prateek | OpenMetadata Hackathon 2026\n",
"\n",
"## What is this?\n",
"A reusable template that connects AI to OpenMetadata.\n",
Comment on lines +8 to +12
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This notebook is titled “LangChain + OpenMetadata Template”, but the code shown here uses requests + groq directly and does not use LangChain. Please either implement a minimal LangChain chain/agent example (and add the needed dependency) or rename the notebook so the title matches the contents.

Copilot uses AI. Check for mistakes.
"Anyone can use this as a starting point for their own\n",
"AI-powered data catalog applications.\n",
"\n",
"## How to use this template:\n",
"1) Add your API keys\n",
"2) Run all cells in order\n",
"3) Ask your own questions\n",
"4) Customize the questions for your use case\n",
"\n",
"## Technologies used:\n",
"1) OpenMetadata API for metadata\n",
"2) Groq AI (LLaMA 3) for natural language processing\n",
"3) Python requests for API calls"
Comment on lines +8 to +25
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title/README call this a “LangChain + OpenMetadata Template”, but the notebook does not import or use LangChain at all (it directly calls Groq and requests). Either update the implementation to actually use LangChain primitives (e.g., LLM/Prompt/Tool abstractions), or rename the notebook and its description to avoid misleading users.

Copilot uses AI. Check for mistakes.
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ac2f9ec7-80b3-4b2d-89ae-3bf237059733",
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"import json\n",
"from groq import Groq\n",
"\n",
"# Your credentials\n",
"GROQ_API_KEY = \"your_groq_api_key_here\"\n",
"BASE_URL = \"https://sandbox.open-metadata.org\"\n",
"TOKEN = \"paste_your_tokens_here\"\n",
"\n",
"HEADERS = {\n",
" \"Authorization\": f\"Bearer {TOKEN}\",\n",
" \"Content-Type\": \"application/json\"\n",
"}\n",
"\n",
"# Initialize Groq client\n",
"client = Groq(api_key=GROQ_API_KEY)\n",
"\n",
"print(\"✅ Setup complete!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cfa44929-6991-432b-8455-071cf8a12fe0",
"metadata": {},
"outputs": [],
"source": [
"# Functions to fetch data from OpenMetadata with error handling\n",
"def get_tables(limit=10):\n",
" try:\n",
" response = requests.get(\n",
" f\"{BASE_URL}/api/v1/tables\",\n",
" headers=HEADERS,\n",
" params={\"limit\": limit}\n",
" )\n",
" if response.status_code != 200:\n",
" print(f\"❌ Error: {response.status_code}\")\n",
" return []\n",
" return response.json().get(\"data\", [])\n",
" except Exception as e:\n",
" print(f\"❌ Error fetching tables: {e}\")\n",
" return []\n",
"\n",
"def get_databases():\n",
" try:\n",
" response = requests.get(\n",
" f\"{BASE_URL}/api/v1/databases\",\n",
" headers=HEADERS,\n",
" params={\"limit\": 20}\n",
" )\n",
" if response.status_code != 200:\n",
" print(f\"❌ Error: {response.status_code}\")\n",
" return []\n",
" return response.json().get(\"data\", [])\n",
" except Exception as e:\n",
" print(f\"❌ Error fetching databases: {e}\")\n",
" return []\n",
"\n",
"def search_assets(query):\n",
" try:\n",
" response = requests.get(\n",
" f\"{BASE_URL}/api/v1/search/query\",\n",
" headers=HEADERS,\n",
" params={\"q\": query, \"index\": \"table_search_index\", \"limit\": 5}\n",
" )\n",
" if response.status_code != 200:\n",
" print(f\"❌ Error: {response.status_code}\")\n",
" return []\n",
" return response.json().get(\"hits\", {}).get(\"hits\", [])\n",
" except Exception as e:\n",
" print(f\"❌ Error searching: {e}\")\n",
" return []\n",
"\n",
"print(\"✅ Helper functions ready!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ddbb5ecf-d621-43a7-a5b7-03ac2cdec978",
"metadata": {},
"outputs": [],
"source": [
"# This function connects AI with OpenMetadata\n",
"# Step 1; First I fetch real tables from OpenMetadata\n",
"# Step 2 ; I give that information to the AI as context\n",
"# Step 3; The AI uses that context to answer the question\n",
"# This way the AI always has uptodate information\n",
"\n",
"def ask_ai(question):\n",
" # Fetch context from OpenMetadata\n",
" tables = get_tables(limit=10)\n",
" table_names = [t.get(\"name\", \"\") for t in tables]\n",
" \n",
" # Build prompt\n",
" prompt = f\"\"\"You are a helpful data catalog assistant.\n",
"You have access to OpenMetadata with these tables: {table_names}\n",
"\n",
"User question: {question}\n",
"\n",
"Answer helpfully and concisely.\"\"\"\n",
"\n",
" response = client.chat.completions.create(\n",
" model=\"llama-3.3-70b-versatile\",\n",
" messages=[{\"role\": \"user\", \"content\": prompt}]\n",
" )\n",
" return response.choices[0].message.content\n",
"\n",
"# Test it!\n",
"answer = ask_ai(\"How many tables do we have and what are some of their names?\")\n",
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ask_ai() only fetches limit=10 tables and then prompts the model with that subset, but the demo question asks “How many tables do we have…”. With the current logic the answer can never reflect the full catalog and may be misleading. Please either (a) change the question/output wording to reflect the limited sample, or (b) fetch/paginate all tables (or at least a larger configurable sample) when the question is about totals.

Suggested change
"answer = ask_ai(\"How many tables do we have and what are some of their names?\")\n",
"answer = ask_ai(\"From the fetched sample of up to 10 tables, how many tables are listed and what are some of their names?\")\n",

Copilot uses AI. Check for mistakes.
"print(\"🤖 AI says:\")\n",
"print(answer)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "72b75316-6857-4018-99b9-c75f29071e4a",
"metadata": {},
"outputs": [],
"source": [
"# Interactive Q&A session\n",
"questions = [\n",
" \"Which tables look incomplete or poorly documented?\",\n",
" \"What kind of organization does this data belong to?\",\n",
" \"If you were a new data analyst, which tables would you explore first?\",\n",
"]\n",
"\n",
"print(\"=\" * 60)\n",
"print(\" 🤖 OpenMetadata AI Assistant Demo\")\n",
"print(\"=\" * 60)\n",
"\n",
"for question in questions:\n",
" print(f\"\\n❓ Question: {question}\")\n",
" print(\"-\" * 40)\n",
" answer = ask_ai(question)\n",
" print(f\"🤖 Answer: {answer}\")\n",
" print()\n",
"\n",
"print(\"=\" * 60)\n",
"print(\" 🤖 OpenMetadata AI Template Demo\")\n",
"print(\" Built for OpenMetadata Hackathon 2026\")\n",
"print(\"=\" * 60)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dbcaac82-2132-47e9-8721-f384270685ad",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.9"
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notebook metadata declares Python 3.13.9, but this repository explicitly supports Python 3.9–3.11 for ingestion/SDK code. Please update the notebook kernel/language metadata to a supported version (e.g., 3.11) to avoid misleading users and compatibility issues.

Suggested change
"version": "3.13.9"
"version": "3.11"

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notebook metadata lists Python 3.13.9, which is outside the ingestion module’s supported versions (e.g., ingestion/noxfile.py lists 3.10–3.12). Please update the kernel/language metadata to a supported version to reduce confusion when users run these notebooks.

Suggested change
"version": "3.13.9"
"version": "3.11.0"

Copilot uses AI. Check for mistakes.
}
Comment on lines +200 to +206
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notebook metadata indicates it was created with Python 3.13.9. OpenMetadata ingestion/examples are expected to run on supported Python versions (e.g., repo notebooks under examples/python-sdk/... use 3.11.x), so this kernel/version metadata is likely to mislead users and can break dependencies. Please re-save the notebook using a supported Python kernel (3.9–3.11) so language_info.version matches the supported runtime.

Copilot uses AI. Check for mistakes.
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading
Loading