-
Notifications
You must be signed in to change notification settings - Fork 2k
Add AI SDK recipes: health report, LangChain template and AI agent no… #27506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
6116eb1
16ad327
8f72d63
d571689
8dc2444
c050919
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,121 @@ | ||
| # 🤖 OpenMetadata AI SDK Recipes | ||
| ### By Baibhav Prateek | OpenMetadata Hackathon 2026 | ||
|
|
||
| ## 🎯 Problem Statement | ||
| Most data teams struggle with poor metadata quality — tables without | ||
| descriptions, no owners assigned, and no easy way to explore their | ||
| data catalog using natural language. This project solves all three | ||
| problems using AI. | ||
|
|
||
| ## 💡 Solution | ||
| Three ready-to-use Jupyter notebooks that demonstrate how to combine | ||
| OpenMetadata's REST API with AI to build powerful metadata workflows. | ||
|
|
||
| --- | ||
|
|
||
| ## 📓 Notebooks | ||
|
|
||
| ### 1. 🏥 Metadata Health Report (`metadata_health_report.ipynb`) | ||
| **Problem it solves:** Data teams have no visibility into how well | ||
| their metadata is documented. | ||
|
|
||
| **What it does:** | ||
| - Connects to OpenMetadata and fetches all tables | ||
| - Checks which tables are missing descriptions and owners | ||
| - Calculates an overall health score (0-100) | ||
| - Generates visual charts showing coverage | ||
| - Exports results to CSV for further analysis | ||
|
|
||
| **Sample Output:** | ||
| ================================================== | ||
| 📊 MY OPENMETADATA HEALTH REPORT | ||
| Total Tables Analyzed : 50 | ||
| ✅ Have Description : 24 (48%) | ||
| ❌ Missing Description : 26 (52%) | ||
| ✅ Have Owner : 0 (0%) | ||
| ❌ Missing Owner : 50 (100%) | ||
| Overall Health Score : 24/100 | ||
| Status : 🔴 CRITICAL | ||
|
|
||
| --- | ||
|
|
||
| ### 2. 🔗 AI Template (`langchain_openmetadata_template.ipynb`) | ||
| **Problem it solves:** Developers need a reusable starting point | ||
| for building AI-powered data catalog applications. | ||
|
|
||
| **What it does:** | ||
| - Provides a clean, reusable template connecting Groq AI to OpenMetadata | ||
| - Fetches real metadata context from OpenMetadata | ||
| - Uses LLaMA 3.3 70b to answer questions about your data | ||
| - Anyone can customize this template for their own use case | ||
|
|
||
| **Sample Questions it answers:** | ||
| - "Which tables look incomplete or poorly documented?" | ||
| - "What kind of organization does this data belong to?" | ||
| - "Which tables should a new data analyst explore first?" | ||
|
|
||
| --- | ||
|
|
||
| ### 3. 🤖 AI Agent (`openmetadata_ai_agent.ipynb`) | ||
| **Problem it solves:** Users have to know exactly what to search | ||
| for in their data catalog. This agent makes it conversational. | ||
|
|
||
| **What it does:** | ||
| - Intelligent agent that automatically decides how to search | ||
| - Has 3 tools: get_tables, search_tables, get_databases | ||
| - AI decides which tool to use based on your question | ||
| - Returns human-friendly answers with full reasoning shown | ||
|
|
||
| **Sample Interaction:** | ||
| ❓ User: Find tables related to orders | ||
| 🧠 Agent thinking... | ||
| 🔧 Agent decided to use: search_tables: orders | ||
| 📦 Data fetched: ['raw_orders', 'fact_orders', 'orders'...] | ||
| 🤖 Answer: Found several order-related tables... | ||
|
|
||
| --- | ||
|
|
||
| ## 🚀 Quick Start | ||
|
|
||
| ### Prerequisites | ||
| ```bash | ||
| pip install openmetadata-ingestion groq requests pandas matplotlib jupyter | ||
| ``` | ||
|
|
||
| ### Setup | ||
| 1. Get your OpenMetadata token from your profile page | ||
| 2. Get a free Groq API key from console.groq.com | ||
| 3. Open any notebook and replace the placeholders in Cell 1: | ||
| ```python | ||
| GROQ_API_KEY = "your_groq_api_key_here" | ||
| TOKEN = "your_openmetadata_token_here" | ||
| ``` | ||
| 4. Run all cells in order! | ||
|
|
||
| --- | ||
|
|
||
| ## 🛠️ Technologies Used | ||
| - **OpenMetadata REST API** — metadata fetching and search | ||
| - **Groq AI (LLaMA 3.3 70b)** — natural language processing | ||
| - **Python** — core language | ||
| - **Pandas** — data analysis | ||
| - **Matplotlib** — visualization | ||
| - **Jupyter Notebooks** — interactive environment | ||
|
|
||
| ## 🎯 Impact | ||
| These notebooks help data teams: | ||
| - **Identify** poorly documented tables instantly | ||
| - **Explore** their data catalog using natural language | ||
| - **Build** AI-powered metadata applications faster | ||
|
|
||
| ## 📁 File Structure | ||
| ingestion/examples/ | ||
| ├── metadata_health_report.ipynb # Health scoring notebook | ||
| ├── langchain_openmetadata_template.ipynb # AI template notebook | ||
| ├── openmetadata_ai_agent.ipynb # AI agent notebook | ||
| ├── requirements.txt # Dependencies | ||
| └── README.md # This file | ||
|
|
||
| ## 🔗 Related Issue | ||
| This submission is for issue #26646 — Metadata AI SDK Starter | ||
| Templates / Recipes | ||
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,210 @@ | ||||||||||
| { | ||||||||||
| "cells": [ | ||||||||||
| { | ||||||||||
| "cell_type": "markdown", | ||||||||||
| "id": "2d542bc1-1752-4bbe-9cac-c88548ce6393", | ||||||||||
| "metadata": {}, | ||||||||||
| "source": [ | ||||||||||
| "# LangChain + OpenMetadata Template\n", | ||||||||||
| "### Built by Baibhav Prateek | OpenMetadata Hackathon 2026\n", | ||||||||||
| "\n", | ||||||||||
| "## What is this?\n", | ||||||||||
| "A reusable template that connects AI to OpenMetadata.\n", | ||||||||||
|
Comment on lines
+8
to
+12
|
||||||||||
| "Anyone can use this as a starting point for their own\n", | ||||||||||
| "AI-powered data catalog applications.\n", | ||||||||||
| "\n", | ||||||||||
| "## How to use this template:\n", | ||||||||||
| "1) Add your API keys\n", | ||||||||||
| "2) Run all cells in order\n", | ||||||||||
| "3) Ask your own questions\n", | ||||||||||
| "4) Customize the questions for your use case\n", | ||||||||||
| "\n", | ||||||||||
| "## Technologies used:\n", | ||||||||||
| "1) OpenMetadata API for metadata\n", | ||||||||||
| "2) Groq AI (LLaMA 3) for natural language processing\n", | ||||||||||
| "3) Python requests for API calls" | ||||||||||
|
Comment on lines
+8
to
+25
|
||||||||||
| ] | ||||||||||
| }, | ||||||||||
| { | ||||||||||
| "cell_type": "code", | ||||||||||
| "execution_count": null, | ||||||||||
| "id": "ac2f9ec7-80b3-4b2d-89ae-3bf237059733", | ||||||||||
| "metadata": {}, | ||||||||||
| "outputs": [], | ||||||||||
| "source": [ | ||||||||||
| "import requests\n", | ||||||||||
| "import json\n", | ||||||||||
| "from groq import Groq\n", | ||||||||||
| "\n", | ||||||||||
| "# Your credentials\n", | ||||||||||
| "GROQ_API_KEY = \"your_groq_api_key_here\"\n", | ||||||||||
| "BASE_URL = \"https://sandbox.open-metadata.org\"\n", | ||||||||||
| "TOKEN = \"paste_your_tokens_here\"\n", | ||||||||||
| "\n", | ||||||||||
| "HEADERS = {\n", | ||||||||||
| " \"Authorization\": f\"Bearer {TOKEN}\",\n", | ||||||||||
| " \"Content-Type\": \"application/json\"\n", | ||||||||||
| "}\n", | ||||||||||
| "\n", | ||||||||||
| "# Initialize Groq client\n", | ||||||||||
| "client = Groq(api_key=GROQ_API_KEY)\n", | ||||||||||
| "\n", | ||||||||||
| "print(\"✅ Setup complete!\")" | ||||||||||
| ] | ||||||||||
| }, | ||||||||||
| { | ||||||||||
| "cell_type": "code", | ||||||||||
| "execution_count": null, | ||||||||||
| "id": "cfa44929-6991-432b-8455-071cf8a12fe0", | ||||||||||
| "metadata": {}, | ||||||||||
| "outputs": [], | ||||||||||
| "source": [ | ||||||||||
| "# Functions to fetch data from OpenMetadata with error handling\n", | ||||||||||
| "def get_tables(limit=10):\n", | ||||||||||
| " try:\n", | ||||||||||
| " response = requests.get(\n", | ||||||||||
| " f\"{BASE_URL}/api/v1/tables\",\n", | ||||||||||
| " headers=HEADERS,\n", | ||||||||||
| " params={\"limit\": limit}\n", | ||||||||||
| " )\n", | ||||||||||
| " if response.status_code != 200:\n", | ||||||||||
| " print(f\"❌ Error: {response.status_code}\")\n", | ||||||||||
| " return []\n", | ||||||||||
| " return response.json().get(\"data\", [])\n", | ||||||||||
| " except Exception as e:\n", | ||||||||||
| " print(f\"❌ Error fetching tables: {e}\")\n", | ||||||||||
| " return []\n", | ||||||||||
| "\n", | ||||||||||
| "def get_databases():\n", | ||||||||||
| " try:\n", | ||||||||||
| " response = requests.get(\n", | ||||||||||
| " f\"{BASE_URL}/api/v1/databases\",\n", | ||||||||||
| " headers=HEADERS,\n", | ||||||||||
| " params={\"limit\": 20}\n", | ||||||||||
| " )\n", | ||||||||||
| " if response.status_code != 200:\n", | ||||||||||
| " print(f\"❌ Error: {response.status_code}\")\n", | ||||||||||
| " return []\n", | ||||||||||
| " return response.json().get(\"data\", [])\n", | ||||||||||
| " except Exception as e:\n", | ||||||||||
| " print(f\"❌ Error fetching databases: {e}\")\n", | ||||||||||
| " return []\n", | ||||||||||
| "\n", | ||||||||||
| "def search_assets(query):\n", | ||||||||||
| " try:\n", | ||||||||||
| " response = requests.get(\n", | ||||||||||
| " f\"{BASE_URL}/api/v1/search/query\",\n", | ||||||||||
| " headers=HEADERS,\n", | ||||||||||
| " params={\"q\": query, \"index\": \"table_search_index\", \"limit\": 5}\n", | ||||||||||
| " )\n", | ||||||||||
| " if response.status_code != 200:\n", | ||||||||||
| " print(f\"❌ Error: {response.status_code}\")\n", | ||||||||||
| " return []\n", | ||||||||||
| " return response.json().get(\"hits\", {}).get(\"hits\", [])\n", | ||||||||||
| " except Exception as e:\n", | ||||||||||
| " print(f\"❌ Error searching: {e}\")\n", | ||||||||||
| " return []\n", | ||||||||||
| "\n", | ||||||||||
| "print(\"✅ Helper functions ready!\")" | ||||||||||
| ] | ||||||||||
| }, | ||||||||||
| { | ||||||||||
| "cell_type": "code", | ||||||||||
| "execution_count": null, | ||||||||||
| "id": "ddbb5ecf-d621-43a7-a5b7-03ac2cdec978", | ||||||||||
| "metadata": {}, | ||||||||||
| "outputs": [], | ||||||||||
| "source": [ | ||||||||||
| "# This function connects AI with OpenMetadata\n", | ||||||||||
| "# Step 1; First I fetch real tables from OpenMetadata\n", | ||||||||||
| "# Step 2 ; I give that information to the AI as context\n", | ||||||||||
| "# Step 3; The AI uses that context to answer the question\n", | ||||||||||
| "# This way the AI always has uptodate information\n", | ||||||||||
| "\n", | ||||||||||
| "def ask_ai(question):\n", | ||||||||||
| " # Fetch context from OpenMetadata\n", | ||||||||||
| " tables = get_tables(limit=10)\n", | ||||||||||
| " table_names = [t.get(\"name\", \"\") for t in tables]\n", | ||||||||||
| " \n", | ||||||||||
| " # Build prompt\n", | ||||||||||
| " prompt = f\"\"\"You are a helpful data catalog assistant.\n", | ||||||||||
| "You have access to OpenMetadata with these tables: {table_names}\n", | ||||||||||
| "\n", | ||||||||||
| "User question: {question}\n", | ||||||||||
| "\n", | ||||||||||
| "Answer helpfully and concisely.\"\"\"\n", | ||||||||||
| "\n", | ||||||||||
| " response = client.chat.completions.create(\n", | ||||||||||
| " model=\"llama-3.3-70b-versatile\",\n", | ||||||||||
| " messages=[{\"role\": \"user\", \"content\": prompt}]\n", | ||||||||||
| " )\n", | ||||||||||
| " return response.choices[0].message.content\n", | ||||||||||
| "\n", | ||||||||||
| "# Test it!\n", | ||||||||||
| "answer = ask_ai(\"How many tables do we have and what are some of their names?\")\n", | ||||||||||
|
||||||||||
| "answer = ask_ai(\"How many tables do we have and what are some of their names?\")\n", | |
| "answer = ask_ai(\"From the fetched sample of up to 10 tables, how many tables are listed and what are some of their names?\")\n", |
Copilot
AI
Apr 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notebook metadata declares Python 3.13.9, but this repository explicitly supports Python 3.9–3.11 for ingestion/SDK code. Please update the notebook kernel/language metadata to a supported version (e.g., 3.11) to avoid misleading users and compatibility issues.
| "version": "3.13.9" | |
| "version": "3.11" |
Copilot
AI
Apr 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notebook metadata lists Python 3.13.9, which is outside the ingestion module’s supported versions (e.g., ingestion/noxfile.py lists 3.10–3.12). Please update the kernel/language metadata to a supported version to reduce confusion when users run these notebooks.
| "version": "3.13.9" | |
| "version": "3.11.0" |
Copilot
AI
Apr 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notebook metadata indicates it was created with Python 3.13.9. OpenMetadata ingestion/examples are expected to run on supported Python versions (e.g., repo notebooks under examples/python-sdk/... use 3.11.x), so this kernel/version metadata is likely to mislead users and can break dependencies. Please re-save the notebook using a supported Python kernel (3.9–3.11) so language_info.version matches the supported runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README lists dependencies that are not used by these notebooks (e.g.,
google-genai, andopenmetadata-ingestioneven though the notebooks call the REST API viarequests). Please align the installation instructions with the actual imports/usage, or refactor the notebooks to use the OpenMetadata Python SDK (openmetadata-ingestion/metadata.sdk) as advertised in the PR description.