Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions ingestion/examples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# OpenMetadata AI SDK Recipes
### By Baibhav Prateek | Hackathon 2026

## Overview
This is my submission for the OpenMetadata Hackathon 2026.
I built 3 notebooks that showcase how to use AI with OpenMetadata.

## Notebooks
### 1. Metadata Health Report
- Connects to OpenMetadata
- Analyzes table and column documentation quality
- Generates health score and visual charts
- Saves results to CSV files

### 2. LangChain OpenMetadata Template
- Reusable template connecting AI to OpenMetadata
- Ask questions about your data in plain English
- Uses Groq AI (LLaMA 3) for natural language processing

### 3. OpenMetadata AI Agent
- Intelligent agent that decides how to search automatically
- Uses multiple tools to fetch the right data
- Most advanced of the three notebooks

## How to Run

### Prerequisites
pip install openmetadata-ingestion groq google-genai requests pandas matplotlib

Comment on lines +80 to +84
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README lists dependencies that are not used by these notebooks (e.g., google-genai, and openmetadata-ingestion even though the notebooks call the REST API via requests). Please align the installation instructions with the actual imports/usage, or refactor the notebooks to use the OpenMetadata Python SDK (openmetadata-ingestion / metadata.sdk) as advertised in the PR description.

Copilot uses AI. Check for mistakes.
### Setup
1. Get your OpenMetadata token from sandbox.open-metadata.org
2. Get your free Groq API key from console.groq.com
3. Replace the placeholder keys in Cell 1 of each notebook
4. Run all cells in order

## Technologies Used
- OpenMetadata API
- Groq AI (LLaMA 3.3 70b)
- Python, Pandas, Matplotlib
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description/issue mention using the OpenMetadata Python SDK, but the notebooks/README currently use raw REST calls via requests (and don’t demonstrate metadata.sdk). Please either update the examples to use the SDK client APIs, or adjust the README to clearly state these are REST-based examples.

Copilot uses AI. Check for mistakes.
- Jupyter Notebooks
Comment thread
gitar-bot[bot] marked this conversation as resolved.
Outdated
261 changes: 261 additions & 0 deletions ingestion/examples/langchain_openmetadata_template.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,261 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "2d542bc1-1752-4bbe-9cac-c88548ce6393",
"metadata": {},
"source": [
"# LangChain + OpenMetadata Template\n",
"### Built by Baibhav Prateek | OpenMetadata Hackathon 2026\n",
"\n",
"## What is this?\n",
"A reusable template that connects AI to OpenMetadata.\n",
Comment on lines +8 to +12
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This notebook is titled “LangChain + OpenMetadata Template”, but the code shown here uses requests + groq directly and does not use LangChain. Please either implement a minimal LangChain chain/agent example (and add the needed dependency) or rename the notebook so the title matches the contents.

Copilot uses AI. Check for mistakes.
"Anyone can use this as a starting point for their own\n",
"AI-powered data catalog applications.\n",
"\n",
"## How to use this template:\n",
"1) Add your API keys\n",
"2) Run all cells in order\n",
"3) Ask your own questions\n",
"4) Customize the questions for your use case\n",
"\n",
"## Technologies used:\n",
"1) OpenMetadata API for metadata\n",
"2) Groq AI (LLaMA 3) for natural language processing\n",
"3) Python requests for API calls"
Comment on lines +8 to +25
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title/README call this a “LangChain + OpenMetadata Template”, but the notebook does not import or use LangChain at all (it directly calls Groq and requests). Either update the implementation to actually use LangChain primitives (e.g., LLM/Prompt/Tool abstractions), or rename the notebook and its description to avoid misleading users.

Copilot uses AI. Check for mistakes.
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ac2f9ec7-80b3-4b2d-89ae-3bf237059733",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"✅ Setup complete!\n"
]
}
],
"source": [
"import requests\n",
"import json\n",
"from groq import Groq\n",
"\n",
"# Your credentials\n",
"GROQ_API_KEY = \"your_groq_api_key_here\"\n",
"BASE_URL = \"https://sandbox.open-metadata.org\"\n",
"TOKEN = \"your_openmetadata_token_here\"\n",
"\n",
"HEADERS = {\n",
" \"Authorization\": f\"Bearer {TOKEN}\",\n",
" \"Content-Type\": \"application/json\"\n",
"}\n",
"\n",
"# Initialize Groq client\n",
"client = Groq(api_key=GROQ_API_KEY)\n",
"\n",
"print(\"✅ Setup complete!\")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "cfa44929-6991-432b-8455-071cf8a12fe0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"✅ Helper functions ready!\n"
]
}
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The committed cells include execution outputs and non-null execution_count values. This makes diffs noisy and can go stale quickly (especially for API-driven responses). Please clear outputs/reset execution counts before committing, or ensure the outputs are intentionally kept and match the current code.

Copilot uses AI. Check for mistakes.
],
"source": [
"# Functions to fetch data from OpenMetadata\n",
"def get_tables(limit=10):\n",
" response = requests.get(\n",
" f\"{BASE_URL}/api/v1/tables\",\n",
" headers=HEADERS,\n",
" params={\"limit\": limit}\n",
" )\n",
" return response.json().get(\"data\", [])\n",
"\n",
"def get_databases():\n",
" response = requests.get(\n",
" f\"{BASE_URL}/api/v1/databases\",\n",
" headers=HEADERS,\n",
" params={\"limit\": 20}\n",
" )\n",
" return response.json().get(\"data\", [])\n",
"\n",
"def search_assets(query):\n",
" response = requests.get(\n",
" f\"{BASE_URL}/api/v1/search/query\",\n",
" headers=HEADERS,\n",
" params={\"q\": query, \"index\": \"table_search_index\", \"limit\": 5}\n",
" )\n",
" return response.json().get(\"hits\", {}).get(\"hits\", [])\n",
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recipe makes HTTP calls to OpenMetadata but does not check for non-2xx responses before calling response.json(). If authentication fails or the server returns an error, this will raise confusing exceptions. Please add response.raise_for_status() (or explicit status checks) and surface a clear error message when the API call fails.

Suggested change
"def get_tables(limit=10):\n",
" response = requests.get(\n",
" f\"{BASE_URL}/api/v1/tables\",\n",
" headers=HEADERS,\n",
" params={\"limit\": limit}\n",
" )\n",
" return response.json().get(\"data\", [])\n",
"\n",
"def get_databases():\n",
" response = requests.get(\n",
" f\"{BASE_URL}/api/v1/databases\",\n",
" headers=HEADERS,\n",
" params={\"limit\": 20}\n",
" )\n",
" return response.json().get(\"data\", [])\n",
"\n",
"def search_assets(query):\n",
" response = requests.get(\n",
" f\"{BASE_URL}/api/v1/search/query\",\n",
" headers=HEADERS,\n",
" params={\"q\": query, \"index\": \"table_search_index\", \"limit\": 5}\n",
" )\n",
" return response.json().get(\"hits\", {}).get(\"hits\", [])\n",
"def openmetadata_get(path, params=None):\n",
" url = f\"{BASE_URL}{path}\"\n",
" response = requests.get(url, headers=HEADERS, params=params)\n",
" try:\n",
" response.raise_for_status()\n",
" except requests.HTTPError as exc:\n",
" error_body = response.text.strip()\n",
" raise RuntimeError(\n",
" f\"OpenMetadata API request failed for {url} with status \"\n",
" f\"{response.status_code}: {error_body or 'No response body returned.'}\"\n",
" ) from exc\n",
"\n",
" try:\n",
" return response.json()\n",
" except ValueError as exc:\n",
" raise RuntimeError(\n",
" f\"OpenMetadata API request to {url} returned a non-JSON response.\"\n",
" ) from exc\n",
"\n",
"def get_tables(limit=10):\n",
" response_json = openmetadata_get(\n",
" \"/api/v1/tables\",\n",
" params={\"limit\": limit}\n",
" )\n",
" return response_json.get(\"data\", [])\n",
"\n",
"def get_databases():\n",
" response_json = openmetadata_get(\n",
" \"/api/v1/databases\",\n",
" params={\"limit\": 20}\n",
" )\n",
" return response_json.get(\"data\", [])\n",
"\n",
"def search_assets(query):\n",
" response_json = openmetadata_get(\n",
" \"/api/v1/search/query\",\n",
" params={\"q\": query, \"index\": \"table_search_index\", \"limit\": 5}\n",
" )\n",
" return response_json.get(\"hits\", {}).get(\"hits\", [])\n",

Copilot uses AI. Check for mistakes.
"\n",
"def get_table_details(table_name):\n",
" results = search_assets(table_name)\n",
" if results:\n",
" return results[0].get(\"_source\", {})\n",
" return {}\n",
"\n",
"print(\"✅ Helper functions ready!\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "ddbb5ecf-d621-43a7-a5b7-03ac2cdec978",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"🤖 AI says:\n",
"We have 9 tables in total. Some of the table names include 'ACCOUNTS', 'acct_issue_table', '_airbyte_raw_customers', '_airbyte_raw_orders', and others related to Airbyte raw data and a cdc table.\n"
]
}
],
"source": [
"# This function connects AI with OpenMetadata\n",
"# Step 1; First I fetch real tables from OpenMetadata\n",
"# Step 2 ; I give that information to the AI as context\n",
"# Step 3; The AI uses that context to answer the question\n",
"# This way the AI always has uptodate information\n",
"\n",
"def ask_ai(question):\n",
" # Fetch context from OpenMetadata\n",
" tables = get_tables(limit=10)\n",
" table_names = [t.get(\"name\", \"\") for t in tables]\n",
" \n",
" # Build prompt\n",
" prompt = f\"\"\"You are a helpful data catalog assistant.\n",
"You have access to OpenMetadata with these tables: {table_names}\n",
"\n",
"User question: {question}\n",
"\n",
"Answer helpfully and concisely.\"\"\"\n",
"\n",
" response = client.chat.completions.create(\n",
" model=\"llama-3.3-70b-versatile\",\n",
" messages=[{\"role\": \"user\", \"content\": prompt}]\n",
" )\n",
" return response.choices[0].message.content\n",
"\n",
"# Test it!\n",
"answer = ask_ai(\"How many tables do we have and what are some of their names?\")\n",
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ask_ai() only fetches limit=10 tables and then prompts the model with that subset, but the demo question asks “How many tables do we have…”. With the current logic the answer can never reflect the full catalog and may be misleading. Please either (a) change the question/output wording to reflect the limited sample, or (b) fetch/paginate all tables (or at least a larger configurable sample) when the question is about totals.

Suggested change
"answer = ask_ai(\"How many tables do we have and what are some of their names?\")\n",
"answer = ask_ai(\"From the fetched sample of up to 10 tables, how many tables are listed and what are some of their names?\")\n",

Copilot uses AI. Check for mistakes.
"print(\"🤖 AI says:\")\n",
"print(answer)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "72b75316-6857-4018-99b9-c75f29071e4a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"============================================================\n",
" 🤖 OpenMetadata AI Assistant Demo\n",
"============================================================\n",
"\n",
"❓ Question: Which tables seem to be related to customers?\n",
"----------------------------------------\n",
"🤖 Answer: The table that seems to be related to customers is '_airbyte_raw_customers'.\n",
"\n",
"\n",
"❓ Question: Which tables look like they contain financial data?\n",
"----------------------------------------\n",
"🤖 Answer: The tables that appear to contain financial data are: \n",
"\n",
"1. 'ACCOUNTS' (multiple instances)\n",
"2. 'acct_issue_table'\n",
"3. '_airbyte_raw_order_items' \n",
"4. '_airbyte_raw_orders' \n",
"\n",
"These tables have names that suggest they may contain information related to financial transactions, accounts, or orders.\n",
"\n",
"\n",
"❓ Question: What would you recommend to improve the data catalog?\n",
"----------------------------------------\n",
"🤖 Answer: To improve the data catalog, I recommend:\n",
"\n",
"1. **Data deduplication**: Remove duplicate 'ACCOUNTS' tables to avoid confusion.\n",
"2. **Table naming conventions**: Rename tables with underscores and prefixes (e.g., '_airbyte_raw_') to more descriptive names for better understanding.\n",
"3. **Data standardization**: Standardize column names and data types across similar tables (e.g., 'orders' and 'order_items') for easier data integration.\n",
"4. **Data documentation**: Add descriptions and metadata to each table to provide context and facilitate discovery.\n",
"5. **Categorization and tagging**: Organize tables into categories (e.g., 'customers', 'orders', 'staff') and apply relevant tags for efficient searching and filtering.\n",
"\n",
"============================================================\n",
"✅ Template demo complete!\n",
"============================================================\n"
]
}
],
"source": [
"# Interactive Q&A session\n",
"questions = [\n",
" \"Which tables look incomplete or poorly documented?\",\n",
" \"What kind of organization does this data belong to?\",\n",
" \"If you were a new data analyst, which tables would you explore first?\",\n",
"]\n",
"\n",
"print(\"=\" * 60)\n",
"print(\" 🤖 OpenMetadata AI Assistant Demo\")\n",
"print(\"=\" * 60)\n",
"\n",
"for question in questions:\n",
" print(f\"\\n❓ Question: {question}\")\n",
" print(\"-\" * 40)\n",
" answer = ask_ai(question)\n",
" print(f\"🤖 Answer: {answer}\")\n",
" print()\n",
"\n",
"print(\"=\" * 60)\n",
"print(\" 🤖 OpenMetadata AI Template Demo\")\n",
"print(\" Built for OpenMetadata Hackathon 2026\")\n",
"print(\"=\" * 60)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dbcaac82-2132-47e9-8721-f384270685ad",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.9"
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notebook metadata declares Python 3.13.9, but this repository explicitly supports Python 3.9–3.11 for ingestion/SDK code. Please update the notebook kernel/language metadata to a supported version (e.g., 3.11) to avoid misleading users and compatibility issues.

Suggested change
"version": "3.13.9"
"version": "3.11"

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notebook metadata lists Python 3.13.9, which is outside the ingestion module’s supported versions (e.g., ingestion/noxfile.py lists 3.10–3.12). Please update the kernel/language metadata to a supported version to reduce confusion when users run these notebooks.

Suggested change
"version": "3.13.9"
"version": "3.11.0"

Copilot uses AI. Check for mistakes.
}
Comment on lines +200 to +206
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notebook metadata indicates it was created with Python 3.13.9. OpenMetadata ingestion/examples are expected to run on supported Python versions (e.g., repo notebooks under examples/python-sdk/... use 3.11.x), so this kernel/version metadata is likely to mislead users and can break dependencies. Please re-save the notebook using a supported Python kernel (3.9–3.11) so language_info.version matches the supported runtime.

Copilot uses AI. Check for mistakes.
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading
Loading