Athena is an end-to-end framework for benchmarking Retrieval-Augmented Generation (RAG) pipelines. It provides modular interfaces for embedding generation, vector database search, and LLM evaluation, wired together by a single configuration file and an orchestration script that collects timing and accuracy metrics.
Published at IISWC 2025: Athena: A Plug-and-Play Advisor for Retrieval-Augmented Generation using VectorDB.
.
├── config.yaml # Single config file for the entire pipeline
├── run_pipeline.py # End-to-end orchestration script
├── metrics.py # MetricsCollector (timing + accuracy)
│
├── emb/
│ ├── emb_server_vllm.py # Launch the vLLM embedding server
│ └── emb_query.py # Client: fetch_embeddings()
│
├── llm/
│ ├── llm_server_vllm.py # Launch a vLLM completion server
│ ├── llm_server_ollama.py # Launch an Ollama server
│ └── llm_eval.py # LLM class + bulk_eval()
│
├── milvus/
│ └── milvus_tasks_backup.py # MilvusInterface
│
├── postgres/
│ └── postgres_tasks_backup_v2.py # PostgresInterface
│
└── profile/
├── gpu_power.py # nvidia-smi GPU power logger
└── uprof_script.sh # AMD uProf CPU profiler
pip install pymilvus vllm requests numpy pyyaml pandas evaluate rouge_scoreMilvus itself should be run via Docker:
# follow https://milvus.io/docs/install_standalone-docker.md
docker compose up -dEmbedding server (vLLM, port 8000):
python emb/emb_server_vllm.pyLLM server — pick one:
python llm/llm_server_vllm.py # vLLM, port 8001
python llm/llm_server_ollama.py # Ollama, port 11434Both server scripts read embedding_model / llm_model from config.yaml.
Edit config.yaml to point at your collection, models, and input files. See Configuration Reference below or the full docs.
python run_pipeline.py # uses ./config.yaml
python run_pipeline.py --config my.yaml # custom configThe pipeline will:
- Embed all questions via the embedding server
- Search Milvus for relevant documents
- Generate answers with the LLM and compute ROUGE scores
- Write results to
output_file(CSV) and optionallymetrics_output(JSON)
# Embedding
embedding_model: "infly/INF-retriever-v1-1.5b"
emb_api_url: "http://localhost:8000/v1/embeddings"
# Milvus
milvus_uri: "http://localhost:19530"
milvus_token: "root:Milvus"
collection_name: "my_collection"
vector_field: "embedding"
search_limit: 5
search_params: {"metric_type": "COSINE", "params": {"ef": 64}}
output_fields: ["id", "text"]
# LLM
llm_model: "llama3"
llm_provider: "ollama" # "ollama" or "vllm"
batch_size: 4
# Input/output
input_file: "prompts.json" # [{"question": "..."}, ...]
groundtruth_file: "answers.json" # ["answer1", "answer2", ...]
output_file: "results.csv"
# Metrics
collect_metrics: true
metrics_output: "metrics.json"Full field descriptions are in the docs.
prompts.json — array of objects, each with a "question" key:
[
{"question": "What is the capital of France?"},
{"question": "Who wrote Hamlet?"}
]answers.json — array of ground-truth answer strings, one per question:
["Paris", "Shakespeare"]results.csv — one row per question:
| question | prediction | reference | rouge-1 | rouge-L | latency_first_token |
|---|
metrics.json — pipeline timing and accuracy summary:
{
"embed_latency": 1.23,
"search_latency": 0.45,
"llm_batch_latency": 8.91,
"end_to_end_latency": 10.59,
"avg_rouge1": 0.61,
"avg_rougeL": 0.58
}from emb.emb_query import fetch_embeddings
embeddings = fetch_embeddings(
["Text 1", "Text 2"],
model="infly/INF-retriever-v1-1.5b",
api_url="http://localhost:8000/v1/embeddings",
) # returns np.ndarray of shape (2, dim)from milvus.milvus_tasks_backup import MilvusInterface
db = MilvusInterface(uri="http://localhost:19530", token="root:Milvus")
results = db.search(
collection_name="my_collection",
anns_field_name="embedding",
emb_in=embeddings.tolist(),
limit=5,
search_params={"metric_type": "COSINE", "params": {"ef": 64}},
output_fields=["id", "text"],
)from llm.llm_eval import LLM, bulk_eval
llm = LLM(model="llama3", provider="ollama")
df = bulk_eval(llm, questions, retrieved_docs, references=answers, batch_size=4)from metrics import MetricsCollector
metrics = MetricsCollector(enabled=True)
with metrics.timed("my_stage"):
do_work()
metrics.save("metrics.json")Set enabled=False to disable all collection with zero overhead.
GPU power draw (NVIDIA):
MODEL=infly/INF-retriever-v1-1.5b DURATION_SEC=60 python profile/gpu_power.pyCPU profiling (AMD):
./profile/uprof_script.sh <output_dir>@INPROCEEDINGS{11241995,
author={Liang, Ning and Wenz, Fabian and Giceva, Jana and Wills, Lisa Wu},
booktitle={2025 IEEE International Symposium on Workload Characterization (IISWC)},
title={Athena: A Plug-and-Play Advisor for Retrieval-Augmented Generation using VectorDB},
year={2025},
pages={28-41},
doi={10.1109/IISWC66894.2025.00013}
}