Skip to content

Composer-Team/athena-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Athena

Athena is an end-to-end framework for benchmarking Retrieval-Augmented Generation (RAG) pipelines. It provides modular interfaces for embedding generation, vector database search, and LLM evaluation, wired together by a single configuration file and an orchestration script that collects timing and accuracy metrics.

Published at IISWC 2025: Athena: A Plug-and-Play Advisor for Retrieval-Augmented Generation using VectorDB.


Repository Layout

.
├── config.yaml              # Single config file for the entire pipeline
├── run_pipeline.py          # End-to-end orchestration script
├── metrics.py               # MetricsCollector (timing + accuracy)
│
├── emb/
│   ├── emb_server_vllm.py   # Launch the vLLM embedding server
│   └── emb_query.py         # Client: fetch_embeddings()
│
├── llm/
│   ├── llm_server_vllm.py   # Launch a vLLM completion server
│   ├── llm_server_ollama.py # Launch an Ollama server
│   └── llm_eval.py          # LLM class + bulk_eval()
│
├── milvus/
│   └── milvus_tasks_backup.py  # MilvusInterface
│
├── postgres/
│   └── postgres_tasks_backup_v2.py  # PostgresInterface
│
└── profile/
    ├── gpu_power.py         # nvidia-smi GPU power logger
    └── uprof_script.sh      # AMD uProf CPU profiler

Quick Start

1. Install dependencies

pip install pymilvus vllm requests numpy pyyaml pandas evaluate rouge_score

Milvus itself should be run via Docker:

# follow https://milvus.io/docs/install_standalone-docker.md
docker compose up -d

2. Start the servers

Embedding server (vLLM, port 8000):

python emb/emb_server_vllm.py

LLM server — pick one:

python llm/llm_server_vllm.py   # vLLM, port 8001
python llm/llm_server_ollama.py # Ollama, port 11434

Both server scripts read embedding_model / llm_model from config.yaml.

3. Configure

Edit config.yaml to point at your collection, models, and input files. See Configuration Reference below or the full docs.

4. Run the pipeline

python run_pipeline.py                    # uses ./config.yaml
python run_pipeline.py --config my.yaml  # custom config

The pipeline will:

  1. Embed all questions via the embedding server
  2. Search Milvus for relevant documents
  3. Generate answers with the LLM and compute ROUGE scores
  4. Write results to output_file (CSV) and optionally metrics_output (JSON)

Configuration Reference

# Embedding
embedding_model: "infly/INF-retriever-v1-1.5b"
emb_api_url: "http://localhost:8000/v1/embeddings"

# Milvus
milvus_uri: "http://localhost:19530"
milvus_token: "root:Milvus"
collection_name: "my_collection"
vector_field: "embedding"
search_limit: 5
search_params: {"metric_type": "COSINE", "params": {"ef": 64}}
output_fields: ["id", "text"]

# LLM
llm_model: "llama3"
llm_provider: "ollama"   # "ollama" or "vllm"
batch_size: 4

# Input/output
input_file: "prompts.json"       # [{"question": "..."}, ...]
groundtruth_file: "answers.json" # ["answer1", "answer2", ...]
output_file: "results.csv"

# Metrics
collect_metrics: true
metrics_output: "metrics.json"

Full field descriptions are in the docs.


Input Format

prompts.json — array of objects, each with a "question" key:

[
  {"question": "What is the capital of France?"},
  {"question": "Who wrote Hamlet?"}
]

answers.json — array of ground-truth answer strings, one per question:

["Paris", "Shakespeare"]

Output Format

results.csv — one row per question:

question prediction reference rouge-1 rouge-L latency_first_token

metrics.json — pipeline timing and accuracy summary:

{
  "embed_latency": 1.23,
  "search_latency": 0.45,
  "llm_batch_latency": 8.91,
  "end_to_end_latency": 10.59,
  "avg_rouge1": 0.61,
  "avg_rougeL": 0.58
}

Modules

emb/emb_query.pyfetch_embeddings()

from emb.emb_query import fetch_embeddings

embeddings = fetch_embeddings(
    ["Text 1", "Text 2"],
    model="infly/INF-retriever-v1-1.5b",
    api_url="http://localhost:8000/v1/embeddings",
)  # returns np.ndarray of shape (2, dim)

milvus/milvus_tasks_backup.pyMilvusInterface

from milvus.milvus_tasks_backup import MilvusInterface

db = MilvusInterface(uri="http://localhost:19530", token="root:Milvus")
results = db.search(
    collection_name="my_collection",
    anns_field_name="embedding",
    emb_in=embeddings.tolist(),
    limit=5,
    search_params={"metric_type": "COSINE", "params": {"ef": 64}},
    output_fields=["id", "text"],
)

llm/llm_eval.pyLLM + bulk_eval()

from llm.llm_eval import LLM, bulk_eval

llm = LLM(model="llama3", provider="ollama")
df = bulk_eval(llm, questions, retrieved_docs, references=answers, batch_size=4)

metrics.pyMetricsCollector

from metrics import MetricsCollector

metrics = MetricsCollector(enabled=True)
with metrics.timed("my_stage"):
    do_work()
metrics.save("metrics.json")

Set enabled=False to disable all collection with zero overhead.


Profiling

GPU power draw (NVIDIA):

MODEL=infly/INF-retriever-v1-1.5b DURATION_SEC=60 python profile/gpu_power.py

CPU profiling (AMD):

./profile/uprof_script.sh <output_dir>

Open-Source Datasets


Citation

@INPROCEEDINGS{11241995,
  author={Liang, Ning and Wenz, Fabian and Giceva, Jana and Wills, Lisa Wu},
  booktitle={2025 IEEE International Symposium on Workload Characterization (IISWC)},
  title={Athena: A Plug-and-Play Advisor for Retrieval-Augmented Generation using VectorDB},
  year={2025},
  pages={28-41},
  doi={10.1109/IISWC66894.2025.00013}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors