- Terminal Bench β Benchmark for LLMs on complex terminal tasks. [paper]
- SkillsBench β Evaluating how well skills work and how effective agents are at using them. [paper]
- LMCache β The fastest KV cache layer for LLMs. [paper]
- OT Agent β Open-source terminal agent from the Open Thoughts team. [blog] [blog]
- ClawsBench β Benchmark for claw-like agents. [paper]
- Harbor β Agent evaluation framework and RL environment toolkit. Contributor.
- lmcache-agent-trace β Agent application, benchmark, and workload traces for LLM serving research.
- claude-code-tracing β Tracing tooling for Claude Code agent runs. [blog]
- vLLM / production-stack β High-throughput LLM inference engine and its K8s-native serving stack. Contributor.
- inference-engine-arena β Postman & Chatbot Arena for inference benchmarking. (Open-sourced ~3 months before SemiAnalysisAI/InferenceX.)
- cacheserve β KV-cache-aware serving experiments. [paper]
- lmcache-trace-analysis / mooncake-trace-replayer β Trace analysis & replay for inference workloads.
- Continuum β Multi-turn LLM agent scheduling with KV-cache time-to-live for efficient serving. Contributor. [paper]
- VidGen β Diffusion + autoregressive models for interactive video/game generation (Diffusive AI).
- LAG β Research experiments.
- citation-verifier β Verifying citations produced by LLM agents (TypeScript).