REST API - Centralized Ollama service with VLM support (dual format), batch processing, and image compression.
Key Features:
- REST API: FastAPI-based service (port 8000) that manages Ollama internally
- Native Ollama Format: Simple, efficient, direct integration
- OpenAI-Compatible Format: For Docling and other OpenAI-compatible clients
- Automatic Management: REST API automatically starts and manages Ollama (no manual setup needed)
- Embeddings API: Generate vector embeddings for semantic search and RAG systems
- Response Caching: Intelligent caching with semantic similarity matching for faster responses
- Model Management: Create, copy, and manage custom models via API
- Automatic Memory Management: Background service automatically unloads idle models
- Fine-tuning Helpers: Scripts and tools for local model customization
- Agent System: Support for Ollama 0.13.5+ agent framework
📚 Documentation: See docs/README.md for complete documentation index.
🛠️ Stability Plan: See docs/STABILITY_PLAN.md for the hardening roadmap.
fastapi0.122.0 +uvicorn[standard]0.38.0 for the HTTP layergunicorn23.0.0 for production process supervisionpsutil7.1.3,tenacity9.1.2,cachetools6.2.2 for process control, retries, and cachingPillow12.0.0 for VLM image handling- Tooling:
pytest9.0.1,pytest-asyncio1.3.0,pytest-cov7.0.0,ruff0.14.6
See pyproject.toml and constraints.txt for the authoritative list of pinned versions.
This service provides a REST API (port 8000) that manages Ollama internally and makes it accessible to all projects:
- Architecture snapshot: see
docs/ARCHITECTURE.mdfor diagrams, request flow, and runtime environments. - Clean Architecture: see
docs/CLEAN_ARCHITECTURE_REFACTORING.mdfor layer structure and dependency rules. - Testing strategy: see
docs/TESTING_PLAN.mdfor comprehensive testing approach and reusable components. - Scaling playbooks: see
docs/SCALING_AND_LOAD_TESTING.mdfor concurrency tuning and load-testing guidance. - Knowledge Machine
- Course Intelligence Compiler
- Story Machine
- Docling_Machine
Get up and running with the Shared Ollama Service in 5 minutes:
# Start REST API (automatically manages Ollama)
./scripts/core/start.shThe service will:
- Auto-detect your hardware and configure optimal settings
- Start Ollama with MPS/Metal GPU acceleration (Apple Silicon)
- Launch the REST API on port 8000
# Health check
curl http://0.0.0.0:8000/api/v1/health
# List available models
curl http://0.0.0.0:8000/api/v1/models
# Inspect the active hardware profile and model recommendations
curl http://0.0.0.0:8000/api/v1/system/model-profileExpected response:
{"status": "healthy", "ollama_status": "running"}Text Chat (Native Ollama Format):
curl -X POST http://0.0.0.0:8000/api/v1/chat \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:14b-q4_K_M",
"messages": [
{"role": "user", "content": "Explain quantum computing in one sentence"}
]
}'Text Chat (OpenAI-Compatible Format):
curl -X POST http://0.0.0.0:8000/api/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:14b-q4_K_M",
"messages": [
{"role": "user", "content": "Explain quantum computing in one sentence"}
]
}'Vision-Language Model (with Image):
# Encode an image to base64 (replace with your image path)
IMAGE_DATA=$(python3 -c "import base64; print('data:image/jpeg;base64,' + base64.b64encode(open('photo.jpg', 'rb').read()).decode())")
curl -X POST http://0.0.0.0:8000/api/v1/vlm \
-H "Content-Type: application/json" \
-d "{
\"model\": \"qwen3-vl:8b-instruct-q4_K_M\",
\"messages\": [{\"role\": \"user\", \"content\": \"What's in this image?\"}],
\"images\": [\"$IMAGE_DATA\"]
}"Generate Embeddings:
curl -X POST http://0.0.0.0:8000/api/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:14b-q4_K_M",
"prompt": "What is machine learning?"
}'Model Management:
# List running models
curl http://0.0.0.0:8000/api/v1/models/ps
# Get detailed model information
curl http://0.0.0.0:8000/api/v1/models/qwen3:14b-q4_K_M/show
# Create a custom model from Modelfile
curl -X POST http://0.0.0.0:8000/api/v1/models/create \
-H "Content-Type: application/json" \
-d '{
"name": "custom-model",
"modelfile": "FROM qwen3:14b-q4_K_M\nSYSTEM \"You are a helpful assistant.\""
}'Open in your browser:
http://0.0.0.0:8000/api/docs
- Client Examples: See docs/CLIENT_GUIDE.md for Python, TypeScript, and Go examples
- VLM Guide: See docs/VLM_GUIDE.md for complete vision-language model documentation
- API Reference: See docs/API_REFERENCE.md for complete endpoint documentation
- Integration: See docs/INTEGRATION_GUIDE.md for project integration examples
The service is multi-tenant and backed by finite Ollama workers. Follow these guardrails so your workloads stay fast and predictable:
- Max prompt tokens: 4 096 (matches current Ollama profile). Trim history or
summarize before sending. Requests exceeding the limit are rejected with
400/code=prompt_too_large. - Max request body: 1.5 MiB (after base64). Compress or split large images.
- Timeouts: Text chat endpoints hard-stop at 120 s, VLM at 150 s.
When a timeout hits you will receive
503withcode=request_timeout.
- Per-tenant concurrency: 6 in-flight text jobs, 3 VLM jobs. Extra requests
receive
429withRetry-Afterheaders—respect them. - Queue depth: When the shared queue is full we fail fast with
503 code=queue_full. Back off exponentially (≥2 s with jitter) before retrying. - Streaming encouraged: Request streamed responses (
stream=truein the OpenAI format) so you can stop when you have enough tokens and free capacity.
- Include only the minimal retrieved chunks needed for the answer. Use your RAG retriever to deduplicate and summarize source docs.
- Drop verbose system prompts; reuse the shared templates from
examples/. - For multi-turn chats keep the last 4–6 turns, summarize older context in your app, and prepend the summary instead of raw history.
- Propagate
X-Shared-Ollama-Request-Idinto your logs for support. - Implement cancellation hooks: if the caller disconnects, cancel the HTTP request so the server frees the slot immediately.
- Handle structured errors: every error response includes
codeandretry_after(if applicable). Use those fields rather than guessing.
- Capture
X-Shared-Ollama-Request-Idand search it inlogs/api.log. prompt_too_large→ shrink payload; call/api/v1/system/model-profilefor live limits.queue_fullorrequest_timeout→ respectRetry-After, stagger retries, and consider lowering client-side concurrency.- If issues persist, capture logs plus request metadata and open a ticket.
Tip: docs/STABILITY_PLAN.md tracks planned limit changes. Watch that file (or release notes) when upgrading clients.
prompt_too_large(400): prompt history exceeded ~4 096 tokens. Trim/summarize before retrying.request_too_large(413): JSON body exceeded 1.5 MiB. Chunk or compress.queue_full(503): shared queue saturated. HonorRetry-Afterand stagger retries.request_timeout(503): you waited 120 s (text) / 150 s (VLM) for a slot. Reduce concurrency or payload size.
Note: Models are loaded on-demand. Up to 3 models can be loaded simultaneously based on available RAM.
-
Know your active profile: call
GET /api/v1/system/model-profileto see which hardware profile was selected, which models are preloaded/warmed, and what RAM assumptions were used after auto-detection. -
Primary:
qwen3-vl:8b-instruct-q4_K_M(8B parameters, quantized vision-language model) ⭐ VLM SUPPORTED- Optimized for laptops: Quantized Q4_K_M build keeps RAM usage around ~6 GB while retaining Qwen 3 multimodal features
- Full multimodal capabilities: Images + text, OCR, chart/table understanding, spatial reasoning
- Dual format support: Native Ollama format + OpenAI-compatible requests (Docling ready)
- 128K context window: Plenty of headroom for long document/image conversations
- Fast load + low power: Smaller footprint = faster cold starts on 32 GB MacBook Pros
-
Secondary:
qwen3:14b-q4_K_M(14B parameters, quantized dense text model)- Hybrid reasoning: Qwen 3 “thinking vs. fast” modes for better latency control
- 128K context window: Handles long chat histories and RAG prompts
- High-quality responses: 14B dense backbone with 36T-token training run
- Fits comfortably: ~8 GB RAM when loaded—ideal default text model for this hardware
-
High-memory profile (≥ 33 GB):
qwen3-vl:32b(VLM) +qwen3:30b(text)- Automatically selected when
config/models.yamlresolves to the high-memory profile - Full-precision multimodal reasoning with 128K+ context and hybrid thinking
- Ideal for workstation/desktop servers running agentic or heavy RAG workloads
- Automatically selected when
Models remain in memory for 5 minutes after last use (configurable via idle_timeout), then are automatically unloaded to free memory by the background cleanup service. The cleanup service also monitors system memory and aggressively unloads models when memory usage exceeds 85% to prevent memory exhaustion. Switching between models requires a brief load time (~2-3 seconds).
The service fully supports vision-language models with both native Ollama and OpenAI-compatible formats:
- Native Ollama Format (
/api/v1/vlm): Simple, efficient, direct integration - OpenAI-Compatible Format (
/api/v1/vlm/openai): For Docling and other OpenAI-compatible clients
Quick Example:
import requests
import base64
with open("photo.jpg", "rb") as f:
img_data = base64.b64encode(f.read()).decode()
response = requests.post(
"http://0.0.0.0:8000/api/v1/vlm",
json={
"model": "qwen3-vl:8b-instruct-q4_K_M",
"messages": [{"role": "user", "content": "What's in this image?"}],
"images": [f"data:image/jpeg;base64,{img_data}"]
}
)
print(response.json()["message"]["content"])📖 Complete VLM Guide: See docs/VLM_GUIDE.md for detailed examples, batch processing, streaming, and best practices.
Generate vector embeddings for semantic search, RAG systems, and similarity matching:
curl -X POST http://0.0.0.0:8000/api/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:14b-q4_K_M",
"prompt": "What is machine learning?"
}'Full CRUD operations for managing models:
- List running models:
GET /api/v1/models/ps - Get model details:
GET /api/v1/models/{name}/show - Create custom models:
POST /api/v1/models/create - Copy models:
POST /api/v1/models/{name}/copy
Scripts for local model customization:
# Create a Modelfile
python scripts/maintenance/fine_tune_helper.py create-modelfile \
--base-model qwen3:14b-q4_K_M \
--system-prompt "You are a helpful coding assistant" \
--output Modelfile
# Create model via API
python scripts/maintenance/fine_tune_helper.py create-model \
--name custom-assistant \
--modelfile ModelfileBackground service automatically:
- Unloads idle models after 5 minutes of inactivity
- Monitors system memory and unloads models when memory > 85%
- Runs cleanup checks every 60 seconds
- Prevents memory exhaustion on single-machine setups
Support for Ollama's agent framework:
- List agents:
GET /api/v1/agents - Run agent:
POST /api/v1/agents/{name}/run - Create agent:
POST /api/v1/agents/create
Intelligent caching system with:
- Semantic similarity matching (95% threshold)
- LRU eviction policy
- Configurable TTL (default: 1 hour)
- Thread-safe operations
- Cache statistics tracking
Complete documentation is available in the docs/ directory:
- Client Guide - Quick start examples for curl, Python, TypeScript, and Go
- VLM Guide - Complete vision-language model guide with examples
- Integration Guide - How to integrate the service into your projects
- API Reference - Complete API documentation
- POML Guide - Prompt Orchestration Markup Language support
- LiteLLM Guide - LiteLLM integration guide
- Embeddings API - Generate vector embeddings for semantic search (
/api/v1/embeddings) - Model Management - Create, copy, and inspect models (
/api/v1/models/*) - Fine-tuning Helpers - Scripts for local model customization (
scripts/maintenance/fine_tune_helper.py) - Agent System - Ollama 0.13.5+ agent framework support (
/api/v1/agents/*) - Response Caching - Intelligent caching with semantic similarity matching
- Automatic Memory Management - Background service for model cleanup
- Operations Guide - Service operations, warm-up, and pre-loading
- Monitoring Guide - Monitoring, metrics, and observability
- Resource Management - Memory usage and performance tuning
- Troubleshooting Guide - Common issues and solutions
- Configuration Guide - Complete configuration reference
- Architecture - System architecture and design
- Development Guide - Development setup and guidelines
- Stability Plan - Hardening roadmap
📚 Full Documentation Index: See docs/README.md for the complete documentation index.
# Install Ollama (native, optimized for Apple Silicon)
./scripts/install_native.sh
# Start the service
./scripts/core/start.sh
# Verify it's running
curl http://0.0.0.0:8000/api/v1/health# Pre-download all required models
./scripts/preload_models.sh📖 Complete Installation Guide: See docs/CONFIGURATION.md for detailed installation and configuration instructions.
When adding new models or modifying the service:
- Update installation scripts if needed
- Update preload/warmup scripts with new models
- Update this README
- Test with all projects
- Document model size and use cases
📖 Development Guide: See docs/DEVELOPMENT.md for development setup, testing, and contribution guidelines.
POST /api/v1/generate- Text generationPOST /api/v1/chat- Chat completion (native format)POST /api/v1/vlm- Vision-language model (native format)POST /api/v1/vlm/openai- Vision-language model (OpenAI format)POST /api/v1/embeddings- Generate embeddings ⭐ NEWPOST /api/v1/batch/chat- Batch chat processingPOST /api/v1/batch/vlm- Batch VLM processing
GET /api/v1/models- List all available modelsGET /api/v1/models/ps- List running models ⭐ NEWGET /api/v1/models/{name}/show- Get model details ⭐ NEWPOST /api/v1/models/create- Create custom model ⭐ NEWPOST /api/v1/models/{name}/copy- Copy model ⭐ NEW
GET /api/v1/agents- List agents ⭐ NEWPOST /api/v1/agents/{name}/run- Run agent ⭐ NEWPOST /api/v1/agents/create- Create agent ⭐ NEW
GET /api/v1/health- Health checkGET /api/v1/queue/stats- Queue statisticsGET /api/v1/metrics- Service metricsGET /api/v1/performance/stats- Performance statisticsGET /api/v1/analytics- Analytics reportGET /api/v1/system/model-profile- Hardware profile and model recommendations
MIT
For issues or questions:
- Check logs:
tail -f logs/api.log - Run health check:
./scripts/diagnostics/health_check.sh - Verify service:
curl http://0.0.0.0:8000/api/v1/health - See docs/TROUBLESHOOTING.md for common issues and solutions
- Open issue in project repository