A multi-tool video understanding agent built with LangGraph for long-form video question answering.
VideoAgent is a modular agentic framework that employs a large language model (LLM) as a central controller for perception, decision-making, and action execution. The system adopts a ReAct-style workflow for iterative, query-driven evidence gathering, enabling efficient and accurate analysis over long videos without relying on predefined workflows or heavy pre-computation.
- Iterative Reasoning: ReAct-based workflow that progressively refines hypotheses through temporally grounded evidence gathering
- Flexible Tool Orchestration: Dynamic coordination of specialized vision experts via a unified interface for problem-oriented temporal localization and visual perception
- Hierarchical Memory: Structured memory organization (Task Context → Video Memory → Tool History → Reasoning State) for coherent long-term reasoning
- Multi-GPU Support: Centralized tool server with GPU-aware resource management for parallel processing
- Model-Aware Caching: Efficient caching for captions and descriptions to reduce redundant computation
Evaluated on the EgoSchema benchmark (500 egocentric videos, ~3 minutes each):
| Method | Accuracy | Avg. Frames |
|---|---|---|
| GPT-4V | 63.5% | - |
| InternVideo2.5 | 63.5% | 128 |
| Tarsier | 68.6% | 128 |
| VideoAgent (Ours) | 70.8% | 22.5 |
+-----------------------------------------------------------------------------+
| VideoAgent |
+-----------------------------------------------------------------------------+
| |
| +-----------------------------------------------------------------------+ |
| | LangGraph Agent (ReAct) | |
| | +-----------+ +-----------+ +--------------------------+ | |
| | | Agent | ---> | Tools | ---> | Force Answer (if needed) | | |
| | | Node | <--- | Node | | | | |
| | +-----------+ +-----------+ +--------------------------+ | |
| +-----------------------------------------------------------------------+ |
| | |
| +-------------------------------v---------------------------------------+ |
| | Hierarchical Memory | |
| | +--------------+ +--------------+ +------------+ +---------------+ | |
| | | Task Context | | Video Memory | |Tool History| |Reasoning State| | |
| | | (Q + Choices)| |(Frames+Caps) | | (Q&A Log) | | (Hypotheses) | | |
| | +--------------+ +--------------+ +------------+ +---------------+ | |
| +-----------------------------------------------------------------------+ |
| | |
| +-------------------------------v---------------------------------------+ |
| | Tool Manager | |
| | +--------+ +----------+ +-----------+ +----------+ +--------+ | |
| | | Q&A | | Retrieval| |Observation| | Detection| | ... | | |
| | | Tools | | Tools | | Tools | | Tools | | | | |
| | +--------+ +----------+ +-----------+ +----------+ +--------+ | |
| +-----------------------------------------------------------------------+ |
| |
+-----------------------------------------------------------------------------+
VideoAgent/
├── video_agent_tools/ # Main agent package
│ ├── cli.py # Command-line interface
│ ├── evaluation.py # Batch evaluation framework
│ ├── graph.py # LangGraph agent (ReAct workflow)
│ ├── prompts.py # Agent prompts and templates
│ ├── state.py # State & memory definitions
│ ├── tools.py # Tool manager
│ ├── resource_management/ # Multi-GPU resource management
│ │ ├── gpu_manager.py # GPU allocation & scheduling
│ │ ├── tool_server.py # Centralized tool server
│ │ └── tool_client.py # Worker tool client
│ └── utils/
│ ├── logging.py # Structured logging
│ ├── tool_cache.py # Model-aware caching
│ └── video.py # Video processing utilities
│
├── tools/ # Tool interface layer
│ ├── interface_base.py # Base Interface class
│ ├── interface/ # Tool interfaces (see below)
│ └── models/ # Model weights (gitignored)
│
├── configs/ # Configuration files
├── scripts/template/eval.sh # Evaluation script template
├── data/EgoSchema_test/ # Dataset directory
├── requirements.txt
└── .env.example
git clone https://github.com/yuanyunchen/VideoAgent.git
cd VideoAgentconda create -n videoagent python=3.10
conda activate videoagentpip install -r requirements.txt
# Install PyTorch with CUDA (adjust for your CUDA version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# For local tools (optional, see Model Setup below)
pip install transformers accelerate ultralyticscp .env.example .env
# Edit .env:
# AIML_API_KEY=your_api_key_here# Download EgoSchema videos from https://egoschema.github.io/
# Place videos in data/EgoSchema_test/videos/VideoAgent uses a decoupled architecture separating the Interface Layer from the Model Layer. The agent interacts with abstract interfaces, allowing seamless model updates without changing agent logic.
┌────────────────────────────────────────────────────────────────────┐
│ Agent Layer │
│ (Sees only tool descriptions, input schemas, formatted output) │
└────────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────────┐
│ Interface Layer (tools/interface/) │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ AGENT_NAME │ │ AGENT_DESCRIPTION│ │ AGENT_INPUT_ │ │
│ │ AGENT_DESCRIPTION│ │ AGENT_INPUT_ │ │ SCHEMA │ │
│ │ format_output() │ │ SCHEMA │ │ format_output() │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────────┐
│ Model Layer (tools/models/) │
│ InternVideo2.5 | VideoTree | TStar | YOLO-World | DAM | ... │
└────────────────────────────────────────────────────────────────────┘
| Category | Tool | Interface | Backend Model |
|---|---|---|---|
| Q&A | internvideo_general_qa |
InternVideoGeneralQA |
InternVideo2.5-Chat-8B |
internvideo_description |
InternVideoDescription |
InternVideo2.5-Chat-8B | |
general_vqa |
GeneralVQA |
API-based MLLM | |
temporal_spatial_qa |
TStarTemporalSpatialQA |
TStar + LLM | |
| Retrieval | temporal_sample_frames |
VideoTreeSampling |
VideoTree (CLIP) |
temporal_spatial_sample_frames |
TStarSampling |
TStar (MobileCLIP) | |
| Observation | view_frame |
ViewFrame |
- |
caption_image |
OmniCaptionerCaptioning |
OmniCaptioner | |
detailed_captioning |
APICaptioning |
API-based MLLM | |
describe_region |
DAMDescription |
DAM (Describe Anything) | |
| Detection | detect_objects |
YOLOWorldDetection |
YOLO-World |
detect_all_objects |
YOLOEPromptFreeDetection |
YOLOE |
Local tools require downloading model weights to tools/models/. Each tool interface specifies its required model.
# Download from HuggingFace
cd tools/models
git lfs install
git clone https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8Bcd tools/models
git clone https://huggingface.co/U4R/OmniCaptionerVideoTree uses CLIP embeddings. The interface automatically downloads CLIP weights on first use.
cd tools/models
git clone https://github.com/TStar-Labs/TStar
# Download MobileCLIP weights
wget https://docs-assets.developer.apple.com/ml-research/datasets/mobileclip/mobileclip_blt.ptpip install ultralytics
# Weights are downloaded automatically on first usecd tools/models
git clone https://github.com/tsinghua-fib-lab/Describe-Anything-Model describe-anything
# Follow DAM installation instructions in its READMEIf you don't want to set up local models, you can use API-only tools:
TOOLS="general_vqa,view_frame,detailed_captioning"
CAPTIONER="gpt-4o-mini" # Use API for captioning# Copy template
cp scripts/template/eval.sh scripts/my_experiment.sh
# Edit configuration
vim scripts/my_experiment.sh
# Key settings:
AGENT_MODEL="x-ai/grok-4-1-fast-reasoning"
TOOLS="internvideo_general_qa,temporal_sample_frames,view_frame,detect_objects"
COUNT=100 # Number of videos (0 = all)
MAX_TOOL_CALLS=20 # Max iterations
INITIAL_FRAMES=5 # Initial context
# Run
chmod +x scripts/my_experiment.sh
./scripts/my_experiment.shpython -m video_agent_tools.cli \
--model "gpt-4o-mini" \
--tools "temporal_sample_frames,view_frame,general_vqa" \
--max-tool-calls 15 \
--max-videos 10 \
--annotation-file data/EgoSchema_test/annotations.json \
--video-dir data/EgoSchema_test/videos \
--experiment-name "test_run"| Option | Description | Default |
|---|---|---|
--model |
LLM model for agent reasoning | gpt-4o-mini |
--tools |
Comma-separated tool list | See eval.sh |
--max-tool-calls |
Max tool calls per video | 10 |
--max-parallel-tools |
Max tools per turn | 3 |
--initial-frames |
Frames to caption initially | 5 |
--captioner |
Captioner (omni-captioner or API model) |
omni-captioner |
--num-workers |
Parallel workers | 1 |
--max-videos |
Number of videos (-1 = all) | -1 |
results/<experiment_name>__<model>_videos_<count>_<date>/
├── logging.log # Full evaluation log
├── result.json # Complete results
├── metrics.csv # Performance metrics
├── summary.txt # Human-readable summary
├── accuracy.txt # Quick accuracy
├── experiment_config.yaml
└── videos/
└── <video_id>/
├── frames/ # Sampled frames (PNG)
├── llm.log # Full LLM interaction log
└── result.json # Per-video result
- Create interface class in
tools/interface/:
from tools.interface_base import Interface, InterfaceCategory
class MyNewTool(Interface):
NAME = "my_new_tool"
CATEGORY = InterfaceCategory.DETECTION
FUNCTIONALITY = "What this tool does"
# Agent-facing metadata
AGENT_NAME = "my_new_tool"
AGENT_DESCRIPTION = "Description shown to agent"
AGENT_INPUT_SCHEMA = {
"query": {"type": "str", "required": True, "description": "Input query"},
"num_results": {"type": "int", "required": False, "default": 5}
}
def initialize(self):
# Load model weights
self.model = load_model("tools/models/my_model")
def __call__(self, video, query, num_results=5, **kwargs):
# Execute tool
result = self.model.process(video, query)
return {"result": result, "count": len(result)}
@classmethod
def format_output_for_agent(cls, output):
# Format output as text for agent consumption
return f"Found {output['count']} results: {output['result']}"- Register in
tools/interface/__init__.py:
from tools.interface.my_tool import MyNewTool
INTERFACE_MAPPING["my_new_tool"] = MyNewTool- Enable in evaluation:
TOOLS="temporal_sample_frames,my_new_tool"# .env file
AIML_API_KEY=your_api_key # Required: API key for LLM
AIML_BASE_URL=https://api.aimlapi.com/v1 # Optional: API endpoint
# Or use OpenAI directly
OPENAI_API_KEY=your_openai_key| Model | Provider | Notes |
|---|---|---|
gpt-4o-mini |
OpenAI | Good value |
gpt-4o |
OpenAI | Best quality |
x-ai/grok-4-1-fast-reasoning |
xAI | Fast reasoning |
anthropic/claude-4-sonnet |
Anthropic | Strong reasoning |
google/gemini-2.5-flash |
1M context |
# Check your CUDA version
nvcc --version
# Install matching PyTorch version
# For CUDA 11.8:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# For CUDA 12.1:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121- InternVideo2.5: Requires ~16GB VRAM. Use
--num-workers 1to limit parallel processing - Multiple tools: Reduce
--max-parallel-toolsto 1-2 - Large videos: The system automatically samples frames; if OOM persists, reduce
--initial-frames
# Memory-constrained configuration
python -m video_agent_tools.cli \
--num-workers 1 \
--max-parallel-tools 1 \
--initial-frames 3 \
...# Ensure git-lfs is installed for large model files
git lfs install
git lfs pull
# Verify model checksums
cd tools/models/InternVideo2_5_Chat_8B
git lfs fsck# Check .env file exists and is formatted correctly
cat .env
# Should show: AIML_API_KEY=your_key_here
# Verify environment variable is loaded
python -c "import os; print(os.getenv('AIML_API_KEY', 'NOT SET'))"- The system implements automatic retry with exponential backoff
- For high-volume evaluation, consider:
- Reducing
--num-workers - Using a higher-tier API plan
- Implementing request batching
- Reducing
API calls use default timeouts from the OpenAI client. If experiencing timeouts, consider:
- Using a more responsive model endpoint
- Reducing
--max-parallel-toolsto lower concurrent API calls - Checking network connectivity to API endpoint
# Verify video directory structure
ls data/EgoSchema_test/videos/
# Should contain .mp4 files
# Check annotation file references correct paths
head -5 data/EgoSchema_test/annotations.json# Test individual tools
python -c "from tools.interface import INTERFACE_MAPPING; print(list(INTERFACE_MAPPING.keys()))"
# Initialize specific tool for debugging
python -c "
from tools.interface import INTERFACE_MAPPING
tool = INTERFACE_MAPPING['internvideo_general_qa']()
tool.initialize()
print('Tool initialized successfully')
"Q: Can I run without GPU?
A: API-only tools (general_vqa, view_frame, detailed_captioning) work on CPU. Local vision models require CUDA.
Q: How do I resume a failed evaluation?
A: Use --restore-path pointing to the previous result directory:
python -m video_agent_tools.cli --restore-path results/previous_run/ ...Completed videos are loaded and skipped automatically.
Q: Why is the first run slow? A: Model weights are loaded on first use. Subsequent runs use cached models.
Q: How do I use a different LLM provider?
A: Set OPENAI_API_KEY for OpenAI, or configure AIML_BASE_URL for other OpenAI-compatible APIs.
- EgoSchema - Benchmark dataset
- LangGraph - Agent framework
- InternVideo2.5 - Video understanding
- VideoTree - Frame sampling
- TStar - Temporal-spatial understanding
- YOLO-World - Open-vocabulary detection
- DAM - Region description
This project is for research purposes.