VideoAgent Tutorial

A multi-tool video understanding agent built with LangGraph for long-form video question answering.

Overview

VideoAgent is a modular agentic framework that employs a large language model (LLM) as a central controller for perception, decision-making, and action execution. The system adopts a ReAct-style workflow for iterative, query-driven evidence gathering, enabling efficient and accurate analysis over long videos without relying on predefined workflows or heavy pre-computation.

Key Features

Iterative Reasoning: ReAct-based workflow that progressively refines hypotheses through temporally grounded evidence gathering
Flexible Tool Orchestration: Dynamic coordination of specialized vision experts via a unified interface for problem-oriented temporal localization and visual perception
Hierarchical Memory: Structured memory organization (Task Context → Video Memory → Tool History → Reasoning State) for coherent long-term reasoning
Multi-GPU Support: Centralized tool server with GPU-aware resource management for parallel processing
Model-Aware Caching: Efficient caching for captions and descriptions to reduce redundant computation

Performance

Evaluated on the EgoSchema benchmark (500 egocentric videos, ~3 minutes each):

Method	Accuracy	Avg. Frames
GPT-4V	63.5%	-
InternVideo2.5	63.5%	128
Tarsier	68.6%	128
VideoAgent (Ours)	70.8%	22.5

Architecture

+-----------------------------------------------------------------------------+
|                               VideoAgent                                    |
+-----------------------------------------------------------------------------+
|                                                                             |
|  +-----------------------------------------------------------------------+  |
|  |                      LangGraph Agent (ReAct)                          |  |
|  |  +-----------+      +-----------+      +--------------------------+   |  |
|  |  |   Agent   | ---> |   Tools   | ---> | Force Answer (if needed) |   |  |
|  |  |    Node   | <--- |    Node   |      |                          |   |  |
|  |  +-----------+      +-----------+      +--------------------------+   |  |
|  +-----------------------------------------------------------------------+  |
|                                  |                                          |
|  +-------------------------------v---------------------------------------+  |
|  |                       Hierarchical Memory                             |  |
|  |  +--------------+ +--------------+ +------------+ +---------------+   |  |
|  |  | Task Context | | Video Memory | |Tool History| |Reasoning State|   |  |
|  |  | (Q + Choices)| |(Frames+Caps) | | (Q&A Log)  | | (Hypotheses)  |   |  |
|  |  +--------------+ +--------------+ +------------+ +---------------+   |  |
|  +-----------------------------------------------------------------------+  |
|                                  |                                          |
|  +-------------------------------v---------------------------------------+  |
|  |                          Tool Manager                                 |  |
|  |  +--------+ +----------+ +-----------+ +----------+ +--------+        |  |
|  |  |  Q&A   | | Retrieval| |Observation| | Detection| |  ...   |        |  |
|  |  | Tools  | |  Tools   | |   Tools   | |  Tools   | |        |        |  |
|  |  +--------+ +----------+ +-----------+ +----------+ +--------+        |  |
|  +-----------------------------------------------------------------------+  |
|                                                                             |
+-----------------------------------------------------------------------------+

Project Structure

VideoAgent/
├── video_agent_tools/              # Main agent package
│   ├── cli.py                      # Command-line interface
│   ├── evaluation.py               # Batch evaluation framework
│   ├── graph.py                    # LangGraph agent (ReAct workflow)
│   ├── prompts.py                  # Agent prompts and templates
│   ├── state.py                    # State & memory definitions
│   ├── tools.py                    # Tool manager
│   ├── resource_management/        # Multi-GPU resource management
│   │   ├── gpu_manager.py          # GPU allocation & scheduling
│   │   ├── tool_server.py          # Centralized tool server
│   │   └── tool_client.py          # Worker tool client
│   └── utils/
│       ├── logging.py              # Structured logging
│       ├── tool_cache.py           # Model-aware caching
│       └── video.py                # Video processing utilities
│
├── tools/                          # Tool interface layer
│   ├── interface_base.py           # Base Interface class
│   ├── interface/                  # Tool interfaces (see below)
│   └── models/                     # Model weights (gitignored)
│
├── configs/                        # Configuration files
├── scripts/template/eval.sh        # Evaluation script template
├── data/EgoSchema_test/            # Dataset directory
├── requirements.txt
└── .env.example

Installation

1. Clone Repository

git clone https://github.com/yuanyunchen/VideoAgent.git
cd VideoAgent

2. Create Environment

conda create -n videoagent python=3.10
conda activate videoagent

3. Install Dependencies

pip install -r requirements.txt

# Install PyTorch with CUDA (adjust for your CUDA version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# For local tools (optional, see Model Setup below)
pip install transformers accelerate ultralytics

4. Configure API Key

cp .env.example .env
# Edit .env:
# AIML_API_KEY=your_api_key_here

5. Prepare Dataset

# Download EgoSchema videos from https://egoschema.github.io/
# Place videos in data/EgoSchema_test/videos/

Tool Interface System

VideoAgent uses a decoupled architecture separating the Interface Layer from the Model Layer. The agent interacts with abstract interfaces, allowing seamless model updates without changing agent logic.

Interface Architecture

┌────────────────────────────────────────────────────────────────────┐
│                          Agent Layer                               │
│    (Sees only tool descriptions, input schemas, formatted output)  │
└────────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌────────────────────────────────────────────────────────────────────┐
│                   Interface Layer (tools/interface/)               │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐  │
│  │    AGENT_NAME    │  │ AGENT_DESCRIPTION│  │ AGENT_INPUT_     │  │
│  │ AGENT_DESCRIPTION│  │  AGENT_INPUT_    │  │     SCHEMA       │  │
│  │  format_output() │  │     SCHEMA       │  │  format_output() │  │
│  └──────────────────┘  └──────────────────┘  └──────────────────┘  │
└────────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌────────────────────────────────────────────────────────────────────┐
│                    Model Layer (tools/models/)                     │
│   InternVideo2.5 | VideoTree | TStar | YOLO-World | DAM | ...      │
└────────────────────────────────────────────────────────────────────┘

Available Tools

Category	Tool	Interface	Backend Model
Q&A	`internvideo_general_qa`	`InternVideoGeneralQA`	InternVideo2.5-Chat-8B
	`internvideo_description`	`InternVideoDescription`	InternVideo2.5-Chat-8B
	`general_vqa`	`GeneralVQA`	API-based MLLM
	`temporal_spatial_qa`	`TStarTemporalSpatialQA`	TStar + LLM
Retrieval	`temporal_sample_frames`	`VideoTreeSampling`	VideoTree (CLIP)
	`temporal_spatial_sample_frames`	`TStarSampling`	TStar (MobileCLIP)
Observation	`view_frame`	`ViewFrame`	-
	`caption_image`	`OmniCaptionerCaptioning`	OmniCaptioner
	`detailed_captioning`	`APICaptioning`	API-based MLLM
	`describe_region`	`DAMDescription`	DAM (Describe Anything)
Detection	`detect_objects`	`YOLOWorldDetection`	YOLO-World
	`detect_all_objects`	`YOLOEPromptFreeDetection`	YOLOE

Model Setup

Local tools require downloading model weights to tools/models/. Each tool interface specifies its required model.

Required Models

InternVideo2.5 (for `internvideo_general_qa`, `internvideo_description`)

# Download from HuggingFace
cd tools/models
git lfs install
git clone https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B

OmniCaptioner (for `caption_image`)

cd tools/models
git clone https://huggingface.co/U4R/OmniCaptioner

VideoTree (for `temporal_sample_frames`)

VideoTree uses CLIP embeddings. The interface automatically downloads CLIP weights on first use.

TStar (for `temporal_spatial_sample_frames`, `temporal_spatial_qa`)

cd tools/models
git clone https://github.com/TStar-Labs/TStar

# Download MobileCLIP weights
wget https://docs-assets.developer.apple.com/ml-research/datasets/mobileclip/mobileclip_blt.pt

YOLO-World (for `detect_objects`)

pip install ultralytics
# Weights are downloaded automatically on first use

DAM (for `describe_region`)

cd tools/models
git clone https://github.com/tsinghua-fib-lab/Describe-Anything-Model describe-anything

# Follow DAM installation instructions in its README

API-Only Mode

If you don't want to set up local models, you can use API-only tools:

TOOLS="general_vqa,view_frame,detailed_captioning"
CAPTIONER="gpt-4o-mini"  # Use API for captioning

Running Experiments

Using Evaluation Script (Recommended)

# Copy template
cp scripts/template/eval.sh scripts/my_experiment.sh

# Edit configuration
vim scripts/my_experiment.sh

# Key settings:
AGENT_MODEL="x-ai/grok-4-1-fast-reasoning"
TOOLS="internvideo_general_qa,temporal_sample_frames,view_frame,detect_objects"
COUNT=100               # Number of videos (0 = all)
MAX_TOOL_CALLS=20       # Max iterations
INITIAL_FRAMES=5        # Initial context

# Run
chmod +x scripts/my_experiment.sh
./scripts/my_experiment.sh

Using CLI

python -m video_agent_tools.cli \
    --model "gpt-4o-mini" \
    --tools "temporal_sample_frames,view_frame,general_vqa" \
    --max-tool-calls 15 \
    --max-videos 10 \
    --annotation-file data/EgoSchema_test/annotations.json \
    --video-dir data/EgoSchema_test/videos \
    --experiment-name "test_run"

CLI Options

Option	Description	Default
`--model`	LLM model for agent reasoning	`gpt-4o-mini`
`--tools`	Comma-separated tool list	See eval.sh
`--max-tool-calls`	Max tool calls per video	`10`
`--max-parallel-tools`	Max tools per turn	`3`
`--initial-frames`	Frames to caption initially	`5`
`--captioner`	Captioner (`omni-captioner` or API model)	`omni-captioner`
`--num-workers`	Parallel workers	`1`
`--max-videos`	Number of videos (-1 = all)	`-1`

Output Structure

results/<experiment_name>__<model>_videos_<count>_<date>/
├── logging.log           # Full evaluation log
├── result.json           # Complete results
├── metrics.csv           # Performance metrics
├── summary.txt           # Human-readable summary
├── accuracy.txt          # Quick accuracy
├── experiment_config.yaml
└── videos/
    └── <video_id>/
        ├── frames/       # Sampled frames (PNG)
        ├── llm.log       # Full LLM interaction log
        └── result.json   # Per-video result

Extending VideoAgent

Adding a New Tool

Create interface class in tools/interface/:

from tools.interface_base import Interface, InterfaceCategory

class MyNewTool(Interface):
    NAME = "my_new_tool"
    CATEGORY = InterfaceCategory.DETECTION
    FUNCTIONALITY = "What this tool does"
    
    # Agent-facing metadata
    AGENT_NAME = "my_new_tool"
    AGENT_DESCRIPTION = "Description shown to agent"
    AGENT_INPUT_SCHEMA = {
        "query": {"type": "str", "required": True, "description": "Input query"},
        "num_results": {"type": "int", "required": False, "default": 5}
    }
    
    def initialize(self):
        # Load model weights
        self.model = load_model("tools/models/my_model")
    
    def __call__(self, video, query, num_results=5, **kwargs):
        # Execute tool
        result = self.model.process(video, query)
        return {"result": result, "count": len(result)}
    
    @classmethod
    def format_output_for_agent(cls, output):
        # Format output as text for agent consumption
        return f"Found {output['count']} results: {output['result']}"

Register in tools/interface/__init__.py:

from tools.interface.my_tool import MyNewTool

INTERFACE_MAPPING["my_new_tool"] = MyNewTool

Enable in evaluation:

TOOLS="temporal_sample_frames,my_new_tool"

Configuration

Environment Variables

# .env file
AIML_API_KEY=your_api_key      # Required: API key for LLM
AIML_BASE_URL=https://api.aimlapi.com/v1  # Optional: API endpoint

# Or use OpenAI directly
OPENAI_API_KEY=your_openai_key

Supported LLM Models

Model	Provider	Notes
`gpt-4o-mini`	OpenAI	Good value
`gpt-4o`	OpenAI	Best quality
`x-ai/grok-4-1-fast-reasoning`	xAI	Fast reasoning
`anthropic/claude-4-sonnet`	Anthropic	Strong reasoning
`google/gemini-2.5-flash`	Google	1M context

Troubleshooting

Installation Issues

CUDA Version Mismatch

# Check your CUDA version
nvcc --version

# Install matching PyTorch version
# For CUDA 11.8:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# For CUDA 12.1:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

Out of Memory (OOM) Errors

InternVideo2.5: Requires ~16GB VRAM. Use --num-workers 1 to limit parallel processing
Multiple tools: Reduce --max-parallel-tools to 1-2
Large videos: The system automatically samples frames; if OOM persists, reduce --initial-frames

# Memory-constrained configuration
python -m video_agent_tools.cli \
    --num-workers 1 \
    --max-parallel-tools 1 \
    --initial-frames 3 \
    ...

Model Loading Failures

# Ensure git-lfs is installed for large model files
git lfs install
git lfs pull

# Verify model checksums
cd tools/models/InternVideo2_5_Chat_8B
git lfs fsck

API Issues

API Key Not Found

# Check .env file exists and is formatted correctly
cat .env
# Should show: AIML_API_KEY=your_key_here

# Verify environment variable is loaded
python -c "import os; print(os.getenv('AIML_API_KEY', 'NOT SET'))"

Rate Limiting

The system implements automatic retry with exponential backoff
For high-volume evaluation, consider:
- Reducing --num-workers
- Using a higher-tier API plan
- Implementing request batching

Timeout Errors

API calls use default timeouts from the OpenAI client. If experiencing timeouts, consider:

Using a more responsive model endpoint
Reducing --max-parallel-tools to lower concurrent API calls
Checking network connectivity to API endpoint

Runtime Issues

Video Not Found

# Verify video directory structure
ls data/EgoSchema_test/videos/
# Should contain .mp4 files

# Check annotation file references correct paths
head -5 data/EgoSchema_test/annotations.json

Tool Initialization Failures

# Test individual tools
python -c "from tools.interface import INTERFACE_MAPPING; print(list(INTERFACE_MAPPING.keys()))"

# Initialize specific tool for debugging
python -c "
from tools.interface import INTERFACE_MAPPING
tool = INTERFACE_MAPPING['internvideo_general_qa']()
tool.initialize()
print('Tool initialized successfully')
"

FAQ

Q: Can I run without GPU? A: API-only tools (general_vqa, view_frame, detailed_captioning) work on CPU. Local vision models require CUDA.

Q: How do I resume a failed evaluation? A: Use --restore-path pointing to the previous result directory:

python -m video_agent_tools.cli --restore-path results/previous_run/ ...

Completed videos are loaded and skipped automatically.

Q: Why is the first run slow? A: Model weights are loaded on first use. Subsequent runs use cached models.

Q: How do I use a different LLM provider? A: Set OPENAI_API_KEY for OpenAI, or configure AIML_BASE_URL for other OpenAI-compatible APIs.

Acknowledgments

EgoSchema - Benchmark dataset
LangGraph - Agent framework
InternVideo2.5 - Video understanding
VideoTree - Frame sampling
TStar - Temporal-spatial understanding
YOLO-World - Open-vocabulary detection
DAM - Region description

License

This project is for research purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
configs		configs
data/EgoSchema_test		data/EgoSchema_test
scripts/template		scripts/template
tools		tools
video_agent_tools		video_agent_tools
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

VideoAgent Tutorial

Overview

Key Features

Performance

Architecture

Project Structure

Installation

1. Clone Repository

2. Create Environment

3. Install Dependencies

4. Configure API Key

5. Prepare Dataset

Tool Interface System

Interface Architecture

Available Tools

Model Setup

Required Models

InternVideo2.5 (for internvideo_general_qa, internvideo_description)

OmniCaptioner (for caption_image)

VideoTree (for temporal_sample_frames)

TStar (for temporal_spatial_sample_frames, temporal_spatial_qa)

YOLO-World (for detect_objects)

DAM (for describe_region)

API-Only Mode

Running Experiments

Using Evaluation Script (Recommended)

Using CLI

CLI Options

Output Structure

Extending VideoAgent

Adding a New Tool

Configuration

Environment Variables

Supported LLM Models

Troubleshooting

Installation Issues

CUDA Version Mismatch

Out of Memory (OOM) Errors

Model Loading Failures

API Issues

API Key Not Found

Rate Limiting

Timeout Errors

Runtime Issues

Video Not Found

Tool Initialization Failures

FAQ

Acknowledgments

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

InternVideo2.5 (for `internvideo_general_qa`, `internvideo_description`)

OmniCaptioner (for `caption_image`)

VideoTree (for `temporal_sample_frames`)

TStar (for `temporal_spatial_sample_frames`, `temporal_spatial_qa`)

YOLO-World (for `detect_objects`)

DAM (for `describe_region`)

Packages