Skip to content

yuanyunchen/VideoAgent

Repository files navigation

VideoAgent Tutorial

A multi-tool video understanding agent built with LangGraph for long-form video question answering.

Overview

VideoAgent is a modular agentic framework that employs a large language model (LLM) as a central controller for perception, decision-making, and action execution. The system adopts a ReAct-style workflow for iterative, query-driven evidence gathering, enabling efficient and accurate analysis over long videos without relying on predefined workflows or heavy pre-computation.

Key Features

  • Iterative Reasoning: ReAct-based workflow that progressively refines hypotheses through temporally grounded evidence gathering
  • Flexible Tool Orchestration: Dynamic coordination of specialized vision experts via a unified interface for problem-oriented temporal localization and visual perception
  • Hierarchical Memory: Structured memory organization (Task Context → Video Memory → Tool History → Reasoning State) for coherent long-term reasoning
  • Multi-GPU Support: Centralized tool server with GPU-aware resource management for parallel processing
  • Model-Aware Caching: Efficient caching for captions and descriptions to reduce redundant computation

Performance

Evaluated on the EgoSchema benchmark (500 egocentric videos, ~3 minutes each):

Method Accuracy Avg. Frames
GPT-4V 63.5% -
InternVideo2.5 63.5% 128
Tarsier 68.6% 128
VideoAgent (Ours) 70.8% 22.5

Architecture

+-----------------------------------------------------------------------------+
|                               VideoAgent                                    |
+-----------------------------------------------------------------------------+
|                                                                             |
|  +-----------------------------------------------------------------------+  |
|  |                      LangGraph Agent (ReAct)                          |  |
|  |  +-----------+      +-----------+      +--------------------------+   |  |
|  |  |   Agent   | ---> |   Tools   | ---> | Force Answer (if needed) |   |  |
|  |  |    Node   | <--- |    Node   |      |                          |   |  |
|  |  +-----------+      +-----------+      +--------------------------+   |  |
|  +-----------------------------------------------------------------------+  |
|                                  |                                          |
|  +-------------------------------v---------------------------------------+  |
|  |                       Hierarchical Memory                             |  |
|  |  +--------------+ +--------------+ +------------+ +---------------+   |  |
|  |  | Task Context | | Video Memory | |Tool History| |Reasoning State|   |  |
|  |  | (Q + Choices)| |(Frames+Caps) | | (Q&A Log)  | | (Hypotheses)  |   |  |
|  |  +--------------+ +--------------+ +------------+ +---------------+   |  |
|  +-----------------------------------------------------------------------+  |
|                                  |                                          |
|  +-------------------------------v---------------------------------------+  |
|  |                          Tool Manager                                 |  |
|  |  +--------+ +----------+ +-----------+ +----------+ +--------+        |  |
|  |  |  Q&A   | | Retrieval| |Observation| | Detection| |  ...   |        |  |
|  |  | Tools  | |  Tools   | |   Tools   | |  Tools   | |        |        |  |
|  |  +--------+ +----------+ +-----------+ +----------+ +--------+        |  |
|  +-----------------------------------------------------------------------+  |
|                                                                             |
+-----------------------------------------------------------------------------+

Project Structure

VideoAgent/
├── video_agent_tools/              # Main agent package
│   ├── cli.py                      # Command-line interface
│   ├── evaluation.py               # Batch evaluation framework
│   ├── graph.py                    # LangGraph agent (ReAct workflow)
│   ├── prompts.py                  # Agent prompts and templates
│   ├── state.py                    # State & memory definitions
│   ├── tools.py                    # Tool manager
│   ├── resource_management/        # Multi-GPU resource management
│   │   ├── gpu_manager.py          # GPU allocation & scheduling
│   │   ├── tool_server.py          # Centralized tool server
│   │   └── tool_client.py          # Worker tool client
│   └── utils/
│       ├── logging.py              # Structured logging
│       ├── tool_cache.py           # Model-aware caching
│       └── video.py                # Video processing utilities
│
├── tools/                          # Tool interface layer
│   ├── interface_base.py           # Base Interface class
│   ├── interface/                  # Tool interfaces (see below)
│   └── models/                     # Model weights (gitignored)
│
├── configs/                        # Configuration files
├── scripts/template/eval.sh        # Evaluation script template
├── data/EgoSchema_test/            # Dataset directory
├── requirements.txt
└── .env.example

Installation

1. Clone Repository

git clone https://github.com/yuanyunchen/VideoAgent.git
cd VideoAgent

2. Create Environment

conda create -n videoagent python=3.10
conda activate videoagent

3. Install Dependencies

pip install -r requirements.txt

# Install PyTorch with CUDA (adjust for your CUDA version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# For local tools (optional, see Model Setup below)
pip install transformers accelerate ultralytics

4. Configure API Key

cp .env.example .env
# Edit .env:
# AIML_API_KEY=your_api_key_here

5. Prepare Dataset

# Download EgoSchema videos from https://egoschema.github.io/
# Place videos in data/EgoSchema_test/videos/

Tool Interface System

VideoAgent uses a decoupled architecture separating the Interface Layer from the Model Layer. The agent interacts with abstract interfaces, allowing seamless model updates without changing agent logic.

Interface Architecture

┌────────────────────────────────────────────────────────────────────┐
│                          Agent Layer                               │
│    (Sees only tool descriptions, input schemas, formatted output)  │
└────────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌────────────────────────────────────────────────────────────────────┐
│                   Interface Layer (tools/interface/)               │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐  │
│  │    AGENT_NAME    │  │ AGENT_DESCRIPTION│  │ AGENT_INPUT_     │  │
│  │ AGENT_DESCRIPTION│  │  AGENT_INPUT_    │  │     SCHEMA       │  │
│  │  format_output() │  │     SCHEMA       │  │  format_output() │  │
│  └──────────────────┘  └──────────────────┘  └──────────────────┘  │
└────────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌────────────────────────────────────────────────────────────────────┐
│                    Model Layer (tools/models/)                     │
│   InternVideo2.5 | VideoTree | TStar | YOLO-World | DAM | ...      │
└────────────────────────────────────────────────────────────────────┘

Available Tools

Category Tool Interface Backend Model
Q&A internvideo_general_qa InternVideoGeneralQA InternVideo2.5-Chat-8B
internvideo_description InternVideoDescription InternVideo2.5-Chat-8B
general_vqa GeneralVQA API-based MLLM
temporal_spatial_qa TStarTemporalSpatialQA TStar + LLM
Retrieval temporal_sample_frames VideoTreeSampling VideoTree (CLIP)
temporal_spatial_sample_frames TStarSampling TStar (MobileCLIP)
Observation view_frame ViewFrame -
caption_image OmniCaptionerCaptioning OmniCaptioner
detailed_captioning APICaptioning API-based MLLM
describe_region DAMDescription DAM (Describe Anything)
Detection detect_objects YOLOWorldDetection YOLO-World
detect_all_objects YOLOEPromptFreeDetection YOLOE

Model Setup

Local tools require downloading model weights to tools/models/. Each tool interface specifies its required model.

Required Models

InternVideo2.5 (for internvideo_general_qa, internvideo_description)

# Download from HuggingFace
cd tools/models
git lfs install
git clone https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B

OmniCaptioner (for caption_image)

cd tools/models
git clone https://huggingface.co/U4R/OmniCaptioner

VideoTree (for temporal_sample_frames)

VideoTree uses CLIP embeddings. The interface automatically downloads CLIP weights on first use.

TStar (for temporal_spatial_sample_frames, temporal_spatial_qa)

cd tools/models
git clone https://github.com/TStar-Labs/TStar

# Download MobileCLIP weights
wget https://docs-assets.developer.apple.com/ml-research/datasets/mobileclip/mobileclip_blt.pt

YOLO-World (for detect_objects)

pip install ultralytics
# Weights are downloaded automatically on first use

DAM (for describe_region)

cd tools/models
git clone https://github.com/tsinghua-fib-lab/Describe-Anything-Model describe-anything

# Follow DAM installation instructions in its README

API-Only Mode

If you don't want to set up local models, you can use API-only tools:

TOOLS="general_vqa,view_frame,detailed_captioning"
CAPTIONER="gpt-4o-mini"  # Use API for captioning

Running Experiments

Using Evaluation Script (Recommended)

# Copy template
cp scripts/template/eval.sh scripts/my_experiment.sh

# Edit configuration
vim scripts/my_experiment.sh

# Key settings:
AGENT_MODEL="x-ai/grok-4-1-fast-reasoning"
TOOLS="internvideo_general_qa,temporal_sample_frames,view_frame,detect_objects"
COUNT=100               # Number of videos (0 = all)
MAX_TOOL_CALLS=20       # Max iterations
INITIAL_FRAMES=5        # Initial context

# Run
chmod +x scripts/my_experiment.sh
./scripts/my_experiment.sh

Using CLI

python -m video_agent_tools.cli \
    --model "gpt-4o-mini" \
    --tools "temporal_sample_frames,view_frame,general_vqa" \
    --max-tool-calls 15 \
    --max-videos 10 \
    --annotation-file data/EgoSchema_test/annotations.json \
    --video-dir data/EgoSchema_test/videos \
    --experiment-name "test_run"

CLI Options

Option Description Default
--model LLM model for agent reasoning gpt-4o-mini
--tools Comma-separated tool list See eval.sh
--max-tool-calls Max tool calls per video 10
--max-parallel-tools Max tools per turn 3
--initial-frames Frames to caption initially 5
--captioner Captioner (omni-captioner or API model) omni-captioner
--num-workers Parallel workers 1
--max-videos Number of videos (-1 = all) -1

Output Structure

results/<experiment_name>__<model>_videos_<count>_<date>/
├── logging.log           # Full evaluation log
├── result.json           # Complete results
├── metrics.csv           # Performance metrics
├── summary.txt           # Human-readable summary
├── accuracy.txt          # Quick accuracy
├── experiment_config.yaml
└── videos/
    └── <video_id>/
        ├── frames/       # Sampled frames (PNG)
        ├── llm.log       # Full LLM interaction log
        └── result.json   # Per-video result

Extending VideoAgent

Adding a New Tool

  1. Create interface class in tools/interface/:
from tools.interface_base import Interface, InterfaceCategory

class MyNewTool(Interface):
    NAME = "my_new_tool"
    CATEGORY = InterfaceCategory.DETECTION
    FUNCTIONALITY = "What this tool does"
    
    # Agent-facing metadata
    AGENT_NAME = "my_new_tool"
    AGENT_DESCRIPTION = "Description shown to agent"
    AGENT_INPUT_SCHEMA = {
        "query": {"type": "str", "required": True, "description": "Input query"},
        "num_results": {"type": "int", "required": False, "default": 5}
    }
    
    def initialize(self):
        # Load model weights
        self.model = load_model("tools/models/my_model")
    
    def __call__(self, video, query, num_results=5, **kwargs):
        # Execute tool
        result = self.model.process(video, query)
        return {"result": result, "count": len(result)}
    
    @classmethod
    def format_output_for_agent(cls, output):
        # Format output as text for agent consumption
        return f"Found {output['count']} results: {output['result']}"
  1. Register in tools/interface/__init__.py:
from tools.interface.my_tool import MyNewTool

INTERFACE_MAPPING["my_new_tool"] = MyNewTool
  1. Enable in evaluation:
TOOLS="temporal_sample_frames,my_new_tool"

Configuration

Environment Variables

# .env file
AIML_API_KEY=your_api_key      # Required: API key for LLM
AIML_BASE_URL=https://api.aimlapi.com/v1  # Optional: API endpoint

# Or use OpenAI directly
OPENAI_API_KEY=your_openai_key

Supported LLM Models

Model Provider Notes
gpt-4o-mini OpenAI Good value
gpt-4o OpenAI Best quality
x-ai/grok-4-1-fast-reasoning xAI Fast reasoning
anthropic/claude-4-sonnet Anthropic Strong reasoning
google/gemini-2.5-flash Google 1M context

Troubleshooting

Installation Issues

CUDA Version Mismatch

# Check your CUDA version
nvcc --version

# Install matching PyTorch version
# For CUDA 11.8:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# For CUDA 12.1:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

Out of Memory (OOM) Errors

  • InternVideo2.5: Requires ~16GB VRAM. Use --num-workers 1 to limit parallel processing
  • Multiple tools: Reduce --max-parallel-tools to 1-2
  • Large videos: The system automatically samples frames; if OOM persists, reduce --initial-frames
# Memory-constrained configuration
python -m video_agent_tools.cli \
    --num-workers 1 \
    --max-parallel-tools 1 \
    --initial-frames 3 \
    ...

Model Loading Failures

# Ensure git-lfs is installed for large model files
git lfs install
git lfs pull

# Verify model checksums
cd tools/models/InternVideo2_5_Chat_8B
git lfs fsck

API Issues

API Key Not Found

# Check .env file exists and is formatted correctly
cat .env
# Should show: AIML_API_KEY=your_key_here

# Verify environment variable is loaded
python -c "import os; print(os.getenv('AIML_API_KEY', 'NOT SET'))"

Rate Limiting

  • The system implements automatic retry with exponential backoff
  • For high-volume evaluation, consider:
    • Reducing --num-workers
    • Using a higher-tier API plan
    • Implementing request batching

Timeout Errors

API calls use default timeouts from the OpenAI client. If experiencing timeouts, consider:

  • Using a more responsive model endpoint
  • Reducing --max-parallel-tools to lower concurrent API calls
  • Checking network connectivity to API endpoint

Runtime Issues

Video Not Found

# Verify video directory structure
ls data/EgoSchema_test/videos/
# Should contain .mp4 files

# Check annotation file references correct paths
head -5 data/EgoSchema_test/annotations.json

Tool Initialization Failures

# Test individual tools
python -c "from tools.interface import INTERFACE_MAPPING; print(list(INTERFACE_MAPPING.keys()))"

# Initialize specific tool for debugging
python -c "
from tools.interface import INTERFACE_MAPPING
tool = INTERFACE_MAPPING['internvideo_general_qa']()
tool.initialize()
print('Tool initialized successfully')
"

FAQ

Q: Can I run without GPU? A: API-only tools (general_vqa, view_frame, detailed_captioning) work on CPU. Local vision models require CUDA.

Q: How do I resume a failed evaluation? A: Use --restore-path pointing to the previous result directory:

python -m video_agent_tools.cli --restore-path results/previous_run/ ...

Completed videos are loaded and skipped automatically.

Q: Why is the first run slow? A: Model weights are loaded on first use. Subsequent runs use cached models.

Q: How do I use a different LLM provider? A: Set OPENAI_API_KEY for OpenAI, or configure AIML_BASE_URL for other OpenAI-compatible APIs.


Acknowledgments


License

This project is for research purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors