10x Benchmark

A comprehensive benchmark comparing how different large language models tackle "vibe coding" — creating a fully functional website for Przeprogramowani.pl in a single attempt, without iterative refinement.

Overview

This repository evaluates the practical capabilities of various state-of-the-art LLMs by having each model create a website implementation based on the same prompt and content specifications. The results provide insights into each model's ability to understand requirements, generate code, and produce production-ready web solutions.

Key Concept: "Vibe Coding"

Vibe coding represents a one-shot approach to web development where an LLM must:

Understand the complete project requirements from a single prompt
Extract and properly format content specifications
Generate a functional, well-structured website
Produce clean, maintainable code without iterative debugging

Project Structure

Core Directories & Files

Path	Purpose
`./website/`	Astro-based results dashboard (displays all benchmark results)
`./scripts/process-results.ts`	TypeScript script that processes CSV results and generates dashboard data
`./eval-attempts/`	Model implementations (one-shot attempts)
`./eval-results/`	Processed evaluation result files

Model Attempt Directories

Each model's implementation is stored in a dedicated directory under ./eval-attempts/.

Each eval-results/{model-name}-attempt-{number} directory contains eval-results.csv with criterion-by-criterion evaluation scores. Multiple attempt directories per model indicate iterative benchmark runs.

How It Works

Prompt: Each model receives the same input prompt (see 10x-bench-eval)
Content: Reference content and specifications are maintained in 10x-bench-eval
Implementation: Models generate website code in their respective attempt directories under ./eval-attempts/
Evaluation: All implementations are assessed using the criteria and tooling from 10x-bench-eval
Results Processing: The scripts/process-results.ts script parses evaluation CSV files and generates data for the dashboard
Results Dashboard: An Astro-based static website (in ./website/) displays comparative results with interactive tables and summaries

Evaluation Criteria

The benchmark evaluates implementations across multiple dimensions:

Technical Stack: Framework choices, code organization, and architecture
Page Structure: Proper implementation of all required pages and routes
Content Accuracy: Correct use of provided copy and content
SEO & Metadata: Proper handling of titles, descriptions, and semantic HTML
Responsive Design: Mobile-friendliness and responsive layout implementation
Code Quality: Readability, maintainability, and best practices
Functionality: Working features and user interactions

For detailed criteria, see ./benchmark/criteria.md

Results Dashboard

Benchmark results are displayed in an interactive Astro-based static website:

./website/ — Results dashboard with:
- Overview page showing all attempts sorted by performance
- Interactive results table with sticky headers and frozen first column
- Model family averages
- Benchmark details page displaying the prompt and evaluation criteria
- Data automatically processed from CSV evaluation files via scripts/process-results.ts

Getting Started

View Results Dashboard

# Install dependencies
npm install

# Build and start development server (processes results and runs Astro)
npm run dev

# Open http://localhost:3000 in your browser

Explore Benchmark Materials

Benchmark prompt, evaluation criteria and reference content are maintained in the companion repo: 10x-bench-eval.

To explore model implementations: ls -la ./eval-attempts/

Build for Production

npm run build

This processes all evaluation results and generates a static production-ready site in ./website/dist/

Data Processing Pipeline

The benchmark uses an automated data pipeline to convert raw evaluation results into the interactive dashboard:

Input: Each attempt directory contains eval-result.csv with criterion scores
Processing: scripts/process-results.ts parses CSV files and calculates:
- Total score for each attempt (excluding "Task completion time")
- Percentage score relative to maximum possible score
- Model family averages across all attempts
Output: Generates website/src/data/results.json
Display: Astro website statically renders the dashboard using the JSON data

The script supports two CSV formats:

New format: Criterion,Score,Max,Notes
Legacy format: Criterion,Score,Notes (assumes Max=1)

Purpose

This benchmark serves as a practical evaluation tool for:

Understanding LLM capabilities in web development
Assessing code generation quality across different models
Identifying strengths and weaknesses in one-shot implementation scenarios
Informing technology choices for AI-assisted development workflows

Related Repositories

Repository	Purpose
10x-bench (this repo)	Model implementations, results dashboard, data processing, and the `/run-eval` skill
10x-bench-eval	Evaluation criteria, scoring methodology, benchmark prompt, reference content

Note: Each attempt represents a completely independent, one-shot effort with no iterative refinement or human intervention during implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.ai/plans		.ai/plans
.claude		.claude
.github/workflows		.github/workflows
benchmark		benchmark
eval-attempts		eval-attempts
eval-results		eval-results
scripts		scripts
slides		slides
website		website
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
prompt.md		prompt.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

10x Benchmark

Overview

Key Concept: "Vibe Coding"

Project Structure

Core Directories & Files

Model Attempt Directories

How It Works

Evaluation Criteria

Results Dashboard

Getting Started

View Results Dashboard

Explore Benchmark Materials

Build for Production

Data Processing Pipeline

Purpose

Related Repositories

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

10x Benchmark

Overview

Key Concept: "Vibe Coding"

Project Structure

Core Directories & Files

Model Attempt Directories

How It Works

Evaluation Criteria

Results Dashboard

Getting Started

View Results Dashboard

Explore Benchmark Materials

Build for Production

Data Processing Pipeline

Purpose

Related Repositories

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages