A comprehensive benchmark comparing how different large language models tackle "vibe coding" — creating a fully functional website for Przeprogramowani.pl in a single attempt, without iterative refinement.
This repository evaluates the practical capabilities of various state-of-the-art LLMs by having each model create a website implementation based on the same prompt and content specifications. The results provide insights into each model's ability to understand requirements, generate code, and produce production-ready web solutions.
Vibe coding represents a one-shot approach to web development where an LLM must:
- Understand the complete project requirements from a single prompt
- Extract and properly format content specifications
- Generate a functional, well-structured website
- Produce clean, maintainable code without iterative debugging
| Path | Purpose |
|---|---|
./website/ |
Astro-based results dashboard (displays all benchmark results) |
./scripts/process-results.ts |
TypeScript script that processes CSV results and generates dashboard data |
./eval-attempts/ |
Model implementations (one-shot attempts) |
./eval-results/ |
Processed evaluation result files |
Each model's implementation is stored in a dedicated directory under ./eval-attempts/.
Each eval-results/{model-name}-attempt-{number} directory contains eval-results.csv with criterion-by-criterion evaluation scores. Multiple attempt directories per model indicate iterative benchmark runs.
- Prompt: Each model receives the same input prompt (see 10x-bench-eval)
- Content: Reference content and specifications are maintained in 10x-bench-eval
- Implementation: Models generate website code in their respective attempt directories under
./eval-attempts/ - Evaluation: All implementations are assessed using the criteria and tooling from 10x-bench-eval
- Results Processing: The
scripts/process-results.tsscript parses evaluation CSV files and generates data for the dashboard - Results Dashboard: An Astro-based static website (in
./website/) displays comparative results with interactive tables and summaries
The benchmark evaluates implementations across multiple dimensions:
- Technical Stack: Framework choices, code organization, and architecture
- Page Structure: Proper implementation of all required pages and routes
- Content Accuracy: Correct use of provided copy and content
- SEO & Metadata: Proper handling of titles, descriptions, and semantic HTML
- Responsive Design: Mobile-friendliness and responsive layout implementation
- Code Quality: Readability, maintainability, and best practices
- Functionality: Working features and user interactions
For detailed criteria, see ./benchmark/criteria.md
Benchmark results are displayed in an interactive Astro-based static website:
./website/— Results dashboard with:- Overview page showing all attempts sorted by performance
- Interactive results table with sticky headers and frozen first column
- Model family averages
- Benchmark details page displaying the prompt and evaluation criteria
- Data automatically processed from CSV evaluation files via
scripts/process-results.ts
# Install dependencies
npm install
# Build and start development server (processes results and runs Astro)
npm run dev
# Open http://localhost:3000 in your browserBenchmark prompt, evaluation criteria and reference content are maintained in the companion repo: 10x-bench-eval.
To explore model implementations: ls -la ./eval-attempts/
npm run buildThis processes all evaluation results and generates a static production-ready site in ./website/dist/
The benchmark uses an automated data pipeline to convert raw evaluation results into the interactive dashboard:
- Input: Each attempt directory contains
eval-result.csvwith criterion scores - Processing:
scripts/process-results.tsparses CSV files and calculates:- Total score for each attempt (excluding "Task completion time")
- Percentage score relative to maximum possible score
- Model family averages across all attempts
- Output: Generates
website/src/data/results.json - Display: Astro website statically renders the dashboard using the JSON data
The script supports two CSV formats:
- New format:
Criterion,Score,Max,Notes - Legacy format:
Criterion,Score,Notes(assumes Max=1)
This benchmark serves as a practical evaluation tool for:
- Understanding LLM capabilities in web development
- Assessing code generation quality across different models
- Identifying strengths and weaknesses in one-shot implementation scenarios
- Informing technology choices for AI-assisted development workflows
| Repository | Purpose |
|---|---|
| 10x-bench (this repo) | Model implementations, results dashboard, data processing, and the /run-eval skill |
| 10x-bench-eval | Evaluation criteria, scoring methodology, benchmark prompt, reference content |
Note: Each attempt represents a completely independent, one-shot effort with no iterative refinement or human intervention during implementation.