One Prompt, Dozens of AIs: The Great Single-File Webpage Bake-Off

Alright, let's talk AI. It feels like every week there's a new large language model (LLM) popping up, promising to revolutionize everything from writing emails to coding complex applications. As tech enthusiasts, we're constantly bombarded with benchmarks and claims, but sometimes you just want to see how these things perform on a simple, practical task. No complex frameworks, no iterative refinement – just one prompt, one shot, and see what happens.

That's exactly what I decided to do. I was curious about how the current landscape of AI models would handle a straightforward web development request: create a basic, modern-looking webpage in a single HTML file. The idea wasn't necessarily to find the best model overall, but rather to get a raw, unfiltered look at how different AIs interpret instructions and generate code right out of the box. Think of it like an AI bake-off, where each contestant gets the same recipe (the prompt) and we get to see the sometimes beautiful, sometimes baffling results.

Take a look at the experiment setup and the results page.

The Challenge: Crafting a Webpage from a Single Wish

To keep things fair and comparable, I used the exact same prompt for every single AI model tested:

Create a beautiful web page design that includes a fixed navbar at the top (make sure the content scrolls under it) with a logo on the left and 5 links that are on the right. The max width of the page is 1000px. Create a beautiful and modern color palette. Use 1 or 2 nice and readable fonts. You can include Font Awesome for icons and other elements. Use HTML, CSS and JS. Give me everything in one single HTML file.

The key constraints and requests here were:

Single File Output: Everything (HTML structure, CSS styling, any basic JS) needed to be bundled into one .html file. This tests the AI's ability to manage embedded or inline styles and scripts effectively.
Fixed Navbar: A common UI pattern, requiring specific CSS (position: fixed , handling potential overlap issues).
Layout Constraints: Logo left, links right, and a crucial max-width: 1000px for the page content.
Aesthetics: Subjective, but requested beautiful, modern, nice colors, and readable fonts. This tests the AI's design sense.
Potential Enhancement: Suggestion to use Font Awesome, testing if the AI can incorporate external libraries via CDN.
The Golden Rule: Only one attempt. No asking for changes, no clarifications, no make it prettier. What you see is the AI's first interpretation.

This prompt, while seemingly simple, actually touches upon several core front-end concepts: layout, positioning, styling, typography, responsiveness (implicitly via max-width), and resource management (single file).

The Contenders: An Army of AI Models Answered the Call

And here's where things got wild. I didn't just test the usual suspects. I threw this prompt at a huge range of models available as of mid-2025. We're talking dozens of contestants, including:

OpenAI: Various flavors of GPT-4, including the o1 , o3 , and o4 series, plus different sizes (mini , nano ).
Google: Gemini (2.0 Flash, 2.5 Pro, 2.5 Flash) and Gemma (3-12B, 3-27B).
Meta: Llama 4 (specifically the 17B Maverick variant, accessed via Novita.ai).
Anthropic: Claude 3.5 and 3.7 Sonnet (including a Reasoning variant).
xAI: Grok models (2, 3, mini, Thinking).
Industry Models: IBM Granite 3.3, Microsoft PHI 4 (and mini).
Specialized/Other Models: DeepSeek (V3, R1 Turbo), DeepCoder (14B), Cogito (from 3B up to 32B).
UI-Focused Tool: Vercel V0.

The sheer number of models involved turned this simple test into a fascinating survey of the AI landscape. Having access to models like Llama 4-17B Maverick, DeepSeek v3, and Gemma3 27B (thanks to platforms like novita.ai) alongside the big players provided a really broad comparison.

Observations from the Front Lines: The Good, The Bad, and The Baffling

Running the same prompt through so many models yielded exactly what I half-expected: massive variability. Here are some key takeaways from the process and the results:

The Process Was (Mostly) Painless: For the most part, getting the code was straightforward. Paste the prompt, wait a bit, and copy the HTML. It was actually quite fun seeing the different interpretations materialize. The single-shot nature meant no tedious back-and-forth.
Performance Spectrum: HUGE Variation: This was the biggest takeaway. Some models absolutely nailed the request, delivering clean, functional, and visually appealing pages. Others... not so much. The range went from Wow, that's genuinely usable! to Did you even read the prompt? Some outputs were borderline unusable or failed to generate anything resembling a complete webpage. It really felt like an LLM lottery sometimes.
Attention to Detail? Hit or Miss: Remember that max-width: 1000px constraint? Quite a few models simply ignored it, letting the content stretch to the full viewport width. This highlights a common issue: LLMs can sometimes miss or misinterpret specific constraints, especially when juggling multiple instructions. It shows the need for careful verification, even with seemingly simple requests.
Speed and Resources Matter: While cloud-based models were generally quick, some of the models I ran locally (like the larger 32B parameter models, possibly the Cogito ones) took noticeably longer. This is a practical consideration – generation time can vary significantly based on the model size and the hardware running it.
Design Sense Varies Wildly: The request for a beautiful and modern design is subjective, of course. But the results showed a vast difference in aesthetic capabilities. Some models produced genuinely pleasant layouts with good color choices and typography. Others defaulted to very basic, uninspired designs or made questionable stylistic choices.
Code Quality Hints: Although I don't have the detailed code analysis here (check the original page for the raw HTML!), my initial impressions (and previous experiences comparing models like Gemini and Claude) suggest variations in code practices. Some models might lean heavily on inline styles (easier to generate in one go, but harder to maintain), while others might generate more structured embedded CSS within <style> tags, which is generally preferred. This single-file constraint likely pushed many models towards one of these two approaches.
The Standouts (My Personal Picks): Based purely on the visual output and adherence to the prompt, a few models really shone. I quite liked the designs produced by OpenAI's GPT-4.1, GPT-4.1-mini, and ChatGPT o3 . Vercel's V0 , being specifically designed for UI generation, also produced a standout result. While comparing, I also felt Google's Gemini (though I can't recall the exact version from memory right now) did a solid job overall, striking a good balance.

The Bigger Picture: What Does This Massive Test Tell Us?

Stepping back from the individual results, what does this grand experiment reveal about using AI for web development in 2025?

AI as a Raw, Unpredictable Tool: The sheer variety underscores that LLMs are powerful but often raw tools. They can generate code incredibly quickly, but the quality, accuracy, and adherence to specifics can be inconsistent. Don't expect polished, production-ready code from a single prompt without review.
I's a Model Lottery: You can't assume all LLMs are created equal, even within the same family (like different GPT or Gemini versions). The performance on a specific task can vary significantly. If you're using AI for coding, it pays to try your prompt on a couple of different models or versions.
Baseline Capability Exposed: This single-shot approach reveals the baseline capability of each model. It shows what they produce without human guidance or iterative refinement. It's a measure of their raw understanding and generation power for this specific task. More complex tasks would likely require much more interaction.
Need for Human Oversight: The ignored constraints, variable code quality, and sometimes nonsensical outputs clearly demonstrate that human developers aren't going anywhere soon. AI can be a fantastic assistant, a pair programmer, or a rapid prototyper, but it needs direction, verification, and often significant correction.
Specialization Matters (Maybe): While hard to definitively conclude without deeper code analysis, it's plausible that models explicitly fine-tuned for code generation or UI design (like Vercel V0) might have an edge in consistency or quality for relevant tasks, though generalist models are clearly catching up fast.

A Fun Experiment, A Sobering Reality Check

This whole exercise started as a simple test born out of curiosity (and maybe a little frustration with initial results from a specific model). It ballooned into a comparison across dozens of AIs, offering a unique snapshot of the current landscape.

The results were fascinating and fun to explore. The key takeaway? Variability reigns supreme. While some LLMs can generate impressive results even from a single prompt, the output quality is far from uniform across the board. We saw everything from elegant designs to broken pages, highlighting both the incredible potential and the current limitations of AI in coding tasks.

For fellow tech enthusiasts and developers dabbling with AI, experiments like these are valuable. They ground the hype in reality and show us where these tools excel and where they still stumble. AI is undoubtedly a powerful force multiplier, but it's still a tool that requires skill, judgment, and a healthy dose of verification to wield effectively.

What are your experiences with AI code generation? Have you run similar tests? Let me know your thoughts!