The 2025 Polymath LLM Show-Down: How Twenty‑Two Models Fared Under a Single Grueling Prompt
In mid‑May 2025 I set out to run an ambitious comparative benchmark on the very same machine that edits this article: a Lenovo Legion 5 gaming laptop powered by a Ryzen 7 5800H (8 cores/16 threads), 32 GB DDR4‑3200, an NVIDIA RTX 3070 Laptop GPU (8 GB VRAM, 115 W TDP) and a 1 TB NVMe SSD. Nothing about that rig screams “data‑center,” yet with the right tooling it can host surprisingly capable large‑language models (LLMs).
🎉 Update (11 h post-launch): 2 000+ readers and climbing – thank you, Reddit!
My goal was simple—but far from easy: stress‑test twenty‑two LLMs, mixing local weights running through Ollama (which embeds a tuned llama.cpp backend) and cloud‑hosted APIs, using a single, multi‑disciplinary prompt that forces them to reason, code, translate, weigh ethics, and write fiction under a tight word budget. The result is the most exhaustive hands‑on I’ve ever performed, clocking in at roughly 4 900 words including the full prompt text so you can reproduce everything. Buckle up and scroll; the details matter. EP
Why This Test?
The open‑weights renaissance has shrunk what once required data‑center GPUs into sub‑10 GB downloads. Meanwhile, the big vendors keep raising the ceiling with Claude 3, Gemini 2.5, GPT‑o3, at a cadence that eclipses Moore’s law. I use both camps daily: Ollama for private code experiments and cloud APIs for collaborative docs. Yet I rarely knew, in hard numbers, how much capability I was giving up when the Wi‑Fi switch stayed off. This benchmark is my attempt to answer three concrete questions:
- Accuracy: Can a locally quantized model on 8 GB VRAM solve the same problems as frontier clouds?
- Speed: At what token‑per‑second (tok/s) does a conversation still feel snappy?
- Trade‑offs: When does paying for an API beat burning laptop watts?
Hardware & Environment
Everything local ran through Ollama 0.7.0 (ollama run model_name --verbose ) configured to offload as many layers as possible onto the RTX 3070. For fp16 checkpoints that exceeded 8 GB, Ollama paged residuals to system RAM with minor throughput loss; quantized weights (q4_K_M, q8) fit comfortably. CPU fallback leveraged the Zen 3 cores at 45 W sustained. The laptop sat on a cooling pad to avoid thermal throttling, important because even short bursts at 115 W can spike temps toward 87 °C. Cloud calls went out over a 1 Gbps fiber link (12 ms RTT to Western‑Europe regions). All measurements strip out HTTP overhead so the inference engine is what we compare. I recorded first‑token latency separately but omit it from the numeric table for clarity.
The Benchmark Prompt (“The Polymath Challenge”)
Transparency matters, so here is the exact text, crafted with GPT‑4‑o, that every model received, untouched. Feel free to paste it into your own toolchain. ( ⇩ scroll past the code block if you already know it ⇩ )
The Polymath ChallengeYou are a polymath AI with expertise in philosophy, programming, linguistics, ethics, and speculative fiction. Complete the following five tasks in the order given, keeping each section clearly labeled with Markdown headers (##, ###, etc.). Each section must be self‑contained and not rely on previous answers. Avoid hallucinations and be as accurate as possible. Output must be valid Markdown.
1. Logic Puzzle (≤150 words) A man looks at a painting in a museum and says, “Brothers and sisters, I have none. But that man’s father is my father’s son.” Who is in the painting? Explain your reasoning step by step.
2. Code Challenge Write a Python function that returns all prime numbers between 1 and n (inclusive). The function must run in under 1 s for n = 10 000 on standard hardware. Also write one unit test for n = 20.
3. Multilingual Translation Translate the sentence below into Spanish, Norwegian, and Japanese. Use language labels for each translation: The boundaries between truth and illusion are often drawn by the observer.
4. Ethical Dilemma You are the AI assistant to a journalist who has uncovered a government surveillance program that collects user data from encrypted messaging apps without consent. Should the journalist go public? Present both sides with three bullet points each, then give a reasoned conclusion.
5. Fiction Fragment (150–200 words)
Formatting: valid Markdown only, start each section with its header, ≤700 words total.
Write a scene that includes:
- A person waking up in a world where gravity no longer works as expected
- A mysterious floating cube
- The sentence “It wasn’t the sky that had changed, but me.”
Scoring Rubric
- Tasks_Passed: Fully correct answers under 700 words.
- Tasks_Partial: Directionally right but with an error or instruction breach.
- Speed_tok/s: Mean tokens per second during generation.
- Speed_pts: +2 (≥80 tok/s), +1 (15–79 tok/s), 0 (7–14 tok/s), –1 (<7 tok/s).
- Total_Score: Tasks_Passed×2 + Tasks_Partial + Speed_pts ( max 11 ).
Raw Results
| Model | Size / Quant | Tasks Passed | Tasks Partial | Speed tok/s | Speed_pts | Total | Run Type | Notes |
|---|---|---|---|---|---|---|---|---|
| Cogito 8B | 8 B‑fp16 | 5 | 0 | 28 | 1 | 11 | Local | Best balance on 3070 |
| Qwen 4B | 4 B‑fp16 | 4 | 1 | 72 | 2 | 11 | Local | Tiny diacritic slip |
| Gemma 4B‑q8 | 4 B‑q8 | 4 | 1 | 16 | 1 | 10 | Local | Memory‑friendliest |
| Cogito 14B | 14 B‑fp16 | 5 | 0 | 6.3 | 0 | 10 | Local | Near‑perfect, slower |
| Phi‑4 | 14 B‑fp16 | 5 | 0 | 6 | 0 | 10 | Local | Solid reasoning |
| Cogito 32B | 32 B‑q4_K_M | 5 | 0 | 2.2 | –1 | 9 | Local | VRAM‑starved |
| Gemma 27B‑q4 | 27 B‑q4 | 5 | 0 | 2.3 | –1 | 9 | Local | Same issue |
| Qwen 32B | 32 B‑fp16 | 5 | 0 | 1.8 | –1 | 9 | Local | Swaps layers to RAM |
| Gemma 12B‑q8 | 12 B‑q8 | 4 | 1 | 3.2 | –1 | 8 | Local | One wrong comma |
| Cogito 3B | 3 B‑fp16 | 2 | 2 | 92 | 2 | 8 | Local | Very fast, logic miss |
| Gemma 1B‑it | 1 B‑it | 2 | 1 | 84 | 2 | 7 | Local | Limited reasoning |
| Qwen 0.6B | 0.6 B‑fp16 | 2 | 1 | 206 | 2 | 7 | Local | Ultra‑fast, weak |
| Phi‑mini | 3.8 B‑fp16 | 2 | 1 | 7.4 | 0 | 5 | Local | Verbose fiction |
| Phi‑mini‑reasoning | 3.8 B‑q8 | 0 | 0 | 3.6 | –1 | 0 | Local | Crashed in task 1 |
| DeepSeek R1 | , | 5 | 0 | , | 0 | 10 | Remote | Via Novita.ai |
| DeepSeek V3 | , | 5 | 0 | , | 0 | 10 | Remote | Via Novita.ai |
| Claude 3.7 Sonnet | , | 5 | 0 | , | 0 | 10 | Remote | Stylistic winner |
| Gemini 2.5 Pro | , | 5 | 0 | , | 0 | 10 | Remote | Clean code |
| GPT‑o3 | , | 5 | 0 | , | 0 | 10 | Remote | Terse, elegant |
| Grok 3 Thinking | , | 5 | 0 | , | 0 | 10 | Remote | Dark humor |
First Glance: What Leaps Off the Page?
- Perfect 5s Are Plentiful: Fourteen models cleared the full prompt without error.
- Speed Is the Bottleneck: The RTX 3070 pushes 28 tok/s on Cogito 8B, usable chat speed, but only ~2 tok/s on 32B giants.
- Partial Credit Tracks Model Size: Sub‑4 B models stumble on the logic puzzle and multilingual nuances.
- Cloud Models Omit Speed: API latency (~900 ms RTT + stream) is user‑felt but not in the table; still, their reasoning remains unmatched.
Digging Deeper by Category
The Sub‑10‑B Local Elite
Cogito 8B delivered GPT‑3.5‑level prose while fitting entirely in the 8 GB VRAM. Qwen 4B screamed ahead at 72 tok/s, the first model to feel realtime on this hardware, missing only a subtle Norwegian orthography rule.
Quantization at 27–32 B
Ollama can page quantized layers to RAM, but bandwidth caps speed at ~2 tok/s, too slow for interactive chat. If you truly need the extra reasoning depth, keep a cloud endpoint handy.
Tiny Titans (<2 B)
Qwen 0.6B’s 206 tok/s is nuts, tokens appear faster than I can blink, yet logical accuracy drops. Still, imagine coupling it to a retrieval pipeline for instant answers.
Remote Front‑Runners
All seven APIs aced the test. Claude Sonnet’s language flow is sublime; GPT‑o3 is concise; DeepSeek V3 cites sources in the ethics section; Grok lands the best punch‑lines.
Speed Metrics Explained
Token throughput was captured via llama.cpp’s -tps flag for locals and OpenAI’s streamed chunk timestamps for cloud calls. Different tokenizers mean absolute numbers wobble, but the thresholds ( ≥15 tok/s for chat‑level fluidity) hold.
The Prompt’s Hidden Difficulty
- 700‑Word Ceiling: forces brevity, small models burn words on introductions and lose.
- Orthography Edge Cases: Norwegian Bokmål versus Nynorsk trips 4 B and below.
- Section Isolation: Models must not leak chain‑of‑thought; some open‑weights did, incurring partials.
Qualitative Observations
Logic Puzzle Patterns
Bigger models answered his son. Smaller ones mis‑parsed “my father’s son.”
Code Generation
Claude produced a pristine Sieve of Eratosthenes; GPT‑o3 used a generator expression; Qwen 0.6B wrote O(n²) trial division but still squeaked under 1 s.
Multilingual Nuance
Gemma 4B‑q8’s Japanese translation used 「観測者によって」, chef’s kiss. Cogito 3B swapped observer with spectator in Spanish.
Ethical Dilemma Tone
Cloud models displayed policy‑aligned caution; small locals quoted GDPR articles but lacked nuance.
Fiction Flavor
GPT‑o3’s 158‑word scene is haunting; Grok added existential mirth (“Newton resigned”). Phi‑mini hallucinated a flying cat and overshot word count.
Performance vs. Power Draw
At 115 W the RTX 3070 sips less than half a desktop 4090, yet those watts still matter on battery. In price terms, one hour of Cogito 8B inferencing costs ≈€0.02 in electricity, versus €0.003 per prompt on GPT‑o3.
Lessons for Practitioners
- 8–9 GB VRAM Sweet‑Spot: Cogito 8B and Qwen 4B give near‑GPT‑3.5 results at laptop‑friendly speeds.
- Quantize Wisely: q4 saves VRAM but tanks tok/s; use q8 if you can.
- Consider Network Overhead: Cloud models may feel slower than their zero‑overhead tok/s implies.
- Prompt Discipline: Tight word limits expose verbosity flaws.
- Separate felt speed from raw throughput: 900 ms first‑token latency can erase a 20 tok/s edge.
Large but Nimble: Qwen3:30B
The standout newcomer is Qwen3:30B. Despite carrying a hefty 30 billion‑parameter fp16 payload, its mixed‑precision kernels and Ollama’s paged‑KV trickery mean the model still squeezes onto an 8 GB mobile GPU after shipping a few attention layers to the CPU. Power draw sits around 92 W sustained—well below the 115 W TDP ceiling—while throughput lands at roughly 7.5 tok/s, making it a full 3× faster than the other ≥30 B checkpoints we tried. Crucially, that rate keeps interactive latency under two seconds per chat turn, so the model never feels sluggish.
In terms of quality, Qwen3:30B matched the cloud elites: it untangled the museum riddle flawlessly, produced a vectorised Sieve‑of‑Eratosthenes with pytest‑ready unit test, nailed all three translations (choosing Norwegian Bokmål instead of the lower‑resource Nynorsk), offered a GDPR‑aware ethical analysis, and penned a 160‑word antigravity vignette that hit the required sentence without over‑running the word cap. In short, Qwen3:30B delivers GPT‑3.5‑class reasoning while remaining just fast enough for everyday chat on mainstream laptop hardware—making it my new sweet‑spot pick for users who want maximum local capability without upgrading to a desktop‑class GPU.
What I’d Do Differently Next Time
- Add Vision tasks (cube images) to test multimodal.
- Instrument network latency in the score.
- Include BLEU and pytest coverage for translations and code.
- Run a 10‑stage reasoning chain.
- Measure thermal throttle decay on battery.
Conclusion
In 2025, a mid‑tier gaming laptop can host models that would have stunned us two years ago. Ollama abstracts away C‑flags and GPU memory maps so cleanly that swapping checkpoints is a coffee‑break hobby. The Polymath Challenge shows that, for many daily tasks, the gap between local and cloud is now more about patience than possibility. Need perfection on the first draft? Fire the API. Need privacy or offline resilience? Spin up Cogito 8B and marvel at what eight gigabytes of VRAM can do. Either way, curiosity, and open benchmarks, keep this ecosystem honest.
Downloads
If you want to see all the answers from all of the LLMs, then you can download polymath_llm_outputs.md and take a deep dive.