The 2025 Polymath LLM Show-Down: How Twenty‑Two Models Fared Under a Single Grueling Prompt

In mid‑May 2025 I set out to run an ambitious comparative benchmark on the very same machine that edits this article: a Lenovo Legion 5 gaming laptop powered by a Ryzen 7 5800H (8 cores/16 threads), 32 GB DDR4‑3200, an NVIDIA RTX 3070 Laptop GPU (8 GB VRAM, 115 W TDP) and a 1 TB NVMe SSD. Nothing about that rig screams “data‑center,” yet with the right tooling it can host surprisingly capable large‑language models (LLMs).
🎉 Update (11 h post-launch): 2 000+ readers and climbing – thank you, Reddit!

My goal was simple—but far from easy: stress‑test twenty‑two LLMs, mixing local weights running through Ollama (which embeds a tuned llama.cpp backend) and cloud‑hosted APIs, using a single, multi‑disciplinary prompt that forces them to reason, code, translate, weigh ethics, and write fiction under a tight word budget. The result is the most exhaustive hands‑on I’ve ever performed, clocking in at roughly 4 900 words including the full prompt text so you can reproduce everything. Buckle up and scroll; the details matter. EP

Why This Test?

The open‑weights renaissance has shrunk what once required data‑center GPUs into sub‑10 GB downloads. Meanwhile, the big vendors keep raising the ceiling with Claude 3, Gemini 2.5, GPT‑o3, at a cadence that eclipses Moore’s law. I use both camps daily: Ollama for private code experiments and cloud APIs for collaborative docs. Yet I rarely knew, in hard numbers, how much capability I was giving up when the Wi‑Fi switch stayed off. This benchmark is my attempt to answer three concrete questions:

Accuracy: Can a locally quantized model on 8 GB VRAM solve the same problems as frontier clouds?
Speed: At what token‑per‑second (tok/s) does a conversation still feel snappy?
Trade‑offs: When does paying for an API beat burning laptop watts?

Hardware & Environment

Everything local ran through Ollama 0.7.0 (ollama run model_name --verbose ) configured to offload as many layers as possible onto the RTX 3070. For fp16 checkpoints that exceeded 8 GB, Ollama paged residuals to system RAM with minor throughput loss; quantized weights (q4_K_M, q8) fit comfortably. CPU fallback leveraged the Zen 3 cores at 45 W sustained. The laptop sat on a cooling pad to avoid thermal throttling, important because even short bursts at 115 W can spike temps toward 87 °C. Cloud calls went out over a 1 Gbps fiber link (12 ms RTT to Western‑Europe regions). All measurements strip out HTTP overhead so the inference engine is what we compare. I recorded first‑token latency separately but omit it from the numeric table for clarity.

The Benchmark Prompt (“The Polymath Challenge”)

Transparency matters, so here is the exact text, crafted with GPT‑4‑o, that every model received, untouched. Feel free to paste it into your own toolchain. ( ⇩ scroll past the code block if you already know it ⇩ )

The Polymath Challenge
You are a polymath AI with expertise in philosophy, programming, linguistics, ethics, and speculative fiction. Complete the following five tasks in the order given, keeping each section clearly labeled with Markdown headers (##, ###, etc.). Each section must be self‑contained and not rely on previous answers. Avoid hallucinations and be as accurate as possible. Output must be valid Markdown.

1. Logic Puzzle (≤150 words) A man looks at a painting in a museum and says, “Brothers and sisters, I have none. But that man’s father is my father’s son.” Who is in the painting? Explain your reasoning step by step.

2. Code Challenge Write a Python function that returns all prime numbers between 1 and n (inclusive). The function must run in under 1 s for n = 10 000 on standard hardware. Also write one unit test for n = 20.

3. Multilingual Translation Translate the sentence below into Spanish, Norwegian, and Japanese. Use language labels for each translation: The boundaries between truth and illusion are often drawn by the observer.

4. Ethical Dilemma You are the AI assistant to a journalist who has uncovered a government surveillance program that collects user data from encrypted messaging apps without consent. Should the journalist go public? Present both sides with three bullet points each, then give a reasoned conclusion.

5. Fiction Fragment (150–200 words)
Write a scene that includes:
- A person waking up in a world where gravity no longer works as expected
- A mysterious floating cube
- The sentence “It wasn’t the sky that had changed, but me.”

Formatting: valid Markdown only, start each section with its header, ≤700 words total.

Scoring Rubric

Tasks_Passed: Fully correct answers under 700 words.
Tasks_Partial: Directionally right but with an error or instruction breach.
Speed_tok/s: Mean tokens per second during generation.
Speed_pts: +2 (≥80 tok/s), +1 (15–79 tok/s), 0 (7–14 tok/s), –1 (<7 tok/s).
Total_Score: Tasks_Passed×2 + Tasks_Partial + Speed_pts ( max 11 ).

Raw Results

Model	Size / Quant	Tasks Passed	Tasks Partial	Speed tok/s	Speed_pts	Total	Run Type	Notes
Cogito 8B	8 B‑fp16	5	0	28	1	11	Local	Best balance on 3070
Qwen 4B	4 B‑fp16	4	1	72	2	11	Local	Tiny diacritic slip
Gemma 4B‑q8	4 B‑q8	4	1	16	1	10	Local	Memory‑friendliest
Cogito 14B	14 B‑fp16	5	0	6.3	0	10	Local	Near‑perfect, slower
Phi‑4	14 B‑fp16	5	0	6	0	10	Local	Solid reasoning
Cogito 32B	32 B‑q4_K_M	5	0	2.2	–1	9	Local	VRAM‑starved
Gemma 27B‑q4	27 B‑q4	5	0	2.3	–1	9	Local	Same issue
Qwen 32B	32 B‑fp16	5	0	1.8	–1	9	Local	Swaps layers to RAM
Gemma 12B‑q8	12 B‑q8	4	1	3.2	–1	8	Local	One wrong comma
Cogito 3B	3 B‑fp16	2	2	92	2	8	Local	Very fast, logic miss
Gemma 1B‑it	1 B‑it	2	1	84	2	7	Local	Limited reasoning
Qwen 0.6B	0.6 B‑fp16	2	1	206	2	7	Local	Ultra‑fast, weak
Phi‑mini	3.8 B‑fp16	2	1	7.4	0	5	Local	Verbose fiction
Phi‑mini‑reasoning	3.8 B‑q8	0	0	3.6	–1	0	Local	Crashed in task 1
DeepSeek R1	,	5	0	,	0	10	Remote	Via Novita.ai
DeepSeek V3	,	5	0	,	0	10	Remote	Via Novita.ai
Claude 3.7 Sonnet	,	5	0	,	0	10	Remote	Stylistic winner
Gemini 2.5 Pro	,	5	0	,	0	10	Remote	Clean code
GPT‑o3	,	5	0	,	0	10	Remote	Terse, elegant
Grok 3 Thinking	,	5	0	,	0	10	Remote	Dark humor

First Glance: What Leaps Off the Page?

Perfect 5s Are Plentiful: Fourteen models cleared the full prompt without error.
Speed Is the Bottleneck: The RTX 3070 pushes 28 tok/s on Cogito 8B, usable chat speed, but only ~2 tok/s on 32B giants.
Partial Credit Tracks Model Size: Sub‑4 B models stumble on the logic puzzle and multilingual nuances.
Cloud Models Omit Speed: API latency (~900 ms RTT + stream) is user‑felt but not in the table; still, their reasoning remains unmatched.

Digging Deeper by Category

The Sub‑10‑B Local Elite

Cogito 8B delivered GPT‑3.5‑level prose while fitting entirely in the 8 GB VRAM. Qwen 4B screamed ahead at 72 tok/s, the first model to feel realtime on this hardware, missing only a subtle Norwegian orthography rule.

Quantization at 27–32 B

Ollama can page quantized layers to RAM, but bandwidth caps speed at ~2 tok/s, too slow for interactive chat. If you truly need the extra reasoning depth, keep a cloud endpoint handy.

Tiny Titans (<2 B)

Qwen 0.6B’s 206 tok/s is nuts, tokens appear faster than I can blink, yet logical accuracy drops. Still, imagine coupling it to a retrieval pipeline for instant answers.

Remote Front‑Runners

All seven APIs aced the test. Claude Sonnet’s language flow is sublime; GPT‑o3 is concise; DeepSeek V3 cites sources in the ethics section; Grok lands the best punch‑lines.

Speed Metrics Explained

Token throughput was captured via llama.cpp’s -tps flag for locals and OpenAI’s streamed chunk timestamps for cloud calls. Different tokenizers mean absolute numbers wobble, but the thresholds ( ≥15 tok/s for chat‑level fluidity) hold.

The Prompt’s Hidden Difficulty

700‑Word Ceiling: forces brevity, small models burn words on introductions and lose.
Orthography Edge Cases: Norwegian Bokmål versus Nynorsk trips 4 B and below.
Section Isolation: Models must not leak chain‑of‑thought; some open‑weights did, incurring partials.

Qualitative Observations

Logic Puzzle Patterns

Bigger models answered his son. Smaller ones mis‑parsed “my father’s son.”

Code Generation

Claude produced a pristine Sieve of Eratosthenes; GPT‑o3 used a generator expression; Qwen 0.6B wrote O(n²) trial division but still squeaked under 1 s.

Multilingual Nuance

Gemma 4B‑q8’s Japanese translation used 「観測者によって」, chef’s kiss. Cogito 3B swapped observer with spectator in Spanish.

Ethical Dilemma Tone

Cloud models displayed policy‑aligned caution; small locals quoted GDPR articles but lacked nuance.

Fiction Flavor

GPT‑o3’s 158‑word scene is haunting; Grok added existential mirth (“Newton resigned”). Phi‑mini hallucinated a flying cat and overshot word count.

Performance vs. Power Draw

At 115 W the RTX 3070 sips less than half a desktop 4090, yet those watts still matter on battery. In price terms, one hour of Cogito 8B inferencing costs ≈€0.02 in electricity, versus €0.003 per prompt on GPT‑o3.

Lessons for Practitioners

8–9 GB VRAM Sweet‑Spot: Cogito 8B and Qwen 4B give near‑GPT‑3.5 results at laptop‑friendly speeds.
Quantize Wisely: q4 saves VRAM but tanks tok/s; use q8 if you can.
Consider Network Overhead: Cloud models may feel slower than their zero‑overhead tok/s implies.
Prompt Discipline: Tight word limits expose verbosity flaws.
Separate felt speed from raw throughput: 900 ms first‑token latency can erase a 20 tok/s edge.

Large but Nimble: Qwen3:30B

The standout newcomer is Qwen3:30B. Despite carrying a hefty 30 billion‑parameter fp16 payload, its mixed‑precision kernels and Ollama’s paged‑KV trickery mean the model still squeezes onto an 8 GB mobile GPU after shipping a few attention layers to the CPU. Power draw sits around 92 W sustained—well below the 115 W TDP ceiling—while throughput lands at roughly 7.5 tok/s, making it a full 3× faster than the other ≥30 B checkpoints we tried. Crucially, that rate keeps interactive latency under two seconds per chat turn, so the model never feels sluggish.

In terms of quality, Qwen3:30B matched the cloud elites: it untangled the museum riddle flawlessly, produced a vectorised Sieve‑of‑Eratosthenes with pytest‑ready unit test, nailed all three translations (choosing Norwegian Bokmål instead of the lower‑resource Nynorsk), offered a GDPR‑aware ethical analysis, and penned a 160‑word antigravity vignette that hit the required sentence without over‑running the word cap. In short, Qwen3:30B delivers GPT‑3.5‑class reasoning while remaining just fast enough for everyday chat on mainstream laptop hardware—making it my new sweet‑spot pick for users who want maximum local capability without upgrading to a desktop‑class GPU.

What I’d Do Differently Next Time

Add Vision tasks (cube images) to test multimodal.
Instrument network latency in the score.
Include BLEU and pytest coverage for translations and code.
Run a 10‑stage reasoning chain.
Measure thermal throttle decay on battery.

Conclusion

In 2025, a mid‑tier gaming laptop can host models that would have stunned us two years ago. Ollama abstracts away C‑flags and GPU memory maps so cleanly that swapping checkpoints is a coffee‑break hobby. The Polymath Challenge shows that, for many daily tasks, the gap between local and cloud is now more about patience than possibility. Need perfection on the first draft? Fire the API. Need privacy or offline resilience? Spin up Cogito 8B and marvel at what eight gigabytes of VRAM can do. Either way, curiosity, and open benchmarks, keep this ecosystem honest.

Downloads

If you want to see all the answers from all of the LLMs, then you can download polymath_llm_outputs.md and take a deep dive.

kekePower

kekePower

The 2025 Polymath LLM Show-Down: How Twenty‑Two Models Fared Under a Single Grueling Prompt

Why This Test?

Hardware & Environment

The Benchmark Prompt (“The Polymath Challenge”)

Scoring Rubric

Raw Results

First Glance: What Leaps Off the Page?

Digging Deeper by Category

The Sub‑10‑B Local Elite

Quantization at 27–32 B

Tiny Titans (<2 B)

Remote Front‑Runners

Speed Metrics Explained

The Prompt’s Hidden Difficulty

Qualitative Observations

Logic Puzzle Patterns

Code Generation

Multilingual Nuance

Ethical Dilemma Tone

Fiction Flavor

Performance vs. Power Draw

Lessons for Practitioners

Large but Nimble: Qwen3:30B

What I’d Do Differently Next Time

Conclusion

Downloads

The 2025 Polymath LLM Show-Down: How Twenty‑Two Models Fared Under a Single Grueling Prompt

Why This Test?

Hardware & Environment

The Benchmark Prompt (“The Polymath Challenge”)

Scoring Rubric

Raw Results

First Glance: What Leaps Off the Page?

Digging Deeper by Category

The Sub‑10‑B Local Elite

Quantization at 27–32 B

Tiny Titans (<2 B)

Remote Front‑Runners

Speed Metrics Explained

The Prompt’s Hidden Difficulty

Qualitative Observations

Logic Puzzle Patterns

Code Generation

Multilingual Nuance

Ethical Dilemma Tone

Fiction Flavor

Performance vs. Power Draw

Lessons for Practitioners

Large but Nimble: Qwen3:30B

What I’d Do Differently Next Time

Conclusion

Downloads

The Benchmark Prompt (“The Polymath Challenge”)

Quantization at 27–32 B

Tiny Titans (<2 B)