DeepSeek R1 and V3 vs Qwen3 - Why 631-Billion Parameters Still Miss the Mark on Instruction Fidelity

Spend any time on r/LocalLLM and you’ll notice a paradox: models that ace logic puzzles often fumble the simple act of doing exactly what you told them. My latest round of trials, sparked by the 2025 Polymath “single-prompt” showdown, drove that home. On paper, DeepSeek’s two flagship checkpoints, R1-631 B and V3-631 B (both served remotely on Novita.ai ), achieved flawless task-completion scores. In practice, when I asked them to crank out long, tag-wrapped content for this very blog, they returned barely 30 % of the requested length and mangled the markup. By contrast, every Qwen3 tier, from the feather-weight 8 B to the remote 235 B, followed my instructions to the letter.
This article unpacks that gap, shares my raw data (successes and spectacular failures), and offers pragmatic work-arounds while we wait for the rumored DeepSeek V4 and R2 refreshes said to drop “within weeks.”

Table of Contents

TL;DR
Polymath Recap & Why It Isn’t the Whole Story
My 2 000-Word & 10 000-Word Stress Tests
How 631 B DeepSeek Models Fall Short
Qwen3’s Secret Sauce (and Its Warts)
Mitigation Tactics (Hypothetical & Proven)
Rumor Watch: DeepSeek V4/R2
Methodology & Hardware Notes
Key Take-Aways for Local-LLM Builders

TL;DR

DeepSeek R1-631 B and V3-631 B nail reasoning tasks but routinely ignore explicit format or length constraints.
Qwen3 (8 B → 235 B) obeys instructions out-of-the-box, even on a single RTX 3070, though the 30 B-A3B variant hallucinated once in a 10 000-word test (details below).
If your pipeline needs precise word counts or tag wrappers, use Qwen3 today; keep DeepSeek for creative ideation unless you’re ready to babysit it with chunked prompts or regex post-processing.
Rumor mill says DeepSeek V4 and R2 will land shortly; worth re-testing when they do.

Polymath Recap & Why It Isn’t the Whole Story

First, a quick refresher. The Polymath Show-Down challenged 22 models to solve five heterogeneous tasks, logic, coding, translation, ethics, micro-fiction, inside a 700-word budget. Speed bonuses and penalties encouraged efficiency:

Model	Tasks Passed	Speed (t/s)	Total
Cogito 8 B	5	28	11
Qwen3 8 B	4	72	11
DeepSeek R1 631 B	5	0.7	10
DeepSeek V3 631 B	5	0.9	10
Qwen3 235 B	5	1.3	9

On that short, mixed prompt the DeepSeek giants looked stellar, losing points only to their glacial token-per-second rates. Yet Polymath stops scoring at 700 words. Most content creators, especially bloggers and fiction writers, need an order of magnitude more text, plus strict structure. That’s where things diverge.

My 2 000-Word & 10 000-Word Stress Tests

To see how models perform under realistic editorial constraints, I ran two single-shot prompts:

Medium-Form Test – “Write exactly 2 000 words explaining why local LLMs matter in 2025, using the QSTag syntax below.”
Long-Form Test – “Write a 10 000-word high-fantasy story in eight chapters, each headed by #H2…#EH2 tags, then embed a summary list.”

Results – Medium-Form (2 000 Words)

DeepSeek R1 631 B: 537 words, stops mid-section; printed “#BD” without closing tag.
DeepSeek V3 631 B: 491 words, ended after a single paragraph; no closing #ART.
Qwen3 8 B: 2 018 words, perfect tag pairs, 1.8 % over target.
Qwen3 30 B-A3B (local): 1 997 words, flawless formatting.
Qwen3 235 B (remote): 2 003 words, minor diacritic glitch in Spanish phrase.

Results – Long-Form (10 000 Words)

DeepSeek R1: 938 words, cut off in Chapter 3 without error.
DeepSeek V3: 814 words, Chapter 2 halfway, ended with “(continued)”.
Qwen3 8 B: 9 721 words, chapters intact, two bullet mismatches in summary.
Qwen3 30 B-A3B: ≈9 800 words but once entered a repetitive loop, rewriting the same paragraph until I killed the Ollama server (Open-WebUI v0.6.10). Subsequent runs behaved.
Qwen3 235 B: 10 123 words, full compliance.

The takeaway is stark: DeepSeek’s towering parameter counts do not translate to obedience on extended tasks, whereas Qwen3 behaves predictably even in its smallest form, though no model is entirely immune to edge-case hallucinations.

How 631 B DeepSeek Models Fall Short

Word-Count Myopia – Above ~700 tokens, both checkpoints self-truncate, ignoring explicit targets.
Tag Corruption – Leaving #BD unclosed, merging #UL/#LI lines into paragraphs, or escaping “#LT” and “#GT” incorrectly.
Section Blindness – Instructions like “H1 > H3 > UL > code” sometimes yield H1 followed by a blob of prose.
Silent Stop Tokens – The sampler quietly emits EOS with no apology; downstream scripts can’t detect failure.

Interestingly, none of these issues affect DeepSeek’s logical content. If you only need reasoning, say, generating test cases, SQL queries, or poetry without strict length, they’re fine. The failure is purely instructional.

Qwen3’s Secret Sauce (and Its Warts)

Why does Qwen3 pass? Based on model-card notes, code comments, and my own diffing experiments, three factors stand out:

Alignment Data Bias – Qwen3’s supervised fine-tuning leans heavily on instruction-rich datasets (MailChimp templates, StackOverflow answers, textbook outlines).
Tag-Aware Tokenizer – Tokens like “#H2” become single IDs, reducing risk of mid-token splits that break markup.
Token-Budget Heuristics – Qwen3 appears to estimate word targets, then self-monitor remaining budget before emitting EOS.

Still, the one-off loop incident with Qwen3 30 B-A3B is a reminder that alignment isn’t perfection; always keep validation scripts running.

Mitigation Tactics (Hypothetical & Proven)

While I have not fine-tuned any model myself, here are tactics I did try, plus a few hypothetical upgrades enthusiasts might explore once they have suitable hardware:

System-Prompt Reinforcement (Proven) – Pre-pend: “You must output exactly 2 000 ± 50 words in the provided QSTag schema. Think step-by-step silently before answering.” Compliance rose from ~20 % to ~55 % on R1.
Chunk-by-Chunk Generation (Proven) – Ask DeepSeek for Chapters 1-2 first, then 3-4, etc. Smaller word budgets finish reliably, though you must stitch parts together.
Temperature & p Tweaks (Proven) – Dropping temperature to 0.2 and top-p to 0.7 reduced premature EOS but also lowered creativity.
Regex Post-Checks (Proven) – Run a validator; if it fails, re-prompt with “You produced 517 words; please continue to reach 2 000.” DeepSeek usually complies on the second pass.
LoRA Fine-Tuning (Hypothetical) – A small, targeted LoRA on QSTag examples could teach DeepSeek obedience; community experiments suggest success with as little as 8 GB VRAM per card.

Rumor Watch: DeepSeek V4 & R2

Multiple Discord insiders claim that DeepSeek V4 and R2, both still at the 631 B scale, are slated for release “within the next few weeks.” Allegedly, the dev team has re-weighted instruction data and introduced longer context windows (128 k tokens mentioned). If true, we may soon see a DeepSeek that retains its reasoning prowess while finally respecting your word-count lines. I’ll rerun all tests the day those checkpoints drop and report back.

Methodology & Hardware Notes

Host Rig (Local): Lenovo Legion 5 – Ryzen 7 5800H, 32 GB RAM, RTX 3070 8 GB, Mageia Cauldron rolling release.
Frontend: Open-WebUI v0.6.10 for local experiments; Ollama for model serving.
Remote API: Novita.ai for DeepSeek R1/V3 631 B and Qwen3 235 B. Zero perceptible lag or rate limits during testing.
Local Models: Qwen3 8 B, 14 B, 30 B-A3B, 32 B fp16 via Ollama (2025-05-18 hashes).
Decoding Defaults: temp 0.7, top-p 0.95, top-k 40, presence 0, frequency 0. Where noted, I lowered temp/top-p for compliance tests.
Scoring Script: Word counts via wc -w on HTML-stripped output; tag validation with Python regex; pass when within ±2 % word target and all tags closed.
Data Access: The 2025 Polymath LLM Show-Down: How Twenty‑Two Models Fared Under a Single Grueling Prompt

Key Take-Aways for Local-LLM Builders

Correctness ≠ Compliance – A model can solve logic puzzles yet still ignore simple constraints.
Bigger Params Don’t Fix Obedience – Two 631 B giants lost to an 8 B sprite because alignment, not size, governs instruction following.
Prompt Engineering Matters, Up to a Point – You can nearly double DeepSeek’s compliance with stricter system prompts, but you’ll still trail Qwen3.
Automate Validation – Always run a post-generation script to count words, check tags, and detect silent truncation.
Stay Agile – With V4/R2 on the horizon, keep your benchmarking harness ready; today’s loser may be tomorrow’s champ.

Final Thoughts

DeepSeek R1 and V3 remind me of brilliant students who ace the hardest exam questions but ignore the essay word-limit box. Qwen3 is that meticulous classmate who reads the rubric twice and hits every checkpoint, though even the honor student had one late-night meltdown. Until DeepSeek’s rumored refresh ships, and proves itself out, Qwen3 remains my daily driver for anything where exact adherence matters. If your work leans more on open-ended creativity and raw reasoning, keep DeepSeek in your toolbox, but wrap its output in guardrails and checks. Happy prompting, and let me know on Reddit External site icon how the next generation turns out!