For Coding

Use public benchmarks as directional signal; your repo, tools, and constraints matter. Prefer real‑world evals (SWE‑bench Verified, LiveCodeBench) over narrow function tests.

Recommended Coding Models (with evidence)

Anthropic Claude 3.5/3.7 Sonnet
- Why: Strong on real‑world coding and agentic tool use.
- Evidence: 49% on SWE‑bench Verified (SOTA at time of post) [1]; 93.7% HumanEval (class‑leading at time of post) [2].
- Use for: Complex refactors, issue resolution, code reviews.
xAI Grok Code Fast 1
- Why: Tuned for fast agentic coding with competitive quality.
- Evidence: 70.8% on SWE‑bench Verified (internal harness, vendor‑reported) [3].
- Use for: Fast prototyping, rapid iterations, CLI/code scaffolding.
Qwen 2.5/3 Coder (family)
- Why: Strong open‑source coder models across sizes; good latency/cost balance.
- Evidence: Technical report improvements on HumanEval/MBPP [4]; competitive open‑source standings on LiveCodeBench [5].
- Use for: OSS‑first stacks, local workflows, budget‑conscious coding.
OpenAI o4‑mini / o3
- Why: Reasoning‑focused models that pair well with tools.
- Evidence: OpenAI reports top AIME 2024/2025 and reasoning performance; positioned for coding/tool use [7].
- Use for: Multi‑step coding plans, orchestration, complex analysis.
Gemini 2.5 Pro
- Why: Long‑context and multimodal strengths; improved coding focus.
- Evidence: Google/DeepMind posts highlighting state‑of‑the‑art reasoning; coding improvements [8].
- Use for: Long doc/code summarization, design discussions, planning.

Quick Selection by Use Case

Task	Suggested Models	Notes
New feature (greenfield)	Qwen 3 Coder, Grok Code Fast 1	Fast iterations; cost‑aware
Bug fix (targeted)	Claude 3.5/3.7 Sonnet, Qwen 3 Coder	Strong retrieval + precise edits
Refactor (no behavior change)	Claude 3.5/3.7 Sonnet, o4‑mini	Ask for tests; small diffs
Code review	Claude 3.5/3.7 Sonnet	Review narrative and risk analysis
Large context summarization	Gemini 2.5 Pro	Long docs/specs/designs
Agentic workflows (tools)	Claude 3.5/3.7 Sonnet, Grok Code Fast 1	Pairs well with Cascade tools

Tips

Benchmark fit: Prefer SWE‑bench Verified and LiveCodeBench for repo‑level coding; use HumanEval+/MBPP+ for function‑level generation.
Iterate: Start with faster/cost‑efficient models; escalate to heavier reasoning only when needed.
Guardrails: Always run tests/linters after edits; keep diffs small and revertible.
Context: Use @‑mentions and Planning Mode to keep tasks scoped and reproducible.

Sources

[1] Anthropic: Claude 3.5 Sonnet — 49% on SWE‑bench Verified — https://www.anthropic.com/research/swe-bench-sonnet
[2] Anthropic: Claude 3.5 Sonnet (HumanEval 93.7%, coding improvements) — https://www.anthropic.com/news/github-copilot
[3] xAI: Grok Code Fast 1 — 70.8% on SWE‑bench Verified (internal harness) — https://x.ai/news/grok-code-fast-1
[4] Qwen2.5‑Coder Technical Report — https://arxiv.org/pdf/2409.12186
[5] LiveCodeBench Leaderboard — https://livecodebench.github.io/leaderboard.html
[6] EvalPlus HumanEval+ Leaderboard — https://evalplus.github.io/leaderboard.html
[7] OpenAI: Introducing o3 and o4‑mini (reasoning‑focused; AIME) — https://openai.com/index/introducing-o3-and-o4-mini/
[8] Google/DeepMind: Gemini 2.5 Pro (reasoning and coding focus) — https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/

Recommended Coding Models (with evidence)​

Quick Selection by Use Case​

Tips​

Sources​

Recommended Coding Models (with evidence)

Quick Selection by Use Case

Tips

Sources