Skip to main content

For Coding

Use public benchmarks as directional signal; your repo, tools, and constraints matter. Prefer real‑world evals (SWE‑bench Verified, LiveCodeBench) over narrow function tests.


  • Anthropic Claude 3.5/3.7 Sonnet

    • Why: Strong on real‑world coding and agentic tool use.
    • Evidence: 49% on SWE‑bench Verified (SOTA at time of post) [1]; 93.7% HumanEval (class‑leading at time of post) [2].
    • Use for: Complex refactors, issue resolution, code reviews.
  • xAI Grok Code Fast 1

    • Why: Tuned for fast agentic coding with competitive quality.
    • Evidence: 70.8% on SWE‑bench Verified (internal harness, vendor‑reported) [3].
    • Use for: Fast prototyping, rapid iterations, CLI/code scaffolding.
  • Qwen 2.5/3 Coder (family)

    • Why: Strong open‑source coder models across sizes; good latency/cost balance.
    • Evidence: Technical report improvements on HumanEval/MBPP [4]; competitive open‑source standings on LiveCodeBench [5].
    • Use for: OSS‑first stacks, local workflows, budget‑conscious coding.
  • OpenAI o4‑mini / o3

    • Why: Reasoning‑focused models that pair well with tools.
    • Evidence: OpenAI reports top AIME 2024/2025 and reasoning performance; positioned for coding/tool use [7].
    • Use for: Multi‑step coding plans, orchestration, complex analysis.
  • Gemini 2.5 Pro

    • Why: Long‑context and multimodal strengths; improved coding focus.
    • Evidence: Google/DeepMind posts highlighting state‑of‑the‑art reasoning; coding improvements [8].
    • Use for: Long doc/code summarization, design discussions, planning.

Quick Selection by Use Case

TaskSuggested ModelsNotes
New feature (greenfield)Qwen 3 Coder, Grok Code Fast 1Fast iterations; cost‑aware
Bug fix (targeted)Claude 3.5/3.7 Sonnet, Qwen 3 CoderStrong retrieval + precise edits
Refactor (no behavior change)Claude 3.5/3.7 Sonnet, o4‑miniAsk for tests; small diffs
Code reviewClaude 3.5/3.7 SonnetReview narrative and risk analysis
Large context summarizationGemini 2.5 ProLong docs/specs/designs
Agentic workflows (tools)Claude 3.5/3.7 Sonnet, Grok Code Fast 1Pairs well with Cascade tools

Tips

  • Benchmark fit: Prefer SWE‑bench Verified and LiveCodeBench for repo‑level coding; use HumanEval+/MBPP+ for function‑level generation.
  • Iterate: Start with faster/cost‑efficient models; escalate to heavier reasoning only when needed.
  • Guardrails: Always run tests/linters after edits; keep diffs small and revertible.
  • Context: Use @‑mentions and Planning Mode to keep tasks scoped and reproducible.

Sources