SWE论文List

SWE论文List

Benchmark

  • SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    • swe-bench的起点

    • https://arxiv.org/pdf/2310.06770

  • SWE-bench Verified

    • https://www.swebench.com/verified.html

  • SWE-bench Lite

    • https://www.swebench.com/lite.html

  • Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

    • SWE-bench Multilingual: 多语言

    • https://arxiv.org/pdf/2504.02605

  • SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    • https://arxiv.org/pdf/2509.16941

Agent

  • SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    • https://arxiv.org/pdf/2405.15793

  • AutoCodeRover: Autonomous Program Improvement

    • https://arxiv.org/pdf/2404.05427

  • Agentless: Demystifying LLM-based Software Engineering Agents

    • 证明“定位→修复→验证”的简单 pipeline 很强

    • https://arxiv.org/pdf/2407.01489

  • [ICLR 2025]OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    • https://arxiv.org/pdf/2407.16741

  • [ICML 2024]Executable Code Actions Elicit Better LLM Agents

    • CodeAct:把 agent action 统一成可执行 Python 代码动作,是很多后续 agent 设计的重要背景

    • https://arxiv.org/pdf/2402.01030

Trajectory SFT / 数据生成 / Verifier

  • [ICML 2025]Training Software Engineering Agents and Verifiers with SWE-Gym

    • 核心贡献是提供 2,438 个真实 Python 任务、可执行环境、单元测试和自然语言任务描述,并用 agent 轨迹训练 SWE agents 和 verifiers;论文还报告了 fine-tuning 和 inference-time verifier 对 SWE-bench Verified/Lite 的提升。

    • https://arxiv.org/pdf/2412.21139

  • SWE-smith: Scaling Data for Software Engineering Agents

    • 它的价值是“规模化造数据”:给任意 Python codebase 建执行环境,再自动合成会破坏现有测试的任务;作者构造了 50k 级别、128 个 GitHub 仓库的数据,并训练 SWE-agent-LM-32B

    • https://arxiv.org/pdf/2504.21798

  • R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents

    • 它关注两个关键问题:如何程序化构造更多可执行 SWE 环境,以及 test-time scaling 怎么做。它提出 SWEGEN,用 test generation 和 back-translation 从 commits 生成任务,并讨论 execution-based 与 execution-free verifier 的互补性。

    • https://arxiv.org/pdf/2504.07164

  • SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

    • 强调合成测试、扩大 agent trajectories、训练和推理双 scaling

    • https://arxiv.org/pdf/2505.16975

  • Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents

    • 它把 agentless workflow 训练看作 localization、code edit、self-reflection 等“技能先验”,再用公开轨迹做 SFT 去适配 agent

    • https://arxiv.org/pdf/2509.23045

RL

  • SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

    • 它把开源软件演化数据作为 RL 训练来源,用轻量规则奖励,例如生成解和 ground-truth solution 的相似度,训练模型恢复开发者的推理与解决过程;论文报告 Llama3-SWE-RL-70B 在 SWE-bench Verified 上达到 41.0%

    • https://arxiv.org/pdf/2502.18449

  • DeepSWE: Training a Fully Open-sourced, State-of-the-Art Coding Agent by Scaling RL

    • 社区/工业界很受关注的 RL-only coding agent 训练 recipe

    • https://www.together.ai/blog/deepswe

  • SWE-RM: Execution-free Feedback For Software Engineering Agents

    • 关注 execution-free reward model,目标是在 test-time scaling 和 RL 里提供比单元测试更细的反馈

    • https://arxiv.org/pdf/2512.21919

  • Toward Training Superintelligent Software Agents through Self-Play SWE-RL

    • 尝试不依赖人工 issue/test,通过 agent 自己注入和修复 bug 来收集 RL 经验

    • https://arxiv.org/pdf/2512.18552

  • Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute

    • 讨论 internal/external TTC、trajectory synthesis、development-process-based search

    • https://arxiv.org/pdf/2503.23803

  • SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

    • 尝试复用已有轨迹,在关键中间步骤分叉,降低重复采样成本

    • https://arxiv.org/pdf/2601.22129

Evaluation可靠性

这里建议读 SWE-Bench+UTBoost / Rigorous Evaluation of Coding Agents on SWE-BenchThe SWE-Bench IllusionSPICE。UTBoost 指出原始测试可能太窄,生成 patch 可能通过测试但没有真正解决问题;该工作发现 SWE-bench Lite 和 Verified leaderboard 中有相当比例的条目会因增强测试和 parser 修正而发生排名变化。The SWE-Bench Illusion 则用“只看 issue description 猜 bug file path”的诊断任务讨论记忆/污染问题。SPICE 是 ASE 2025,尝试自动标注 issue clarity、test coverage、effort estimation,把数据质量做成可规模化流程。

LICENSED UNDER CC BY-NC-SA 4.0