Benchmark
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
swe-bench的起点
https://arxiv.org/pdf/2310.06770
SWE-bench Verified
https://www.swebench.com/verified.html
SWE-bench Lite
https://www.swebench.com/lite.html
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
SWE-bench Multilingual: 多语言
https://arxiv.org/pdf/2504.02605
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
https://arxiv.org/pdf/2509.16941
Agent
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
https://arxiv.org/pdf/2405.15793
AutoCodeRover: Autonomous Program Improvement
https://arxiv.org/pdf/2404.05427
Agentless: Demystifying LLM-based Software Engineering Agents
证明“定位→修复→验证”的简单 pipeline 很强
https://arxiv.org/pdf/2407.01489
[ICLR 2025]OpenHands: An Open Platform for AI Software Developers as Generalist Agents
https://arxiv.org/pdf/2407.16741
[ICML 2024]Executable Code Actions Elicit Better LLM Agents
CodeAct:把 agent action 统一成可执行 Python 代码动作,是很多后续 agent 设计的重要背景
https://arxiv.org/pdf/2402.01030
Trajectory SFT / 数据生成 / Verifier
[ICML 2025]Training Software Engineering Agents and Verifiers with SWE-Gym
核心贡献是提供 2,438 个真实 Python 任务、可执行环境、单元测试和自然语言任务描述,并用 agent 轨迹训练 SWE agents 和 verifiers;论文还报告了 fine-tuning 和 inference-time verifier 对 SWE-bench Verified/Lite 的提升。
https://arxiv.org/pdf/2412.21139
SWE-smith: Scaling Data for Software Engineering Agents
它的价值是“规模化造数据”:给任意 Python codebase 建执行环境,再自动合成会破坏现有测试的任务;作者构造了 50k 级别、128 个 GitHub 仓库的数据,并训练 SWE-agent-LM-32B
https://arxiv.org/pdf/2504.21798
R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents
它关注两个关键问题:如何程序化构造更多可执行 SWE 环境,以及 test-time scaling 怎么做。它提出 SWEGEN,用 test generation 和 back-translation 从 commits 生成任务,并讨论 execution-based 与 execution-free verifier 的互补性。
https://arxiv.org/pdf/2504.07164
SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development
强调合成测试、扩大 agent trajectories、训练和推理双 scaling
https://arxiv.org/pdf/2505.16975
Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents
它把 agentless workflow 训练看作 localization、code edit、self-reflection 等“技能先验”,再用公开轨迹做 SFT 去适配 agent
https://arxiv.org/pdf/2509.23045
RL
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
它把开源软件演化数据作为 RL 训练来源,用轻量规则奖励,例如生成解和 ground-truth solution 的相似度,训练模型恢复开发者的推理与解决过程;论文报告 Llama3-SWE-RL-70B 在 SWE-bench Verified 上达到 41.0%
https://arxiv.org/pdf/2502.18449
DeepSWE: Training a Fully Open-sourced, State-of-the-Art Coding Agent by Scaling RL
社区/工业界很受关注的 RL-only coding agent 训练 recipe
https://www.together.ai/blog/deepswe
SWE-RM: Execution-free Feedback For Software Engineering Agents
关注 execution-free reward model,目标是在 test-time scaling 和 RL 里提供比单元测试更细的反馈
https://arxiv.org/pdf/2512.21919
Toward Training Superintelligent Software Agents through Self-Play SWE-RL
尝试不依赖人工 issue/test,通过 agent 自己注入和修复 bug 来收集 RL 经验
https://arxiv.org/pdf/2512.18552
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute
讨论 internal/external TTC、trajectory synthesis、development-process-based search
https://arxiv.org/pdf/2503.23803
SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents
尝试复用已有轨迹,在关键中间步骤分叉,降低重复采样成本
https://arxiv.org/pdf/2601.22129
Evaluation可靠性
这里建议读 SWE-Bench+、UTBoost / Rigorous Evaluation of Coding Agents on SWE-Bench、The SWE-Bench Illusion、SPICE。UTBoost 指出原始测试可能太窄,生成 patch 可能通过测试但没有真正解决问题;该工作发现 SWE-bench Lite 和 Verified leaderboard 中有相当比例的条目会因增强测试和 parser 修正而发生排名变化。The SWE-Bench Illusion 则用“只看 issue description 猜 bug file path”的诊断任务讨论记忆/污染问题。SPICE 是 ASE 2025,尝试自动标注 issue clarity、test coverage、effort estimation,把数据质量做成可规模化流程。