longhorizon.sh
Papers and resources on long-horizon AI agents
November 2024
Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts
Nov 22, 2024 · METR Blog
RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts
Nov 22, 2024 · arXiv
February 2025
Reinforcement Learning for Long-Horizon Interactive LLM Agents
Feb 2025 · alphaXiv
March 2025
Measuring AI Ability to Complete Long Tasks
Mar 19, 2025 · METR Blog
HCAST: Human-Calibrated Autonomy Software Tasks
Mar 21, 2025 · arXiv
July 2025
How Does Time Horizon Vary Across Domains?
Jul 14, 2025 · METR Blog
September 2025
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
Sep 11, 2025 · alphaXiv
WebResearcher: Unleashing Unbounded Reasoning Capability in Long-Horizon Agents
Sep 16, 2025 · alphaXiv
December 2025
Long-Horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
Dec 15, 2025 · Hugging Face
February 2026
OdysseyArena: Benchmarking LLMs for Long-Horizon, Active and Inductive Interactions
Feb 5, 2026 · arXiv
Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
Feb 11, 2026 · Hugging Face
KLong: Training LLM Agent for Extremely Long-Horizon Tasks
Feb 19, 2026 · alphaXiv
March 2026
AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents
Mar 20, 2026 · Hugging Face
A Subgoal-Driven Framework for Improving Long-Horizon LLM Agents
Mar 20, 2026 · arXiv