Back to The Future: Evaluating AI Agents on Predicting Future Events
Captured source
source ↗Back to The Future: Evaluating AI Agents on Predicting Future Events
⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →
Introducing Together AI's new look →
🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →
⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →
📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →
🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →
All blog posts
Research
Published 7/17/2025
Back to The Future: Evaluating AI Agents on Predicting Future Events
Authors
Federico Bianchi, Junlin Wang, Zain Hasan, Shang Zhu, Roy Yuan, Clémentine Fourrier, James Zou
Table of contents
40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...
Links in this article
FutureBench Interactive Leaderboard DeepSeek-V3 Firecrawl Tavily
Future of AI Most current AI benchmarks focus on answering questions about the past, either by testing models on existing knowledge (in a static manner, such as HLE or GPQA, or augmented, like BrowseComp or GAIA) or previously solved problems (like PaperBench, DABStep, or most coding evaluations). However, we believe that more valuable AI, and ultimately AGI, will be distinguished by its ability to use this past to forecast interesting aspects of the future, rather than merely reciting old facts. Forecasting future events is a complex and holistic task: it requires sophisticated reasoning, synthesis, weighing probabilities and genuine understanding, rather than pattern matching against or searching existing information. Evaluating models on their ability to predict future outcomes, whether in science, economics, geopolitics, or technology tests the kind of intelligence that creates real-world value. Beyond its inherent importance, this forecasting-based approach also solves many methodological problems faced by current evaluations and benchmarks. Traditional benchmarks that measure accuracy on fixed test sets are inevitably affected by possible data contamination, and without access to the full reproducible training pipeline of a model, it's hard to trust the results. The most serious evaluation efforts now keep their test sets completely private, creating a frustrating arms race between evaluators and potential "gaming the leaderboard" mechanics (Singh et al., 2025). Forecasting makes contamination impossible by design , as you can't train on data that doesn't yet exist! This creates a level playing field where success depends on reasoning capability rather than memorization. Perhaps most importantly, predictions about the future are inherently verifiable . We can wait and see who was right, creating an objective, time-stamped measure of model performance. We therefore propose evaluating agents on their ability to predict future events (Ye et al., 2024; Karger et al., 2025). FutureBench draws from real-world prediction markets and emerging news to create interesting prediction tasks grounded in actual future outcomes. We collect events from platforms and live news coverages and manifolds markets, filtering them to focus on emerging events worth predicting. Using an agent-based approach, we curate scenarios that require genuine reasoning rather than simple pattern matching. Think geopolitical developments, market movements, or technology adoption trends - events where informed analysis actually matters. Can Agents Predict Future Events? This is the obvious question, and it's at the heart of what makes this benchmark interesting! We believe the answer cannot be a simple "yes" or a "no", as it mostly depends on the actual questions; there are always important caveats to consider. Humans constantly use their ability to weigh current information to predict future events. Aren't most career moves, relationship choices, or even business strategies essentially bets on future outcomes? Some predictions involve irreducible uncertainty ( Will it rain on December 17th, 2027 at noon? ), but many don't. When a skilled analyst predicts a company's quarterly earnings or a policy expert forecasts election outcomes, they're using available information to make informed decisions. This is precisely what we're asking AI agents to do with FutureBench! The task isn't to get agents to fortune-tell, but rather to synthesize information and reason under stronger uncertainty than most other benchmarks. The agent's prediction quality directly reflects its ability to search relevant information, synthesize complex data, and reason about cause-and-effect relationships. These are precisely the capabilities we want to measure in real-world applications. Tools like DeepResearch are already used for market analysis and strategic planning. The quality of information collection strongly correlates with decision-making effectiveness. FutureBench is inspired by this evaluation process and tries to compute agents' quality with objective, verifiable outcomes. FutureBench How We Generate Prediction Questions Building a benchmark that tests real prediction capabilities requires a steady stream of meaningful questions. We've developed two complementary approaches that capture different types of future events: 1. News-Generated Questions: Finding Tomorrow's Headlines Today Our first approach uses AI to mine current events for prediction opportunities. We deploy a smolagents -based agent to scrape a few major news websites, analyze front-page articles, and generate prediction questions about their likely outcomes. The agent reads through and identifies interesting articles and formulates specific, time-bound questions from their content, for example "Will the Federal Reserve cut interest rates by at least 0.25% by July 1st, 2025?". We guide this process with carefully crafted prompts that specify what makes a good prediction question—events that are meaningful, verifiable, and uncertain at extraction time. Technical Stack: Model : DeepSeek-V3 for reasoning and question generation Scraping : Firecrawl for reliable content extraction Search : Tavily for additional context when needed
The agent typically generates 5 questions per scraping session, with a time horizon of a single week, meaning that we assume we'll know the answer to the question after seven days. This gives us a natural pipeline of fresh evaluation material tied to real-world events. 2. Polymarket Integration: Leveraging…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Substantive research post, not a major launch