Scaling Context to 10M+ Tokens with Recursive Language Models

How Recursive Language Models (RLMs) from MIT CSAIL are revolutionizing long-context AI. By treating prompts as external environments and using inference-time compute, RLMs defeat “context rot” and scale to 10 million+ tokens—outperforming standard large context windows on complex reasoning tasks.

For data scientists and AI engineers, the “context window” has long been a defining constraint. While we have seen windows grow from 4k to 128k and even 1M+ tokens, simply making the window larger hasn’t been a silver bullet. We still face “context rot”—the phenomenon where model performance degrades as the prompt length increases, even if the input technically fits within the window.

A recent paper from MIT CSAIL proposes a shift in perspective. Instead of trying to cram more tokens into the neural network’s immediate view, they introduce Recursive Language Models (RLMs). This is a general inference strategy that treats context management not as an architectural problem, but as an inference-time compute problem.

It turns out that models can be far more powerful if you allow them to treat their own prompts as an object in an external environment, which they understand and manipulate by writing code that invokes LLMs!

author Alex L. Zhang summarizes the key insight on X

Here is a breakdown of how RLMs work and why they outperform traditional long-context scaffolds.

The Core Concept: Out-of-Core Processing for LLMs

The authors draw an analogy to out-of-core algorithms in classical computing. When a dataset is too large to fit in main memory (RAM), we don’t buy infinite RAM; we write algorithms that fetch data from disk in manageable chunks only when needed.

RLMs apply this logic to Large Language Models:

  1. The Prompt is the Environment: Instead of feeding a 500k-token prompt directly into the transformer, the RLM treats the prompt as an external artifact—specifically, a string variable stored in a Python REPL environment (Read-Eval-Print Loop, an interactive coding shell where the model can execute code and see results immediately).
  2. The LLM is the Controller: The LLM acts as an agent that can interact with this environment. It writes code to inspect the prompt (e.g., checking its length, reading specific snippets).
  3. Recursion is the Mechanism: Crucially, the LLM can write code to call itself. It can decompose a massive prompt into sub-problems and invoke a fresh instance of the LLM to process those specific chunks.

How it Works Under the Hood

The architecture is surprisingly simple but creates complex behaviors. When an RLM receives a prompt P:

  1. Initialization: A Python REPL is started. The prompt P
    is loaded into memory as a variable (e.g., prompt_str).
  2. Interaction: The “Root” LLM is given instructions on how to use the REPL. It does not see the full prompt $P$. It sees metadata (e.g., “You have a string of length 1,000,000”).
  3. Execution & Recursion: The Root LLM generates Python code to manipulate prompt_str.
    • Example: It might split the string by “Chapter” headers.
    • Recursion: It iterates through these chapters and calls a function llm_query(sub_prompt). This spawns a sub-LLM (a recursive call) to process that specific chapter.
  4. Aggregation: The results from the sub-calls are returned to the REPL variables, which the Root LLM can then synthesize into a final answer.

RLM vs. RAG and Summarization

Standard approaches to long context usually involve Context Compaction (summarizing old text to make room for new text) or RAG (retrieving top-k chunks).

The paper argues these are insufficient for “information-dense” tasks:

  • Summarization is lossy: You inevitably drop details that might be needed later.
  • RAG lacks global structure: Retrieving disjoint chunks breaks the narrative or logical flow required for complex reasoning.

RLMs differ because they allow the model to programmatically determine how to traverse the data. If the task requires linear complexity (reading every line), the RLM can write a loop to do exactly that. If it requires a binary search, the RLM can write logic to jump around the text.

Performance and Results

The researchers evaluated RLMs using GPT-5 and Qwen3-Coder on tasks designed to break context windows, such as OOLONG (a benchmark where the answer depends on nearly every line of input) and S-NIAH (Single Needle in a Haystack).

Key Findings:

  1. Scaling Beyond Physical Limits: On simple retrieval tasks like S-NIAH, frontier models like GPT-5 actually perform well within their context windows (O(1) complexity). However, RLMs allow this capability to scale orders of magnitude further, maintaining near-perfect performance up to 10 million tokens—a scale where standard models physically cannot accept the input.
  2. Complex Reasoning: On the OOLONG-Pairs task (which requires quadratic processing cost), base models failed almost completely (<0.1% score). RLMs achieved viable scores (up to 58%) by effectively breaking the problem down.
  3. Cost Efficiency: Surprisingly, RLMs were often cheaper or comparable in cost to base models for the same tasks. While the RLM makes multiple calls, it only processes relevant tokens for each call, avoiding the quadratic attention cost of processing the entire massive prompt at once.

Emergent Behaviors

One of the most interesting aspects of the paper is that the models were not explicitly trained to be RLMs; they were just prompted to use the REPL. Yet, they exhibited sophisticated strategies:

  • Regex Filtering: The models frequently used Python’s re module to “grep” the prompt for keywords before reading, drastically reducing the token count needed for processing.
  • Self-Verification: In several trajectories, the RLM would generate an answer, then spawn a new sub-LLM call specifically to verify that answer against the raw text before returning it.

Conclusion

Recursive Language Models suggest that the future of long-context AI might not rely solely on larger context windows (which are computationally expensive to train and serve). Instead, inference-time compute—allowing the model to “think” by running code and recursively calling itself—offers a scalable path to processing inputs of essentially infinite length.

For researchers and engineers, this points toward a new design pattern: treating LLMs not just as text processors, but as recursive operators over data environments.

Source: Zhang, A. L., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv preprint arXiv:2512.24601.

Previous Article

Retrieving the 'Why': Generative Document Expansion for Causal Search

Subscribe to our Newsletter

Subscribe to our email newsletter to get the latest posts delivered right to your email.
Pure inspiration, zero spam ✨