Helping LLMs and Robots Interact Smartly · A Systematic Effort from Xiangyang Ji's Group at Tsinghua

When training large models and embodied agents, the main cost often lies not in raw compute, but in the interaction between prompts and the model, and between the agent and its environment. How to organize these interactions more efficiently is a question now drawing attention. Starting from the perspective of the Bayesian brain, the Ji's research lab at Tsinghua team proposes an active-cognition framework built on predictive sampling along a single unified thread, and validates it across several settings including large models and embodied agents.

This is not a summary of isolated papers, but a research thread running through several results in the pursuit of AGI from Ji's research lab. Our insight directs a systematic outcome developed over the past few years, offering a new scaling perspective through conditional interaction. From our Nature Communications' work that lays the theoretical foundation, to a series of results published one after another at machine-learning venues such as ICML, KDD, ICLR, our developed Model predictive sampling has gradually grown into a methodological system spanning "embodied decision-making — LLM post-training."

When training a large model, or training a robot to perform complex tasks, the main bottleneck often lies not in compute itself, but in "which tasks to choose, or how to smartly steer the interactions — between data and model, between agent and environment."

Here "interaction" is a core concept that runs through the whole article. In a large model, feeding a sequence into the model and having it produce the next token step by step is itself one interaction between the model and the data; in reinforcement learning, repeatedly sampling a prompt to generate multiple rollouts is a denser form of interaction; and in embodied intelligence, an agent taking actions in the environment and observing the feedback is an interaction with the physical world. Each of these carries a cost: a single forward pass or sampling step of a large model consumes memory and compute, while a single environment interaction of an agent costs time, wears down hardware, and sometimes carries safety risk. As training scales up, the total cost of interaction often becomes a more practical constraint than simply stacking compute.

Intuitively, human learning is never "uniform drilling" — we spend more time on our weak spots and less on what we have already mastered. Yet most mainstream model training still proceeds by "uniform sampling": each round picks a random batch of tasks, updates the model, and repeats. This causes easy samples to consume compute over and over, while critical tasks in the tail and out-of-distribution (OOD) tasks remain undersampled for a long time.

Looking further, as training scales up, this problem concentrates into a sharp interaction bottleneck: to judge "which problem to practice," one usually has to first evaluate how hard each task is — and that evaluation is itself an expensive, even risky, interaction. This pain point appears in two forms across two rapidly scaling scenarios, forming the two core motivations of the MPTS line of work.

01 LLM Fine-tuning / RL Fine-tuningLLM Fine-tuning & RL Fine-tuning

In RL fine-tuning (RLFT) of reasoning models, every step requires the model to generate many rollouts for a batch of prompts, then estimate advantages and update the policy. The bottleneck: prompts that are too easy or too hard produce almost no useful gradient signal, yet still occupy inference compute. This means a sizable share of compute is spent on low-information rollouts, an easily overlooked overhead in LLM post-training. And trying to avoid it via "evaluate first, then filter" requires running those rollouts anyway, creating a circular dependency.

Pain point: low-information rollouts consume compute

02 Adaptive RL for Embodied AgentsAdaptive RL for Embodied Agents

For an embodied agent to evaluate how hard a task is often means actually interacting with the environment once. But interaction in the physical world is subject to hard constraints: it takes time, wears down hardware, and may bring safety risks — for example a robot falling over, a manipulator colliding, or an autonomous vehicle facing extreme road conditions. In such settings, the cost of trial-and-error is not only time and compute, but possibly hard-to-recover physical consequences. The larger the scale and the more tasks there are, the less realistic it becomes to rank difficulty by exhaustive interaction.

Pain point: environment interaction is expensive and risky

The MPTS line of work targets exactly these two scaling pain points: rather than paying a high price to evaluate every task, it trains a lightweight model to predict the evaluation result, thereby reducing low-information rollouts and high-risk environment trial-and-error. Behind it is a prediction-centered sampling approach.

MPTS: Replacing "Evaluation" with "Prediction"

The answer from Prof. Xiangyang Ji's group in the Department of Automation, Tsinghua University, is Model Predictive Task Sampling (MPTS).

MPTS's core insight comes from an analogy with neuroscience: the human brain is energy-efficient — it does not need to actually experience something to estimate how hard it is, but rather builds an internal model from past experience, "simulates" the outcome in the mind, and then decides whether it is worth committing attention to.

MPTS uses a generative model to model the "episodic optimization process" of learning, and predicts the adaptation risk of each task via posterior inference — that is, how hard the task is for the current model. This lightweight "risk-predictive model (RPM)" is updated online, amortizing the high cost of evaluating tasks one by one, and is shown in theory to approximate the true ranking of task difficulty.

It is worth emphasizing that MPTS provides a predictable difficulty signal, not a fixed sampling strategy — "which difficulty level to pick" is left entirely to the higher-level priority sampling criterion, so it can plug in to serve many objectives. In robustness-oriented settings, one can plug in a risk-sensitive robust-optimization criterion (such as CVaR) to bias sampling toward high-risk tail tasks; in acceleration-oriented settings, one can plug in a curriculum-learning-style criterion that organizes prompts by "medium difficulty first" — prompts that are too easy or too hard produce almost no useful gradient, while problems with intermediate success rates carry the most information. The same predictive core, paired with different acquisition criteria, can serve the two distinct goals of "more robust" and "faster."

The whole framework seamlessly supports zero-shot, few-shot, and fast fine-tuning, among other settings, and applies equally to pretraining and meta-learning of foundation models, as well as domain randomization and meta-reinforcement-learning for robot policies.

Experimental results show that MPTS performs better than existing methods on both LLM fine-tuning and embodied / sequential decision-making tasks: adaptation robustness on high-risk tail tasks and OOD tasks improves, and learning efficiency improves as well — achieving comparable or better generalization with fewer samples and interactions.

The research roots of the Bayesian brain run deep, at the intersection of machine learning, experimental psychology, and Bayesian statistics. As early as the 1860s, Hermann Helmholtz's work in experimental psychology modeled, in terms of probabilistic estimation, the brain's ability to extract perceptual information from sensory data. Its basic idea remains the kernel of the whole thread to this day: the nervous system needs to organize messy perceptual data into an approximate internal model of the external world, and use it to guide efficient and robust decisions.

1860s · Helmholtz

Perception as unconscious probabilistic inference

The brain extracting information from sensory data is in essence probabilistic estimation — organizing perception into an approximate internal model of the external world.

1990s · Hinton & Friston

Free energy: a computable measure of discrepancy

Introducing free energy to measure the discrepancy between the true features of the world and the representations captured by a neural-network model.

Friston's synthesis

Both perception and action minimize free energy

Action and perception are both seen as suppressing free energy, leading respectively to perceptual inference and active inference; variational Bayesian methods characterize the mechanisms of predictive coding and Bayesian filtering.

This work · MPTS

Generative modeling + simulated prediction + active selection

Landing this framework in LLM fine-tuning and adaptive RL for embodied agents, proposing model-predictive task sampling.

In the 1990s, researchers such as Geoffrey Hinton and Karl Friston began to bring free energy into this picture — treating it as a computable measure of the discrepancy between the true features of the world and the feature representations captured by a neural-network model. Friston's later synthesis unified the workings of the Bayesian brain under the general principle of free-energy minimization: formally, action and perception are both viewed as suppressing free energy, thereby leading respectively to perceptual inference and active inference, and pushing the Bayesian brain toward a more embodied, more behavioral understanding. With variational Bayes, one can characterize how the brain continually updates its internal model of the world using sensory information, so as to minimize the discrepancy between sensory input and its prediction — which, from a neurobiological standpoint, can be understood as predictive coding, or more generally, Bayesian filtering.

Inspired precisely by this idea of the Bayesian brain, this work applies the framework of "generative modeling, simulated prediction, active selection" to LLM fine-tuning and adaptive RL for embodied agents, and proposes MPTS. It has two main contributions: first, it shows that the optimization process during learning can be coarsely predicted via generative modeling — task difficulty need not be measured only expensively after the fact, but can also be estimated in advance; second, it provides a plug-and-play, sample-adaptive way to enhance robustness, alleviating efficiency bottlenecks such as memory and compute, and bringing training speedup in some scenarios.

A Paradigm: From "Evaluating the World" to "Predicting the World"

Bringing this intellectual history down into the training loop of machine learning, one finds that today's mainstream practice is precisely "non-Bayesian." Uniform sampling is a purely passive stance: the system holds no belief about "which experiences are more valuable," and just updates on a random batch each round — the result being that mastered easy samples consume compute repeatedly, while critical tail and out-of-distribution tasks are chronically undersampled. Priority sampling, while pointing in the right direction, falls into another trap: to rank difficulty, one must first evaluate every task. This amounts to "examining everywhere carefully just to decide where to look," running counter to the brain's principle of energy efficiency.

MPTS can be understood as bringing the Helmholtz–Friston thread into an engineering implementation: giving the learning loop itself the flavor of active inference. Its four stages roughly correspond to the relevant concepts of the Bayesian brain —

Internal generative model→Generative belief

Holding beliefs about the learning process

A lightweight generative model characterizes the "distribution of adaptation risk over task space," giving not just point estimates but also quantified uncertainty.

Predictive coding→Amortized inference

Prediction replaces expensive evaluation

A risk-predictive model amortizes away the cost of per-task evaluation, predicting difficulty directly via posterior inference, with no need for one-by-one trial-and-error.

Active inference→Sampling as action

Selection driven by inference

"Which batch of tasks to learn from" becomes an active decision, aimed at minimizing expected adaptation risk and approaching the robust objective CVaR.

Free-energy minimization→Error-driven update

The loop closes here

True risk signals flow back, updating the model's beliefs via streaming variational inference — the machine-learning version of prediction error correcting the generative model.

A Systematic Body of Work: One Thread, Breakthroughs in Two Domains

The significance of MPTS lies not in a single method, but in the fact that it corresponds to a fairly general problem: in an iterative optimization process, how to efficiently select the samples "worth interacting with, worth learning from"? Because this problem is fairly universal, the work built around it is not isolated, but shares a single theoretical kernel while landing in different domains. Across two domains, Xiangyang Ji's group has published related results, one after another, at top journals and conferences such as Nature Communications, T-PAMI, ICML, KDD, and ICLR.

Domain one: robust fast adaptation of embodied agents in randomized environments. With MPTS, published in Nature Communications, as its theoretical foundation, this domain brings posterior prediction, batch diversity, and adversarially explicit task-distribution generation into sequential decision-making, allowing an agent to adapt robustly and quickly to new environments even when physical interaction is expensive and risky.

Domain two: making RL post-training and supervised fine-tuning of reasoning LLMs more efficient. This domain brings the idea of "prediction instead of evaluation" into the full RLVR and SFT pipeline, using a lightweight predictive model to estimate prompt difficulty online and save a large number of uninformative rollouts. It is represented by MoPPS — one of the earliest predictive sampling algorithms to achieve RLVR training acceleration.

Viewing the two domains together with the journal work that serves as the theoretical foundation reveals a fairly clear structure: at the base is the unified principle of model predictive sampling, with the shared kernel of "replacing expensive evaluation with prediction," and above it two branches — robust adaptation in embodied decision-making and efficient post-training of reasoning LLMs — linked by the same idea, from environment interaction to prompt interaction, from robustness to acceleration. This way of organizing many scenarios under one principle makes the work more of a coherent system than a set of independent attempts.

Attention from Academia and Industry

The influence of this research thread already extends beyond the group's own papers. In academia, the series has received positive feedback from Turing laureates Yann LeCun and Yoshua Bengio, and has been cited by dozens of ACM / IEEE Fellows in their own research — a level of attention worth noting for a method focused on "how the training loop selects data."

In industry, the methods have also seen some adoption. Among them, MoPPS, as one of the earlier predictive sampling algorithms to achieve RLVR training acceleration, has been adopted as one of the baselines by groups at Meta, Apple, Alibaba (Qwen, Roll), and Tencent Hunyuan. That a sampling algorithm from academia has been incorporated into companies' training pipelines suggests this approach is usable in engineering practice.

Significance: Training Efficiency Is Becoming Increasingly Important

As the cost of training large models rises and the compute consumption of RL fine-tuning for reasoning models grows, the value of "which tasks or interaction scenarios to choose" rises accordingly.

What MPTS and its line of work provide is not a trick for a single scenario, but a theoretically grounded active-sampling approach that transfers across scenarios: replacing expensive evaluation with a lightweight predictive model, quantifying uncertainty via posterior inference, and then approaching a preset adaptive optimization objective with a configurable acquisition criterion (biased toward high-risk tails in robust settings, and toward medium difficulty in RLVR-acceleration settings). That the same kernel applies across embodied decision-making and LLM post-training reflects that it targets the shared part of these problems.

This holds for both kinds of applications — whether fine-tuning reasoning models like DeepSeek / Qwen, or training robot manipulation policies in simulated or real environments, the core problem structure is similar: evaluation is costly, yet difficulty information is indispensable.

From a more general standpoint, as training costs draw more attention, having a system hold an updatable judgment of "which experiences are more worth computing on," and actively choose what to interact with accordingly, is a direction worth continuing to explore. This is also consistent with the basic idea of the Bayesian brain: replacing exhaustive enumeration with prediction, and correcting beliefs with observation — one can understand the learning process itself as a form of active inference. This line of thinking traces back to Helmholtz's conception of perception as probabilistic inference.

Related Work from the Group

An overview of the work from Prof. Xiangyang Ji's group covered in this article (grouped into theoretical foundation, embodied decision-making, and LLM post-training).

NC Model Predictive Task Sampling for Efficient and Robust Adaptation. The theoretical-foundation work on model predictive task sampling (arXiv:2501.11039, accepted at Nature Communications).
ICML-25 Fast and Robust: Task Sampling with Posterior and Diversity Synergies for Adaptive Decision-Makers in Randomized Environments. Task sampling with posterior and diversity synergies, for adaptive decision-makers in randomized environments.
KDD-25 Robust Fast Adaptation from Adversarially Explicit Task Distribution Generation. Robust fast adaptation based on adversarially explicit task-distribution generation.
KDD-26 Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? i.e. MoPPS: online prediction of prompt difficulty to accelerate RL fine-tuning of reasoning models (arXiv:2507.04632).
ICLR-26 Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models. Dynamics-predictive sampling, for active RL fine-tuning of large reasoning models.
ICML-26 Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models. Small generalizable prompt-predictive models steering efficient post-training of large reasoning models.
ICML-26 Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning. Utility-diversity aware online batch selection, for LLM supervised fine-tuning.

Acknowledgments

This work acknowledges the contributions of our research members, mainly Cheems Wang, Yun Qu, Yixiu Mao, Heming Zou, Lizhou Cai, and Yuhang Jiang, over the past two years in pursuing the techniques of enactive AI on the path to AGI.

Helping LLMs and Robots "Interact Smartly" — A Systematic Research Exploration from Xiangyang Ji's Group at Tsinghua University