Arxiv Papers of Today

生成时间: 2025-11-14 16:29:55 (UTC+8); Arxiv 发布时间: 2025-11-14 20:00 EST (2025-11-15 09:00 UTC+8)

今天共有 36 篇相关文章

Keyword: reinforcement learning

Scaling Environments for LLM Agents in the Era of Learning from Interaction: A Survey

在从交互中学习的时代扩展 LLM 代理的环境：一项调查

Authors: Yuchen Huang, Sijia Li, Minghao Liu, Wei Liu, Shijue Huang, Zhiyuan Fan, Hou Pong Chan, Yi R. Fung
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.09586
Pdf link: https://arxiv.org/pdf/2511.09586
Abstract LLM-based agents can autonomously accomplish complex tasks across various domains. However, to further cultivate capabilities such as adaptive behavior and long-term decision-making, training on static datasets built from human-level knowledge is insufficient. These datasets are costly to construct and lack both dynamism and realism. A growing consensus is that agents should instead interact directly with environments and learn from experience through reinforcement learning. We formalize this iterative process as the Generation-Execution-Feedback (GEF) loop, where environments generate tasks to challenge agents, return observations in response to agents' actions during task execution, and provide evaluative feedback on rollouts for subsequent learning. Under this paradigm, environments function as indispensable producers of experiential data, highlighting the need to scale them toward greater complexity, realism, and interactivity. In this survey, we systematically review representative methods for environment scaling from a pioneering environment-centric perspective and organize them along the stages of the GEF loop, namely task generation, task execution, and feedback. We further analyze benchmarks, implementation strategies, and applications, consolidating fragmented advances and outlining future research directions for agent intelligence.
中文摘要 基于 LLM 的代理可以自主完成跨各个领域的复杂任务。然而，要进一步培养适应性行为和长期决策等能力，在基于人类水平知识构建的静态数据集上进行训练是不够的。这些数据集的构建成本高昂，并且缺乏活力和真实感。越来越多的共识是，代理应该直接与环境交互，并通过强化学习从经验中学习。我们将这个迭代过程形式化为生成-执行-反馈（GEF）循环，其中环境生成任务来挑战代理，在任务执行期间响应代理的作返回观察结果，并为后续学习提供有关推出的评估反馈。在这种范式下，环境充当体验数据不可或缺的生产者，这凸显了将它们扩展到更高的复杂性、真实性和交互性的必要性。在本次调查中，我们从开创性的以环境为中心的角度系统地回顾了环境扩展的代表性方法，并沿着 GEF 循环的阶段（即任务生成、任务执行和反馈）进行组织。我们进一步分析基准、实施策略和应用，整合零散的进展，并概述智能体智能的未来研究方向。

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

MMaDA-Parallel：用于思维感知编辑和生成的多模态大型扩散语言模型

Authors: Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.09611
Pdf link: https://arxiv.org/pdf/2511.09611
Abstract While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9\% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at this https URL
中文摘要 虽然思维感知生成旨在提高复杂任务的性能，但我们确定了一种关键故障模式，即现有的顺序自回归方法可能会由于错误传播而自相矛盾地降低性能。为了系统地分析这个问题，我们提出了 ParaBench，这是一个旨在评估文本和图像输出模式的新基准。我们使用 ParaBench 的分析表明，这种性能下降与生成的推理与最终图像之间的对齐不良密切相关。为了解决这个问题，我们提出了一种并行多模态扩散框架 MMaDA-Parallel，它可以在整个去噪轨迹中实现文本和图像之间的连续双向交互。MMaDA-Parallel 通过监督微调进行训练，然后通过并行强化学习（ParaRL）进一步优化，这是一种沿轨迹应用语义奖励以强制执行跨模态一致性的新颖策略。实验验证了我们的模型显着改善了跨模态对齐和语义一致性，与最先进的模型 Bagel 相比，ParaBench 上的输出对齐提高了 6.9%，为思维感知图像合成建立了更强大的范式。我们的代码是开源的，位于此 https URL

Optimistic Reinforcement Learning with Quantile Objectives

具有分位数目标的乐观强化学习

Authors: Mohammad Alipour-Vaezi, Huaiyang Zhong, Kwok-Leung Tsui, Sajad Khodadadian
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.09652
Pdf link: https://arxiv.org/pdf/2511.09652
Abstract Reinforcement Learning (RL) has achieved tremendous success in recent years. However, the classical foundations of RL do not account for the risk sensitivity of the objective function, which is critical in various fields, including healthcare and finance. A popular approach to incorporate risk sensitivity is to optimize a specific quantile of the cumulative reward distribution. In this paper, we develop UCB-QRL, an optimistic learning algorithm for the $\tau$-quantile objective in finite-horizon Markov decision processes (MDPs). UCB-QRL is an iterative algorithm in which, at each iteration, we first estimate the underlying transition probability and then optimize the quantile value function over a confidence ball around this estimate. We show that UCB-QRL yields a high-probability regret bound $\mathcal O\left((2/\kappa)^{H+1}H\sqrt{SATH\log(2SATH/\delta)}\right)$ in the episodic setting with $S$ states, $A$ actions, $T$ episodes, and $H$ horizons. Here, $\kappa>0$ is a problem-dependent constant that captures the sensitivity of the underlying MDP's quantile value.
中文摘要 近年来，强化学习（RL）取得了巨大的成功。然而，RL 的经典基础没有考虑目标函数的风险敏感性，而目标函数在包括医疗保健和金融在内的各个领域都至关重要。纳入风险敏感性的一种流行方法是优化累积奖励分布的特定分位数。在本文中，我们开发了 UCB-QRL，这是一种针对有限视界马尔可夫决策过程（MDP）中 $\tau$ 分位数目标的乐观学习算法。UCB-QRL 是一种迭代算法，在每次迭代中，我们首先估计潜在的转换概率，然后围绕该估计值的置信球优化分位数值函数。我们表明，UCB-QRL 在具有 $S$ 状态、$A$ 动作、$T$ 情节和 $H$ 视界的情景设置中产生高概率后悔绑定 $\mathcal O\left（（2/\kappa）^{H+1}H\sqrt{SATH\log（2SATH/\delta）}\right）$。在这里，$\kappa>0$ 是一个与问题相关的常量，它捕获了底层 MDP 分位数值的灵敏度。

SEBA: Sample-Efficient Black-Box Attacks on Visual Reinforcement Learning

SEBA：对视觉强化学习的采样高效黑盒攻击

Authors: Tairan Huang, Yulin Jin, Junxu Liu, Qingqing Ye, Haibo Hu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.09681
Pdf link: https://arxiv.org/pdf/2511.09681
Abstract Visual reinforcement learning has achieved remarkable progress in visual control and robotics, but its vulnerability to adversarial perturbations remains underexplored. Most existing black-box attacks focus on vector-based or discrete-action RL, and their effectiveness on image-based continuous control is limited by the large action space and excessive environment queries. We propose SEBA, a sample-efficient framework for black-box adversarial attacks on visual RL agents. SEBA integrates a shadow Q model that estimates cumulative rewards under adversarial conditions, a generative adversarial network that produces visually imperceptible perturbations, and a world model that simulates environment dynamics to reduce real-world queries. Through a two-stage iterative training procedure that alternates between learning the shadow model and refining the generator, SEBA achieves strong attack performance while maintaining efficiency. Experiments on MuJoCo and Atari benchmarks show that SEBA significantly reduces cumulative rewards, preserves visual fidelity, and greatly decreases environment interactions compared to prior black-box and white-box methods.
中文摘要 视觉强化学习在视觉控制和机器人技术方面取得了显着进展，但其对对抗性扰动的脆弱性仍然没有得到充分探索。现有的黑盒攻击大多集中在基于向量或离散动作的RL，其在基于图像的连续控制上的有效性受到较大的动作空间和过多的环境查询的限制。我们提出了SEBA，这是一个样本高效的框架，用于对视觉RL代理进行黑盒对抗性攻击。SEBA 集成了在对抗条件下估计累积奖励的影子 Q 模型、产生视觉上难以察觉的扰动的生成对抗网络以及模拟环境动态以减少现实世界查询的世界模型。通过学习影子模型和完善生成器的两阶段迭代训练过程，SEBA 在保持效率的同时实现了强大的攻击性能。MuJoCo 和 Atari 基准测试的实验表明，与之前的黑盒和白盒方法相比，SEBA 显着减少了累积奖励，保留了视觉保真度，并大大减少了环境交互。

ConstrainedSQL: Training LLMs for Text2SQL via Constrained Reinforcement Learning

ConstrainedSQL：通过约束强化学习训练 Text2SQL 的 LLM

Authors: Weiqin Chen, Nhan Huu Pham, Michael Robert Glass, Long Hai Vu, Gaetano Rossiello, Dharmashankar Subramanian, Santiago Paternain
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.09693
Pdf link: https://arxiv.org/pdf/2511.09693
Abstract Reinforcement learning (RL) has demonstrated significant promise in enhancing the reasoning capabilities of Text2SQL LLMs, especially with advanced algorithms such as GRPO and DAPO. However, the performance of these methods is highly sensitive to the design of reward functions. Inappropriate rewards can lead to reward hacking, where models exploit loopholes in the reward structure to achieve high scores without genuinely solving the task. This work considers a constrained RL framework for Text2SQL that incorporates natural and interpretable reward and constraint signals, while dynamically balancing trade-offs among them during the training. We establish the theoretical guarantees of our constrained RL framework and our numerical experiments on the well-known Text2SQL datasets substantiate the improvement of our approach over the state-of-the-art RL-trained LLMs.
中文摘要 强化学习（RL）在增强 Text2SQL LLM 的推理能力方面显示出巨大的前景，特别是使用 GRPO 和 DAPO 等先进算法。然而，这些方法的性能对奖励函数的设计高度敏感。不适当的奖励可能会导致奖励黑客攻击，即模型利用奖励结构中的漏洞来获得高分，而没有真正解决任务。这项工作考虑了 Text2SQL 的受约束 RL 框架，该框架包含自然且可解释的奖励和约束信号，同时在训练过程中动态平衡它们之间的权衡。我们建立了约束 RL 框架的理论保证，我们在著名的 Text2SQL 数据集上的数值实验证实了我们的方法相对于最先进的 RL 训练的 LLM 的改进。

Baby Sophia: A Developmental Approach to Self-Exploration through Self-Touch and Hand Regard

小索菲亚：通过自我触摸和手部关注进行自我探索的发展方法

Authors: Stelios Zarifis, Ioannis Chalkiadakis, Artemis Chardouveli, Vasiliki Moutzouri, Aggelos Sotirchos, Katerina Papadimitriou, Panagiotis Filntisis, Niki Efthymiou, Petros Maragos, Katerina Pastra
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.09727
Pdf link: https://arxiv.org/pdf/2511.09727
Abstract Inspired by infant development, we propose a Reinforcement Learning (RL) framework for autonomous self-exploration in a robotic agent, Baby Sophia, using the BabyBench simulation environment. The agent learns self-touch and hand regard behaviors through intrinsic rewards that mimic an infant's curiosity-driven exploration of its own body. For self-touch, high-dimensional tactile inputs are transformed into compact, meaningful representations, enabling efficient learning. The agent then discovers new tactile contacts through intrinsic rewards and curriculum learning that encourage broad body coverage, balance, and generalization. For hand regard, visual features of the hands, such as skin-color and shape, are learned through motor babbling. Then, intrinsic rewards encourage the agent to perform novel hand motions, and follow its hands with its gaze. A curriculum learning setup from single-hand to dual-hand training allows the agent to reach complex visual-motor coordination. The results of this work demonstrate that purely curiosity-based signals, with no external supervision, can drive coordinated multimodal learning, imitating an infant's progression from random motor babbling to purposeful behaviors.
中文摘要 受婴儿发育的启发，我们提出了一种强化学习（RL）框架，用于使用 BabyBench 模拟环境在机器人代理 Baby Sophia 中进行自主自我探索。智能体通过模仿婴儿好奇心驱动的对自己身体的探索的内在奖励来学习自我触摸和手视行为。对于自触摸，高维触觉输入被转换为紧凑、有意义的表示，从而实现高效学习。然后，智能体通过内在奖励和课程学习发现新的触觉接触，鼓励广泛的身体覆盖、平衡和概括。对于手部观察，手的视觉特征，例如肤色和形状，是通过运动牙牙学语来学习的。然后，内在奖励鼓励智能体执行新颖的手部动作，并用目光跟随它的手。从单手到双手训练的课程学习设置使智能体能够达到复杂的视觉-运动协调能力。这项工作的结果表明，纯粹基于好奇心的信号，在没有外部监督的情况下，可以驱动协调的多模态学习，模仿婴儿从随机运动牙牙学语到有目的的行为的进展。

Out-of-Distribution Generalization with a SPARC: Racing 100 Unseen Vehicles with a Single Policy

使用 SPARC 进行分布外泛化：使用单一策略赛跑 100 辆看不见的车辆

Authors: Bram Grooten, Patrick MacAlpine, Kaushik Subramanian, Peter Stone, Peter R. Wurman
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.09737
Pdf link: https://arxiv.org/pdf/2511.09737
Abstract Generalization to unseen environments is a significant challenge in the field of robotics and control. In this work, we focus on contextual reinforcement learning, where agents act within environments with varying contexts, such as self-driving cars or quadrupedal robots that need to operate in different terrains or weather conditions than they were trained for. We tackle the critical task of generalizing to out-of-distribution (OOD) settings, without access to explicit context information at test time. Recent work has addressed this problem by training a context encoder and a history adaptation module in separate stages. While promising, this two-phase approach is cumbersome to implement and train. We simplify the methodology and introduce SPARC: single-phase adaptation for robust control. We test SPARC on varying contexts within the high-fidelity racing simulator Gran Turismo 7 and wind-perturbed MuJoCo environments, and find that it achieves reliable and robust OOD generalization.
中文摘要 推广到看不见的环境是机器人和控制领域的一项重大挑战。在这项工作中，我们专注于情境强化学习，其中智能体在具有不同环境的环境中行动，例如自动驾驶汽车或四足机器人，它们需要在与训练不同的地形或天气条件下运行。我们处理了推广到分布外（OOD）设置的关键任务，而无需在测试时访问显式上下文信息。最近的工作通过在不同的阶段训练上下文编码器和历史适应模块来解决这个问题。虽然前景广阔，但这种两阶段方法的实施和培训很麻烦。我们简化了方法并引入了 SPARC：用于稳健控制的单相适应。我们在高保真赛车模拟器《跑车浪漫旅 7》和风扰动的 MuJoCo 环境中的不同环境中测试了 SPARC，发现它实现了可靠且稳健的 OOD 泛化。

Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

向小偷致敬：探索去中心化 GRPO 中的攻击和防御

Authors: Nikolay Blagoev, Oğuzhan Ersoy, Lydia Yiyu Chen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.09780
Pdf link: https://arxiv.org/pdf/2511.09780
Abstract Group Relative Policy Optimization (GRPO) has demonstrated great utilization in post-training of Large Language Models (LLMs). In GRPO, prompts are answered by the model and, through reinforcement learning, preferred completions are learnt. Owing to the small communication volume, GRPO is inherently suitable for decentralised training as the prompts can be concurrently answered by multiple nodes and then exchanged in the forms of strings. In this work, we present the first adversarial attack in decentralised GRPO. We demonstrate that malicious parties can poison such systems by injecting arbitrary malicious tokens in benign models in both out-of-context and in-context attacks. Using empirical examples of math and coding tasks, we show that adversarial attacks can easily poison the benign nodes, polluting their local LLM post-training, achieving attack success rates up to 100% in as few as 50 iterations. We propose two ways to defend against these attacks, depending on whether all users train the same model or different models. We show that these defenses can achieve stop rates of up to 100%, making the attack impossible.
中文摘要 组相对策略优化（GRPO）在大型语言模型（LLM）的后训练中表现出了很大的利用率。在 GRPO 中，模型回答提示，并通过强化学习学习首选完成。由于通信量小，GRPO 本质上适合分散训练，因为提示可以由多个节点同时回答，然后以字符串的形式交换。在这项工作中，我们展示了去中心化 GRPO 中的第一个对抗性攻击。我们证明，恶意方可以通过在上下文外和上下文攻击中在良性模型中注入任意恶意令牌来毒害此类系统。通过数学和编码任务的实证示例，我们表明对抗性攻击很容易毒害良性节点，污染其本地 LLM 训练后，在短短 50 次迭代中实现高达 100% 的攻击成功率。我们提出了两种防御这些攻击的方法，具体取决于所有用户是训练相同的模型还是不同的模型。我们表明，这些防御可以实现高达 100% 的阻止率，使攻击变得不可能。

Beyond Monotonicity: Revisiting Factorization Principles in Multi-Agent Q-Learning

超越单调性：重新审视多智能体 Q 学习中的因式分解原理

Authors: Tianmeng Hu, Yongzheng Cui, Rui Tang, Biao Luo, Ke Li
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.09792
Pdf link: https://arxiv.org/pdf/2511.09792
Abstract Value decomposition is a central approach in multi-agent reinforcement learning (MARL), enabling centralized training with decentralized execution by factorizing the global value function into local values. To ensure individual-global-max (IGM) consistency, existing methods either enforce monotonicity constraints, which limit expressive power, or adopt softer surrogates at the cost of algorithmic complexity. In this work, we present a dynamical systems analysis of non-monotonic value decomposition, modeling learning dynamics as continuous-time gradient flow. We prove that, under approximately greedy exploration, all zero-loss equilibria violating IGM consistency are unstable saddle points, while only IGM-consistent solutions are stable attractors of the learning dynamics. Extensive experiments on both synthetic matrix games and challenging MARL benchmarks demonstrate that unconstrained, non-monotonic factorization reliably recovers IGM-optimal solutions and consistently outperforms monotonic baselines. Additionally, we investigate the influence of temporal-difference targets and exploration strategies, providing actionable insights for the design of future value-based MARL algorithms.
中文摘要 值分解是多智能体强化学习（MARL）的核心方法，通过将全局价值函数分解为局部值，实现集中训练和分散执行。为了确保个体-全局-最大值（IGM）的一致性，现有方法要么强制执行单调性约束，从而限制表达能力，要么以算法复杂性为代价采用更软的代理。在这项工作中，我们提出了非单调值分解的动力系统分析，将学习动态建模为连续时间梯度流。我们证明，在近似贪婪的探索下，所有违反IGM一致性的零损失均衡都是不稳定的鞍点，而只有IGM一致的解才是学习动力学的稳定吸引子。对合成矩阵博弈和具有挑战性的 MARL 基准的广泛实验表明，无约束、非单调分解可以可靠地恢复 IGM 最优解，并且始终优于单调基线。此外，我们还研究了时间差异目标和探索策略的影响，为设计基于未来价值的 MARL 算法提供了可作的见解。

Uncertainty-Guided Checkpoint Selection for Reinforcement Finetuning of Large Language Models

面向大型语言模型强化精细化的不确定性引导检查点选择

Authors: Manh Nguyen, Dung Nguyen, Dai Do, Svetha Venkatesh, Hung Le
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.09864
Pdf link: https://arxiv.org/pdf/2511.09864
Abstract Reinforcement learning (RL) finetuning is crucial to aligning large language models (LLMs), but the process is notoriously unstable and exhibits high variance across model checkpoints. In practice, selecting the best checkpoint is challenging: evaluating checkpoints on the validation set during training is computationally expensive and requires a good validation set, while relying on the final checkpoint provides no guarantee of good performance. We introduce an uncertainty-guided approach for checkpoint selection (UGCS) that avoids these pitfalls. Our method identifies hard question-answer pairs using per-sample uncertainty and ranks checkpoints by how well they handle these challenging cases. By averaging the rewards of the top-uncertain samples over a short training window, our method produces a stable and discriminative signal without additional forward passes or significant computation overhead. Experiments across three datasets and three LLMs demonstrate that it consistently identifies checkpoints with stronger generalization, outperforming traditional strategies such as relying on training or validation performance. These results highlight that models solving their hardest tasks with low uncertainty are the most reliable overall.
中文摘要 强化学习（RL）微调对于调整大型语言模型（LLM）至关重要，但该过程非常不稳定，并且在模型检查点之间表现出很高的方差。在实践中，选择最佳检查点具有挑战性：在训练期间评估验证集上的检查点计算成本高昂，并且需要良好的验证集，而依赖最终检查点并不能保证良好的性能。我们引入了一种不确定性引导的检查点选择方法（UGCS），以避免这些陷阱。我们的方法使用每个样本的不确定性来识别困难的问答对，并根据检查点处理这些具有挑战性的情况的能力对检查点进行排名。通过在较短的训练窗口内对最高不确定样本的奖励进行平均，我们的方法产生稳定且有区别的信号，而无需额外的前向传递或大量的计算开销。跨三个数据集和三个 LLM 的实验表明，它始终如一地识别具有更强泛化性的检查点，优于依赖训练或验证性能等传统策略。这些结果强调，以低不确定性解决最困难任务的模型总体上是最可靠的。

In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback

代币内理性优化：通过自我反馈实现准确简洁的 LLM 推理

Authors: Mingye Zhu, Yi Liu, Zheren Fu, Quan Wang, Yongdong Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.09865
Pdf link: https://arxiv.org/pdf/2511.09865
Abstract Training Large Language Models (LLMs) for chain-of-thought reasoning presents a significant challenge: supervised fine-tuning on a single "golden" rationale hurts generalization as it penalizes equally valid alternatives, whereas reinforcement learning with verifiable rewards struggles with credit assignment and prohibitive computational cost. To tackle these limitations, we introduce InTRO (In-Token Rationality Optimization), a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning. Instead of directly optimizing an intractable objective over all valid reasoning paths, InTRO leverages correction factors-token-wise importance weights estimated by the information discrepancy between the generative policy and its answer-conditioned counterpart, for informative next token selection. This approach allows the model to perform token-level exploration and receive self-generated feedback within a single forward pass, ultimately encouraging accurate and concise rationales. Across six math-reasoning benchmarks, InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model. Its chains of thought are also notably more concise, exhibiting reduced verbosity. Beyond this, InTRO enables cross-domain transfer, successfully adapting to out-of-domain reasoning tasks that extend beyond the realm of mathematics, demonstrating robust generalization.
中文摘要 训练大型语言模型（LLM）进行思维链推理提出了重大挑战：对单个“黄金”基本原理进行监督微调会损害泛化，因为它会惩罚同样有效的替代方案，而具有可验证奖励的强化学习则在学分分配和令人望而却步的计算成本方面遇到困难。为了解决这些限制，我们引入了 InTRO（In-Token Rationality Optimization），这是一个新框架，可以实现代币级探索和自我反馈，以实现准确简洁的推理。InTRO 不是直接优化所有有效推理路径上的棘手目标，而是利用校正因素——根据生成策略与其答案条件对应物之间的信息差异估计的标记重要性权重，以进行信息丰富的下一个标记选择。这种方法允许模型执行代币级探索并在单次前向传递中接收自生成的反馈，最终鼓励准确和简洁的基本原理。在六个数学推理基准中，InTRO 的性能始终优于其他基线，相对于基本模型，解决方案的准确性提高了 20%。它的思想链也明显更加简洁，表现出更少的冗长。除此之外，InTRO 还支持跨域转移，成功适应超出数学领域的域外推理任务，表现出强大的泛化能力。

HierRouter: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning

HierRouter：通过强化学习协调路由专业大型语言模型

Authors: Nikunj Gupta, Bill Guo, Rajgopal Kannan, Viktor K. Prasanna
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.09873
Pdf link: https://arxiv.org/pdf/2511.09873
Abstract Large Language Models (LLMs) deliver state-of-the-art performance across many tasks but impose high computational and memory costs, limiting their deployment in resource-constrained or real-time settings. To address this, we propose HierRouter, a hierarchical routing approach that dynamically assembles inference pipelines from a pool of specialized, lightweight language models. Formulated as a finite-horizon Markov Decision Process (MDP), our approach trains a Proximal Policy Optimization (PPO)-based reinforcement learning agent to iteratively select which models to invoke at each stage of multi-hop inference. The agent conditions on the evolving context and accumulated cost to make context-aware routing decisions. Experiments with three open-source candidate LLMs across six benchmarks, including QA, code generation, and mathematical reasoning, show that HierRouter improves response quality by up to 2.4x compared to using individual models independently, while incurring only a minimal additional inference cost on average. These results highlight the promise of hierarchical routing for cost-efficient, high-performance LLM inference. All codes can be found here this https URL Nikunj-Gupta/hierouter.
中文摘要 大型语言模型（LLM）在许多任务中提供最先进的性能，但会带来高昂的计算和内存成本，限制了它们在资源受限或实时环境中的部署。为了解决这个问题，我们提出了 HierRouter，这是一种分层路由方法，可以从专门的轻量级语言模型池中动态组装推理管道。我们的方法被表述为有限视界马尔可夫决策过程（MDP），训练基于近端策略优化（PPO）的强化学习代理，以迭代选择在多跳推理的每个阶段调用的模型。代理根据不断变化的上下文和累积成本进行条件，以做出上下文感知的路由决策。在六个基准测试（包括 QA、代码生成和数学推理）中对三个开源候选 LLM 进行的实验表明，与独立使用单个模型相比，HierRouter 将响应质量提高了 2.4 倍，同时平均仅产生最小的额外推理成本。这些结果凸显了分层路由对经济高效、高性能 LLM 推理的前景。所有代码都可以在这里找到 https URL Nikunj-Gupta/hierouter。

DemoTuner: Efficient DBMS Knobs Tuning via LLM-Assisted Demonstration Reinforcement Learning

DemoTuner：通过 LLM 辅助演示强化学习进行高效的 DBMS 旋钮调整

Authors: Hui Dou, Lei Jin, Yuxuan Zhou, Jiang He, Yiwen Zhang
Subjects: Subjects: Machine Learning (cs.LG); Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2511.09998
Pdf link: https://arxiv.org/pdf/2511.09998
Abstract The performance of modern DBMSs such as MySQL and PostgreSQL heavily depends on the configuration of performance-critical knobs. Manual tuning these knobs is laborious and inefficient due to the complex and high-dimensional nature of the configuration space. Among the automated tuning methods, reinforcement learning (RL)-based methods have recently sought to improve the DBMS knobs tuning process from several different perspectives. However, they still encounter challenges with slow convergence speed during offline training. In this paper, we mainly focus on how to leverage the valuable tuning hints contained in various textual documents such as DBMS manuals and web forums to improve the offline training of RL-based methods. To this end, we propose an efficient DBMS knobs tuning framework named DemoTuner via a novel LLM-assisted demonstration reinforcement learning method. Specifically, to comprehensively and accurately mine tuning hints from documents, we design a structured chain of thought prompt to employ LLMs to conduct a condition-aware tuning hints extraction task. To effectively integrate the mined tuning hints into RL agent training, we propose a hint-aware demonstration reinforcement learning algorithm HA-DDPGfD in DemoTuner. As far as we know, DemoTuner is the first work to introduce the demonstration reinforcement learning algorithm for DBMS knobs tuning. Experimental evaluations conducted on MySQL and PostgreSQL across various workloads demonstrate the significant advantages of DemoTuner in both performance improvement and online tuning cost reduction over three representative baselines including DB-BERT, GPTuner and CDBTune. Additionally, DemoTuner also exhibits superior adaptability to application scenarios with unknown workloads.
中文摘要 MySQL 和 PostgreSQL 等现代 DBMS 的性能在很大程度上取决于性能关键旋钮的配置。由于配置空间的复杂性和高维性质，手动调整这些旋钮既费力又低效。在自动调谐方法中，基于强化学习（RL）的方法最近试图从几个不同的角度改进DBMS旋钮调谐过程。然而，他们在离线训练中仍然遇到收敛速度慢的挑战。在本文中，我们主要关注如何利用DBMS手册和Web论坛等各种文本文档中包含的有价值的调优提示来改进基于RL的方法的离线训练。为此，我们通过一种新颖的LLM辅助演示强化学习方法，提出了一种名为DemoTuner的高效DBMS旋钮调优框架。具体来说，为了全面准确地挖掘文档中的调优提示，我们设计了一个结构化的思维链提示，利用 LLM 来执行条件感知调优提示提取任务。为了有效地将挖掘的调优提示集成到RL代理训练中，我们在DemoTuner中提出了一种提示感知的演示强化学习算法HA-DDPGfD。据我们所知，DemoTuner是第一个引入DBMS旋钮调优的演示强化学习算法的工作。在 MySQL 和 PostgreSQL 上对各种工作负载进行的实验评估表明，与 DB-BERT、GPTuner 和 CDBTune 等三个代表性基线相比，DemoTuner 在性能提升和在线调优成本降低方面具有显着优势。此外，DemoTuner还表现出对未知工作负载的应用场景的卓越适应性。

Reinforcing Trustworthiness in Multimodal Emotional Support Systems

加强多模式情感支持系统的可信度

Authors: Huy M. Le, Dat Tien Nguyen, Ngan T. T. Vo, Tuan D. Q. Nguyen, Nguyen Binh Le, Duy Minh Ho Nguyen, Daniel Sonntag, Lizi Liao, Binh T. Nguyen
Subjects: Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2511.10011
Pdf link: https://arxiv.org/pdf/2511.10011
Abstract In today's world, emotional support is increasingly essential, yet it remains challenging for both those seeking help and those offering it. Multimodal approaches to emotional support show great promise by integrating diverse data sources to provide empathetic, contextually relevant responses, fostering more effective interactions. However, current methods have notable limitations, often relying solely on text or converting other data types into text, or providing emotion recognition only, thus overlooking the full potential of multimodal inputs. Moreover, many studies prioritize response generation without accurately identifying critical emotional support elements or ensuring the reliability of outputs. To overcome these issues, we introduce \textsc{ MultiMood}, a new framework that (i) leverages multimodal embeddings from video, audio, and text to predict emotional components and to produce responses responses aligned with professional therapeutic standards. To improve trustworthiness, we (ii) incorporate novel psychological criteria and apply Reinforcement Learning (RL) to optimize large language models (LLMs) for consistent adherence to these standards. We also (iii) analyze several advanced LLMs to assess their multimodal emotional support capabilities. Experimental results show that MultiMood achieves state-of-the-art on MESC and DFEW datasets while RL-driven trustworthiness improvements are validated through human and LLM evaluations, demonstrating its superior capability in applying a multimodal framework in this domain.
中文摘要 在当今世界，情感支持变得越来越重要，但对于寻求帮助的人和提供帮助的人来说，它仍然具有挑战性。多模式情感支持方法通过整合不同的数据源来提供同理心、与上下文相关的响应，从而促进更有效的互动，显示出巨大的前景。然而，目前的方法存在显着的局限性，通常仅依赖文本或将其他数据类型转换为文本，或者仅提供情感识别，从而忽视了多模态输入的全部潜力。此外，许多研究优先考虑反应生成，而没有准确识别关键的情感支持要素或确保输出的可靠性。为了克服这些问题，我们引入了 \textsc{ MultiMood}，这是一个新框架，它（i）利用视频、音频和文本的多模态嵌入来预测情绪成分并产生符合专业治疗标准的反应反应。为了提高可信度，我们（ii）结合新的心理学标准并应用强化学习（RL）来优化大型语言模型（LLM），以始终遵守这些标准。我们还（iii）分析了几个高级法学硕士，以评估他们的多模式情感支持能力。实验结果表明，MultiMood在MESC和DFEW数据集上实现了最先进的水平，而RL驱动的可信度改进通过人类和LLM评估得到了验证，证明了其在该领域应用多模态框架的卓越能力。

Multi-agent In-context Coordination via Decentralized Memory Retrieval

通过分散式内存检索实现多代理上下文协调

Authors: Tao Jiang, Zichuan Lin, Lihe Li, Yi-Chen Li, Cong Guan, Lei Yuan, Zongzhang Zhang, Yang Yu, Deheng Ye
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.10030
Pdf link: https://arxiv.org/pdf/2511.10030
Abstract Large transformer models, trained on diverse datasets, have demonstrated impressive few-shot performance on previously unseen tasks without requiring parameter updates. This capability has also been explored in Reinforcement Learning (RL), where agents interact with the environment to retrieve context and maximize cumulative rewards, showcasing strong adaptability in complex settings. However, in cooperative Multi-Agent Reinforcement Learning (MARL), where agents must coordinate toward a shared goal, decentralized policy deployment can lead to mismatches in task alignment and reward assignment, limiting the efficiency of policy adaptation. To address this challenge, we introduce Multi-agent In-context Coordination via Decentralized Memory Retrieval (MAICC), a novel approach designed to enhance coordination by fast adaptation. Our method involves training a centralized embedding model to capture fine-grained trajectory representations, followed by decentralized models that approximate the centralized one to obtain team-level task information. Based on the learned embeddings, relevant trajectories are retrieved as context, which, combined with the agents' current sub-trajectories, inform decision-making. During decentralized execution, we introduce a novel memory mechanism that effectively balances test-time online data with offline memory. Based on the constructed memory, we propose a hybrid utility score that incorporates both individual- and team-level returns, ensuring credit assignment across agents. Extensive experiments on cooperative MARL benchmarks, including Level-Based Foraging (LBF) and SMAC (v1/v2), show that MAICC enables faster adaptation to unseen tasks compared to existing methods. Code is available at this https URL.
中文摘要 在不同数据集上训练的大型 Transformer 模型在以前从未见过的任务中表现出令人印象深刻的 Few Shot 性能，而无需更新参数。这种能力也在强化学习（RL）中得到了探索，智能体与环境交互以检索上下文并最大化累积奖励，在复杂环境中表现出强大的适应性。然而，在协作多智能体强化学习（MARL）中，智能体必须朝着共同目标进行协调，分散的策略部署可能导致任务对齐和奖励分配的不匹配，从而限制策略调整的效率。为了应对这一挑战，我们引入了通过去中心化记忆检索（MAICC）进行多智能体上下文协调，这是一种旨在通过快速适应来增强协调的新方法。我们的方法包括训练一个集中式嵌入模型来捕获细粒度的轨迹表示，然后进行近似中心化模型的去中心化模型，以获取团队级别的任务信息。基于学习到的嵌入，检索相关轨迹作为上下文，结合智能体当前的子轨迹，为决策提供信息。在去中心化执行过程中，我们引入了一种新颖的内存机制，可以有效地平衡测试时的在线数据和离线内存。基于构建的记忆，我们提出了一个混合效用分数，该分数结合了个人和团队层面的回报，确保跨代理的信用分配。对合作 MARL 基准的广泛实验，包括基于水平的觅食（LBF）和 SMAC （v1/v2），表明与现有方法相比，MAICC 能够更快地适应看不见的任务。代码可在此 https URL 中找到。

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

当眼睛和耳朵不一致时：MLLM 能否辨别视听混乱？

Authors: Qilang Ye, Wei Zeng, Meng Liu, Jie Zhang, Yupeng Hu, Zitong Yu, Yu Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.10059
Pdf link: https://arxiv.org/pdf/2511.10059
Abstract Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion'' scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs Is there a/an muted-object sound''. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves the accuracy by 10~30\% over the baseline model with limited training data. Follow: this https URL.
中文摘要 多模态大型语言模型（MLLM）能否识别视觉上存在但没有音频的混淆对象？为了研究这一点，我们引入了一个新的基准测试 AV-ConfuseBench，它通过修改视频中物体的相应声音来模拟“视听混乱”场景，例如，将发声对象静音并询问 MLLM 是否有静音对象声音“。实验结果表明，由于视觉主导的推理，MLLM，如Qwen2.5-Omni和Gemini 2.5，很难区分不存在的音频。受此观察的启发，我们引入了 RL-CoMM，这是一种基于强化学习的协作多 MLLM，建立在 Qwen2.5-Omni 基础之上。RL-CoMM 包括两个阶段：1）为了缓解视觉主导的模糊性，我们引入了一个外部模型，即大型音频语言模型（LALM）作为参考模型，以生成纯音频推理。然后，我们设计了一个逐步推理奖励函数，使MLLM能够通过纯音频参考进行自我改进的视听推理。2）为了确保准确的答案预测，我们引入了以答案为中心的置信度优化，以减少潜在异构推理差异的不确定性。广泛的视听问答和视听幻觉实验表明，RL-CoMM在有限的训练数据下，比基线模型提高了10~30%的准确率。关注：这个 https URL。

Tree-Based Stochastic Optimization for Solving Large-Scale Urban Network Security Games

基于树的随机优化求解大规模城市网络安全博弈

Authors: Shuxin Zhuang, Linjian Meng, Shuxin Li, Minming Li, Youzhi Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.10072
Pdf link: https://arxiv.org/pdf/2511.10072
Abstract Urban Network Security Games (UNSGs), which model the strategic allocation of limited security resources on city road networks, are critical for urban safety. However, finding a Nash Equilibrium (NE) in large-scale UNSGs is challenging due to their massive and combinatorial action spaces. One common approach to addressing these games is the Policy-Space Response Oracle (PSRO) framework, which requires computing best responses (BR) at each iteration. However, precisely computing exact BRs is impractical in large-scale games, and employing reinforcement learning to approximate BRs inevitably introduces errors, which limits the overall effectiveness of the PSRO methods. Recent advancements in leveraging non-convex stochastic optimization to approximate an NE offer a promising alternative to the burdensome BR computation. However, utilizing existing stochastic optimization techniques with an unbiased loss function for UNSGs remains challenging because the action spaces are too vast to be effectively represented by neural networks. To address these issues, we introduce Tree-based Stochastic Optimization (TSO), a framework that bridges the gap between the stochastic optimization paradigm for NE-finding and the demands of UNSGs. Specifically, we employ the tree-based action representation that maps the whole action space onto a tree structure, addressing the challenge faced by neural networks in representing actions when the action space cannot be enumerated. We then incorporate this representation into the loss function and theoretically demonstrate its equivalence to the unbiased loss function. To further enhance the quality of the converged solution, we introduce a sample-and-prune mechanism that reduces the risk of being trapped in suboptimal local optima. Extensive experimental results indicate the superiority of TSO over other baseline algorithms in addressing the UNSGs.
中文摘要 城市网络安全运动会（UNSG）模拟了城市道路网络上有限安全资源的战略分配，对城市安全至关重要。然而，由于其庞大且组合的行动空间，在大规模的联合国秘书长中找到纳什均衡（NE）具有挑战性。解决这些游戏的一种常见方法是策略空间响应预言机（PSRO）框架，它要求在每次迭代时计算最佳响应（BR）。然而，在大型博弈中，精确计算精确的BR是不切实际的，而采用强化学习来近似BR不可避免地会引入误差，这限制了PSRO方法的整体有效性。利用非凸随机优化近似NE的最新进展为繁琐的BR计算提供了一种有前途的替代方案。然而，利用现有的随机优化技术和无偏损失函数来处理UNSG仍然具有挑战性，因为动作空间太广阔，神经网络无法有效表示。为了解决这些问题，我们引入了基于树的随机优化（TSO），这是一个弥合NE查找的随机优化范式与UNSG需求之间差距的框架。具体来说，我们采用基于树的动作表示，将整个动作空间映射到树结构上，解决了神经网络在无法枚举动作空间时表示动作时面临的挑战。然后，我们将这种表示纳入损失函数中，并在理论上证明其与无偏损失函数的等效性。为了进一步提高收敛解决方案的质量，我们引入了一种采样和修剪机制，以降低被困在次优局部最优值中的风险。广泛的实验结果表明，TSO 在解决 UNSG 问题方面优于其他基线算法。

Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning

观点：迈向鲁棒机器人学习的统一表达策略优化

Authors: Haidong Huang, Haiyue Zhu. Jiayu Song, Xixin Zhao, Yaohua Zhou, Jiayi Zhang, Yuze Zhai, Xiaocong Li
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.10087
Pdf link: https://arxiv.org/pdf/2511.10087
Abstract Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9\% absolute improvement over Uni-O4 on locomotion tasks and +12.4\% on dexterous manipulation, demonstrating strong generalization and scalability.
中文摘要 离线到在线强化学习（O2O-RL）已成为安全高效的机器人政策部署的一种有前途的范式，但面临着两个基本挑战：多模态行为的覆盖范围有限和在线适应过程中的分布变化。我们提出了UEPO，这是一个统一的生成框架，其灵感来自大型语言模型预训练和微调策略。我们的贡献是三重的：（1）多种子动态感知扩散策略，无需训练多个模型即可有效捕获多种模态;（2）强制执行物理意义的策略多样性的动态分歧正则化机制;（3）基于扩散的数据增强模块，增强动力学模型泛化。在 D4RL 基准测试中，UEPO 在运动任务上比 Uni-O4 实现了 +5.9\% 的绝对改进，在灵巧作上实现了 +12.4\%，表现出很强的泛化性和可扩展性。

Learning-Based Channel Access in Wi-Fi: A Multi-Armed Bandit Approach

Wi-Fi 中基于学习的信道访问：一种多臂强盗方法

Authors: Miguel Casasnovas, Francesc Wilhelmi, Richard Combes, Maksymilian Wojnar, Katarzyna Kosek-Szott, Szymon Szott, Anders Jonsson, Luis Esteve, Boris Bellalta
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2511.10143
Pdf link: https://arxiv.org/pdf/2511.10143
Abstract Due to its static protocol design, IEEE 802.11 (aka Wi-Fi) channel access lacks adaptability to address dynamic network conditions, resulting in inefficient spectrum utilization, unnecessary contention, and packet collisions. This paper investigates reinforcement learning (RL) solutions to optimize Wi-Fi's medium access control (MAC). In particular, a multi-armed bandit (MAB) framework is proposed for dynamic channel access (including both the primary channel and channel width) and contention window (CW) adjustment. In this setting, we study relevant learning design principles such as adopting joint or factorial action spaces (handled by a single agent (SA) and multiple agents (MA), respectively) and the importance of incorporating contextual information. Our simulation results show that cooperative MA architectures converge faster than their SA counterparts, as agents operate over smaller action spaces. Another key insight is that contextual MAB algorithms consistently outperform non-contextual ones, highlighting the value of leveraging side information in action selection. Moreover, in multi-player settings, results demonstrate that decentralized learners can achieve implicit coordination, although their greediness may degrade coexisting networks' performance and induce policy-chasing dynamics. Overall, these findings demonstrate that (contextual) MAB-based learning offers a practical and adaptive alternative to static IEEE 802.11 protocols, enabling more efficient and intelligent spectrum utilization.
中文摘要 由于其静态协议设计，IEEE 802.11（又名 Wi-Fi）信道接入缺乏应对动态网络条件的适应性，导致频谱利用效率低下、不必要的争用和数据包冲突。本文研究了用于优化 Wi-Fi 中等访问控制（MAC）的强化学习（RL）解决方案。特别是，提出了一种多臂强盗（MAB）框架，用于动态信道接入（包括主信道和信道宽度）和争用窗口（CW）调整。在这种情况下，我们研究了相关的学习设计原则，例如采用联合或阶乘动作空间（分别由单个代理（SA）和多个代理（MA）处理）以及合并上下文信息的重要性。我们的仿真结果表明，协作 MA 架构比 SA 架构收敛得更快，因为代理在更小的动作空间上运行。另一个关键见解是，上下文 MAB 算法的性能始终优于非上下文算法，这凸显了在行动选择中利用侧面信息的价值。此外，在多人游戏环境中，结果表明，去中心化学习者可以实现隐式协调，尽管他们的贪婪可能会降低共存网络的性能并引发政策追逐动态。总体而言，这些发现表明，基于（上下文）MAB的学习为静态IEEE 802.11协议提供了一种实用且自适应的替代方案，从而实现了更高效、更智能的频谱利用。

Improved Offline Reinforcement Learning via Quantum Metric Encoding

通过量子度量编码改进离线强化学习

Authors: Outongyi Lv, Yewei Yuan, Nana Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2511.10187
Pdf link: https://arxiv.org/pdf/2511.10187
Abstract Reinforcement learning (RL) with limited samples is common in real-world applications. However, offline RL performance under this constraint is often suboptimal. We consider an alternative approach to dealing with limited samples by introducing the Quantum Metric Encoder (QME). In this methodology, instead of applying the RL framework directly on the original states and rewards, we embed the states into a more compact and meaningful representation, where the structure of the encoding is inspired by quantum circuits. For classical data, QME is a classically simulable, trainable unitary embedding and thus serves as a quantum-inspired module, on a classical device. For quantum data in the form of quantum states, QME can be implemented directly on quantum hardware, allowing for training without measurement or re-encoding. We evaluated QME on three datasets, each limited to 100 samples. We use Soft-Actor-Critic (SAC) and Implicit-Q-Learning (IQL), two well-known RL algorithms, to demonstrate the effectiveness of our approach. From the experimental results, we find that training offline RL agents on QME-embedded states with decoded rewards yields significantly better performance than training on the original states and rewards. On average across the three datasets, for maximum reward performance, we achieve a 116.2% improvement for SAC and 117.6% for IQL. We further investigate the $\Delta$-hyperbolicity of our framework, a geometric property of the state space known to be important for the RL training efficacy. The QME-embedded states exhibit low $\Delta$-hyperbolicity, suggesting that the improvement after embedding arises from the modified geometry of the state space induced by QME. Thus, the low $\Delta$-hyperbolicity and the corresponding effectiveness of QME could provide valuable information for developing efficient offline RL methods under limited-sample conditions.
中文摘要 样本有限的强化学习（RL）在实际应用中很常见。然而，在此约束下，离线 RL 性能通常不是最优的。我们考虑通过引入量子计量编码器（QME）来处理有限样本的替代方法。在这种方法中，我们不是将 RL 框架直接应用于原始状态和奖励，而是将状态嵌入到更紧凑和有意义的表示中，其中编码的结构受到量子电路的启发。对于经典数据，QME 是一种经典的可模拟、可训练的酉嵌入，因此在经典设备上充当量子启发模块。对于量子态形式的量子数据，QME 可以直接在量子硬件上实现，无需测量或重新编码即可进行训练。我们在三个数据集上评估了 QME，每个数据集仅限于 100 个样本。我们使用软行为者批评者（SAC）和隐式 Q-学习（IQL）这两种著名的 RL 算法来证明我们方法的有效性。从实验结果中，我们发现，在具有解码奖励的QME嵌入状态上训练离线RL代理的性能明显优于在原始状态和奖励上进行训练。平均而言，在三个数据集中，为了获得最大的奖励性能，我们实现了 SAC 的 116.2% 和 IQL 的 117.6% 的改进。我们进一步研究了框架的 $\Delta$ 双曲性，这是已知对 RL 训练功效很重要的状态空间的几何属性。QME嵌入状态表现出较低的$\Delta$双曲性，表明嵌入后的改善是由QME引起的状态空间的修改几何形状引起的。因此，QME的低$\Delta$双曲性和相应的有效性可以为在有限样本条件下开发高效的离线强化学习方法提供有价值的信息。

Heuristic Transformer: Belief Augmented In-Context Reinforcement Learning

启发式转换器：信念增强上下文强化学习

Authors: Oliver Dippel, Alexei Lisitsa, Bei Peng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.10251
Pdf link: https://arxiv.org/pdf/2511.10251
Abstract Transformers have demonstrated exceptional in-context learning (ICL) capabilities, enabling applications across natural language processing, computer vision, and sequential decision-making. In reinforcement learning, ICL reframes learning as a supervised problem, facilitating task adaptation without parameter updates. Building on prior work leveraging transformers for sequential decision-making, we propose Heuristic Transformer (HT), an in-context reinforcement learning (ICRL) approach that augments the in-context dataset with a belief distribution over rewards to achieve better decision-making. Using a variational auto-encoder (VAE), a low-dimensional stochastic variable is learned to represent the posterior distribution over rewards, which is incorporated alongside an in-context dataset and query states as prompt to the transformer policy. We assess the performance of HT across the Darkroom, Miniworld, and MuJoCo environments, showing that it consistently surpasses comparable baselines in terms of both effectiveness and generalization. Our method presents a promising direction to bridge the gap between belief-based augmentations and transformer-based decision-making.
中文摘要 Transformers 展示了卓越的上下文学习（ICL）功能，支持自然语言处理、计算机视觉和顺序决策的应用。在强化学习中，ICL 将学习重新定义为一个监督问题，无需参数更新即可促进任务适应。在之前利用 Transformer 进行顺序决策的工作的基础上，我们提出了启发式 Transformer （HT），这是一种上下文强化学习（ICRL）方法，它通过信念分布对奖励来增强上下文数据集，以实现更好的决策。使用变分自动编码器（VAE），学习低维随机变量来表示奖励的后验分布，该后验分布与上下文数据集和查询状态一起作为 Transformer 策略的提示。我们评估了 HT 在暗室、迷你世界和 MuJoCo 环境中的性能，表明它在有效性和泛化性方面始终超过可比基线。我们的方法提出了一个有希望的方向，以弥合基于信念的增强和基于转换器的决策之间的差距。

Beyond Single-Step Updates: Reinforcement Learning of Heuristics with Limited-Horizon Search

超越单步更新：使用有限视野搜索对启发式方法进行强化学习

Authors: Gal Hadar, Forest Agostinelli, Shahaf S. Shperberg
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.10264
Pdf link: https://arxiv.org/pdf/2511.10264
Abstract Many sequential decision-making problems can be formulated as shortest-path problems, where the objective is to reach a goal state from a given starting state. Heuristic search is a standard approach for solving such problems, relying on a heuristic function to estimate the cost to the goal from any given state. Recent approaches leverage reinforcement learning to learn heuristics by applying deep approximate value iteration. These methods typically rely on single-step Bellman updates, where the heuristic of a state is updated based on its best neighbor and the corresponding edge cost. This work proposes a generalized approach that enhances both state sampling and heuristic updates by performing limited-horizon searches and updating each state's heuristic based on the shortest path to the search frontier, incorporating both edge costs and the heuristic values of frontier states.
中文摘要 许多顺序决策问题可以表述为最短路径问题，其目标是从给定的起始状态达到目标状态。启发式搜索是解决此类问题的标准方法，它依靠启发式函数来估计任何给定状态到目标的成本。最近的方法利用强化学习通过应用深度近似值迭代来学习启发式学习。这些方法通常依赖于单步贝尔曼更新，其中状态的启发式方法根据其最佳邻居和相应的边缘成本进行更新。这项工作提出了一种通用方法，通过执行有限范围搜索并根据搜索边界的最短路径更新每个状态的启发式方法，结合边缘成本和前沿状态的启发式值，增强状态采样和启发式更新。

PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning

PROPA：通过强化学习实现视觉推理的过程级优化

Authors: Yanbei Jiang, Chao Lei, Yihao Ding, Krista Ehinger, Jey Han Lau
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.10279
Pdf link: https://arxiv.org/pdf/2511.10279
Abstract Despite significant progress, Vision-Language Models (VLMs) still struggle with complex visual reasoning, where multi-step dependencies cause early errors to cascade through the reasoning chain. Existing post-training paradigms are limited: Supervised Fine-Tuning (SFT) relies on costly step-level annotations, while Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO provide only sparse, outcome-level feedback, hindering stable optimization. We introduce PROPA (Process-level Reasoning Optimization with interleaved Policy Alignment), a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations. To overcome the cold-start problem, PROPA interleaves GRPO updates with SFT, enabling the model to learn from both successful and failed reasoning trajectories. A Process Reward Model (PRM) is further trained to guide inference-time search, aligning the test-time search with the training signal. Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines. It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art, establishing a strong reasoning and generalization capability for visual reasoning tasks. The code isavailable at: this https URL.
中文摘要 尽管取得了重大进展，视觉语言模型（VLM）仍然在复杂的视觉推理方面苦苦挣扎，其中多步骤依赖关系会导致早期错误在推理链中级联。现有的训练后范式是有限的：监督微调（SFT）依赖于昂贵的阶梯级注释，而GRPO等具有可验证奖励的强化学习（RLVR）方法仅提供稀疏的、结果级的反馈，阻碍了稳定的优化。我们引入了 PROPA（具有交错策略对齐的流程级推理优化），这是一个新颖的框架，它将蒙特卡洛树搜索（MCTS）与 GRPO 集成在一起，以生成密集的流程级奖励，并在每个中间步骤优化推理，而无需人工注释。为了克服冷启动问题，PROPA将GRPO更新与SFT交错，使模型能够从成功和失败的推理轨迹中学习。进一步训练过程奖励模型（PRM）以指导推理时间搜索，使测试时间搜索与训练信号保持一致。在七个基准测试和四个 VLM 主干网中，PROPA 始终优于基于 SFT 和 RLVR 的基线。与现有最先进的技术相比，它在域内任务上实现了高达 17.0% 的增益，在域外任务上实现了高达 21.0% 的增益，为视觉推理任务建立了强大的推理和泛化能力。代码可在以下位置获得：此 https URL。

Causal Model-Based Reinforcement Learning for Sample-Efficient IoT Channel Access

基于因果模型的强化学习，实现样本效率高的物联网信道接入

Authors: Aswin Arun, Christo Kurisummoottil Thomas, Rimalpudi Sarvendranath, Walid Saad
Subjects: Subjects: Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2511.10291
Pdf link: https://arxiv.org/pdf/2511.10291
Abstract Despite the advantages of multi-agent reinforcement learning (MARL) for wireless use case such as medium access control (MAC), their real-world deployment in Internet of Things (IoT) is hindered by their sample inefficiency. To alleviate this challenge, one can leverage model-based reinforcement learning (MBRL) solutions, however, conventional MBRL approaches rely on black-box models that are not interpretable and cannot reason. In contrast, in this paper, a novel causal model-based MARL framework is developed by leveraging tools from causal learn- ing. In particular, the proposed model can explicitly represent causal dependencies between network variables using structural causal models (SCMs) and attention-based inference networks. Interpretable causal models are then developed to capture how MAC control messages influence observations, how transmission actions determine outcomes, and how channel observations affect rewards. Data augmentation techniques are then used to generate synthetic rollouts using the learned causal model for policy optimization via proximal policy optimization (PPO). Analytical results demonstrate exponential sample complexity gains of causal MBRL over black-box approaches. Extensive simulations demonstrate that, on average, the proposed approach can reduce environment interactions by 58%, and yield faster convergence compared to model-free baselines. The proposed approach inherently is also shown to provide interpretable scheduling decisions via attention-based causal attribution, revealing which network conditions drive the policy. The resulting combination of sample efficiency and interpretability establishes causal MBRL as a practical approach for resource-constrained wireless systems.
中文摘要 尽管多智能体强化学习（MARL）在中型访问控制（MAC）等无线用例中具有优势，但它们在物联网（IoT）中的实际部署因其样本效率低下而受到阻碍。为了缓解这一挑战，可以利用基于模型的强化学习（MBRL）解决方案，然而，传统的 MBRL 方法依赖于不可解释且无法推理的黑盒模型。相比之下，本文利用因果学习工具开发了一种基于因果模型的新型MARL框架。特别是，所提出的模型可以使用结构因果模型（SCM）和基于注意力的推理网络来显式表示网络变量之间的因果依赖关系。然后开发可解释的因果模型来捕获 MAC 控制消息如何影响观察、传输作如何决定结果以及信道观察如何影响奖励。然后使用数据增强技术使用学习到的因果模型生成合成推出，通过近端策略优化（PPO）进行策略优化。分析结果表明，与黑盒方法相比，因果 MBRL 的样本复杂度呈指数级增长。广泛的仿真表明，与无模型基线相比，所提出的方法平均可以减少 58% 的环境相互作用，并产生更快的收敛速度。所提出的方法本质上还被证明可以通过基于注意力的因果归因提供可解释的调度决策，揭示哪些网络条件驱动策略。由此产生的样本效率和可解释性的结合将因果 MBRL 确立为资源受限无线系统的实用方法。

Rectify Evaluation Preference: Improving LLMs' Critique on Math Reasoning via Perplexity-aware Reinforcement Learning

纠正评估偏好：通过困惑感知强化学习提高法学硕士对数学推理的批评

Authors: Changyuan Tian, Zhicong Lu, Shuang Qian, Nayu Liu, Peiguang Li, Li Jin, Leiyi Hu, Zhizhao Zeng, Sirui Wang, Ke Zeng, Zhi Guo
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.10303
Pdf link: https://arxiv.org/pdf/2511.10303
Abstract To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason -- imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs' critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon -- ``LLMs incline to judge solutions with lower perplexity as correct'', which is dubbed as \textit{imbalanced evaluation preference}. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.
中文摘要 为了改进大型语言模型（LLMs）的多步数学推理（Multi-step Mathematical Reasoning，MsMR），通过自动批评MsMR推理过程中的错误并对问题解决方案做出最终判断，从语料库获得可扩展的监督至关重要。现有方法大多依赖于精心制作高质量的监督微调演示来增强批评能力，而很少关注深入研究法学硕士批评性能不佳的根本原因。本文正交量化并调查了潜在原因——评价偏好不平衡，并进行了统计偏好分析。在对原因的分析下，提出了一种新的困惑感知强化学习算法来纠正评价偏好，提升批评能力。具体来说，为了探究 LLM 的批评特征，精心构建了一对多问题解决方案（OPS）基准，以量化 LLM 在评估自身和他人生成的问题解决方案时的行为差异。然后，为了深入调查行为差异，我们进行了以困惑度为导向的统计偏好分析，发现了一个有趣的现象——“LLM倾向于将困惑度较低的解决方案判断为正确”，这被称为\textit{不平衡的评估偏好}。为了纠正这种偏好，我们将困惑度视为群体相对策略优化算法中的接力棒，支持法学硕士探索将低困惑度判断为错误、将较高困惑度判断为正确的轨迹。对我们构建的 OPS 和现有可用的批评基准的广泛实验结果证明了我们方法的有效性。

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

MonkeyOCR v1.5 技术报告：为复杂模式解锁强大的文档解析

Authors: Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, Zeen Wang, Qiangjun Ji, Fanxi Zhou, Qi Zhang, Yuanrui Hu, Jiahao Liu, Zhang Li, Ziyang Zhang, Qiang Liu, Xiang Bai
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.10390
Pdf link: https://arxiv.org/pdf/2511.10390
Abstract Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.
中文摘要 文档解析是文档智能中的一项核心任务，支持信息提取、检索增强生成和自动文档分析等应用程序。然而，现实世界的文档通常具有复杂的布局，包括多级表格、嵌入式图像或公式以及跨页结构，这对于现有的 OCR 系统来说仍然具有挑战性。我们介绍了 MonkeyOCR v1.5，这是一个统一的视觉语言框架，通过两阶段解析管道增强了布局理解和内容识别。第一阶段采用大型多模态模型共同预测文档布局和阅读顺序，利用视觉信息确保结构和顺序的一致性。第二阶段对检测区域内的文本、公式和表格进行局部识别，保持高视觉保真度，同时减少错误传播。为了解决复杂的表格结构，我们提出了一种基于视觉一致性的强化学习方案，该方案通过渲染和比较对齐来评估识别质量，无需手动注释即可提高结构准确性。此外，还引入了两个专门的模块，即图像解耦表解析和类型引导表合并，以实现对包含嵌入图像的表的可靠解析以及跨页或跨列表的重建。OmniDocBench v1.5 上的综合实验表明，MonkeyOCR v1.5 实现了最先进的性能，优于 PPOCR-VL 和 MinerU 2.5，同时在视觉复杂的文档场景中表现出卓越的鲁棒性。

AgentEvolver: Towards Efficient Self-Evolving Agent System

AgentEvolver：迈向高效自我进化的代理系统

Authors: Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, Zhaoyang Liu, Bolin Ding, Jingren Zhou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.10395
Pdf link: https://arxiv.org/pdf/2511.10395
Abstract Autonomous agents powered by large language models (LLMs) have the potential to significantly enhance human productivity by reasoning, using tools, and executing complex tasks in diverse environments. However, current approaches to developing such agents remain costly and inefficient, as they typically require manually constructed task datasets and reinforcement learning (RL) pipelines with extensive random exploration. These limitations lead to prohibitively high data-construction costs, low exploration efficiency, and poor sample utilization. To address these challenges, we present AgentEvolver, a self-evolving agent system that leverages the semantic understanding and reasoning capabilities of LLMs to drive autonomous agent learning. AgentEvolver introduces three synergistic mechanisms: (i) self-questioning, which enables curiosity-driven task generation in novel environments, reducing dependence on handcrafted datasets; (ii) self-navigating, which improves exploration efficiency through experience reuse and hybrid policy guidance; and (iii) self-attributing, which enhances sample efficiency by assigning differentiated rewards to trajectory states and actions based on their contribution. By integrating these mechanisms into a unified framework, AgentEvolver enables scalable, cost-effective, and continual improvement of agent capabilities. Preliminary experiments indicate that AgentEvolver achieves more efficient exploration, better sample utilization, and faster adaptation compared to traditional RL-based baselines.
中文摘要 由大型语言模型（LLM）提供支持的自主代理有可能通过推理、使用工具和在不同环境中执行复杂任务来显着提高人类生产力。然而，目前开发此类代理的方法仍然成本高昂且效率低下，因为它们通常需要手动构建任务数据集和具有广泛随机探索的强化学习（RL）管道。这些限制导致数据构建成本高得令人望而却步，勘探效率低，样本利用率低。为了应对这些挑战，我们推出了 AgentEvolver，这是一个自我进化的代理系统，它利用法学硕士的语义理解和推理能力来驱动自主代理学习。AgentEvolver 引入了三种协同机制：（i）自我提问，它可以在新环境中生成好奇心驱动的任务，减少对手工数据集的依赖;（ii）自我导航，通过经验重用和混合政策指导提高勘探效率;（iii）自我归因，通过根据轨迹状态和行为的贡献为轨迹状态和行为分配差异化奖励来提高样本效率。通过将这些机制集成到一个统一的框架中，AgentEvolver 可以实现可扩展、经济高效且持续改进代理功能。初步实验表明，与传统的基于RL的基线相比，AgentEvolver实现了更高效的探索、更好的样本利用率和更快的适应。

Explaining Decentralized Multi-Agent Reinforcement Learning Policies

解释去中心化多智能体强化学习策略

Authors: Kayla Boggess, Sarit Kraus, Lu Feng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.10409
Pdf link: https://arxiv.org/pdf/2511.10409
Abstract Multi-Agent Reinforcement Learning (MARL) has gained significant interest in recent years, enabling sequential decision-making across multiple agents in various domains. However, most existing explanation methods focus on centralized MARL, failing to address the uncertainty and nondeterminism inherent in decentralized settings. We propose methods to generate policy summarizations that capture task ordering and agent cooperation in decentralized MARL policies, along with query-based explanations for When, Why Not, and What types of user queries about specific agent behaviors. We evaluate our approach across four MARL domains and two decentralized MARL algorithms, demonstrating its generalizability and computational efficiency. User studies show that our summarizations and explanations significantly improve user question-answering performance and enhance subjective ratings on metrics such as understanding and satisfaction.
中文摘要 近年来，多智能体强化学习（MARL）引起了人们的极大兴趣，它支持跨各个领域的多个智能体进行顺序决策。然而，大多数现有的解释方法都集中在中心化 MARL，未能解决去中心化环境中固有的不确定性和非确定性。我们提出了生成策略摘要的方法，以捕获去中心化 MARL 策略中的任务排序和代理协作，以及对何时、为什么不和什么类型的用户查询特定代理行为的基于查询的解释。我们跨四个 MARL 域和两个去中心化 MARL 算法评估了我们的方法，展示了其通用性和计算效率。用户研究表明，我们的总结和解释显着提高了用户的问答性能，并增强了对理解和满意度等指标的主观评分。

Reasoning About Intent for Ambiguous Requests

关于歧义请求的意图的推理

Authors: Irina Saparina, Mirella Lapata
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.10453
Pdf link: https://arxiv.org/pdf/2511.10453
Abstract Large language models often respond to ambiguous requests by implicitly committing to one interpretation. Intent misunderstandings can frustrate users and create safety risks. To address this, we propose generating multiple interpretation-answer pairs in a single structured response to ambiguous requests. Our models are trained with reinforcement learning and customized reward functions using multiple valid answers as supervision. Experiments on conversational question answering and semantic parsing demonstrate that our method achieves higher coverage of valid answers than baseline approaches. Human evaluation confirms that predicted interpretations are highly aligned with their answers. Our approach promotes transparency with explicit interpretations, achieves efficiency by requiring only one generation step, and supports downstream applications through its structured output format.
中文摘要 大型语言模型通常通过隐式承诺一种解释来响应模棱两可的请求。意图误解可能会让用户感到沮丧并造成安全风险。为了解决这个问题，我们建议在对模棱两可的请求的单个结构化响应中生成多个解释-答案对。我们的模型通过强化学习和自定义奖励函数进行训练，使用多个有效答案作为监督。对话式问答和语义解析的实验表明，我们的方法比基线方法实现了更高的有效答案覆盖率。人工评估证实，预测的解释与他们的答案高度一致。我们的方法通过明确的解释提高透明度，只需一个生成步骤即可实现效率，并通过其结构化输出格式支持下游应用程序。

Strategic Opponent Modeling with Graph Neural Networks, Deep Reinforcement Learning and Probabilistic Topic Modeling

使用图神经网络、深度强化学习和概率主题建模进行战略对手建模

Authors: Georgios Chalkiadakis, Charilaos Akasiadis, Gerasimos Koresis, Stergios Plataniots, Leonidas Bakopoulos
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.10501
Pdf link: https://arxiv.org/pdf/2511.10501
Abstract This paper provides a comprehensive review of mainly Graph Neural Networks, Deep Reinforcement Learning, and Probabilistic Topic Modeling methods with a focus on their potential incorporation in strategic multiagent settings. We draw interest in (i) Machine Learning methods currently utilized for uncovering unknown model structures adaptable to the task of strategic opponent modeling, and (ii) the integration of these methods with Game Theoretic concepts that avoid relying on assumptions often invalid in real-world scenarios, such as the Common Prior Assumption (CPA) and the Self-Interest Hypothesis (SIH). We analyze the ability to handle uncertainty and heterogeneity, two characteristics that are very common in real-world application cases, as well as scalability. As a potential answer to effectively modeling relationships and interactions in multiagent settings, we champion the use of Graph Neural Networks (GNN). Such approaches are designed to operate upon graph-structured data, and have been shown to be a very powerful tool for performing tasks such as node classification and link prediction. Next, we review the domain of Reinforcement Learning (RL), and in particular that of Multiagent Deep Reinforcement Learning (MADRL). Following, we describe existing relevant game theoretic solution concepts and consider properties such as fairness and stability. Our review comes complete with a note on the literature that utilizes PTM in domains other than that of document analysis and classification. The capability of PTM to estimate unknown underlying distributions can help with tackling heterogeneity and unknown agent beliefs. Finally, we identify certain open challenges specifically, the need to (i) fit non-stationary environments, (ii) balance the degrees of stability and adaptation, (iii) tackle uncertainty and heterogeneity, (iv) guarantee scalability and solution tractability.
中文摘要 本文主要对图神经网络、深度强化学习和概率主题建模方法进行了全面综述，重点关注它们在战略多智能体环境中的潜在应用。我们对（i）目前用于发现适用于战略对手建模任务的未知模型结构的机器学习方法，以及（ii）将这些方法与博弈论概念相结合，避免依赖在现实场景中通常无效的假设，例如共同先验假设（CPA）和自利假说（SIH）。我们分析了处理不确定性和异构性的能力，这两个特征在实际应用案例中非常常见，以及可扩展性。作为在多智能体环境中有效建模关系和交互的潜在答案，我们倡导使用图神经网络（GNN）。这种方法旨在对图结构数据进行作，并且已被证明是执行节点分类和链路预测等任务的非常强大的工具。接下来，我们回顾了强化学习（RL）的领域，特别是多智能体深度强化学习（MADRL）的领域。接下来，我们描述了现有的相关博弈论解概念，并考虑了公平性和稳定性等属性。我们的综述附有关于在文档分析和分类以外的领域使用 PTM 的文献的注释。PTM 估计未知潜在分布的能力有助于解决异质性和未知代理信念。最后，我们确定了某些开放的挑战，需要（i）适应非平稳环境，（ii）平衡稳定性和适应程度，（iii）解决不确定性和异质性，（iv）保证可扩展性和解决方案的可处理性。

Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following

基于评分标准的基准测试和强化学习，以推进法学硕士教学遵循

Authors: Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Karishma Mandyam, Sopan Khosla, Yuanhao Xiong, Nanshu Wang, Selina Peng, Beibin Li, Shengjie Bi, Shishir G. Patil, Qi Qi, Shengyu Feng, Julian Katz-Samuels, Richard Yuanzhe Pang, Sujan Gonugondla, Hunter Lang, Yue Yu, Yundi Qian, Maryam Fazel-Zarandi, Licheng Yu, Amine Benhalloum, Hany Awadalla, Manaal Faruqui
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.10507
Pdf link: https://arxiv.org/pdf/2511.10507
Abstract Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted instructions-remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF (we will release this benchmark soon), a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs ability to follow complex, multi-turn, and system-level instructions. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.
中文摘要 大型语言模型（LLM）的最新进展在一系列任务上取得了令人印象深刻的性能，但高级指令遵循（IF）——尤其是对于复杂、多轮和系统提示的指令——仍然是一个重大挑战。由于缺乏高质量的、人工注释的基准和可靠、可解释的奖励信号，对此类能力的严格评估和有效培训受到阻碍。在这项工作中，我们引入了 AdvancedIF（我们将很快发布此基准测试），这是一个综合基准测试，包含 1,600 多个提示和专家策划的评分标准，用于评估 LLM 遵循复杂、多轮和系统级指令的能力。我们进一步提出了 RIFL（基于评分标准的指令遵循学习），这是一种新颖的训练后管道，它利用评分标准生成、微调的评分标准验证器和奖励塑造来实现有效的强化学习，以实现指令遵循的有效强化学习。大量实验表明，RIFL 显着提高了 LLM 的指令遵循能力，在 AdvancedIF 上实现了 6.7% 的绝对增益，在公共基准测试中取得了强劲的成绩。我们的消融研究证实了 RIFL 中每个成分的有效性。这项工作将评分标准确立为训练和评估法学硕士高级 IF 的强大工具，为更强大、更可靠的人工智能系统铺平了道路。

Towards Emotionally Intelligent and Responsible Reinforcement Learning

迈向情商和负责任的强化学习

Authors: Garapati Keerthana, Manik Gupta
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.10573
Pdf link: https://arxiv.org/pdf/2511.10573
Abstract Personalized decision systems in healthcare and behavioral support often rely on static rule-based or engagement-maximizing heuristics that overlook users' emotional context and ethical constraints. Such approaches risk recommending insensitive or unsafe interventions, especially in domains involving serious mental illness, substance use disorders, or depression. To address this limitation, we propose a Responsible Reinforcement Learning (RRL) framework that integrates emotional and contextual understanding with ethical considerations into the sequential decision-making process. RRL formulates personalization as a Constrained Markov Decision Process (CMDP), where the agent optimizes engagement and adherence while ensuring emotional alignment and ethical safety. We introduce a multi-objective reward function that explicitly balances short-term behavioral engagement with long-term user well-being, and define an emotion-informed state representation that captures fluctuations in emotional readiness, affect, and risk. The proposed architecture can be instantiated with any RL algorithm (e.g., DQN, PPO) augmented with safety constraints or Lagrangian regularization. Conceptually, this framework operationalizes empathy and responsibility within machine learning policy optimization, bridging safe RL, affective computing and responsible AI. We discuss the implications of this approach for human-centric domains such as behavioral health, education, and digital therapeutics, and outline simulation-based validation paths for future empirical work. This paper aims to initiate a methodological conversation about ethically aligned reinforcement learning for emotionally aware and trustworthy personalization systems.
中文摘要 医疗保健和行为支持中的个性化决策系统通常依赖于静态的、基于规则的或参与度最大化的启发式方法，而忽略了用户的情感背景和道德约束。这种方法可能会推荐不敏感或不安全的干预措施，特别是在涉及严重精神疾病、物质使用障碍或抑郁症的领域。为了解决这一限制，我们提出了一个负责任的强化学习（RRL）框架，该框架将情感和背景理解与道德考虑整合到顺序决策过程中。RRL 将个性化表述为约束马尔可夫决策过程（CMDP），代理在其中优化参与度和依从性，同时确保情感一致性和道德安全。我们引入了一种多目标奖励函数，该函数明确平衡了短期行为参与与长期用户福祉，并定义了一种情绪知情状态表示，以捕捉情绪准备、影响和风险的波动。所提出的架构可以使用任何 RL 算法（例如 DQN、PPO）进行实例化，并通过安全约束或拉格朗日正则化进行增强。从概念上讲，该框架在机器学习政策优化中实施同理心和责任，将安全 RL、情感计算和负责任的 AI 联系起来。我们讨论了这种方法对行为健康、教育和数字治疗等以人为本的领域的影响，并为未来的实证工作概述了基于模拟的验证路径。本文旨在发起一场关于道德一致的强化学习的方法论对话，以实现情感意识和值得信赖的个性化系统。

Instella: Fully Open Language Models with Stellar Performance

Instella：具有出色性能的完全开放的语言模型

Authors: Jiang Liu, Jialian Wu, Xiaodong Yu, Yusheng Su, Prakamya Mishra, Gowtham Ramesh, Sudhanshu Ranjan, Chaitanya Manem, Ximeng Sun, Ze Wang, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.10628
Pdf link: https://arxiv.org/pdf/2511.10628
Abstract Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.
中文摘要 大型语言模型（LLM）在广泛的任务中表现出了卓越的性能，但大多数高性能模型仍然是闭源或部分开放的，限制了透明度和可重复性。在这项工作中，我们介绍了 Instella，这是一个完全开放的 30 亿参数语言模型系列，完全在开放可用的数据和代码库上进行训练。Instella 由 AMD Instinct MI300X GPU 提供支持，通过大规模预训练、通用指令调整和与人类偏好保持一致而开发。尽管与许多同时代的模型相比，使用的预训练代币要少得多，但 Instella 在完全开放的模型中取得了最先进的结果，并且与同等规模的领先开放权重模型具有竞争力。我们进一步发布了两个专门的变体：Instella-Long，能够处理高达 128K 标记的上下文长度，以及 Instella-Math，这是一种以推理为中心的模型，通过数学任务的监督微调和强化学习得到增强。这些贡献共同使 Instella 成为社区透明、高性能且多功能的替代方案，推进了开放和可重复的语言建模研究的目标。

Robot Crash Course: Learning Soft and Stylized Falling

机器人速成班：学习软坠落和风格化坠落

Authors: Pascal Strauch, David Müller, Sammy Christen, Agon Serifi, Ruben Grandia, Espen Knoop, Moritz Bächer
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.10635
Pdf link: https://arxiv.org/pdf/2511.10635
Abstract Despite recent advances in robust locomotion, bipedal robots operating in the real world remain at risk of falling. While most research focuses on preventing such events, we instead concentrate on the phenomenon of falling itself. Specifically, we aim to reduce physical damage to the robot while providing users with control over a robot's end pose. To this end, we propose a robot agnostic reward function that balances the achievement of a desired end pose with impact minimization and the protection of critical robot parts during reinforcement learning. To make the policy robust to a broad range of initial falling conditions and to enable the specification of an arbitrary and unseen end pose at inference time, we introduce a simulation-based sampling strategy of initial and end poses. Through simulated and real-world experiments, our work demonstrates that even bipedal robots can perform controlled, soft falls.
中文摘要 尽管最近在强劲运动方面取得了进展，但在现实世界中运行的双足机器人仍然面临跌倒的风险。虽然大多数研究都集中在预防此类事件上，但我们却专注于跌倒本身的现象。具体来说，我们的目标是减少对机器人的物理伤害，同时为用户提供对机器人最终姿势的控制。为此，我们提出了一种与机器人无关的奖励函数，该函数在强化学习期间平衡了实现所需最终姿势与冲击最小化和关键机器人部件的保护。为了使策略对广泛的初始坠落条件具有鲁棒性，并能够在推理时指定任意和看不见的最终位姿，我们引入了基于仿真的初始和结束位姿采样策略。通过模拟和真实世界的实验，我们的工作表明，即使是双足机器人也可以进行受控的软坠落。

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

通过自洽抽样增强MLLM基于结果奖励的RL训练

Authors: Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, Jinguo Zhu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.10648
Pdf link: https://arxiv.org/pdf/2511.10648
Abstract Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.
中文摘要 结果奖励强化学习（RL）是完善多模态大型语言模型（MLLM）的逐步推理的一种常见且越来越重要的方法。在多项选择题设置（多模态推理基准的主导格式）中，范式面临着一个重大但经常被忽视的障碍：在错误的思维链后猜测正确选项的不忠实轨迹会获得与真正推理相同的奖励，这是一个不容忽视的缺陷。我们建议使用自洽抽样（SCS）来纠正这个问题。对于每个问题，SCS （i）引入小的视觉扰动，以及（ii）对初始轨迹进行重复截断和重采样;生成的轨迹之间的一致性会产生可微分的一致性分数，从而在策略更新期间降低不可靠跟踪的权重。基于 Qwen2.5-VL-7B-Instruct，将 SCS 插入 RLOO、GRPO 和 REINFORCE++ 系列，在六个多模态基准测试中将准确率提高了 7.7 个百分点，额外计算可以忽略不计。SCS 在 Qwen2.5-VL-3B-Instruct 和 InternVL3-8B 上也产生了显着的收益，为 MLLM 中的结果奖励 RL 提供了一种简单、通用的补救措施。

Keyword: diffusion policy

Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning

观点：迈向鲁棒机器人学习的统一表达策略优化

Authors: Haidong Huang, Haiyue Zhu. Jiayu Song, Xixin Zhao, Yaohua Zhou, Jiayi Zhang, Yuze Zhai, Xiaocong Li
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.10087
Pdf link: https://arxiv.org/pdf/2511.10087
Abstract Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9\% absolute improvement over Uni-O4 on locomotion tasks and +12.4\% on dexterous manipulation, demonstrating strong generalization and scalability.
中文摘要 离线到在线强化学习（O2O-RL）已成为安全高效的机器人政策部署的一种有前途的范式，但面临着两个基本挑战：多模态行为的覆盖范围有限和在线适应过程中的分布变化。我们提出了UEPO，这是一个统一的生成框架，其灵感来自大型语言模型预训练和微调策略。我们的贡献是三重的：（1）多种子动态感知扩散策略，无需训练多个模型即可有效捕获多种模态;（2）强制执行物理意义的策略多样性的动态分歧正则化机制;（3）基于扩散的数据增强模块，增强动力学模型泛化。在 D4RL 基准测试中，UEPO 在运动任务上比 Uni-O4 实现了 +5.9\% 的绝对改进，在灵巧作上实现了 +12.4\%，表现出很强的泛化性和可扩展性。