Arxiv Papers of Today

生成时间: 2026-05-21 19:30:44 (UTC+8); Arxiv 发布时间: 2026-05-21 20:00 EDT (2026-05-22 08:00 UTC+8)

今天共有 63 篇相关文章

Keyword: reinforcement learning

SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

SOLAR：一种自我优化的开放式自主智能体，实现终身学习和持续适应

Authors: Nitin Vetcha, Dianbo Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.20189
Pdf link: https://arxiv.org/pdf/2605.20189
Abstract Despite the remarkable success of large language models (LLMs), they still face bottlenecks while deploying in dynamic, real-world settings with primary challenges being concept drift and the high cost of gradient-based adaptation. Traditional fine-tuning (FT) struggles to adapt to non-stationary data streams without resulting in catastrophic for getting or requiring extensive manual data curation. To address these limitations within the streaming and continual learning paradigm, we propose the Self-Optimizing Lifelong Autonomous Reasoner (SOLAR) which is an open-ended autonomous agent that leverages parameter-level meta-learning to self-improve, treating model weights as an environment for exploration. It initiates the process by consolidating a strong prior over common-sense knowledge making it effective for transfer-learning. By utilizing a multi-level reinforcement learning approach, SOLAR autonomously discovers adaptation strategies, enabling efficient test-time adaptation to unseen domains. Crucially, SOLAR maintains an evolving knowledge base of valid modification strategies, implicitly acting as an episodic memory buffer to balance plasticity (adaptation to new tasks) and stability (retention of meta-knowledge). Experiments demonstrate that SOLAR outperforms strong baselines on common-sense, mathematical, medical, coding, social and logical reasoning tasks, marking a significant step toward autonomous agents capable of lifelong adaptation in evolving environments.
中文摘要 尽管大型语言模型（LLM）取得了显著成功，但在动态的现实环境中部署时仍面临瓶颈，主要挑战是概念漂移和基于梯度的适应成本较高。传统的微调（FT）难以适应非固定数据流，否则会导致获取或需要大量手动管理数据的灾难性问题。为了解决流式和持续学习范式中的这些局限，我们提出了自我优化终身自主推理器（SOLAR），这是一种开放式自主智能体，利用参数级元学习进行自我改进，将模型权重视为探索的环境。它通过巩固强有力的先知知识而非常识知识来启动这一过程，使其在迁移学习中非常有效。通过采用多层次强化学习方法，SOLAR 自主发现适应策略，实现对未见领域的高效测试适应。关键是，SOLAR 保持着不断发展的有效修改策略知识库，隐含地作为情节记忆缓冲，平衡可塑性（适应新任务）与稳定性（元知识的保留）。实验表明，SOLAR在常识、数学、医学、编码、社会和逻辑推理任务上优于强基线，标志着自主智能体在不断变化的环境中实现终身适应的迈出了重要一步。

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

闭环优化、仿真与建模编排的工具增强代理

Authors: Liyuan Deng, Shujian Deng, Yongkang Chen, Yongkang Dai, Zhihang Zhong, Linyang Li, Xiao Sun, Yilei Shi, Huaxi Huang
Subjects: Subjects: Artificial Intelligence (cs.AI); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2605.20190
Pdf link: https://arxiv.org/pdf/2605.20190
Abstract Iterative industrial design-simulation optimization is bottlenecked by the CAD-CAE semantic gap: translating simulation feedback into valid geometric edits under diverse, coupled constraints. To fill this gap, we propose COSMO-Agent (Closed-loop Optimization, Simulation, and Modeling Orchestration), a tool-augmented reinforcement learning (RL) framework that teaches LLMs to complete the closed-loop CAD-CAE process. Specifically, we cast CAD generation, CAE solving, result parsing, and geometry revision as an interactive RL environment, where an LLM learns to orchestrate external tools and revise parametric geometries until constraints are satisfied. To make this learning stable and industrially usable, we design a multi-constraint reward that jointly encourages feasibility, toolchain robustness, and structured output validity. In addition, we contribute an industry-aligned dataset that covers 25 component categories with executable CAD-CAE tasks to support realistic training and evaluation. Experiments show that COSMO-Agent training substantially improves small open-source LLMs for constraint-driven design, exceeding large open-source and strong closed-source models in feasibility, efficiency, and stability.
中文摘要 迭代工业设计-仿真优化被CAD-CAE语义差距所限制：将仿真反馈转化为在多样耦合约束下的有效几何编辑。为填补这一空白，我们提出了COSMO-Agent（闭环优化、仿真与建模编排）框架，这是一种工具增强强化学习（RL）框架，旨在教授LLM完成闭环CAD-CAE过程。具体来说，我们将CAD生成、CAE求解、结果解析和几何修订构建为一个互动式强化学习环境，LLM学习协调外部工具并修正参数几何，直到满足约束条件。为了使这种学习稳定且具工业实用性，我们设计了一种多约束奖励，共同鼓励可行性、工具链的稳健性和结构化的输出有效性。此外，我们还提供了一个行业对齐的数据集，涵盖25个组件类别，并可执行CAD-CAE任务，以支持真实的训练和评估。实验显示，COSMO-Agent训练显著提升了小型开源LLMs的约束驱动设计能力，在可行性、效率和稳定性方面超过了大型开源和强闭源模型。

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

通过基于代理的思维链调优进行长上下文推理

Authors: Miao Li, Irina Saparina, Alexander Gurung, Mirella Lapata
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.20201
Pdf link: https://arxiv.org/pdf/2605.20201
Abstract Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context -- rather than the full sequence. Despite sharing the same underlying reasoning process, models exhibit a significant performance disparity between proxy and full contexts. To improve long-context reasoning, we propose ProxyCoT, a novel training framework that transfers reasoning capabilities from short proxy contexts to full long contexts. Specifically, we first obtain high-quality chain-of-thought reasoning traces on proxy contexts through reinforcement learning or distillation from a larger teacher model, and then ground the generated traces in full long contexts with supervised fine-tuning. Experiments across different datasets demonstrate that ProxyCoT consistently outperforms strong baselines with reduced computational overhead. Furthermore, models trained with ProxyCoT generalize their long-context reasoning capabilities to out-of-domain tasks.
中文摘要 近年来的大型语言模型支持多达1000万个代币的输入，但在需要复杂推理的长上下文任务中表现较差。这类任务只需用输入的子集——代理上下文——来解决，而不必使用完整序列。尽管共享相同的底层推理过程，模型在代理模型与完整上下文之间表现出显著的性能差异。为了提升长上下文推理能力，我们提出了ProxyCoT，一种新颖的训练框架，将推理能力从短代理上下文转移到完整的长上下文中。具体来说，我们首先通过强化学习或更大型教师模型的提炼，获得代理上下文上的高质量思维链推理痕迹，然后通过监督微调将生成的痕迹置于完整的长上下文中。跨不同数据集的实验表明，ProxyCoT在计算开销更低的情况下，始终优于强基线。此外，使用 ProxyCoT 训练的模型将其长上下文推理能力推广到域外任务。

NaP-Control: Navigating Diffusion Prior for Versatile and Fast Character Control

NaP控制：在扩散前期导航，实现灵活且快速的特征控制

Authors: Chia-Wen Chen, Yan Wu, Korrawe Karunratanakul, Siyu Tang
Subjects: Subjects: Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.20209
Pdf link: https://arxiv.org/pdf/2605.20209
Abstract Achieving precise, versatile whole-body character control in physics-based animation remains challenging. Recent diffusion-based policies generate rich and expressive motions but typically rely on gradient-based test-time guidance to satisfy task objectives, which is slow and can reduce robustness. We introduce NaP-Control (Navigating Diffusion Prior for Versatile and Fast Character Control), abbreviated as NaP. Our method uses reinforcement learning to manipulate the latent noise of a task-agnostic diffusion policy prior, steering it toward task-specific behaviors for fast, robust control with high motion fidelity. In contrast to methods that rely solely on offline training, NaP interacts with the environment during training to correct motions and optimize task rewards, improving success rates and enabling adaptation to challenging scenarios. By directly predicting task-optimized diffusion noise, NaP eliminates iterative guidance during denoising and enables efficient inference. Experiments show that NaP attains higher success rates and faster inference while preserving natural motion across diverse tasks.
中文摘要 在基于物理的动画中实现精准且多功能的全身角色控制依然充满挑战。近期基于扩散的策略产生丰富且富有表现力的运动，但通常依赖基于梯度的测试时间指导来满足任务目标，这速度较慢且可能降低鲁棒性。我们介绍NaP控制（NaP-Control，导航扩散前置以实现灵活快速的字符控制），简称NaP。我们的方法利用强化学习来操控任务无关扩散策略的潜在噪声，引导其朝向任务特定行为，实现快速、稳健且高运动保真度的控制。与仅依赖线下训练的方法不同，NaP在训练过程中与环境互动，纠正动作并优化任务奖励，提高成功率并促进对挑战情境的适应。通过直接预测任务优化的扩散噪声，NaP消除了去噪过程中的迭代引导，实现了高效的推断。实验表明，NaP在保持自然运动的同时，能够实现更高的成功率和更快的推断。

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

GROW：将GRPO与开放世界VLM代理的状态动作建模对齐

Authors: Xiongbin Wu, Zhihao Luo, Shanzhe Lei, Lechao Zhang, Xuhong Wang, Jie Yang, Zhonglong Zheng, Yuanjie Zheng, Xin Tan, Wei Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.20246
Pdf link: https://arxiv.org/pdf/2605.20246
Abstract Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception and action execution. However, existing methods still rely primarily on Supervised Fine-Tuning (SFT) with expert demonstrations, while the advanced reinforcement learning (RL) algorithm, specifically Group Relative Policy Optimization (GRPO), has not been effectively employed for multi-turn RL in these tasks because standard GRPO requires full trajectories as training samples which leads to excessively long context and noise. To address this issue, we propose GROW, a RL framework for open-world VLM agents that decomposes collected trajectories into state-action samples, and computes advantages between these samples rather than treating a full trajectory as a single entity. We further provide a surrogate analysis indicating that, even though the grouped samples are conditioned on different local states rather than an identical prompt context, the objective can preserve the core relative policy optimization signal of GRPO under simplifying assumptions. Experiments on more than 800 Minecraft tasks show that our method achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of our proposed RL framework for open-world VLM agents.
中文摘要 近年来，视觉语言模型（VLM）代理在开放世界任务中取得了有希望的进展，在这些任务中，成功完成任务往往需要多次视觉感知和行动执行。然而，现有方法仍主要依赖监督微调（SFT）和专家演示，而高级强化学习（RL）算法，特别是群相对策略优化（GRPO），在这些任务中尚未有效应用，因为标准GRPO要求训练样本为完整轨迹，导致上下文过长且噪声过长。为解决这一问题，我们提出了GROW框架，这是一种面向开放世界VLM代理的强化学习框架，它将收集的轨迹分解为状态-动作样本，并计算这些样本之间的优势，而非将完整轨迹视为单一实体。我们还提供了替代分析，表明即使分组样本基于不同的局部状态而非相同的提示上下文，目标在简化假设下仍能保持GRPO的核心相对政策优化信号。对800多个Minecraft任务的实验表明，我们的方法实现了最先进的（SOTA）性能，展示了我们提出的强化学习框架在开放世界VLM代理中的有效性。

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

多智能体强化学习，在行人行为不确定性下实现安全自动驾驶

Authors: Prakash Aryan, Kaushik Raghupathruni, Timo Kehrer, Sebastiano Panichella
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.20255
Pdf link: https://arxiv.org/pdf/2605.20255
Abstract Simulation-based testing of self-driving cars (SDCs) typically relies on scripted or simplified pedestrian models that do not capture the heterogeneity and uncertainty of real human crossing behavior. This limits the realism of safety assessments, especially in scenarios involving jaywalking, which is governed by latent personality traits that the vehicle cannot observe. We hypothesize that jointly training pedestrians and the SDC with multi-agent reinforcement learning (MARL) produces more realistic interaction scenarios than training the SDC against fixed pedestrian policies, and that the resulting behavior gap between predictable and unpredictable crossings can be measured directly from trajectories. This paper describes a MARL environment in which an SDC and 12 pedestrians are co-trained using Multi-Agent Proximal Policy Optimization (MAPPO). Pedestrian locomotion follows scripted Dijkstra pathfinding, while an RL policy controls high-level go/wait decisions. Jaywalking probability depends on a per-pedestrian personality trait sampled at episode start and hidden from the SDC. In 500-episode evaluations, the co-trained SDC reached 78% of goals with a 14% collision rate, compared to 35% goals and 33% collisions for the best rule-based baseline. A speed differential metric shows that the SDC traveled 2.65 m/s faster near jaywalkers than near crosswalk users at close range (0-3 m), indicating that jaywalking encounters were not anticipated. Jaywalking accounted for 13% of crossing events but was associated with 62% of collisions. Co-training with MARL pedestrians reduced collisions by 30% relative to single-agent RL, as pedestrians learned to wait when the SDC approached at speed.
中文摘要 基于模拟的自动驾驶汽车（SDC）测试通常依赖于脚本化或简化的行人模型，这些模型无法捕捉真实人类过街行为的异质性和不确定性。这限制了安全评估的真实性，尤其是在涉及乱穿马路的场景中，因为车辆无法察觉的潜在个性特征所决定。我们假设，联合训练行人与SDC使用多智能体强化学习（MARL）相比针对固定行人策略训练SDC更真实的交互场景，且可预测与不可预测交叉口之间的行为差距可直接从轨迹测量。本文描述了一个MARL环境，其中一名SDC和12名行人使用多代理近端策略优化（MAPPO）共同训练。行人移动遵循脚本化的Dijkstra路径寻找，而强化学习策略则控制高层次的开始/等待决策。乱穿马路的概率取决于每人在发作开始时抽样并隐藏在SDC之外的性格特征。在500集的评估中，联合训练的SDC达到了78%的目标，碰撞率为14%，而最佳规则基线的目标为35%，碰撞率为33%。速度差指标显示，SDC在近距离（0-3米）处，在行人横穿马路附近的速度比行人横穿线使用者快2.65米/秒，表明未预料到马过马路遭遇。乱穿马路占横穿事件的13%，但与62%的碰撞事件有关。与MARL行人的共训相比单一药物强化学习减少了30%的碰撞，因为行人学会在高速接近SDC时等待。

FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

FBOS-RL：反馈驱动双目标协同强化学习

Authors: Xikai Zhang, Yongzhi Li, Likang Xiao, Yingze Zhang, Yanhua Cheng, Quan Chen, Peng Jiang, Wenjun Wu, Liu Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.20256
Pdf link: https://arxiv.org/pdf/2605.20256
Abstract Reinforcement learning has become a cornerstone for aligning and unlocking the reasoning capabilities of large-scale models. At its core, the training loop of GRPO and its variants alternates between rollout sampling and policy update. Unlike supervised learning, where each gradient step is anchored to an explicit ground-truth target, the optimal gradient direction for updating model parameters in this setting is not known a priori; the high-quality rollouts drawn during the sampling stage therefore act as the implicit "teacher" that guides every parameter update. However, GRPO adopt a simple sampling scheme that conditions all rollouts on the same original prompt. When a task lies beyond the policy model's current capability, this sampling scheme rarely yields a high-quality rollout, leaving the policy model without a meaningful gradient direction when updating its parameters, which causes training to stall. To address this issue, we propose FBOS-RL, a Feedback-Driven Bi-Objective Synergistic reinforcement learning framework. Specifically, we let the model perform Feedback-Guided Exploration Enhancement based on the feedback provided by the environment, and on top of this we design two mutually reinforcing training objectives: Exploitation-oriented Policy Alignment(EPA) and Exploration-oriented Capability Cultivation(ECC). Extensive experiments demonstrate that EPA and ECC can mutually reinforce each other, forming a positive flywheel effect that significantly improves both the training efficiency and the final performance ceiling of reinforcement learning. Specifically, under an identical number of rollouts, FBOS-RL learns substantially faster than GRPO and feedback-based baselines and ultimately attains a higher performance ceiling, while exhibiting higher policy entropy and lower gradient norms throughout training.
中文摘要 强化学习已成为对齐和释放大规模模型推理能力的基石。GRPO及其变体的核心训练循环在推广抽样和政策更新之间交替进行。与监督学习不同，后者每个梯度步骤都锚定在一个明确的地面真实目标，在这种情况下更新模型参数的最佳梯度方向尚无预先已知;因此，抽样阶段绘制的高质量推广作为隐含的“老师”，指导每一次参数更新。然而，GRPO采用了一种简单的抽样方案，将所有推出的流程都以同一个原始提示为条件。当任务超出策略模型当前能力时，这种抽样方案很少能带来高质量的推广，导致策略模型在更新参数时缺乏有意义的梯度方向，导致训练停滞。为解决这一问题，我们提出了FBOS-RL，一种反馈驱动的双目标协同强化学习框架。具体来说，我们让模型基于环境反馈执行反馈引导探索增强，并在此基础上设计了两个相互强化的训练目标：面向开发的策略对齐（EPA）和探索导向的能力培养（ECC）。大量实验表明，EPA和ECC可以相互强化，形成正向飞轮效应，显著提升训练效率和强化学习的最终性能上限。具体来说，在相同数量的推广下，FBOS-RL的学习速度远快于GRPO和基于反馈的基线，最终达到更高的性能上限，同时在整个训练过程中表现出更高的策略熵和更低的梯度规范。

It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs

这需要两者：在大型语言模型中实现上下文完整性的互补自我蒸馏

Authors: Sangwoo Park, Woongyeong Yeo, Seanie Lee, Yumin Choi, Hyomin Lee, Kangsan Kim, Jinheon Baek, Seong Joon Oh, Sung Ju Hwang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2605.20258
Pdf link: https://arxiv.org/pdf/2605.20258
Abstract Contextual Integrity (CI) defines privacy not merely as keeping information hidden, but as governing information flows according to the norms of a given context. As large language models are increasingly deployed as personal agents handling sensitive workflows, adhering to CI becomes critical. However, even frontier models remain unreliable in making disclosure decisions, and existing mitigation strategies often degrade underlying task performance. To overcome this privacy-utility trade-off, we propose SELFCI, a complementary self-distillation framework that decouples information suppression from task resolution. SELFCI jointly optimizes two independent reverse KL divergences over distinct teacher distributions derived from feedback: one encourages preserving task-relevant information for utility, while the other enforces minimal and appropriate disclosure. This complementary formulation induces a Product-of-Experts (PoE) target, aligning the policy with the intersection of capability and privacy requirements. Empirical evaluations demonstrate that SELFCI, without relying on costly external supervision, consistently outperforms competitive baselines such as online reinforcement learning algorithms (e.g., GRPO). These trends further extend to out-of-domain settings involving agentic workflows and accumulated private context, suggesting that SELFCI provides a practical path toward CI alignment.
中文摘要 情境完整性（CI）将隐私定义为不仅仅是隐藏信息，而是根据特定情境的规范来管理信息流动。随着大型语言模型越来越多地作为个人代理部署，处理敏感工作流程，遵守CI变得至关重要。然而，即使是前沿模型在信息披露决策上仍不可靠，现有的缓解策略常常降低了基础任务的表现。为了克服隐私与效用的权衡，我们提出了SELFCI这一互补的自我提炼框架，将信息抑制与任务解决解耦。SELFCI联合优化了两种独立的反向知识谱分歧，分别针对基于反馈的教师分布：一种鼓励保留任务相关信息以供参考，另一种则强制最小且适当的披露。这一互补表述引入了专家产品（PoE）目标，使政策与能力与隐私要求的交叉点保持一致。实证评估表明，SELFCI在不依赖昂贵外部监督的情况下，持续优于在线强化学习算法（如GRPO）等竞争基线。这些趋势进一步延伸到涉及代理工作流和积累私有上下文的域外环境，表明SELFCI为CI对齐提供了一条切实可行的路径。

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

适形选择性行为：RLVR训练LLMs的随时有效风险控制

Authors: Hamed Khosravi, Xiaoming Huo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.20270
Pdf link: https://arxiv.org/pdf/2605.20270
Abstract A local specialist LLM, fine-tuned with reinforcement learning from verifiable rewards (RLVR) on operator-local data, is installed in a regulated organization with per-deployment error budget $\alpha$. The operator needs a safety certificate for this deployment's stream at every round: no pooling across deployments, no waiting for a long-run average. Existing wrappers cannot deliver this on adaptive, online-updated streams: offline conformal-risk methods require exchangeability; online-conformal methods bound only long-run averages; non-exchangeable extensions are marginally valid; and the closest anytime wrapper, A-RCPS, controls marginal rather than selective risk. Using a (test statistic, validity guarantee, deployment rule) framework, we identify one empty cell forced by deployment requirements: e-process per threshold, selective risk, anytime-pathwise validity, max-certified-threshold rule. Conformal Selective Acting (CSA) fills it as a per-round wrapper maintaining a Ville-type e-process per threshold on a Bonferroni grid, evaluated against the RLVR filtration. Under predictable updates and isotonic-calibrated monotone risk we prove (i) an anytime-pathwise selective-risk bound $R_T^{\mathrm{act}}\le\alpha+O(N_T^{-1/2})$, (ii) rate-optimal certification matching $\Theta(\bar\eta^{-2}\log(1/\delta))$, and (iii) a horizon-independent release-rate gap. Across eight specialist benchmarks ($480$ streams), sixteen adversarial distribution-shift cells ($160$ streams), and five live Expert-Iteration RLVR cells with online LoRA over four base models in three architecture families ($10{,}300$ rounds), CSA is the only method among ten compared that satisfies pathwise validity and non-refusing deployment on every cell. We do not propose a new LLM, training algorithm, or policy class; CSA is the deployment-side complement, orthogonal to the model, for operators who cannot use a frontier API.
中文摘要 一个本地专业LLM通过对操作员-本地数据的可验证奖励强化学习（RLVR）进行微调，安装在受监管的组织中，每次部署错误预算为$\alpha$。运营商在每次部署流都需要安全证书：不跨部署池，不等待长期平均值。现有包装器无法在自适应、在线更新的流中实现这一点：离线共形风险方法需要可交换性;在线共形方法仅约束长期平均值;不可交换的扩展具有边缘有效性;而最接近的任意时间包装器A-RCPS，则控制边际风险而非选择性风险。利用（测试统计量、有效性保证、部署规则）框架，我们识别出一个由部署要求强制执行的空单元格：每阈值电子处理、选择性风险、任意路径有效性、最大认证阈值规则。共形选择作用（CSA）将其作为每轮包裹器填充，在Bonferroni网格上每个阈值下保持Ville型e过程，并结合RLVR滤波进行评估。在可预测更新和等张校准单调风险下，我们证明了（i）任意路径选择风险界限$R_T^{\mathrm{act}}\le\alpha+O（N_T^{-1/2}）$，（ii）速率最优认证匹配$\Theta（\bar\eta^{-2}\log（1/\delta））$，（iii）地平线无关释放率差距。在八个专业基准测试（480美元流）、16个对抗分布转移单元（160美元流）以及5个带有在线LoRA的专家迭代RLVR单元中，覆盖四个基础模型、三类架构家族（约10，}300美元轮次），CSA是十种方法中唯一满足路径有效性和非拒绝部署的每个单元的方法。我们不提出新的大型语言模型、训练算法或策略类;CSA是部署端的补充，与模型正交，适用于无法使用Frontier API的操作员。

Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning

更小的抽象状态空间使强化学习中的跨尺度泛化成为可能

Authors: Nasehatul Mustakim, Lucas Lehnert
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.20272
Pdf link: https://arxiv.org/pdf/2605.20272
Abstract While humans readily generalize abstract concepts to more complex or larger tasks, building Reinforcement Learning (RL) systems with this ability remains elusive. Here, we present the first theoretical model of how such Out-of-Distribution (OOD) generalization can be achieved in RL agents. Our approach considers Partially Observable Markov Decision Processes (POMDPs) and assumes that an intelligent agent uses an abstraction function to determine which experiences can be treated as equivalent and which must be distinguished. First, we extend the existing state abstraction framework and proof techniques to POMDPs. Then, we define a successor-weighted model reduction, a model reduction variant that enables compression into smaller abstract spaces than prior definitions allow. We derive a bound on the agent's OOD test performance, thereby defining the conditions under which OOD generalization is achievable. This bound decomposes an agent's performance loss into approximation and estimation errors, revealing how reducing an agent's abstract state space size improves test performance and OOD generalization. Our analysis suggests that constraining an agent to operate over a small, finite set of abstract states is necessary for achieving generalization to more complex tasks. Our results motivate further research into learning RL architectures that scale across tasks of varying complexity levels.
中文摘要 虽然人类很容易将抽象概念推广到更复杂或更大的任务，但构建具备这种能力的强化学习（RL）系统仍然遥不可及。在这里，我们提出了第一个理论模型，说明如何在强化学习代理中实现这种非分布（OOD）泛化。我们的方法考虑部分可观测马尔可夫决策过程（POMDP），并假设智能代理使用抽象函数来判断哪些经验可以视为等价，哪些必须区分。首先，我们将现有的状态抽象框架和证明技术扩展到POMDPs。然后，我们定义了一个后继加权模型约简，这是一种模型约简变体，使其能够压缩到比以往定义更小的抽象空间。我们推导出代理的OOD测试表现的界限，从而定义了OOD推广可实现的条件。该界限将代理的性能损失分解为近似误差和估计误差，揭示了缩小代理抽象状态空间大小如何提升测试性能和OOD泛化。我们的分析表明，限制智能体只能在有限的抽象状态上操作，是实现向更复杂任务推广的必要条件。我们的研究结果激励了进一步研究能够跨越不同复杂度任务的强化学习架构。

Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis

通过轨迹整合反馈调控体积计算机断层扫描分析中的解剖感知奖励

Authors: Tianwei Lin, Zhongwei Qiu, Jie Cao, Jiang Liu, Wenjie Yan, Bo Zhang, Yu Zhong, Wenqiao Zhang, Yingda Xia, Ling Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.20277
Pdf link: https://arxiv.org/pdf/2605.20277
Abstract Medical vision-language models (VLMs) have rapidly advanced as general-purpose multimodal assistants, yet their deployment in 3D Computed Tomography (CT) analysis remains constrained by a persistent mismatch between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms still rely on lexical proxy signals that induce \textit{Evaluation Hallucinations}'', where models optimize linguistic fluency rather than factual clinical correctness, leading to diagnostically critical errors. To bridge this gap, we introduce the \textbf{Clinical Abnormality Benchmarking Substrate (CABS)}, a structured system that decomposes radiology reports into verifiable clinical semantic units. Using CABS, we identify a\textit{Mechanistic Divergence}'' in standard RL, where surface-similarity rewards drive policy gradients to bypass medical facts. We therefore propose \textbf{Trajectory-Integral Feedback GRPO (TIF-GRPO)}, a novel framework integrating control-theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine-grained regulation in medical VLMs. Our project is available at \href{this https URL}{GitHub}.
中文摘要 医疗视觉语言模型（VLM）作为通用多模态助手迅速发展，但其在3D计算机断层扫描（CT）分析中的应用仍受优化目标与临床严谨性之间持续不匹配的限制。当前强化学习（RL）范式仍依赖词汇代理信号，诱导“\textit{评估幻觉}'”，模型优化语言流畅度而非临床事实正确性，导致诊断上的关键错误。为弥合这一空白，我们引入了 \textbf{临床异常基准基准基质（CABS）}，这是一个结构化系统，将放射报告分解为可验证的临床语义单元。利用CABS，我们在标准强化学习中识别出“\textit{机制发散}''”，其中表面相似性奖励推动政策梯度绕过医学事实。因此，我们提出了 \textbf{轨迹-整合反馈 GRPO（TIF-GRPO）}，这是一个将控制理论原则整合进策略优化的新框架。通过将临床推理定义为异常发现的伪时间轨迹，TIF-GRPO通过一个整体反馈循环调节解剖学感知的奖励，将持续遗漏作为累积状态错误惩罚，将幻觉抑制为过度控制。3D CT基准测试的实验表明，我们的方法显著提升了异常检测和临床准确性，建立了医学VLM细粒度调控的新范式。我们的项目可在 \href{this https URL}{GitHub} 获取。

JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA

JUDO：工业异常质量保证的并置领域导向多模态推理器

Authors: Hyunju Kang, Woohyun Lee, Jaewon Kim, Hogun Park
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.20284
Pdf link: https://arxiv.org/pdf/2605.20284
Abstract Industrial anomaly detection has been significantly advanced by Large Multimodal Models (LMMs), enabling diverse human instructions beyond detection, particularly through visually grounded reasoning for better image understanding. However, LMMs lack domain-specific knowledge, which limits their ability to generate accurate responses in complex industrial scenarios. In this work, we present JUDO, Juxtaposed Domain-Oriented Multimodal Reasoner, a framework that efficiently incorporates domain knowledge and context in visual and textual reasoning. Through visual reasoning, our model segments the defect region by juxtaposing query images with normal images as visual domain context, enabling a fine-grained visual comparative inspection. Furthermore, we inject domain knowledge through supervised fine-tuning (SFT) to enhance context understanding and subsequently guide domain reasoning through reinforcement learning (GRPO) with tailored rewards, opting for a domain-oriented reasoning process. Experimental results demonstrate that JUDO achieves superior performance on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. These results highlight the importance of enhancing domain knowledge and context for effective reasoning in anomaly understanding.
中文摘要 大型多模态模型（LMM）显著推动了工业异常检测，使得人类指令能够实现超越检测的多样化，特别是通过基于视觉的推理以更好地理解图像。然而，LMM缺乏领域特定的知识，这限制了其在复杂工业场景中生成准确响应的能力。在本研究中，我们介绍了JUDO（并置领域导向多模态推理器）框架，该框架高效地将领域知识和上下文融入视觉和文本推理。通过视觉推理，我们的模型通过将查询图像与正常图像并置作为视觉域上下文，对缺陷区域进行细致度的视觉比较检查。此外，我们通过监督微调（SFT）注入领域知识，以增强上下文理解，随后通过强化学习（GRPO）引导领域推理，采用定制奖励，采用面向领域的推理过程。实验结果显示，JUDO在MMAD基准测试中表现优异，超过了Qwen2.5-VL-7B和GPT-4o等模型。这些结果凸显了提升领域知识和上下文对于异常理解中有效推理的重要性。

Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages

内省式X训练：反馈条件提升LLM所有训练阶段的扩展性

Authors: Brandon Cui, Ximing Lu, Jaehun Jung, Syeda Nahida Akter, Hyunwoo Kim, Yuxiao Qu, David Acuna, Shrimai Prabhumoye, Yejin Choi, Prithviraj Ammanabrolu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.20285
Pdf link: https://arxiv.org/pdf/2605.20285
Abstract We tackle the question of how to scale more efficiently across the many, ever-growing stages of current LLM training pipelines. Our guiding intuition stems from the fact that the dynamics of later stages of the pipeline, e.g. post-training, can be used to inform earlier stages such as pre-training. To this end, we propose Introspective Training (or IXT), inspired by offline reward-conditioned reinforcement learning and applicable to any stage of training. IXT uses a thinking reward model to annotate data with natural language critique based feedback, enabling quality aware training from the earliest stages of the pipeline. Models are then trained by prefix-conditioning the data with the generated feedback -- ensuring that not all tokens are treated equally starting much earlier in training than usual. Comprehensive experiments on 7.5-12B transformer-based dense LLMs trained from scratch all the way up to 18 Trillion tokens seen show that our method: bends scaling curves resulting in up to 2.8x more compute efficiency generally; and reaches performance levels unachievable for models trained otherwise in domains such as math and code.
中文摘要 我们探讨如何在当前LLM培训流程中不断增长的多个阶段中更高效地扩展。我们的指导直觉源于这样一个事实：流程后期阶段（如培训后）的动态可以用来指导早期阶段，如预培训。为此，我们提出了内省训练（Introspective Training，简称IXT），灵感来源于离线奖励条件强化学习，适用于训练的任何阶段。IXT采用思维奖励模型，通过基于自然语言批评的反馈来注释数据，实现从流程最早阶段起就实现质量意识训练。然后通过对生成的反馈对数据进行前缀条件训练——确保所有代币在训练开始前更早时才被平等对待。对7.5-12B基于变换器、密集LLM从零训练至18万亿个令牌的综合实验显示，我们的方法：弯曲缩放曲线，通常带来多达2.8倍的计算效率;并达到了数学和代码等领域中其他训练模型无法达到的性能水平。

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

ParaVT：驯服工具先行悖论以适应智能视频强化学习中的并行工具使用

Authors: Zuhao Yang, Kaichen Zhang, Sudong Wang, Keming Wu, Zhongyu Yang, Bo Li, Xiaojuan Qi, Shijian Lu, Xingxuan Li, Lidong Bing
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.20342
Pdf link: https://arxiv.org/pdf/2605.20342
Abstract Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.
中文摘要 通过强化学习（RL）训练大型多模态模型（LMM），以原生调用视频处理工具（如裁剪），已成为实现长视频理解的有前景路径。然而，现有的原生RL方法按顺序调度工具调用（即每回合一次）：一次错误的采集会传播错误且无同类纠正，多轮工具调用会破坏上下文，推理成本随回合数线性增长。我们介绍ParaVT，这是首个多智能体端到端强化学习训练的并行视频工具调用框架，能够在一次回合内调度多个时间窗口裁剪，以实现更清晰的上下文和更好的容错能力。然而，将标准强化学习应用于ParaVT揭示了一个我们称之为“工具先验悖论”的障碍：实现工具探索的预训练工具先验，同时也会破坏冷启动结构格式，并暴露出温度采样下跳过工具奖励捷径。在较弱的先行LMM上进行跨模型对比支持这一说法：格式保持稳定，但强化学习不引发工具调用，表明先验强度是格式崩溃和工具探索的共同驱动力。我们提出了PARA-GRPO（解析锚定与比例门控GRPO），它通过两种互补机制补充了标准强化学习：（i）仅在结构性代币最易崩溃的位置施加有针对性的奖励;（ii）按提示词帧预算随机化，生成训练提示，调用工具时能获得可测量的奖励信号，而非跳过。在六个长视频理解基准中，ParaVT相较Qwen3-VL基线平均提升+7.9%，PARA-GRPO将训练时间格式合规性从0.13提升至0.64。随着工具能力在现代LMM中日益内化，强化学习必须与生成的先验配合，ParaVT提供了代理强化学习的通用配方。代码、数据和模型权重均公开。

ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning

ConceptSeg-R1：通过元强化学习分割任意概念

Authors: Yuan Zhao, Youwei Pang, Jiaming Zuo, Wei Ji, Kailai Zhou, Bin Fan, Yunkang Cao, Lihe Zhang, Xiaofeng Liu, Huchuan Lu, Weisi Lin, Dacheng Tao, Xiaoqi Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.20385
Pdf link: https://arxiv.org/pdf/2605.20385
Abstract Recent progress in promptable segmentation has shifted visual perception from object-level localization toward concept-level understanding. However, the notion of a concept remains under-specified, making it unclear whether current methods truly generalize beyond category recognition. In this work, we formalize generalized concept segmentation through a three-level taxonomy consisting of context-independent (CI), context-dependent (CD), and context-reasoning (CR) concepts, which reveals a clear capability gap across increasing levels of cognitive complexity. To address this challenge, we propose ConceptSeg-R1, a unified framework that reformulates concept segmentation as rule-induced concept grounding. At the core of our method is Meta-GRPO, a meta-reinforcement learning mechanism that learns transferable task rules from visual demonstrations and verifies them through proxy reasoning. The inferred reasoning states are then translated into segmentation-ready concept prompts via a lightweight concept translation module, enabling deductive application to target images. A shortcut routing strategy further preserves the native efficiency of segmentation models on simple cases. To systematically evaluate generalized concept segmentation, we conduct extensive experiments across diverse CI, CD, and CR concept segmentation benchmarks spanning natural, industrial, medical and reasoning-intensive domains. Without bells and whistles, ConceptSeg-R1 achieves strong performance across the full concept hierarchy while maintaining the native capability of promptable segmentation backbones. As an initial step toward segmenting any concept, we hope ConceptSeg-R1 can serve as a practical baseline for advancing segmentation from object-level prediction toward concept-level understanding.
中文摘要 近期可提示分割的进展使视觉感知从对象层面定位转向概念层面理解。然而，概念这一概念仍未明确明确，因此尚不清楚当前方法是否真正超越范畴识别范围。本研究通过三层分类法形式化了广义概念分割，该分类法由上下文无关（CI）、上下文依赖（CD）和上下文推理（CR）概念组成，揭示了认知复杂度不断提升中能力差距的明显差距。为应对这一挑战，我们提出了ConceptSeg-R1，一个统一框架，将概念分割重新表述为规则驱动的概念基础。我们方法的核心是Meta-GRPO，这是一种元强化学习机制，通过视觉演示学习可转移的任务规则，并通过代理推理进行验证。推断出的推理状态随后通过轻量级概念翻译模块转换为适合分割的概念提示，实现对目标图像的演绎应用。捷径路由策略进一步保留了分割模型在简单情况下的原生效率。为了系统评估广义概念分割，我们在涵盖自然、工业、医学及推理密集等多领域CI、CD和CR概念分割基准中进行了广泛实验。在没有花哨功能的情况下，ConceptSeg-R1 在整个概念层级中实现了强大的性能，同时保持了可提示分割骨干的原生能力。作为任何概念分割的第一步，我们希望ConceptSeg-R1能作为推动从对象层级预测向概念层理解的切分的实用基线。

Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

用于大型语言模型强化学习的MXFP4量化误差分解：可约偏差、可恢复死区和不可约底

Authors: Xiaocan Li, Shiliang Wu, Zheng Shen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.20402
Pdf link: https://arxiv.org/pdf/2605.20402
Abstract MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degradation. Existing work treats the quantization error as a monolithic noise term, missing the distinct mechanisms upon interpreting how quantization error damages training. We prove an exact three-way decomposition of quantization error and show how each component dominates a distinct RL training pathway. Our theoretical and empirical analysis decomposes the MXFP4 quantization error into three additive components: "scale bias" from power-of-two rounding, "deadzone truncation" from zeroing small values, and "grid noise" from rounding to the nearest 4-bit grid. Each component dominates a distinct RL failure mode: scale bias accumulates multiplicatively through the backward pass, affecting gradient accuracy; deadzone truncation degrades rollout quality; and grid noise raises the policy's entropy. We combine corrections that are RL failure mode-targeted but not component-exclusive: Macro-block scaling to reduce scale bias, Outlier Fallback recovers deadzone entries, but also partially reduces scale bias induced error, and Adaptive Quantization Noise (AQN) for controlling the policy entropy. On Qwen2.5-3B dense and Qwen3-30B-A3B-Base mixture-of-experts model, the targeted corrections recover BF16 accuracy to within 0.7% and 3.0% respectively.
中文摘要 MXFP4 算术可以显著加速大型语言模型（LLM）训练后的强化学习（RL），但量化误差会严重降低准确率。现有研究将量化误差视为单一噪声项，忽略了量化误差如何损害训练的独特机制。我们证明了量化误差的精确三因子分解，并展示了每个组成部分如何主导不同的强化学习训练路径。我们的理论和实证分析将MXFP4量化误差分解为三个加法成分：“尺度偏差”来自二的幂四舍五入，“死区截断”来自于归零小值，以及“栅格噪声”，来自于将四舍五入到最近的4位网格。每个组件都主导着不同的强化学习失效模式：尺度偏置通过后向传递乘数累积，影响梯度精度;死区截断会降低推广质量;而网格噪声则提高了政策的熵。我们结合了针对强化学习失败模式但非组件专属的修正方法：宏块缩放以减少尺度偏置，异常值回溯恢复死区条目，同时部分降低尺度偏差诱导的误差，以及自适应量化噪声（AQN）用于控制策略熵。在Qwen2.5-3B密度和Qwen3-30B-A3B-Base专家混合模型中，针对性修正分别将BF16的准确率恢复到0.7%和3.0%以内。

Spectral Souping: A Unified Framework for Online Preference Alignment

Spectral Souping：在线偏好对齐的统一框架

Authors: Yinlam Chow, Guy Tennenholtz, Ted Yun, James Harrison, Arthur Gretton, Andre Barreto, Bo Dai
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.20408
Pdf link: https://arxiv.org/pdf/2605.20408
Abstract Reinforcement Learning from Human Feedback (RLHF) effectively aligns Large Language Models (LLMs) with aggregate human preferences but often fails to address the diverse and conflicting needs of individual users. To overcome this issue, we introduce Spectral Souping, a unified framework for efficient, online preference alignment. Our contribution is the discovery of a universal spectral representation within LLMs, which is proven to be highly amenable to model merging. This theoretical insight enables a two-phase methodology: we first learn a basis of specialized policies offline, each focused on a distinct, fine-grained preference dimension. An online adaptation algorithm then efficiently ``soups'' these policies at inference time, either by merging their outputs or parameters, enabling rapid model adaptation without the need for costly online retraining w.r.t. tailored preference rewards. Experiments on online preference alignment benchmarks demonstrate that our method achieves significant performance improvements over existing state-of-the-art approaches, presenting a scalable and computationally efficient solution for dynamically adapting LLMs to individual user preferences.
中文摘要 来自人类反馈的强化学习（RLHF）有效地将大型语言模型（LLMs）与人类的整体偏好对齐，但往往未能满足个体用户多样化且相互冲突的需求。为解决这一问题，我们引入了Spectral Souping，一个高效的在线偏好对齐统一框架。我们的贡献是发现了LLMs中的通用谱表示，该表示已被证明高度适合模型合并。这一理论洞见使得两阶段方法论成为可能：我们首先学习离线的专业政策基础，每个政策都聚焦于一个独特且细致的偏好维度。在线适应算法随后在推理时高效地“搅拌”这些策略，无论是合并输出还是参数，从而实现快速模型适应，无需昂贵的在线再训练，提供定制化的偏好奖励。在线偏好对齐基准测试的实验表明，我们的方法相比现有最先进的方法实现了显著的性能提升，提供了一个可扩展且计算高效的解决方案，能够动态适应用户偏好。

OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

OSCToM：高阶心智理论中的强化学习引导对抗生成

Authors: Sharmin Sultana Srishty, Kazi Mahathir Rahman, Malaika Parizat Sakkhi, Samia Shahid Prianna, Shaikhul Islam Sinat
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.20423
Pdf link: https://arxiv.org/pdf/2605.20423
Abstract Large Language Models (LLMs) perform well on many language tasks, but their Theory of Mind (ToM) reasoning is still uneven in complex social settings. Existing benchmarks, including ExploreToM, do not always test the recursive beliefs and information asymmetries that make these settings difficult. This paper presents OSCToM (Observer-Self Conflict Theory of Mind), an approach for modeling nested belief conflicts in LLM-based ToM tasks. The key case is one in which an observer's view of another agent conflicts with the observer's own belief state. Such cases go beyond simple perspective-taking and require recursive, multi-layered reasoning. OSCToM combines reinforcement learning (RL), an extended domain-specific language, and compositional surrogate models to generate observer-self conflicts. In our experiments, OSCToM-8B gives the best overall result among the systems tested. It improves on the reported ExploreToM results on FANToM and remains competitive on Hi-ToM and BigToM. On the information-asymmetric FANToM benchmark, OSCToM reaches 76% accuracy, compared with the 0.2% reported by ExploreToM. The data-synthesis procedure is also 6x more efficient, indicating that targeted training data can help smaller models handle advanced cognitive reasoning. The project code is available at this https URL.
中文摘要 大型语言模型（LLMs）在许多语言任务中表现良好，但它们的心智理论（ToM）推理在复杂的社会环境中仍然不均衡。现有基准测试，包括ExploreToM，并不总是测试使这些设置变得困难的递归信念和信息不对称性。本文提出了OSCToM（观察者-自我冲突心智理论），这是一种用于建模基于LLM的ToM任务中嵌套信念冲突的方法。关键情况是观察者对另一主体的看法与其自身信念状态发生冲突。此类情况超越了简单的视角取向，需要递归的多层次推理。OSCToM结合了强化学习（RL）、一种扩展的领域特定语言和组合代理模型，以生成观察者与自我冲突。在我们的实验中，OSCToM-8B在测试系统中整体表现最佳。它在FANToM上报告的ExploreToM结果有所提升，并且在Hi-ToM和BigToM上依然具有竞争力。在信息非对称FANToM基准测试中，OSCToM的准确率达到76%，而ExploreToM报告的准确率为0.2%。数据综合过程的效率也提升了6倍，表明有针对性训练数据可以帮助小型模型处理高级认知推理。项目代码可在该 https URL 访问。

Reinforcing Human Behavior Simulation via Verbal Feedback

通过口头反馈强化人类行为模拟

Authors: Weiwei Sun, Xuhui Zhou, Jiarui Liu, Weihua Du, Haojia Sun, Yiqing Xie, Qianou Ma, Sihao Chen, Mengting Wan, Longqi Yang, Pei Zhou, Sherry Wu, Sean Welleck, Graham Neubig, Yiming Yang, Maarten Sap
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.20506
Pdf link: https://arxiv.org/pdf/2605.20506
Abstract Humans learn social norms and behaviors from verbal feedback (e.g., a parent saying "that was rude" or a friend explaining "here's why that hurt"). Yet, learning from feedback for LLMs has largely focused on domains like code and math, where RL rewards are directly verifiable and condensed into scalar values. As LLMs are increasingly used to simulate human behavior, e.g., standing in for users, patients, students, and other personas, there is a pressing need to make them more human-like, which requires embracing a fundamentally different kind of signal: feedback that is verbal, subjective, and multi-faceted. We present DITTO, a model trained by treating verbal feedback as a first-class signal in reinforcement learning. After each rollout, DITTO receives verbal feedback and generates a feedback-conditioned improved rollout; both outputs are jointly optimized with GRPO, distilling verbal guidance into the base policy without requiring feedback at test time. We also introduce SOUL (Simulation gym Of hUman-Like behavior), a unified benchmark and training data suite spanning 10 tasks across six categories: Theory of Mind, character role play, social skill, learner simulation, user simulation, and persona simulation. DITTO achieves an average 36% improvement over the base model and exceeds GPT-5.4 on 6 of 10 SOUL benchmarks, demonstrating that RL with verbal feedback is a promising direction for training LLMs to simulate human behavior.
中文摘要 人类通过言语反馈学习社会规范和行为（例如，父母说“那很无礼”，或者朋友解释“这就是为什么那会让人受伤”）。然而，LLMs的反馈学习主要集中在代码和数学等领域，因为强化学习的奖励可以直接验证并浓缩成标量值。随着LLM越来越多地被用来模拟人类行为，例如代替用户、患者、学生及其他角色，迫切需要让它们更具人性化，这需要拥抱一种根本不同的信号类型：口头、主观且多方面的反馈。我们介绍DITTO，这是一个通过将口头反馈视为强化学习中一类信号来训练的模型。每次推广后，DITTO都会收到口头反馈，并生成反馈条件的改进版;两个输出均与GRPO共同优化，将口头指导提炼进基础策略，无需测试时反馈。我们还介绍了SOUL（模拟人体行为健身房），这是一个统一的基准和训练数据集，涵盖六大类别的10个任务：心智理论、角色扮演、社交技能、学习者模拟、用户模拟和人格模拟。DITTO相较基础模型平均提升36%，并在10个SOUL基准测试中有6个超过GPT-5.4，表明基于语言反馈的强化学习是训练大型语言模型模拟人类行为的有前景方向。

Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs

在LLM后训练中通过logit平均法补充SFT强化学习

Authors: Xingwei Gan, Ying Zhu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.20555
Pdf link: https://arxiv.org/pdf/2605.20555
Abstract We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a trainable policy, and incorporate the method into Group Relative Policy Optimization (GRPO). In contrast to Reinforcement Learning with Verifiable Rewards (RLVR) methods, our proposal does not involve a Kullback Leibler (KL) regularization or critic; the trainable policy and the reference anchor are coupled through the logit averaging structure to leverage the reasoning expertise of the trainable policy while maintaining the formatting advantage of SFT. Our method is evaluated on MATH, cn-k12, and MMLU, and the results show a higher accuracy or at least comparable accuracy relative to the canonical KL-regularized GRPO.
中文摘要 我们引入了一种新颖方法，将冻结参考策略（例如SFT）和可训练策略的logits平均，并将其纳入Group Relative Policy Optimization（GRPO）。与可验证奖励强化学习（RLVR）方法不同，我们的提案不涉及Kullback Leibler（KL）正则化或批判;可训练策略和参考锚点通过logit平均结构耦合，利用可训练策略的推理专长，同时保持SFT的格式优势。我们的方法在MATH、cn-k12和MMLU上进行了评估，结果显示相对于规范的KL正则化GRPO具有更高的准确性或至少相当的准确性。

Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

Mahjax：一款基于 GPU 的加速麻将模拟器，用于 JAX 强化学习

Authors: Soichiro Nishimori, Shinri Okano, Keigo Habara, Sotetsu Koyamada, Eason Yu, Masashi Sugiyama
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.20577
Pdf link: https://arxiv.org/pdf/2605.20577
Abstract Riichi Mahjong is a multi-player, imperfect-information game characterized by stochasticity and high-dimensional state spaces. These attributes present a unique combination of challenges that mirror complex real-world decision-making problems in reinforcement learning. While prior research has heavily relied on supervised learning from human play logs to pre-train the policy, algorithms capable of learning \textit{tabula rasa} (from scratch) offer greater potential for general applicability, as evidenced by the AlphaZero lineage. To facilitate such research, we introduce \textbf{Mahjax}, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large-scale rollout parallelization on Graphics Processing Units (GPUs). We also provide a high-quality visualization tool to streamline debugging and interaction with trained agents. Experimental results demonstrate that Mahjax achieves throughputs of up to \textbf{2 million} and \textbf{1 million steps per second} on eight NVIDIA A100 GPUs under the no-red and red rules, respectively. Furthermore, we validate the environment's utility for reinforcement learning by showing that agents can be trained effectively to improve their rank against baseline policies.
中文摘要 立地麻将是一款多人、不完美信息游戏，特点是随机性和高维状态空间。这些属性带来了独特的挑战组合，反映了强化学习中复杂的现实决策问题。虽然以往研究高度依赖于从人类游戏日志中进行监督学习来预训练策略，但能够从零学习\textit{tabula rasa}（从零开始）的算法在通用应用方面具有更大的潜力，这一点从AlphaZero谱系中可见一斑。为促进此类研究，我们引入了 \textbf{Mahjax}，一个用 JAX 实现的全矢量化立直麻将环境，以实现图形处理单元（GPU）上的大规模并行展开。我们还提供高质量的可视化工具，简化调试和与受过培训的代理的交互。实验结果表明，Mahjax 在八块NVIDIA A100 GPU上分别在无红规则下实现高达 \textbf{200万步}和\textbf{1百万步每秒}的吞吐量。此外，我们验证了环境在强化学习中的效用，证明代理可以有效训练以提升其相对于基线策略的排名。

Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning

带有潜在类比的组合转导，用于离线目标条件强化学习

Authors: Junseok Kim, Dohyeong Kim, Mineui Hong, Songhwai Oh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.20609
Pdf link: https://arxiv.org/pdf/2605.20609
Abstract Compositional generalization is essential for reaching unseen goals under novel contextual variations in offline goal-conditioned reinforcement learning (GCRL), where a generalist goal-reaching agent must be learned from limited data. Most prior approaches pursue this via trajectory stitching over temporally contiguous segments, which limits composing behaviors across varying contexts. To overcome this limitation, we formalize analogy transduction as synthesizing new plans by composing task-endogenous analogies with given contexts and propose a novel analogy representation tailored for it. Grounded in our theory, this analogy representation captures what changes under optimal task execution, remains invariant to contextual variations, and is sufficient for optimal goal reaching. We further contend that generalization to unseen analogy-context pairs is a practical obstacle in analogy transduction, and introduce a new approach for offline GCRL that enables analogy transduction beyond seen pairs to unseen combinations. We empirically demonstrate the effectiveness of our approach on OGBench manipulation environments, substantially outperforming prior methods that do not perform analogy transduction. Project page: this https URL
中文摘要 在离线目标条件强化学习（GCRL）中，在新颖的上下文变体下，组合泛化对于实现看不见的目标至关重要，因为在这些变化中，必须从有限的数据中学习通用的目标达成代理。大多数以往方法通过轨迹缝合在时间相连的片段上进行，这限制了在不同上下文中的作曲行为。为克服这一限制，我们将类比转导形式化为通过与给定上下文合成任务内生类比来综合新计划，并提出了专门针对该类比的新类比表示。基于我们的理论，这种类比表示捕捉了在最优任务执行下的变化，且不受上下文变化影响，足以实现最佳目标达成。我们进一步认为，推广到看不见的类比-上下文对是类比传导中的实际障碍，并引入了一种离线GCRL的新方法，使类比转导能够超越可见对到看不见的组合。我们通过实证证明了该方法在OGBench操作环境中的有效性，显著优于以往不进行类比转导的方法。项目页面：此 https URL

Design for Manufacturing: A Manufacturability Knowledge-Integrated Reinforcement Learning Framework for Free-Form Pipe Routing in Aeroengines

制造设计：用于航空发动机自由成型管道布线的可制造性知识集成钢筋学习框架

Authors: Caicheng Wang, Zili Wang, Shuyou Zhang, Yongzhe Xiang, Zheyi Li, Liangyou Li, Jianrong Tan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.20644
Pdf link: https://arxiv.org/pdf/2605.20644
Abstract Design for manufacturing plays a critical role in advanced aeroengine development, where complex components necessitate careful consideration of manufacturability. However, current practices in pipe routing remain largely decoupled from down-stream manufacturing, leading to labor-intensive, trial-and-error iterations to achieve manufacturable designs. To address this problem, this study proposes the Frenet-based pipe routing optimization (FPRO) framework, a manufacturability knowledge-integrated reinforcement learning approach for free-form pipe design in aeroengines. FPRO formulates the routing problem as a boundary value problem in the Frenet frame. In this framework, the pipe path is represented by curvature and torsion profiles, which are generated using cubic Hermite interpolation. To integrate design and manufacturing, domain-specific manufacturing knowledge is embedded as constraints on the permissible ranges of curvature and torsion. The path optimization is performed using the proximal policy optimization algorithm with stochastic exploration and a stage-guided reward mechanism. A unified mapping formulation then translates the optimized path into motion trajectories for the bending die, enabling direct fabrication on a six-axis free-bending machine. Experimental results demonstrate that FPRO consistently generates collision-free, manufacturable paths with smoother geometric profiles compared to Cartesian-based methods. It also achieves faster convergence and superior performance in terminal alignment, path length, obstacle avoidance, and manufacturability compared to state-of-the-art reinforcement learning baselines. Real-world validation confirms the close geometric correspondence between the manufactured pipe and its digital design, validating the practical feasibility of FPRO.
中文摘要 制造设计在先进航空发动机开发中起着关键作用，复杂零部件需要对制造性进行仔细考虑。然而，当前管道布线实践仍与下游制造脱节，导致为了实现可制造设计，需要大量劳动和反复试验。为解决这一问题，本研究提出了基于Frenet的管道布线优化（FPRO）框架，这是一种用于航空发动机自由形态管道设计的制造性知识集成强化学习方法。FPRO 将路由问题表述为 Frenet 框架中的边界值问题。在该框架中，管道路径由曲率和扭转轮廓表示，这些曲线通过立方赫米特插值生成。为了整合设计与制造，领域特定的制造知识被嵌入为对允许曲率和挠转范围的约束。路径优化采用近端策略优化算法，结合随机探索和阶段引导奖励机制。统一的映射公式随后将优化路径转换为弯曲模具的运动轨迹，从而实现在六轴自由弯曲机上的直接制造。实验结果表明，与基于笛卡尔的方法相比，FPRO始终能够生成无碰撞、可制造且几何轮廓更为光滑的路径。与最先进的强化学习基线相比，它在终端对齐、路径长度、障碍物避让和制造性方面实现了更快的收敛和优越性能。实际验证确认了制造管道与其数字设计之间的密切几何对应关系，验证了FPRO的实际可行性。

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

反射者：内化逐步反思反对间接越狱

Authors: Jiachen Ma, Jiawen Zhang, Xiangtian Li, Bo Zou, Chaochao Lu, Chao Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.20654
Pdf link: https://arxiv.org/pdf/2605.20654
Abstract While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflection within the generation trajectory. Reflector first leverages teacher-guided generation to produce high-quality reflection data for supervised fine-tuning (SFT), establishing structured reflection patterns. It subsequently uses Reinforcement Learning (RL) with outcome-driven and reward-validity supervision to instill robust, autonomous self-reflection capabilities. Empirical results show that Reflector achieves Defense Success Rates (DSR) exceeding 90% against complex indirect attacks while generalizing robustly across diverse threat scenarios. Notably, the framework enhances both task-specific and general utility, yielding a 5.85% gain on GSM8K alongside improved performance on knowledge-intensive benchmarks. By internalizing trajectory-level safety, Reflector overcomes the fundamental limitations of surface alignment without significant computational overhead, offering an efficient and scalable solution for the development of safe and capable LLMs.
中文摘要 尽管大型语言模型（LLMs）展现了卓越的能力，但它们仍易受到复杂、多步越狱攻击的影响，这些攻击通过利用内部生成过程绕过了传统的表面安全对齐。为解决这些漏洞，我们提出了Reflector，这是一个原则性的两阶段框架，将自我反思融入世代发展轨迹中。Reflector 首先利用教师引导生成技术，生成高质量的反射数据，用于监督微调（SFT），建立结构化的反射模式。随后，它采用强化学习（RL）结合结果驱动和奖励有效性监督，赋予强大自主的自我反思能力。实证结果显示，Reflector 在应对复杂间接攻击时的防御成功率（DSR）超过90%，同时在多种威胁场景中具有强有力的泛化能力。值得注意的是，该框架提升了任务特定性和通用实用性，GSM8K性能提升了5.85%，并在知识密集基准测试中表现有所提升。通过内部化轨迹级安全性，Reflector 克服了表面对准的根本局限，且不增加计算开销，为开发安全且具备能力的大型语言模型提供了高效且可扩展的解决方案。

IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

IndusAgent：用代理工具强化开放词汇工业异常检测

Authors: Rongbin Tan, Fangfang Lin, Zhenlong Yuan, Min Qiu, Kejin Cui, Mengmeng Wang, Yi Wang, Zijian Song, Zhiyuan Wang, Jiyuan Wang, Yue Wang, Shuhan Song§, Huawei Cao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.20682
Pdf link: https://arxiv.org/pdf/2605.20682
Abstract Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by domain-misaligned reasoning and hallucinated structural inferences. To address these challenges, we propose \textbf{IndusAgent}, a tool-augmented agentic framework for open-vocabulary IAD. Specifically, we first construct \textbf{Indus-CoT}, a structured dataset that integrates global visual observations, high-resolution local patches, and expert normalcy priors, providing supervision for fine-tuning the model on rigorous industrial inspection trajectories. Building on this, IndusAgent dynamically orchestrates a set of external tools, including dynamic region cropping, high-frequency feature enhancement, and prior retrieval, thus enabling the agent to actively resolve visual ambiguities and disentangle subtle anomalies. Furthermore, we introduce a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage, ensuring that tool invocation occurs only when beneficial. Extensive evaluations on five industrial anomaly benchmarks, including MVTec-AD, VisA, MPDD, DTD, and SDD, demonstrate that IndusAgent achieves state-of-the-art zero-shot performance among all existing methods, validating our robustness and generalization capacity.
中文摘要 多模态大型语言模型（MLLMs）在连接视觉感知与文本推理方面展现出卓越能力，实现了在多样工业场景中的零样本理解。然而，它们在开放词汇工业异常检测（IAD）中的表现常受限于领域错位推理和幻觉式结构推断。为应对这些挑战，我们提出了 \textbf{IndusAgent}，一个工具增强的开放词汇智能体框架。具体来说，我们首先构建了 \textbf{Indus-CoT}，这是一个结构化数据集，集成了全球视觉观测、高分辨率局部斑块和专家常态性先验，为模型在严格工业检测轨迹上的微调提供监督。基于此，IndusAgent动态协调一套外部工具，包括动态区域裁剪、高频特征增强和事先检索，从而使智能体能够主动解决视觉歧义并解开细微异常。此外，我们引入了门控强化学习目标，共同优化异常分类、定位准确性、异常类型推理和高效工具使用，确保工具调用仅在有利时进行。对包括MVTec-AD、VisA、MPDD、DTD和SDD在内的五个工业异常基准测试进行了广泛评估，表明IndusAgent在所有现有方法中实现了最先进的零射击性能，验证了我们的鲁棒性和泛化能力。

Distributed Direct Preference Optimization

分布式直接偏好优化

Authors: Zhanhong Jiang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.20696
Pdf link: https://arxiv.org/pdf/2605.20696
Abstract Preference-based reinforcement learning (RL) is a key paradigm for aligning policies with human judgments, yet its theoretical behavior in distributed settings where preference data are fragmented across heterogeneous users remains poorly understood. Direct Preference Optimization (DPO) avoids explicit reward modeling but lacks convergence guarantees under federated and decentralized training, where communication constraints and non-IID preferences fundamentally alter optimization dynamics. We provide the first convergence and time-complexity analysis of DPO in distributed environments. Modeling personalized offline RL with user-specific preference distributions, we characterize the induced global optimization landscape. For federated DPO, we derive convergence rates that quantify the impact of client drift, communication frequency, and preference heterogeneity; for decentralized DPO, we establish convergence over general communication graphs and show how spectral connectivity governs optimization speed and consensus. Empirically, we corroborate our theoretical insights on standard alignment benchmarks, demonstrating that our proposed methods not only enjoy strong theoretical guarantees but also deliver robust and scalable performance in practice. The code base is available here.
中文摘要 基于偏好的强化学习（RL）是政策与人类判断对齐的关键范式，但在偏好数据分散于异质用户中的分布式环境中，其理论行为仍难以理解。直接偏好优化（DPO）避免显式奖励建模，但在联邦和去中心化训练中缺乏收敛保证，因为通信约束和非IID偏好从根本上改变优化动态。我们首次在分布式环境中进行了DPO的收敛和时间复杂度分析。通过建模个性化离线强化学习，并结合用户特定偏好分布，我们描述了诱导的全局优化景观。对于联邦DPO，我们推导收敛率，以量化客户漂移、沟通频率和偏好异质性的影响;对于去中心化DPO，我们建立了一般通信图的收敛性，展示了频谱连通性如何决定优化速度和共识。通过实证，我们验证了标准对齐基准的理论见解，证明我们提出的方法不仅具有强有力的理论保证，还在实践中提供了稳健且可扩展的性能。代码库可以在这里获取。

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

AGPO：带有双重统计反馈的自适应组策略优化

Authors: Miaobo Hu, Shuhao Hu, Bokun Wang, Ruohan Wang, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.20722
Pdf link: https://arxiv.org/pdf/2605.20722
Abstract Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistics to control both update magnitude and exploration. AGPO uses a shared probe-derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust-region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step-wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5-14B trained with AGPO outperforms PPO/GRPO under the same generated-token budget, reaching 67.3% on GSM8K and 40.5% on MATH. Gains transfer to Llama-3-8B and Gemma-2-9B, and ablations confirm both modules are complementary. Our implementation is publicly available at this https URL.
中文摘要 强化学习提升了LLM推理能力，但PPO/GRPO通常使用固定的削波和解码温度，这使得训练变得脆弱且调优较重。我们提出了自适应组策略优化（AGPO），这是一种无批评的GRPO细化，利用组级统计控制更新幅度和探索。AGPO使用共享的探针导出的统计状态驱动两个控制器：（i）自适应裁剪，通过奖励离散和偏态设定信任区域大小，探针投票熵、策略熵和逐步KL漂移;以及（ii）双向自适应温度采样，即根据相对于运行基线的中心不确定性，加热或冷却解码。在九项英语和中文数学/STEM基准测试中，使用AGPO训练的Qwen2.5-14B在相同生成令牌预算下优于PPO/GRPO，GSM8K达到67.3%，MAT达到40.5%。增益转移至Llama-3-8B和Gemma-2-9B，消融结果显示两模块互补。我们的实现可在此 https URL 公开获取。

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

分布感知奖励：强化学习胜过预测分布对LLM回归的应用

Authors: Jungsoo Park, Hyungjoo Chae, Ethan Mendes, Jay DeYoung, Varsha Kishore, Wei Xu, Alan Ritter
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.20740
Pdf link: https://arxiv.org/pdf/2605.20740
Abstract Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensuring calibrated predictive distributions. This limits applications requiring candidate ranking or uncertainty estimation. We introduce Distribution-Aware Reward, an on-policy reinforcement learning objective whose main contribution is to train language models to produce better predictive distributions for regression tasks, rather than only optimizing individual decoded outputs against scalar targets. Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed. We evaluate our method on a controlled Gaussian-mixture task, code performance prediction, and molecular property prediction from SMILES strings. Across tasks, our method improves over supervised fine-tuning and pointwise reinforcement learning baselines, with strong rank-correlation gains, including a 6-point Spearman improvement on KBSS. On MoleculeNet, it uses only SMILES strings yet remains competitive with strong graph-based and 3D molecular models. Further analyses show that our method mitigates rollout diversity collapse and improves uncertainty diagnostics, suggesting that directly optimizing predictive distributions makes language model regression more robust and better calibrated.
中文摘要 大型语言模型可以从文本、代码和分子字符串等异构输入预测实值量，但大多数训练目标会独立对每个解码浮点数进行评分，从而在不确保预测分布校准的情况下提升点估计。这限制了需要候选人排名或不确定性估计的申请。我们引入了分布感知奖励，这是一种基于策略的强化学习目标，其主要贡献是训练语言模型，使其为回归任务产生更好的预测分布，而不仅仅是针对标量目标优化单个解码输出。我们的方法将多个解码样本视为实证预测分布，采用连续排序概率评分，并根据每次推广对分布质量的边际贡献分配“省略一”的功劳，奖励既准确又适当分散的预测。我们在受控高斯混合任务、代码性能预测和SMILES字符串分子性质预测上评估了我们的方法。在各任务中，我们的方法优于监督微调和点状强化学习基线，排名相关性显著提升，包括KBSS上的Spearman提升了6分。在 MoleculeNet 上，它仅使用 SMILES 字符串，但仍能与强大的基于图表和三维分子模型竞争。进一步分析显示，我们的方法减轻了推广多样性崩溃，并提升了不确定性诊断，表明直接优化预测分布使语言模型回归更稳健、校准更佳。

Q-SpiRL: Quantum Spiking Reinforcement Learning for Adaptive Robot Navigation

Q-SpiRL：自适应机器人导航的量子尖峰强化学习

Authors: Mohamed Khair Altrabulsi, Nouhaila Innan, Alberto Marchisio, Muhammad Kashif, Muhammad Shafique
Subjects: Subjects: Robotics (cs.RO); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2605.20801
Pdf link: https://arxiv.org/pdf/2605.20801
Abstract Adaptive robot navigation in dynamic environments requires policies that can reach the target reliably while producing efficient and stable trajectories. This paper presents Q-SpiRL, a quantum spiking reinforcement learning framework for obstacle-aware robot navigation. The framework develops and evaluates five agent families: tabular Q-learning, classical MLP, classical SNN, quantum-enhanced MLP (QMLP), and quantum-enhanced spiking neural network (QSNN). While all models are implemented under a unified training and evaluation pipeline, the QSNN is the central architecture of interest, as it combines spike-based temporal processing with variational quantum feature transformation. Experiments are conducted across three grid-world environments of increasing size, namely 20x20, 30x30, and 40x40, with both static and dynamic obstacles. Performance is assessed using success rate, success-weighted path length, path length, and turn rate under deterministic inference. Results show that QSNN achieves the strongest overall trade-off between task completion, trajectory efficiency, and motion smoothness, reaching up to 99% success rate while maintaining high path efficiency in the most challenging setting. Execution on IBM quantum hardware further demonstrates the feasibility of deploying the proposed hybrid policy under real-device conditions.
中文摘要 动态环境中的自适应机器人导航需要能够可靠到达目标，同时实现高效且稳定轨迹的策略。本文介绍了Q-SpiRL，一种用于障碍感知机器人导航的量子尖峰强化学习框架。该框架开发并评估了五个代理家族：表格Q学习、经典MLP、经典SNN、量子增强MLP（QMLP）和量子增强尖峰神经网络（QSNN）。虽然所有模型均在统一的训练和评估流水线下实现，但QSNN是核心架构，因为它结合了基于尖峰的时间处理与变分量子特征变换。实验在三个大小逐渐增加的网格世界环境中进行，分别是20x20、30x30和40x40，包含静态和动态障碍。性能通过成功率、成功加权路径长度、路径长度和确定性推断下的转弯率进行评估。结果显示，QSNN在任务完成率、轨迹效率和运动平滑性之间实现了最强的整体权衡，在最具挑战性的环境中，成功率高达99%，同时保持高路径效率。在IBM量子硬件上的执行进一步展示了在真实设备条件下部署该混合策略的可行性。

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

DPO和RLHF的条件等价性：隐性假设、失效模式与可证比对

Authors: Zhiqin Yang, Yonggang Zhang, Wei Xue, Dong Fang, Bo Han, Yike Guo
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.20834
Pdf link: https://arxiv.org/pdf/2605.20834
Abstract Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: this https URL.
中文摘要 直接偏好优化（DPO）已成为人类反馈强化学习（RLHF）的流行替代方案，提供了理论等价性与更简单的实现。我们证明这种等价性是有条件的，而非普遍的，依赖于一个在实践中经常被违反的隐性假设：RLHF最优策略必须优先考虑人类偏好的反应。当这一假设失效时，DPO会优化相对优势相对于参考政策，而非与人类偏好的绝对一致，导致病态趋同，即政策减少DPO损失而偏好不偏好的反应。我们刻画该假设被违反的时机，证明存在一个不理想的解空间，并证明DPO和RLHF在此类情况下优化了根本不同的目标。为此，我们引入了受限偏好优化（CPO），在RLHF基础上加入可证明比对的约束。我们进一步通过软边际排名提供了几何解释，揭示DPO实施边际排名时可能带有负面目标。我们的理论分析确定了DPO保证的有效性，并提供了既简单又可验证对齐的解决方案。基于标准基准的全面实验表明，CPO实现了最先进的性能。代码可在以下 https URL 获取。

Finite-Time Regret Analysis of Retry-Aware Bandits

重试感知强盗的有限时间遗憾分析

Authors: Bingkui Tong, Junpei Komiyama, Soichiro Nishimori, Paavo Parmas
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.20854
Pdf link: https://arxiv.org/pdf/2605.20854
Abstract We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@$k$ and max@$k$. Given a posterior over arm values, ReMax chooses a sampling distribution that maximizes the posterior expected maximum reward over $M$ virtual draws. Although this objective was introduced in reinforcement learning as an exploration mechanism under uncertainty, its regret properties in bandit problems have remained unclear. For Gaussian rewards and the first nontrivial case $M=2$, we characterize the optimal ReMax distribution through an expected-improvement balance condition and prove the first sublinear regret bound for ReMax. Our analysis separates the usual saturation behavior of suboptimal arms from a ReMax-specific underestimation effect, in which the optimal arm may be sampled too rarely after an unfavorable estimate. This explains why ReMax can be more exploitative than Thompson sampling (TS) and why its regret analysis is technically delicate. Experiments support this picture: ReMax often outperforms KL-UCB and Thompson sampling under mild underestimation, while posterior-variance scaling empirically mitigates severe underestimation.
中文摘要 我们研究了一种随机强盗算法，其动机来自重试意识目标，这些目标重视多次尝试中最佳结果，如pass@$k$和max@$k$。给定后验对臂值，ReMax选择一个抽样分布，最大化$M$虚拟抽取的后验期望最大奖励。尽管该目标作为不确定性下的探索机制被引入强化学习，但其在盗匪问题中的遗憾特性仍不明确。对于高斯奖励和第一个非平凡情况$M=2$，我们通过期望改进平衡条件刻画最优ReMax分布，并证明ReMax的第一个亚线性遗憾上界。我们的分析将次优臂的常见饱和行为与ReMax特异性低估效应区分开来，后者在估计不利后，最优臂可能被采样过少。这也解释了为什么ReMax比Thompson采样（TS）更具剥削性，以及其后悔分析在技术上较为敏感。实验支持这一观点：ReMax在轻度低估下常常优于KL-UCB和Thompson采样，而后验方差尺度则在实证上减轻了严重低估。

PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR

PlexRL：RLVR 服务化 LLM 执行的集群级编排

Authors: Yiqi Zhang, Fangzheng Jiao, Tian Tang, Boyu Tian, Hangyu Wang, Qiaoling Chen, Guoteng Wang, Zhen Jiang, Peng Sun, Ping Zhang, Xiaohe Hu, Ziming Liu, Menghao Zhang, Yanmin Jia, Yang You, Siyuan Feng
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.20863
Pdf link: https://arxiv.org/pdf/2605.20863
Abstract Reinforcement learning with verifiable rewards (RLVR) has recently unlocked strong reasoning capabilities in large language models (LLMs), triggering rapid exploration of new algorithms and data. However, RLVR training is notoriously inefficient: long-tailed rollouts, tool-induced stalls, and asymmetric resource requirements between rollout and training introduce substantial idle time that cannot be eliminated by job-local optimizations such as synchronous pipelining, asynchronous rollout, or colocated execution. We argue that this inefficiency is structural. While idle gaps are unavoidable within individual RLVR jobs, they are largely anti-correlated across jobs and therefore exploitable at the cluster level. Leveraging this observation, we present PlexRL, a cluster-level runtime for multiplexing unified LLM services across RLVR jobs. By centrally managing model placement, state transitions, and function-level scheduling under strict affinity constraints, PlexRL time-slices LLM execution across jobs to fill otherwise idle periods without expensive model migration. Our implementation and evaluations demonstrate that PlexRL significantly improves effective cluster capacity and reduces user GPU hour cost by maximum 37.58% while preserving algorithmic flexibility and introducing minimal per-job overhead.
中文摘要 带有可验证奖励的强化学习（RLVR）最近在大型语言模型（LLMs）中解锁了强大的推理能力，促使人们快速探索新算法和数据。然而，RLVR训练以效率低著称：长尾部署、工具驱动的停滞以及部署与培训之间的资源需求不对称，会带来大量空闲时间，这些空闲时间无法通过同步流水线、异步部署或共址执行等作业本地优化来消除。我们认为这种低效是结构性的。虽然在单个RLVR岗位内空闲缺口是不可避免的，但这些空隙在各个岗位间大多是反相关的，因此在集群层面容易被利用。基于这一观察，我们介绍了PlexRL，一个集群级运行时，用于跨RLVR作业复用统一LLM服务。通过集中管理模型布局、状态转换和函数级调度，在严格亲和力约束下，PlexRL 对 LLM 执行进行时间切片，填补闲置期，避免昂贵的模型迁移。我们的实现和评估表明，PlexRL显著提升了有效集群容量，并将用户GPU小时数降低了最多37.58%，同时保持了算法灵活性，并引入了最小的每作业开销。

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

多步似然比修正用于可验证奖励的强化学习

Authors: Deokgyu Yoon, Hyungkyu Kang, Joongkyu Lee, Byeongchan Kim, Gyungin Shin, Sungrae Park, Min-hwan Oh
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.20865
Pdf link: https://arxiv.org/pdf/2605.20865
Abstract Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local approximation of the exact policy gradient objective. While this approximation improves stability by reducing the variance induced by importance sampling, it also introduces structural bias into the surrogate objective, which must be controlled through trust region mechanisms. In this work, we introduce the $N$-step forward trace, which augments the PPO surrogate objective using the cumulative likelihood ratio of the next $N-1$ tokens. Building on this idea, we propose $N$-Step Forward-Trace Policy Optimization (NFPO), a practical RLVR algorithm that integrates the $N$-step forward trace into the masked policy gradient framework. NFPO provides a continuous bridge between the PPO surrogate objective and the exact policy gradient objective, offering a principled mechanism for controlling the bias-variance trade-off. Our theoretical analysis shows that, with an appropriate choice of $N$, the proposed objective yields a tighter policy-improvement bound than the standard PPO surrogate. Experiments on comprehensive reasoning benchmarks demonstrate that NFPO consistently improves performance, supporting our theoretical findings.
中文摘要 带有可验证奖励的强化学习（RLVR）在提升大型语言模型的推理能力方面起着关键作用。然而，广泛使用的PPO替代目标本质上是局部性的，因为它们依赖于对政策梯度目标的局部近似。虽然这种近似通过降低重要性抽样引起的方差来提升稳定性，但也为代理目标引入了结构性偏倚，而这必须通过信任区域机制进行控制。在本研究中，我们引入了$N 步前进追踪，利用接下来$N-1$代币的累计似然比来增强PPO代理目标。基于这一理念，我们提出了$N$步前向追踪策略优化（NFPO），这是一种实用的RLVR算法，将$N步步前进追踪整合进掩蔽策略梯度框架。NFPO为PPO替代目标与精确政策梯度目标之间提供了连续桥梁，提供了控制偏误-方差权衡的原则机制。我们的理论分析表明，在适当选择$N美元的情况下，所提出的目标比标准PPO替代指标更严格地形成了政策改进界限。综合推理基准测试的实验表明，NFPO能够持续提升性能，支持我们的理论发现。

ProCrit: Self-Elicited Multi-Perspective Reasoning with Critic-Guided Revision for Multimodal Sarcasm Detection

ProCrit：多模态讽刺检测的自引多视角推理与批评指导修订

Authors: Yingjia Xu, Jiulong Wu, Bowen Zhang, Baokui Guo, Siyuan Chai, Min Cao
Subjects: Subjects: Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.20867
Pdf link: https://arxiv.org/pdf/2605.20867
Abstract Multimodal sarcasm detection requires reasoning over cross-modal incongruities between literal expression and intended meaning, yet the specific analytical perspectives needed vary across samples due to the diversity of sarcastic mechanisms. While recent methods make this analytical process explicit, they still rely on fixed, predefined perspectives that operate independently under hand-crafted routing rules. We argue that multimodal sarcasm detection instead calls for self-elicited multi-perspective reasoning, where a model autonomously generates the perspectives needed for each sample and progressively integrates them into a coherent analysis. To realize this goal, we propose ProCrit, a Proposal-Critic two-agent framework with a proposal agent for multi-perspective reasoning and a critic agent for external evaluation and targeted revision guidance. First, to overcome the lack of process-level supervision in existing sarcasm datasets, ProCrit synthesizes process-level reasoning annotations through a dynamic-role agentic rollout: a strong vision-language model sequentially spawns analytical roles within a shared context, and the resulting multi-role trajectories are flattened into sequences that preserve cross-perspective dependencies while enabling efficient autoregressive generation. Second, to improve reasoning reliability, ProCrit adopts a draft-critique-revise paradigm in which an independent critic identifies reasoning deficiencies and provides targeted natural-language feedback for directed revision. Finally, we develop a mutual-refinement training framework that jointly optimizes proposal drafting and feedback-guided revision via dual-stage reinforcement learning, while refining the critic agent according to the actual effectiveness of its feedback. Experiments on three widely used benchmarks demonstrate the effectiveness of ProCrit.
中文摘要 多模态讽刺检测需要对字面表达与意图意义之间的跨模态不一致进行推理，但由于讽刺机制的多样性，不同样本所需的具体分析视角也各不相同。虽然近期方法明确了这一分析过程，但它们仍依赖于固定的预定义视角，这些视角在手工设计的路由规则下独立运作。我们认为，多模态讽刺检测则需要自我诱发多视角推理，即模型自主生成每个样本所需的视角，并逐步整合进连贯分析。为实现这一目标，我们提出了ProCrit，一个提案-批评者双代理框架，其中一个提案代理负责多视角推理，一个批评代理负责外部评估和有针对性的修订指导。首先，为了弥补现有讽刺数据集中缺乏过程级监督的问题，ProCrit 通过动态角色代理展开综合过程层级推理注释：一个强有力的视觉语言模型在共享上下文中顺序生成分析角色，产生的多角色轨迹被平整化成保持交叉视角依赖关系且实现高效自回归生成的序列。其次，为了提高推理可靠性，ProCrit采用了一种草稿-批评-修订范式，由独立批评者发现推理不足，并提供有针对性的自然语言反馈以进行指导性修订。最后，我们开发了一个互惠改进培训框架，通过双阶段强化学习共同优化提案起草和反馈引导的修订，同时根据反馈的实际效果对批评代理进行优化。对三个广泛使用的基准测试的实验证明了ProCrit的有效性。

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

PlanningBench：生成可扩展且可验证的规划数据，用于评估和训练大型语言模型

Authors: Ziliang Zhao, Zenan Xu, Shuting Wang, Hongjin Qian, Yan Lei, Minda Hu, Zhao Wang, Shihan Dou, Zhicheng Dou, Pluto Zhou
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.20873
Pdf link: https://arxiv.org/pdf/2605.20873
Abstract Planning is a fundamental capability for large language models (LLMs) because such complex tasks require models to coordinate goals, constraints, resources, and long-term consequences into executable and verifiable solutions. Existing planning benchmarks, however, usually treat planning data as fixed collections of instances rather than controllable generation targets. This limits scenario coverage, ties difficulty to surface-level proxies rather than structural sources, and offers limited support for scalable generation, automatic verification, or planning-oriented training. We introduce PlanningBench, a framework for generating scalable, diverse, and verifiable planning data for both evaluation and training. PlanningBench starts from real planning scenarios and abstracts practical workflows into a structured taxonomy of more than 30 task types, subtasks, constraint families, and difficulty factors. Guided by this taxonomy, a constraint-driven synthesis pipeline instantiates self-contained planning problems with adaptive difficulty control, quality filtering, and instance-level verification checklists. This shifts planning data construction from fixed benchmark collection to controllable generation while preserving realistic task grounding. We use PlanningBench to evaluate open-source and closed-source frontier LLMs, and find that current models still struggle to produce complete solutions under coupled constraints. Beyond evaluation, reinforcement learning on verified PlanningBench data improves performance on unseen planning benchmarks and broader instruction-following tasks. Further analysis suggests that determinate or well-specified optimal solutions provide clearer reward signals and more stable training dynamics. Overall, PlanningBench provides a controllable source of planning data for diagnosing and improving generalizable planning abilities in LLMs.
中文摘要 规划是大型语言模型（LLM）的基本能力，因为此类复杂任务需要模型将目标、约束、资源和长期后果协调成可执行且可验证的解决方案。然而，现有的规划基准通常将规划数据视为固定的实例集合，而非可控的生成目标。这限制了场景覆盖范围，将难度与表层代理而非结构性来源联系起来，并且对可扩展生成、自动验证或规划导向培训的支持有限。我们介绍了PlanningBench，一个用于生成可扩展、多样化且可验证的规划数据的框架，用于评估和培训。PlanningBench 从真实的规划场景出发，将实际工作流抽象为包含 30 多种任务类型、子任务、约束族和难度因素的结构化分类法。在该分类法的指导下，约束驱动的综合流程通过自适应难度控制、质量过滤和实例级验证清单实现自包含的规划问题。这使规划数据构建从固定基准收集转向可控生成，同时保持了任务的现实基础。我们使用 PlanningBench 评估开源和闭源的前沿大型语言模型，发现现有模型在耦合约束下仍难以产出完整解决方案。除了评估之外，基于经过验证的PlanningBench数据进行强化学习还能提升在看不见的规划基准和更广泛的指令跟踪任务中的表现。进一步分析表明，确定性或明确指定的最优解能提供更清晰的奖励信号和更稳定的训练动态。总体而言，PlanningBench 提供了一个可控的规划数据来源，用于诊断和提升大型语言模型中可推广的规划能力。

CIG: Exploration via Conditional Information Gain

CIG：通过条件信息获取探索

Authors: Tim Joseph, Marcus Fechner, Philipp Stegmaier, Karam Daaboul, J. Marius Zöllner
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.20878
Pdf link: https://arxiv.org/pdf/2605.20878
Abstract Intrinsic rewards for exploration in reinforcement learning condition on different contexts: lifelong rewards score each transition against accumulated experience but ignore within-rollout redundancy; episodic rewards penalize intra-trajectory repetition but discard lifetime progress. Hybrid methods combine both signals through heuristic weights or require Gaussian-process dynamics that do not scale beyond low-dimensional state spaces. Trajectory-level information gain decomposes into per-step terms that condition on the replay buffer and rollout prefix simultaneously, but remains intractable for deep models. We derive the Conditional Information Gain (CIG) reward as a tractable surrogate: a log-determinant objective over an ensemble disagreement kernel whose Cholesky factorization yields causal per-step rewards that retain both conditioning sets while scaling to high-dimensional state spaces. We instantiate CIG in a model-based setting, where rollouts are short and within-rollout corrections remain largely unexplored. Across twelve tasks spanning discrete (MiniGrid) and continuous control (OGBench), in both clean and stochastic-distractor settings, CIG outperforms or matches prior exploration methods while remaining robust to stochastic distractors.
中文摘要 在不同情境下强化学习条件下探索的内在奖励：终身奖励对每次过渡基于累积经验进行评分，但忽略推广内冗余;章节奖励惩罚轨迹内重复，但丢弃终身进度。混合方法通过启发式权重结合两种信号，或需要高斯过程动力学，且这些动力学无法超出低维状态空间。轨迹级信息增益分解为每步项，这些项同时以回放缓冲区和展开前缀为条件，但对于深度模型仍然难以处理。我们推导条件信息获得（CIG）奖励作为一个可解的替代变量：一个对数行列式目标，定义在集合分歧核上的，其Cholesky分解产生因果每步奖励，同时保持两个条件集并扩展到高维状态空间。我们在基于模型的环境中实现CIG，在那里推广时间较短，部署内的修正大多未被探索。在涵盖离散（MiniGrid）和连续控制（OGBench）的十二项任务中，无论是干净还是随机干扰，CIG都能优于或匹敌以往探索方法，同时对随机干扰保持鲁棒性。

For How Long Should We Be Punching? Learning Action Duration in Fighting Games

我们应该打多久？学习格斗游戏中的动作持续时间

Authors: Hoang Hai Nguyen, Kurt Driessens, Dennis J.N.J. Soemers
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.20911
Pdf link: https://arxiv.org/pdf/2605.20911
Abstract Fighting games such as Street Fighter II present unique challenges to reinforcement learning (RL) agents due to their fast-paced, real-time nature. In most RL frameworks, agents are hard-coded to make decisions at a fixed interval, typically every frame or every N frames. Although this design ensures timely responses, it restricts the agent's ability to adjust its reaction timing. Acting every frame grants frame-perfect reflexes, which are unrealistic compared to human players, whereas longer fixed intervals reduce computational cost but hinder responsiveness. We consider an alternative decision-making framework in which the agent learns not only what action to take but also for how long to execute it. By jointly predicting both action and duration, the agent can dynamically adapt its responsiveness to different situations in the game. We implement this method using the open-source FightLadder environment with agents trained against scripted built-in bots, systematically testing different frame skip configurations to analyze their influence on performance, responsiveness, and learned behavior. Experiments show that learned timing can match the performance of well-chosen fixed frame skips and encourages repeatable action patterns, but does not ensure robustness on its own. In most cases, we see agents performing best with consistently high frame skip values (i.e., low responsiveness). This strategy makes it easier to learn exploitative strategies where the same action is repeated over and over, which the scripted bots appear to be susceptible to.
中文摘要 像《街头霸王II》这样的格斗游戏因其快速节奏、实时的特性，给强化学习（RL）代理带来了独特的挑战。在大多数强化学习框架中，代理被硬编码为在固定间隔内做出决策，通常是每帧或每N帧。虽然这种设计确保了及时的反应，但限制了代理调整反应时间的能力。每帧行动能获得帧级反射，这与真人玩家相比不真实;而更长的固定间隔降低计算成本，但会降低响应速度。我们考虑一种替代决策框架，在该框架中，智能体不仅学习采取何种行动，还能学习执行该行动的时间长度。通过共同预测动作和持续时间，智能体可以动态调整响应速度以适应游戏中的不同情境。我们利用开源的FightLadder环境实现该方法，代理针对脚本内置机器人训练，系统性测试不同的跳帧配置，分析其对性能、响应性和学习行为的影响。实验表明，学习到的时序可以匹配精心选择的固定帧跳跃性能，并鼓励可重复的动作模式，但仅靠时序并不能保证鲁棒性。在大多数情况下，我们看到代理在持续高跳帧值（即响应性低）时表现最佳。这种策略使得学习重复相同动作的剥削策略变得更容易，而脚本化机器人似乎更容易受到这些攻击。

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

边说边思考：一种可控、交错的实时语音生成推理方法

Authors: Xuan Du, Qiangyu Yan, Wenshuo Li, Borui Jiang, Changming Xiao, Han Shu, Xinghao Chen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.20946
Pdf link: https://arxiv.org/pdf/2605.20946
Abstract The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during natural speech generation. This requires high-quality data where reasoning and speech are precisely aligned, and the length ratio are under controlled. We introduce a novel pipeline to generate such seamlessly interleaved audio data. To train our model, we combine interleaved SFT with refined data and reinforcement learning with two new rewards: a TA-Balance Reward to manage timing and thinking-answer ratio, and a Linguistic Quality Reward to refine expression. Experiments show our approach achieves 13% better performance on mathmatical and logic benchmarks while generating instant response like a spoken-language instruct model which outputs fast CoT response. Furthermore, our method generates more natural and fluent answers than prior methods.
中文摘要 边思考边说的范式旨在让人工智能交流更具人性化。一个关键挑战是在进行深度推理的同时保持流畅的口语。我们的方法InterRS仅在自然语音生成过程中插入推理步骤来解决这个问题。这需要高质量的数据，使推理和语音精确对齐，且长度比例控制不足。我们引入了一种新颖的流水线，用于生成如此无缝交错的音频数据。为了训练模型，我们将交错SFT与精细数据和强化学习结合，并新增了两个奖励：用于管理时间和思考-回答比例的TA-平衡奖励，以及用于精炼表达的语言质量奖励。实验显示，我们的方法在数学和逻辑基准测试上表现提升了13%，同时产生了即时响应，类似于口语教学模型，输出快速的CoT响应。此外，我们的方法比以往方法更自然、更流畅地生成答案。

Beyond the Bellman Recursion: A Pontryagin-Guided Framework for Non-Exponential Discounting

超越贝尔曼递归：非指数折现的庞特里亚金引导框架

Authors: Hojin Ko, Jeonggyu Huh
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2605.20996
Pdf link: https://arxiv.org/pdf/2605.20996
Abstract Most value-based and actor--critic reinforcement learning methods rely on Bellman-style recursions, yet these recursions collapse under non-exponential discounting common in human preferences and survival processes. We show the breakdown is structural: exponential discounting sits at a fragile intersection of multiplicativity and time homogeneity, and violating either property breaks standard dynamic programming. To overcome this, we propose Pontryagin-Guided Direct Policy Optimization (PG-DPO), a variational framework that abandons recursion and couples the Pontryagin Maximum Principle with Monte Carlo rollouts via an Adjoint-MC projection enforcing pointwise Hamiltonian maximization. Across multi-dimensional hyperbolic and survival-discount benchmarks, PG-DPO improves accuracy and stability where equation-driven solvers and critic-based baselines diverge.
中文摘要 大多数基于价值和行为者-批判型强化学习方法依赖于贝尔曼式递归，但这些递归在人类偏好和生存过程中常见的非指数折现下会崩溃。我们展示了分解是结构性的：指数折现处于重数性和时间均匀性的脆弱交汇处，违反任一性质都会破坏标准动态规划。为克服这一问题，我们提出了庞特里亚金引导直接策略优化（PG-DPO），这是一种变分框架，放弃递归，通过伴随MC投影将庞特里亚金极大原理与蒙特卡洛展开耦合，强制执行点点哈密顿最大化。在多维双曲和存活折扣基准中，PG-DPO在方程驱动求解器和基于批评者基线的分歧处提升了准确性和稳定性。

Decoupling Communication from Policy: Robust MARL under Bandwidth Constraints

与策略的通信解耦：带宽约束下的鲁棒MARL

Authors: Alexi Canesse, Benoît Goupil, Jesse Read, Sonia Vanier
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.21085
Pdf link: https://arxiv.org/pdf/2605.21085
Abstract Communication enables coordination in multi-agent reinforcement learning (MARL), but many real-world applications, e.g., search-and-rescue with drone swarms, operate under severe bandwidth constraints. Many communication architectures still expose a coupled bottleneck in which a shared latent representation is used for both policy execution and inter-agent communication. Consequently, reducing message size directly limits the policy's latent space, often leading to significant performance degradation. We address this with two contributions. First, we introduce $\beta$, a normalised per-agent bandwidth budget that unifies sparsity, rounds, and message dimension into a single comparable constraint. Second, we provide SLIM, a minimal architecture that decouples the communication pathway from the policy's latent representation, allowing us to isolate the effect of bandwidth from the effect of policy capacity while benefiting from in-step communication. We evaluate our method on several partially-observable MARL benchmarks, where communication is essential. Our approach achieves state-of-the-art performance and exhibits scalability and robustness under limited communication, with only marginal degradation as bandwidth is reduced.
中文摘要 通信使多智能体强化学习（MARL）中的协调成为可能，但许多现实应用，如无人机群的搜救，在严重的带宽限制下运行。许多通信架构仍然存在耦合瓶颈，即共享潜在表示同时用于策略执行和代理间通信。因此，减少消息大小直接限制了策略的潜在空间，常常导致显著的性能下降。我们对此有两方面的贡献。首先，我们引入$\beta$，一种归一化的每个代理带宽预算，将稀疏度、回合数和消息维度统一为一个可比的约束。其次，我们提供了SLIM架构，这是一种最小化架构，将通信路径与策略的潜在表示解耦，使我们能够将带宽效应与政策容量的影响隔离开来，同时享受步入式通信。我们在多个部分可观测的MARL基准测试中评估我们的方法，其中沟通至关重要。我们的方法实现了最先进的性能，并在有限通信下展现出可扩展性和鲁棒性，带宽减少时仅有边际性能下降。

Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

群体相对政策优化中的优势崩溃：诊断与缓解

Authors: Xixiang He, Qiyao Sun, Ao Cheng, Xingming Li, Xuanyu Ji, Hailun Lu, Runke Huang, Qingyong Hu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.21125
Pdf link: https://arxiv.org/pdf/2605.21125
Abstract Group Relative Policy Optimization (GRPO), a prominent algorithm within the Reinforcement Learning from Verifiable Rewards (RLVR) framework, has achieved strong results in improving the reasoning capabilities of large language models (LLMs). However, GRPO is prone to advantage collapse, a failure mode where homogeneous rewards within a group (e.g., all correct or all incorrect answers) yield near-zero advantages and vanishing gradients. To address this, we introduce the Advantage Collapse Rate (ACR), the first diagnostic metric quantifying the proportion of training batches with ineffective gradients. Across models from 0.5B to 14B parameters on mathematical reasoning benchmarks, we show that ACR strongly predicts training stagnation and final performance. We then propose Adaptive Virtual Sample Policy Optimization (AVSPO), a lightweight extension of GRPO that injects virtual reward samples, guided by real-time ACR monitoring, to enable learning from homogeneous groups without additional model rollouts. AVSPO reduces advantage collapse by 58-63% relative to GRPO and yields consistent accuracy gains of 4-6 percentage points across all model scales, while maintaining generalization on the evaluated out-of-domain task. Code and datasets are available at this https URL.
中文摘要 群体相对策略优化（GRPO）是可验证奖励强化学习（RLVR）框架中的一个重要算法，在提升大型语言模型（LLMs）的推理能力方面取得了显著成效。然而，GRPO容易出现优势崩溃，即群体内同质奖励（例如全部正确或全部错误答案）导致几乎为零优势且梯度消失的失败模式。为此，我们引入了优势崩溃率（ACR），这是首个量化训练批次中梯度无效比例的诊断指标。在0.5B到14B的数学推理基准模型中，我们表明ACR强烈预测训练停滞和最终表现。随后，我们提出了自适应虚拟样本策略优化（AVSPO），这是GRPO的一种轻量级扩展，通过实时ACR监测注入虚拟奖励样本，实现从同质组学习而无需额外模型推广。AVSPO相较GRPO减少了58-63%的优势崩溃，并在所有模型尺度上持续提升4-6个百分点的准确性，同时保持对评估的域外任务的泛化性。代码和数据集可在该 https URL 访问。

Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving

思考而预见行动：自动驾驶的认知-身体强化学习

Authors: Yang Wu, Qiang Meng, Zhaojiang Liu, Youquan Liu, Jian Yang, Jin Xie
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.21139
Pdf link: https://arxiv.org/pdf/2605.21139
Abstract Current end-to-end autonomous driving models are fundamentally constrained by the behavioral cloning ceiling of imitation learning. While reinforcement learning offers a path to smarter autonomy, it demands two missing pieces of infrastructure: (1) a cognitive foundation that understands traffic semantics and driving intent, and (2) a foresighted physical environment that can anticipate the consequences of candidate actions. To this end, we propose CoPhy, a CognitivePhysical reinforcement learning framework for autonomous driving. To distill to think, we distill VLM knowledge into the BEV encoder and then discard the VLM entirely, retaining cognitive ability at zero inference cost while releasing the cognitive channel as a pluggable interface for optional human language commands. To foresee to act, we build an auto-regressive BEV world model that explicitly predicts future semantic maps conditioned on candidate actions, serving as an interpretable physical sandbox from which safety metrics are directly derived. Built upon this dual infrastructure, we optimize the driving policy via GRPO with a novel dual-reward mechanism: a physical reward derived from BEV rollouts enforces hard safety constraints, while a cognitive reward from a language-aligned scorer ensures intent compliance. Extensive experiments demonstrate that CoPhy not only achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks, but also enables safer driving via cognitively informed scene compliance and flexible intent control through user-defined language instructions.
中文摘要 当前端到端自动驾驶模型根本受制于模仿学习的行为克隆天花板。虽然强化学习为实现更智能的自主提供了道路，但它需要两个缺失的基础设施：（1）理解交通语义和驾驶意图的认知基础，以及（2）能够预见候选行为后果的具前瞻性的物理环境。为此，我们提出了CoPhy，一种用于自动驾驶的认知物理强化学习框架。为了简化思考，我们将VLM知识提炼到BEV编码器中，然后完全舍弃VLM，保持认知能力且不需任何推理成本，同时释放认知通道作为可选人类语言命令的可插拔接口。为了预见行动，我们构建了一个自回归的BEV世界模型，明确预测基于候选动作的未来语义图，作为一个可解释的物理沙盒，直接推导安全指标。基于这一双重基础设施，我们通过GRPO优化驱动策略，采用一种新颖的双重奖励机制：基于BEV推广的物理奖励强制执行硬安全约束，而来自语言对齐评分器的认知奖励确保意图合规。大量实验表明，CoPhy不仅在NAVSIM v1和v2基准测试上取得了最先进的成绩，还通过认知知情的场景合规性和用户自定义语言指令灵活的意图控制，实现了更安全的驾驶。

Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning

通过逆向生成数据和引导强化学习学习先积分

Authors: Jingfeng Zhong, Zhengxiang Liu, Zhijie Wang, Shuai Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.21160
Pdf link: https://arxiv.org/pdf/2605.21160
Abstract The discovery of first integrals is of fundamental scientific importance for understanding conservation laws in dynamical systems. However, existing symbolic computation tools and Large Language Models (LLMs) remain limited on this task because high-quality training data are scarce and successful solutions often depend on mathematical intuition. This paper presents FISolver, an LLM-based solver developed to address this challenge. First, we introduce a "Backward Generation" algorithm that systematically builds large-scale datasets of (differential equation, first integral) pairs by deriving differential equations from sampled integrals, thereby alleviating the data scarcity bottleneck. Second, we apply supervised fine-tuning to a compact mathematical model and further improve its performance through reinforcement learning with a Levenshtein Distance-based shaped reward. In addition, we design data synthesis and blending strategies that support effective adaptation to difficult problem families from sparse examples. Experiments show that FISolver, while requiring substantially lower computational cost, significantly outperforms larger mathematical LLMs and commercial solvers such as Mathematica on challenging benchmarks, indicating a new data-driven route for automated discovery of first integrals.
中文摘要 第一积分的发现对于理解动力系统中的守恒定律具有根本性的科学意义。然而，现有的符号计算工具和大型语言模型（LLM）在这方面仍然有限，因为高质量的训练数据稀缺，成功解决方案往往依赖数学直觉。本文介绍了FISolver，一款基于LLM的求解器，旨在解决这一挑战。首先，我们引入了一种“逆向生成”算法，通过从抽样积分推导微分方程，系统地构建大规模的（微分方程，第一积分）对数据集，从而缓解了数据稀缺的瓶颈。其次，我们将监督微调应用于紧凑的数学模型，并通过基于Levenshtein距离的形状奖励进行强化学习进一步提升其性能。此外，我们设计数据综合和混合策略，支持从稀疏样本中有效适应棘手问题家族。实验显示，尽管FISolver的计算成本显著降低，但在具有挑战性的基准测试中，它显著优于大型数学大型语言模型和商业求解器如Mathematica，这表明了第一积分自动发现数据驱动的新路径。

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

ScenePilot：可控边界驱动的自动驾驶临界场景生成

Authors: Qiyu Ruan, Yuxuan Wang, He Li, Zhenning Li, Cheng-zhong Xu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.21168
Pdf link: https://arxiv.org/pdf/2605.21168
Abstract Safety-critical scenarios are central to evaluating autonomous driving systems, yet their rarity in naturalistic logs makes simulation-based stress testing indispensable. Most scenario generation methods treat surrounding agents as adversaries, but they either (i) induce failures without explicitly modeling vehicle-road physical limits, yielding visually extreme yet physically unsolvable crashes, or (ii) enforce physical feasibility or policy feasibility in isolation, which can over-focus on aggressive maneuvers or remain tied to a controller-dependent capability boundary. We propose ScenePilot, a feasibility-guided, boundary-driven framework that targets the boundary band: scenarios that are physically solvable in principle yet still cause the deployed autonomy stack to fail. We formulate generation as constrained multi-objective reinforcement learning, combining an RSS-derived physical-feasibility score $\sigma$ with an online-learned AV-risk predictor $\Phi$, and introduce step-level feasibility-aware shielding to keep exploration near the feasibility boundary while avoiding infeasible artifacts. Experiments on SafeBench with multiple planners show that ScenePilot yields substantially higher collision rates (+6.2 percentage points) while preserving physical validity, and that adversarial fine-tuning on these boundary-band scenarios consistently reduces downstream crash rates. The code is available at this https URL.
中文摘要 安全关键场景是评估自动驾驶系统的核心，但它们在自然日志中的稀有性使基于模拟的压力测试不可或缺。大多数情景生成方法将周围智能体视为对手，但它们要么（i）在未明确建模车辆-道路物理极限的情况下诱导失败，导致视觉上极端但物理上无法解决的碰撞;要么（ii）孤立地强制执行物理可行性或政策可行性，这可能过度关注激进机动，或受限于依赖控制器的能力边界。我们提出了ScenePilot，一个可行性导向、边界驱动的框架，目标是边界带：原则上物理可解但仍会导致部署的自治栈失效的场景。我们将生成过程表述为受限多目标强化学习，结合RSS衍生的物理可行性评分$\sigma$与在线学习的杀毒风险预测变量$\Phi$，并引入步级可行性感知屏蔽，使探索保持在可行边界附近，同时避免不可行的伪影。在SafeBench上使用多个规划器进行的实验显示，ScenePilot在保持物理效度的同时，碰撞率显著提高（+6.2个百分点），并且在这些边界带场景中进行对抗微调能持续降低下游坠机率。代码可在该 https URL 访问。

Domain-Adaptable Reinforcement Learning for Code Generation with Dense Rewards

用于高密度奖励代码生成的领域自适应强化学习

Authors: Erfan Aghadavoodi Jolfaei, Daniel Maninger, Abhinav Anand, Mert Tiftikci, Mira Mezini
Subjects: Subjects: Machine Learning (cs.LG); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2605.21180
Pdf link: https://arxiv.org/pdf/2605.21180
Abstract Large language models show strong potential for automated code generation, but lack guarantees for correctness, quality, safety, and domain-specific constraints. For instance in robotics, where code generation is increasingly being used for planning and executing actions, awareness of the environment and physical constraints is critical. To facilitate the adaption of code-generating LLMs to diverse requirements, including domain-specific ones, we present a reinforcement learning framework that fine-tunes pre-trained LLMs using proximal policy optimization. Our customizable execution-aware reward formula captures and optimizes syntax, functional correctness, code style, security, and simulator executability. A token-level reward mapping mechanism enables effective credit assignment from execution outcomes to generated tokens. The framework is evaluated on general-purpose code generation (MBPP/MBPP+) and robotic program synthesis (RoboEval). The results show substantial improvements in functional correctness and simulator executability, including an absolute pass@1 increase of 19% on MBPP and a reduction in execution failures by 51% on RoboEval. These findings demonstrate that structured reinforcement learning can effectively align language models to correct program generation and domain-specific requirements.
中文摘要 大型语言模型在自动代码生成方面具有强大潜力，但缺乏正确性、质量、安全性和领域特定约束的保证。例如，在机器人领域，代码生成越来越多地用于计划和执行动作，因此对环境和物理约束的意识至关重要。为了促进代码生成LLM适应多样化需求，包括领域特定需求，我们提出了一个强化学习框架，通过近端策略优化对预训练LLM进行微调。我们可定制的执行感知奖励公式捕捉并优化语法、函数正确性、代码样式、安全性和模拟器可执行性。代币级奖励映射机制使从执行结果到生成代币的信用分配能够有效实现。该框架的评估基于通用代码生成（MBPP/MBPP+）和机器人程序综合（RoboEval）。结果显示，功能正确性和模拟器可执行性显著提升，包括MBPP的绝对pass@1提升了19%，RoboEval的执行失败减少了51%。这些发现表明，结构化强化学习能够有效使语言模型与程序生成和领域特定需求对齐。

Reinforcement Learning-based Control via Y-wise Affine Neural Networks: Comparative Case Studies for Chemical Processes

基于强化学习的Y向仿射神经网络控制：化学过程的比较案例研究

Authors: Austin Braniff, Yuhe Tian
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2605.21211
Pdf link: https://arxiv.org/pdf/2605.21211
Abstract In this work we present an efficient and practically implementable approach for the application of reinforcement learning (RL)-based control in chemical process systems. This is an area that has yet to widely adopt RL-based control largely due to inherent challenges in trusting RL algorithms and the time-consuming process of training reliable agents. To address these challenges, we leverage a class of RL algorithms termed Y-wise Affine Neural Network (YANN)- RL, which we have developed in our prior work (Braniff and Tian, 2025a). By strategically initializing actor and critic networks YANN-RL algorithms provide confident and interpretable starting points within control schemes. We apply this RL-based control approach to three different process engineering case studies publicly available on the PC-Gym library (Bloor et al., 2026): (i) a continuous stirred tank reactor (CSTR), (ii) a four-tank system, and (iii) a multistage extraction column. Our approach is compared to several popular RL algorithms (PPO, SAC, DDPG, and TD3) and is benchmarked against nonlinear model predictive control (NMPC). These case studies demonstrate that YANN-RL can greatly reduce the training time and data needed, can be deployed with confidence for chemical process systems, and can approach the performance of NMPC without the knowledge of a full nonlinear model.
中文摘要 在本研究中，我们提出了一种高效且可操作的方法，用于在化学工艺系统中应用基于强化学习（RL）的控制。由于信任强化学习算法的固有挑战以及训练可靠代理的耗时过程，这一领域尚未广泛采用基于强化学习的控制。为应对这些挑战，我们利用了一类名为Y向仿射神经网络（YANN）- RL的强化学习算法，该算法在我们之前的工作中已开发（Braniff和Tian，2025a）。通过战略性初始化演员网络和批评网络，YANN-RL算法在控制方案中提供了自信且可解释的起点。我们将基于强化学习的控制方法应用于PC-Gym库中公开的三个不同工艺工程案例研究（Bloor等，2026）：（i）连续搅拌罐反应器（CSTR），（ii）四罐系统，以及（iii）多级提取塔。我们的方法与多种流行的强化学习算法（PPO、SAC、DDPG 和 TD3）进行了比较，并以非线性模型预测控制（NMPC）为基准测试。这些案例研究表明，YANN-RL可以大幅缩短训练时间和数据，可以自信地部署于化学工艺系统，并且在没有完整非线性模型的情况下，也能接近NMPC的性能。

Behavior-Consistent Deep Reinforcement Learning

行为一致性深度强化学习

Authors: Marcel Hussing, Liv G. d'Aliberti, Claas Voelcker, Benjamin Eysenbach, Eric Eaton
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.21214
Pdf link: https://arxiv.org/pdf/2605.21214
Abstract Reinforcement learning (RL) often exhibits high variance across training runs, leading to unreliable performance and posing a major challenge to deployment in real-world domains. In this work, we address the challenge of cross-run policy divergence by formalizing the problem of behavior-consistent RL, where the objective is to obtain policies that are both high-performing and distributionally similar across training runs. Our key observation is that maximum-entropy RL provides a direct mechanism for controlling behavioral divergence by anchoring runs to a common (uniform) prior. We prove that, for Boltzmann policies, choosing the temperature proportional to $Q$-function disagreement bounds the pairwise KL divergence between the induced policies. However, we also show that naïvely increasing entropy might impair policy optimization while amplifying off-policy error. Building upon these observations, we propose $Q$-value Expectile Disagreement (QED), a state-dependent temperature schedule that uses double-critic disagreement as a single-run proxy for cross-run disagreement. Empirically, we demonstrate that across 18 continuous-control tasks, QED reduces across-run divergence by two orders of magnitude without sacrificing performance, resulting in a considerable reduction in return variance at modest sample-efficiency costs.
中文摘要 强化学习（RL）在不同训练运行间常表现出较高的变异性，导致性能不可靠，并在实际环境中的部署带来重大挑战。本研究通过形式化行为一致性强化学习（ROL）问题，解决跨运行策略分歧的挑战，目标是获得在训练运行间既高效又分布相似的策略。我们的关键观察是，最大熵强化学习通过将运行锚定到一个共同（均匀）先验，直接控制行为发散。我们证明，对于玻尔兹曼策略，选择与$Q$函数不一致成正比的温度会限制诱导策略间的两两KL散度。然而，我们也表明，天真地增加熵可能会削弱策略优化，同时放大非策略错误。基于这些观察，我们提出了$Q$值期望不一致（QED），这是一种依赖状态的温度表，利用双重批判性分歧作为跨运行分歧的单次运行代理。实证显示，在18个连续控制任务中，QED在不牺牲性能的情况下将横向运行的散度减少了两个数量级，从而显著降低了返回方差，同时降低了适度的样本效率成本。

PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

PREFINE：基于偏好的隐性奖励与成本微调以实现安全对齐

Authors: Richa Verma, Bavish Kulur, Sanjay Chawla, Balaraman Ravindran
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.21225
Pdf link: https://arxiv.org/pdf/2605.21225
Abstract We address the problem of making a pre-trained reinforcement learning (RL) policy safety-aware by incorporating cost constraints without retraining it from scratch. While costs could be numerically encoded, we assume a more general setting is when costs are provided as preferences. Given a reward-optimized policy and a small dataset of preferred (low-cost) and dispreferred (high-cost) trajectories, our goal is to fine-tune the policy to generate low-cost behaviors while retaining high rewards. Unlike standard RLHF in language models, where preferences are defined over responses to the same prompt, our setting involves trajectory-level preferences in continuous control environments. We introduce PREFINE: Preference-based Implicit Reward and Cost Fine-Tuning for Safety Alignment which is a preference-based fine-tuning method that adapts Direct Preference Optimization (DPO), which is now widely used for LLM fine-tuning, to the sequential decision making setting. PREFINE constructs policy-sampled counterfactual trajectories to establish meaningful preference contrasts and jointly optimizes for reward retention and safety alignment. Empirically, PREFINE reduces constraint violations and catastrophic failures by over 60% while maintaining original reward behavior. PREFINE produces policies that achieve low-cost, high-reward performance with significantly improved data and computational efficiency compared to full offline RL or imitation learning, bridging preference alignment and safe policy adaptation in continuous domains.
中文摘要 我们通过纳入成本约束，解决让预训练强化学习（RL）政策具备安全意识的问题，而无需从零重新训练。虽然成本可以用数值编码，但我们假设更一般的情境是成本作为偏好提供。给定一个奖励优化的策略和一个小的优选（低成本）和不优（高成本）轨迹数据集，我们的目标是微调该策略，以生成低成本行为，同时保留高回报。与语言模型中的标准RLHF不同，后者偏好是针对同一提示的响应定义的，我们的设置涉及轨迹级偏好，且是在连续控制环境中。我们介绍了PREFINE：基于偏好的隐性奖励与成本微调安全对齐，这是一种基于偏好的微调方法，将现已被广泛用于大型语言模型微调的直接偏好优化（DPO）适配到顺序决策环境。PREFINE构建了策略抽样的反事实轨迹，以建立有意义的偏好对比，并共同优化奖励的保留和安全一致性。从经验角度看，PREFINE在保持原始奖励行为的同时，将约束违规和灾难性失败减少了超过60%。PREFINE能够实现低成本、高回报的性能，数据和计算效率显著提升，相较于完全离线强化学习或模仿学习，桥接偏好对齐和连续域中安全的策略适应。

LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

LamPO：一种用于推理语言模型的Lambda风格策略优化

Authors: Zhe Yuan, Yipeng Zhou, Jinghan Li, Xinyuan Chen, Bowen Deng, Zhiqian Chen, Liang Zhao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.21235
Pdf link: https://arxiv.org/pdf/2605.21235
Abstract Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving reasoning language models on tasks such as mathematics, coding, and scientific question answering. However, widely used group-relative objectives, such as GRPO, summarize each sampled group with scalar statistics and therefore discard fine-grained relational information among candidate responses. This weakens credit assignment under sparse outcome rewards, especially when multiple generated solutions differ only subtly in reasoning quality. We propose \textbf{LamPO}, a \textbf{Lambda-Style Policy Optimization} method that replaces scalar group advantages with a \emph{Pairwise Decomposed Advantage}. LamPO aggregates pairwise reward gaps within each response group and modulates each comparison by a confidence-aware weight computed from sequence log-probability differences, while retaining the critic-free and clipped-update structure of PPO-style optimization. When reference solutions are available, we further add a lightweight ROUGE-L-based dense auxiliary reward to reduce reward sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond with Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show that LamPO consistently improves over GRPO and recent RLVR variants, with more stable training dynamics and better sample efficiency.
中文摘要 带可验证奖励的强化学习（RLVR）已成为改进推理语言模型在数学、编码和科学问答等任务中的有效范式。然而，广泛使用的群体相对目标，如GRPO，会用标量统计量总结每个抽样群体，因此会舍弃候选回答中的细粒度关系信息。这削弱了在结果奖励稀疏的情况下的信用分配，尤其是当多个生成解在推理质量上仅有细微差异时。我们提出了 \textbf{LamPO}，这是一种 \textbf{Lambda 风格策略优化}方法，用 \emph{Pairwise Decomd Advantage} 替代标量群优势。LamPO汇总每个响应组内的成对奖励差距，并通过基于序列对数概率差异计算的置信度权重调节每次比较，同时保留PPO式优化的无批判和裁剪更新结构。当有参考解可用时，我们进一步添加基于轻量级ROUGE-L的密集辅助奖励以降低奖励稀疏性。在AIME24、AIME25、MATH-500和GPQA-Diamond上与Qwen3-1.7B、Qwen3-4B和Phi-4-mini的实验显示，LamPO在训练动态和采样效率方面持续优于GRPO及近期RLVR变体，表现更稳定。

Reinforcement Learning for Risk Adaptation via Differentiable CVaR Barrier Functions

通过可微分的CVaR屏障函数进行风险适应的强化学习

Authors: Xinyi Wang, Taekyung Kim, Bardh Hoxha, Georgios Fainekos, Dimitra Panagou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.21257
Pdf link: https://arxiv.org/pdf/2605.21257
Abstract Planning through crowded environments under uncertain obstacle motions remains difficult, as stochastic interactions often induce overly conservative behavior or reduced efficiency. To address this challenge, we propose an end-to-end risk adaptation framework for crowd navigation under obstacle-motion uncertainty modeled by a Gaussian mixture model. The framework combines reinforcement learning~(RL) with a differentiable quadratic-program safety layer based on Conditional Value-at-Risk~(CVaR) barrier functions, jointly learning nominal control input, risk level, and safety margin and enforcing explicit probabilistic safety constraints. This design enables context-aware adaptation, promoting efficient behavior while invoking caution only when necessary. We conduct extensive evaluations in dynamic, uncertain, and crowded environments across varying obstacle densities and robot models, and further assess generalization under three out-of-distribution cases. Comparisons across optimization-based, RL-based, and integrated RL and optimization methods are provided, and the proposed method is shown to deliver the strongest overall performance in safety, efficiency, and generalization under uncertainty.
中文摘要 在不确定障碍运动下规划拥挤环境仍然困难，因为随机相互作用常导致过度保守的行为或降低效率。为应对这一挑战，我们提出了一个端到端的风险适应框架，用于障碍物运动不确定性下的人群导航，采用高斯混合模型建模。该框架结合了强化学习~（RL）与基于条件风险值~（CVaR）障碍函数的可微二次规划安全层，共同学习名义控制输入、风险水平和安全裕度，并强制执行显式的概率安全约束。这种设计实现了情境感知适应，促进高效行为，同时仅在必要时谨慎。我们在动态、不确定和拥挤环境中，针对不同障碍密度和机器人模型进行了广泛评估，并进一步评估三种分布外情形下的泛化情况。对基于优化、基于强化学习和集成的强化学习与优化方法进行了比较，所提出的方法在安全性、效率和泛化性方面表现出最强的整体性能。

How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

在线强化学习到底够用多少？RLVR离线偏好优化的实用推广

Authors: Richa Verma, Balaraman Ravindran
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.21266
Pdf link: https://arxiv.org/pdf/2605.21266
Abstract Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for reasoning in language models, with GRPO as its primary example. However, GRPO requires continuous online rollout generation, making it computationally expensive and difficult to scale. While Direct Preference Optimization (DPO) offers a stable and efficient offline alternative, it is typically expected to underperform w.r.t. online RL methods such as GRPO when trained on rollouts from a cold supervised fine-tuned (SFT) policy. We introduce G2D (GRPO to DPO)}, a three-stage pipeline that performs a short GRPO warm-up, constructs a static preference dataset, and fine-tunes a model offline with DPO. Across a set of values of the number of online steps (K) in GRPO on Qwen2.5-7B and Llama-3.1-8B, we find that offline DPO with moderate warm-up matches or outperforms GRPO at substantially lower compute cost in our setting. On Qwen2.5-7B, G2D at K=150 achieves 62.4% on MATH-500, outperforming GRPO (51.6%) by 10.8% at ~4x lower compute. On Llama-3.1-8B, G2D at K=500 achieves 49.4%, surpassing GRPO in our experimental setting. We show that performance is not governed by the number of preference pairs, which does not vary much w.r.t. K, but by their informativeness. Moderate warm-up produces rollouts with calibrated uncertainty, yielding stronger contrastive signal, while excessive warm-up leads to overconfident policies and less informative data. Our results recast the offline-online gap in RLVR as primarily a data informativeness problem, and identify short online RL warm-up with appropriate difficulty calibration of the fine-tuning dataset as a compute-efficient alternative to online RL.
中文摘要 可验证奖励强化学习（RLVR）已成为语言模型推理的强大范式，GRPO是其主要例子。然而，GRPO需要持续的在线部署生成，计算成本高且难以扩展。虽然直接偏好优化（DPO）提供了一种稳定高效的离线替代方案，但在基于冷监督微调策略（SFT）的部署训练时，通常预期其表现不及如GRPO等在线强化学习方法。我们引入了G2D（GRPO到DPO）}，这是一条三阶段流水线，执行短暂的GRPO预热，构建静态偏好数据集，并通过DPO离线微调模型。在Qwen2.5-7B和Llama-3.1-8B的GRPO在线步数（K）一组值中，我们发现在我们的环境中，带有适度热身匹配的离线DPO或表现优于GRPO，且计算成本显著降低。在Qwen2.5-7B中，G2D在K=150时在MATH-500中达到62.4%，比GRPO（51.6%）高出10.8%，计算量低~4倍。在Llama-3.1-8B上，G2D在K=500时达到49.4%，超过了我们的实验环境下的GRPO。我们证明，性能并不取决于偏好对的数量，而偏好对的数量与K变化不大，而是由它们的信息量决定的。适度的预热会产生带有校准不确定性的部署，产生更强烈的对比信号;而过度预热则导致政策过度自信和信息量减少。我们的结果重新定位了RLVR中的离线与在线差距，主要是一个数据信息量问题，并将短暂的在线强化学习热身与适当难度校准的微调数据集识别为在线强化学习的高效计算替代方案。

DriveMA: Rethinking Language Interfaces in Driving VLAs with One-Step Meta-Actions

DriveMA：重新思考驱动VLA中的语言接口，采用一步元行动

Authors: Weicheng Zheng, Yixin Huang, Qiao Sun, Derun Li, Hang zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.21273
Pdf link: https://arxiv.org/pdf/2605.21273
Abstract Driving Vision-Language-Action Models (Driving VLAs) commonly introduce natural-language reasoning as an intermediate interface for end-to-end planning, but reasoning-centric interfaces face three practical bottlenecks: obtaining high-quality reasoning annotations is difficult, generating and understanding long reasoning chains is challenging for compact models, and inference latency is substantially increased. In this paper, we rethink the design of language interfaces in Driving VLAs and show that concise one-step meta-actions are a simple yet effective alternative to verbose reasoning. Meta-actions provide semantic decision grounding while remaining low-entropy, and being automatically derivable from expert trajectories, enabling scalable supervision and reliable trajectory conditioning. Building on this interface, we propose DriveMA, which combines action-centric supervised training with a turn-level credit-assignment reinforcement learning framework that jointly optimizes meta-action correctness, trajectory quality, and trajectory--meta-action consistency. Experiments show that DriveMA already achieves a new state of the art on the Waymo End-to-End Driving Challenge with a 2B model, reaching a Rater Feedback Score (RFS) of 8.060, while its 4B version further improves the state of the art to 8.079; DriveMA also obtains competitive performance on NAVSIM. Ablations demonstrate that one-step meta-actions offer a better practical trade-off between expressiveness, predictability, and inference efficiency than natural-language reasoning or finer-grained action sequences. Code, data, and models will be released to facilitate future research.
中文摘要 驱动视觉-语言-行动模型（Driving VLAs）通常将自然语言推理作为端到端规划的中间接口，但以推理为中心的界面面临三个实际瓶颈：获得高质量推理注释困难，紧凑模型难以生成和理解长推理链，推理延迟显著增加。本文重新思考了《驱动VLAs》中语言接口的设计，并展示了简洁的一步元动作是对冗长推理的简单有效替代方案。元动作在保持低熵的同时，提供了语义决策基础，并且可自动从专家轨迹中推导，从而实现可扩展的监督和可靠的轨迹条件。基于该接口，我们提出了DriveMA，结合了以行动为中心的监督训练与回合级学分赋值强化学习框架，共同优化元动作正确性、轨迹质量和轨迹-元动作一致性。实验显示，DriveMA在Waymo端到端驾驶挑战赛中已达到2B模型的新水平，Rater Feedback Score（RFS）达到8.060，而其4B版本进一步提升至8.079;DriveMA在NAVSIM上也取得了竞争性表现。消融表明，一步元动作在表达力、可预测性和推理效率之间提供了比自然语言推理或更细粒度动作序列更好的实际权衡。代码、数据和模型将发布，以促进未来研究。

\textit{Stochastic} MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

\textit{Stochastic} 平均流策略：带熵镜像下降的一步生成控制

Authors: Zeyuan Wang, Da Li, Yulin Chen, Yuehu Gong, Yanming Guo, Ye Shi, Liang Bai, Tianyuan Yu, Yanwei Fu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.21282
Pdf link: https://arxiv.org/pdf/2605.21282
Abstract Online off-policy reinforcement learning (RL) is shaped by two coupled choices: the policy class and the update rule. Gaussian policies are fast and have tractable entropy, but struggle with multimodal action distributions. Generative policies are more expressive, but often require iterative sampling or lack tractable entropy estimates. On the optimisation side, SAC-style soft policy improvement and mirror descent (MD) can be viewed as minimising different KL divergences: the former moves the policy towards a value-induced Boltzmann distribution, while the latter regularises each update against the previous policy. Combining entropy regularisation with an MD constraint is therefore attractive, as it supports exploration while stabilising policy improvement; however, the resulting target can be multimodal and is poorly matched by unimodal Gaussian policies. We propose Stochastic MeanFlow Policies (SMFP), a one-step generative policy class that maps Gaussian noise to actions through a MeanFlow transformation. This stochastic reparameterisation yields a tractable entropy surrogate and allows MeanFlow policies to be trained within off-policy mirror descent under a unified objective for exploratory yet stable improvement. Across seven MuJoCo benchmarks, SMFP improves over Gaussian and generative baselines while retaining single-step inference efficiency.
中文摘要 在线非策略强化学习（RL）由两个耦合选择塑造：策略类和更新规则。高斯策略快速且熵可解，但在多模作用分布方面存在困难。生成策略表达力更强，但通常需要迭代抽样或缺乏可处理的熵估计。在优化方面，SAC式软政策改进和镜像下降（MD）可视为最小化不同的KL背度：前者使政策趋向价值诱导的玻尔兹曼分布，后者则使每次更新都与前一政策正则化。因此，将熵正则化与MD约束结合具有吸引力，因为它支持探索同时稳定政策改进;然而，得到的目标可以是多模的，且与单峰高斯策略匹配较差。我们提出了随机均流策略（SMFP），这是一种一步生成策略类，通过均流变换将高斯噪声映射到动作。这种随机重参数化产生了一个可处理的熵替代变量，并允许在非策略镜像下降中训练平均流量策略，目标统一，实现探索性且稳定的改进。在七个MuJoCo基准测试中，SMFP优于高斯基线和生成基线，同时保持单步推理效率。

TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health

TimeSRL：通过语义强化学习调优的大型语言模型进行可推广时间序列行为建模——心理健康案例研究

Authors: Yuang Fan, Lilin Xu, Millie Wu, Jingping Nie, Qingyu Chen, Yuzhe Yang, Zhuo Zhang, Xin Liu, Subigya Nepal, Xiaofan Jiang, Xuhai "Orson" Xu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2605.21295
Pdf link: https://arxiv.org/pdf/2605.21295
Abstract Longitudinal passive sensing enables continuous health prediction, yet models often fail under cross-dataset distribution shifts. Traditional ML overfits cohort-specific artifacts, while Large Language Models (LLMs) struggle to reason reliably over long, heterogeneous time-series. We introduce TimeSRL, a two-stage LLM framework that routes predictions through an explicit semantic bottleneck. The model first abstracts raw signals into high-level natural language, then predicts behavioral outcomes from these abstractions alone. This forces the model to reason over semantic concepts that we argue generalize better than raw numbers. We optimize this process end-to-end using Group Relative Policy Optimization (GRPO) with Reinforcement Learning from Verifiable Rewards (RLVR), learning outcome-aligned abstractions without gold intermediate annotations. Instantiated on mental-health prediction, TimeSRL achieves state-of-the-art performance on a benchmark designed to stress-test cross-cohort generalization under a rigorous leave-one-dataset-out (LOSO) protocol, reducing mean absolute error (MAE) over strong non-LLM ML and LLM baselines by 3.1--10.1% and 9.5--44.1% for anxiety, and 3.2--9.6% and 27.4--57.6% for depression (all $p$s<0.05). TimeSRL significantly outperforms prior methods in cross-benchmark transfer across different sensing pipelines, rivaling its own within-domain performance without target-domain fine-tuning. These results demonstrate that semantic abstractions are reusable and point to a new direction for generalizable behavior modeling via RL-tuned LLMs.
中文摘要 纵向被动传感实现了持续健康预测，但模型在跨数据集分布变化下常常失效。传统的机器学习会对群体特定的工件进行过度拟合，而大型语言模型（LLMs）则难以在漫长且异质的时间序列上可靠推理。我们介绍TimeSRL，一个两阶段的大型语言模型框架，通过显式语义瓶颈路由预测。模型首先将原始信号抽象为高级自然语言，然后仅凭这些抽象预测行为结果。这迫使模型对我们认为比原始数字更能推广的语义概念进行推理。我们通过组相对策略优化（GRPO）结合可验证奖励强化学习（RLVR）对该过程进行端到端优化，学习无金中间标注的与结果对齐的抽象。TimeSRL基于心理健康预测，在一项旨在压力测试严格的“保留一数据集”（LOSO）协议下的跨队列泛化基准测试上达到了最先进的性能，在强的非LLM机器学习和LLM基线上，焦虑的平均绝对误差降低了3.1%-10.1%和9.5--44.1%，抑郁症的平均绝对误差降低了3.2--9.6%和27.4--57.6%（均为$p$s）<0.05）。TimeSRL在跨不同传感流水线的跨基准传输方面显著优于以往方法，能够媲美其未进行目标域微调的域内性能。这些结果表明语义抽象是可重复使用的，并为通过强化学习调优的大型语言模型（LLMs）实现可推广行为建模开辟了新方向。

DeCoR: Design and Control Co-Optimization for Urban Streets Using Reinforcement Learning

DeCoR：利用强化学习设计与控制城市街道的协同优化

Authors: Bibek Poudel, Lei Zhu, Kevin Heaslip, Sai Swaminathan, Weizi Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.21311
Pdf link: https://arxiv.org/pdf/2605.21311
Abstract Modern vision systems can detect, track, and forecast urban actors at scale, yet translating perception outputs to urban design remains limited. We introduce DeCoR, a two-stage reinforcement learning framework that leverages flow observations to co-optimize crosswalk layout and network-level signal control. The design stage encodes the pedestrian network as a graph and learns a generative policy that parameterizes a Gaussian mixture model over crosswalk location and width, from which new crosswalks are sampled. For each layout, a shared control policy learns adaptive signal timings to minimize joint pedestrian and vehicle delay. On a 750 m real-world urban corridor with demand sensed from video and Wi-Fi logs, DeCoR learns a layout that reduces pedestrian arrival time to their nearest crosswalk by 23% while using fewer crosswalks than existing configurations. On the control side, DeCoR reduces pedestrian and vehicle wait time by 79% and 65%, respectively, relative to fixed-time signalization. Further, the control policy generalizes to demands outside of training and is robust to layout changes without retraining.
中文摘要 现代视觉系统能够大规模检测、追踪和预测城市行为者，但将感知输出转化为城市设计的效果仍然有限。我们介绍了DeCoR，一个两阶段的强化学习框架，利用流量观测来协同优化斑马线布局和网络级信号控制。设计阶段将行人网络编码为图，学习生成策略，该策略对高斯混合模型参数化，涵盖斑马线位置和宽度，从中抽样新的斑马线。对于每种布局，共享控制策略学习自适应信号时序，以最小化行人和车辆联合延迟。在一条750米长的真实城市走廊上，通过视频和Wi-Fi日志感知需求，DeCoR学习出一种布局，能够将行人到达最近斑马线的时间缩短23%，同时使用比现有配置更少的人行横道。在控制端，DeCoR相较于固定时间信号，分别将行人和车辆的等待时间缩短了79%和65%。此外，控制策略可推广至培训外的需求，且在无需重新培训的情况下实现布局变更时具有鲁棒性。

Learning Robust Dexterous In-Hand Manipulation from Joint Sensors with Proprioceptive Transformer

通过本体感觉变换器学习关节传感器的强健灵巧手部操作

Authors: Senlan Yao, Chenyu Yang, Jaehoon Kim, Aristotelis Sympetheros, Robert K. Katzschmann
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.21330
Pdf link: https://arxiv.org/pdf/2605.21330
Abstract In-hand object manipulation is a fundamental yet challenging capability for dexterous robots. Despite significant progress in dexterous manipulation, existing approaches rely heavily on vision or tactile sensing to track object states, while joint sensing -- the most readily available modality on any robotic hand -- remains largely overlooked, particularly for tendon-driven hands. In this paper, we study how far joint sensing alone can go by asking: (i) whether motor encoders or direct joint sensing provides better proprioceptive feedback, (ii) how to extract environment information from joint measurements, and (iii) whether joint-only control can achieve competitive real-world performance without external perception. We present the Proprioceptive Transformer (PT), an exteroceptive-free approach for continuous cube rotation on a tendon-driven dexterous hand that uses only joint sensing feedback. A teacher policy is first trained via reinforcement learning with privileged object information, then distilled into PT, which operates solely on joint position and velocity histories. The Transformer architecture effectively extracts implicit object state information from temporal patterns in joint sensor readings. Experiments on the real ORCA hand show that our approach achieves 3.1x higher rotation speed than baselines. We also demonstrate that our PT achieves a 23.4% lower RMSE for cube position estimation than the MLP baseline, indicating superior extraction of exteroceptive information from proprioceptive sources.
中文摘要 手持物体操作是灵巧机器人既基础又具挑战性的能力。尽管灵巧操作取得了显著进步，现有方法仍高度依赖视觉或触觉感知来追踪物体状态，而关节感知——任何机器人手中最容易获得的技术——仍然大多被忽视，尤其是在肌腱驱动的手中。本文探讨了仅靠关节感应能达到的程度，问题包括：（i）运动编码器还是直接关节传感是否能提供更好的本体感觉反馈，（ii）如何从关节测量中提取环境信息，（iii）仅关节控制是否能在无外部感知的情况下实现具有竞争力的现实世界性能。我们介绍本体感觉变换器（PT），这是一种无外感觉的方法，用于在肌腱驱动的灵巧手上实现连续立方体旋转，仅依靠关节感应反馈。教师策略首先通过特权对象信息的强化学习进行训练，然后提炼为PT，PT仅基于联合位置和速度历史。Transformer 架构有效从联合传感器读数中的时间模式中提取隐式对象状态信息。在真实ORCA手型上的实验显示，我们的自转速度比基线高3.1倍。我们还证明，我们的PT在立方体位置估计上的RMSE比MLP基线低23.4%，表明从本体感觉来源提取外感受信息的能力更优。

Validating Navmesh using Geometry: Voxel-Based Analysis with Prioritized Exploration

利用几何验证导航网格：基于体素的分析与优先探索

Authors: Ramesh Raghavan, Ojas Sharma, Sebastien Larrue, Alan Isaac Kunder, Aakash Sai, Rishi Mathur
Subjects: Subjects: Software Engineering (cs.SE); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.21397
Pdf link: https://arxiv.org/pdf/2605.21397
Abstract Navigation mesh (Navmesh) inconsistencies affect the player experience by directly impacting the navigation systems used by non-playable characters (NPCs) in game environments. While navmeshes are generated from world geometry using well-established algorithms, environments change throughout development as terrain is adjusted and assets are moved or replaced, resulting in mismatches between the navmesh and the actual environment. Existing automated approaches attempt to detect navigation issues using exploration agents and reinforcement learning techniques. However, since these methods rely on the navigation data itself or evaluate navigation behavior indirectly, they do not explicitly verify whether the navigation representation reflects the walkable space defined by underlying geometry. This paper presents a framework for validating navigation meshes through an independent, geometry-driven analysis of navmesh correctness. The approach reconstructs walkable space directly from environment geometry using a voxel-based representation, followed by constraint-aware traversal and connectivity evaluation. Validation is formulated as a prioritized search problem over the voxel space, where reinforcement learning guides sampling toward regions more likely to exhibit inconsistencies. At each sampled location, reachability derived from the voxel representation is compared against reachability obtained from the navmesh via engine-level queries. Experiments across multiple large-scale open-world game environments show that the approach consistently lowers exploration effort while maintaining similar defect detection coverage. The framework runs offline within the game engine and can be integrated into automated quality assurance pipelines. Since the method relies on geometry, it can be adapted across game engines with minimal changes, making it suitable for production deployment.
中文摘要 导航网格（Navmesh）不一致通过直接影响游戏环境中非可操作角色（NPC）使用的导航系统，从而影响玩家体验。虽然导航网格是通过成熟算法从世界几何体生成的，但随着地形调整和资产移动或替换，环境在开发过程中会发生变化，导致导航网格与实际环境之间存在不匹配。现有的自动化方法尝试通过探索代理和强化学习技术检测导航问题。然而，由于这些方法依赖导航数据本身或间接评估导航行为，因此它们并未明确验证导航表示是否反映了底层几何定义的可步行空间。本文提出了通过独立、几何驱动的导航网格正确性分析来验证导航网格的框架。该方法通过基于体素的表示，直接从环境几何重建可步行空间，随后进行约束感知的遍历和连通性评估。验证被表述为对体素空间的优先搜索问题，强化学习引导采样到更可能表现出不一致的区域。在每个采样位置，从体素表示得出的可达性与通过引擎级查询从导航网格获得的可达性进行比较。在多个大型开放世界游戏环境中的实验表明，这种方法在保持缺陷检测覆盖率的同时，始终能降低探索工作量。该框架可在游戏引擎中离线运行，并可集成到自动化的质量保证流程中。由于该方法依赖几何结构，可以在不同游戏引擎之间进行适配，且改动极小，适合生产部署。

roto 2.0: The Robot Tactile Olympiad

Roto 2.0：机器人触觉奥林匹克竞赛

Authors: Elle Miller, Jayaram Reddy, Ayush Deshmukh, Trevor McInroe, David Abel, Oisin Mac Aodha, Sethu Vijayakumar
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.21429
Pdf link: https://arxiv.org/pdf/2605.21429
Abstract Tactile-based reinforcement learning (RL) is currently hindered by fragmented research and a focus on over-saturated orientation tasks. We introduce v2 of the Robot Tactile Olympiad (\texttt{roto 2.0}), a GPU-parallelised benchmark designed to standardise tactile-based RL across four distinct robotic morphologies (16-DOF to 24-DOF). Unlike prior benchmarks, roto focuses on end-to-end "blind" manipulation, utilising only proprioception and tactile sensing without state information or distillation. We demonstrate a significant performance leap, with our blind agents achieving 13 Baoding ball rotations in 10 seconds, an order of magnitude faster than current state-of-the-art speeds. By open-sourcing our environments and robustly tuned baselines, we reduce the barrier to entry and enable researchers to prioritise fundamental algorithmic challenges over tedious RL tuning. Website: this https URL
中文摘要 基于触觉的强化学习（RL）目前受限于零散的研究和对过度饱和定向任务的关注。我们介绍机器人触觉奥林匹克竞赛（\texttt{roto 2.0}）v2，这是一个GPU并行化基准测试，旨在将基于触觉的强化学习标准化，涵盖四种不同机器人形态（16-DOF-24-DOF）。与以往基准不同，roto专注于端到端的“盲”操作，仅利用本体感觉和触觉感知，无需状态信息或提炼。我们展示了显著的性能飞跃，盲人特工在10秒内完成13个宝定球旋转，比当前最先进的速度快一个数量级。通过开源环境和强化的基线，我们降低了进入门槛，使研究人员能够优先解决根本性的算法挑战，而非繁琐的强化学习调优。网站：这个 https URL

Mem-$π$: Adaptive Memory through Learning When and What to Generate

Mem-$π$：通过学习何时何物生成的适应性记忆

Authors: Xiaoqiang Wang, Chao Wang, Hadi Nekoei, Christopher Pal, Alexandre Lacoste, Spandana Gella, Bang Liu, Perouz Taslakian
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.21463
Pdf link: https://arxiv.org/pdf/2605.21463
Abstract We present Mem-$\pi$, a framework for adaptive memory in large language model (LLM) agents, where useful guidance is generated on demand rather than retrieved from external memory stores. Existing memory-augmented agents typically rely on similarity-based retrieval from episodic memory banks or skill libraries, returning static entries that often misalign with the current context. In contrast, Mem-$\pi$ uses a dedicated language or vision-language model with its own parameters, separate from the downstream agent, to generate context-specific guidance for complex tasks. Conditioned on the current agent context, the model jointly decides when to produce guidance and what guidance to produce. We train it with a decision-content decoupled reinforcement learning (RL) objective, enabling it to abstain when generation would not help and otherwise produce concise, useful guidance. Across diverse agentic benchmarks spanning web navigation, terminal-based tool use, and text-based embodied interaction, Mem-$\pi$ consistently outperforms retrieval-based and prior RL-optimized memory baselines, achieving over 30% relative improvement on web navigation tasks.
中文摘要 我们介绍Mem-$\pi$，一种大型语言模型（LLM）代理中自适应记忆的框架，其中有用的指导是按需生成的，而非从外部存储中检索。现有的记忆增强代理通常依赖基于相似性的情节记忆库或技能库检索，返回的静态条目常常与当前上下文不匹配。相比之下，Mem-$\pi$ 使用专用语言或视觉语言模型，具有自身参数，独立于下游代理，为复杂任务生成上下文特定的指导。根据当前代理的上下文，模型共同决定何时产生指导以及产生何种指导。我们用决策内容解耦强化学习（RL）目标训练它，使其在生成无效时保持戒除，并产生简洁有用的指导。在涵盖网页导航、终端工具使用和文本具象交互等多元代理基准测试中，Mem-$\pi$ 持续优于基于检索和先前强化学习优化的内存基线，在网页导航任务中相对提升超过30%。

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

DelTA：可验证奖励强化学习的辨别性代币信用分配

Authors: Kaiyi Zhang, Wei Wu, Yankai Lin
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.21467
Pdf link: https://arxiv.org/pdf/2605.21467
Abstract Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understood. We introduce a discriminator view of RLVR updates, showing that the policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors and thereby determines which token probabilities are increased or decreased during learning. Under standard sequence-level RLVR, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors. However, such centroid construction can be dominated by shared high-frequency patterns, such as formatting tokens, diluting sparse yet discriminative directions that better distinguish high-reward responses from low-reward ones. To address this limitation, we propose $\textbf{DelTA}$, a discriminative token credit assignment method that estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction. On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base, respectively. Additional results on code generation, a different backbone, and out-of-domain evaluations further demonstrate the generalization ability of DelTA.
中文摘要 可验证奖励强化学习（RLVR）已成为提升大型语言模型推理能力的核心技术。尽管有效，但响应级奖励如何转化为代币级概率变化仍不充分。我们引入了RLVR更新的判别器视角，表明策略梯度更新方向隐含作为词元梯度向量的线性判别子，从而决定学习过程中哪些词币概率的增加或降低。在标准序列级RLVR下，该判别器由通过对标记梯度向量的优势加权平均形成的正负侧重心构成。然而，这种重心构建可能被共享的高频模式所主导，比如格式化代币、稀释稀疏但有辨别性的方向，从而更好地区分高奖励反应和低奖励反应。为解决这一限制，我们提出了$\textbf{DelTA}$，一种判别式代币积分分配方法，通过估算代币系数来放大侧特定代币梯度方向，并降重共享或弱判别性方向。这些系数会重新加权自归一化的RLVR替代者，使有效的侧向重心更具对比度，从而重塑RLVR的更新方向。在七个数学基准测试中，DelTA在Qwen3-8B-Base和Qwen3-14B-Base上分别比最强的同尺度基线高出3.26分和2.62分。关于代码生成、不同骨干和域外评估的额外结果进一步展示了DelTA的泛化能力。

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

你只需要最低限度的RLVR训练：通过第一阶轨迹推算LLMs。

Authors: Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Chengsong Huang, Jiaxin Huang, Yu Meng
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.21468
Pdf link: https://arxiv.org/pdf/2605.21468
Abstract Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20$\times$ beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at this https URL.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为提升大型语言模型（LLM）推理能力的主流范式，然而所产生的参数轨迹的底层几何结构仍未被充分探索。本研究证明，RLVR的权重轨迹极低且高度可预测。具体来说，我们发现大多数下游性能提升通过参数差值的秩一近似来捕捉，该预测的幅度随训练步数近线性变化。基于此，我们提出了一种简单且计算高效的方法RELEX（强化学习外推），该方法从短观察窗口估计秩1子空间，并通过线性回归推断未来检查点，无需学习模型。在三种模型（即Qwen2.5-Math-1.5B、Qwen3-4B-Base和Qwen3-8B-Base）中，RELEX生成的检查点在域内和域外基准测试上均与或超过RLVR性能，所需完整RLVR训练步骤仅为15%。令人惊讶的是，RELEX能够在无训练成本的情况下远远超出观察窗口，预测检查点可比观察前缀多达10-20美元/时间点，并持续改进（例如，只观察前50步，外推到1000步）。我们的消融分析证实了RELEX的极简充分性：无论是提高子空间秩还是采用非线性建模，都无法带来进一步的外推收益。最后，我们证明了RELEX的成功源于一种“去噪”效应：通过将更新投射到秩1子空间，模型丢弃了在外推过程中会降低性能的随机优化噪声。我们的代码可在此 https URL 访问。

Keyword: diffusion policy

NaP-Control: Navigating Diffusion Prior for Versatile and Fast Character Control

NaP控制：在扩散前期导航，实现灵活且快速的特征控制

Authors: Chia-Wen Chen, Yan Wu, Korrawe Karunratanakul, Siyu Tang
Subjects: Subjects: Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.20209
Pdf link: https://arxiv.org/pdf/2605.20209
Abstract Achieving precise, versatile whole-body character control in physics-based animation remains challenging. Recent diffusion-based policies generate rich and expressive motions but typically rely on gradient-based test-time guidance to satisfy task objectives, which is slow and can reduce robustness. We introduce NaP-Control (Navigating Diffusion Prior for Versatile and Fast Character Control), abbreviated as NaP. Our method uses reinforcement learning to manipulate the latent noise of a task-agnostic diffusion policy prior, steering it toward task-specific behaviors for fast, robust control with high motion fidelity. In contrast to methods that rely solely on offline training, NaP interacts with the environment during training to correct motions and optimize task rewards, improving success rates and enabling adaptation to challenging scenarios. By directly predicting task-optimized diffusion noise, NaP eliminates iterative guidance during denoising and enables efficient inference. Experiments show that NaP attains higher success rates and faster inference while preserving natural motion across diverse tasks.
中文摘要 在基于物理的动画中实现精准且多功能的全身角色控制依然充满挑战。近期基于扩散的策略产生丰富且富有表现力的运动，但通常依赖基于梯度的测试时间指导来满足任务目标，这速度较慢且可能降低鲁棒性。我们介绍NaP控制（NaP-Control，导航扩散前置以实现灵活快速的字符控制），简称NaP。我们的方法利用强化学习来操控任务无关扩散策略的潜在噪声，引导其朝向任务特定行为，实现快速、稳健且高运动保真度的控制。与仅依赖线下训练的方法不同，NaP在训练过程中与环境互动，纠正动作并优化任务奖励，提高成功率并促进对挑战情境的适应。通过直接预测任务优化的扩散噪声，NaP消除了去噪过程中的迭代引导，实现了高效的推断。实验表明，NaP在保持自然运动的同时，能够实现更高的成功率和更快的推断。

Mobile UMI: Cross-View Diffusion Policy with Decoupled Kinematics for Mobile Manipulation

移动UMI：带解耦运动学的交叉视角扩散策略用于移动操作

Authors: Haoran Huang, Haonan Dong, Huixu Dong
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.20894
Pdf link: https://arxiv.org/pdf/2605.20894
Abstract Mobile imitation learning on portable demonstration interfaces faces two coupled bottlenecks: locomotion-contaminated action labels and inference-induced execution latency on a continuously moving base. Recent wrist-mounted interfaces lower the cost of tabletop data collection, yet a single wrist view does not capture the global context required for base navigation. Adding a body-mounted camera entangles human walking with hand motion. Meanwhile, generative policies introduce hundreds of milliseconds of inference latency, during which the base advances past predicted waypoints, forcing backward corrections at action splices. This paper presents Mobile UMI, a hardware-free demonstration framework that addresses both gaps through three components. First, a dual-camera capture system records chest-centric global context and wrist-centric local interaction without any robot present. Second, a one-shot ChArUco-based spatial anchor unifies the chest and hand visual-inertial frames; the hand pose is then re-expressed relative to the chest to extract decoupled SE(3) manipulation and SE(2) base trajectories. Third, an asynchronous receding-horizon executor performs online state matching: each generated action chunk is realigned with the current physical pose so that expired waypoints are discarded before execution. The full system is evaluated on four long-horizon household tasks, achieving an average success rate of 83.8% over 100 trials per task. Controlled comparisons against ACT and Diffusion Policy show that the chest-relative label alone closes much of the gap; online state matching closes the remainder. These results indicate that, for mobile imitation learning under the tested conditions, explicit kinematic factorization combined with state-level latency alignment provides an effective solution without requiring architectural changes to the underlying policy class.
中文摘要 便携式演示接口上的移动模仿学习面临两个耦合瓶颈：运动污染的动作标签和持续移动基底上的推理诱导执行延迟。近期的腕带接口降低了桌面数据采集的成本，但单一手腕视图无法捕捉基地导航所需的全局背景。加入人体摄像头将人类行走与手部动作纠缠在一起。与此同时，生成策略引入数百毫秒的推理延迟，在此期间基准会超过预测的路径点，迫使在动作拼接处进行向后修正。本文介绍了Mobile UMI，一个无硬件演示框架，通过三个组成部分解决了这两个空白。首先，双摄像机捕捉系统在没有机器人的情况下记录以胸部为中心的全球情境和以手腕为中心的局部互动。其次，基于ChArUco的一次性空间锚点统一了胸部和手部的视觉惯性帧;然后，手部姿势相对于胸部重新表述，以提取脱钩的SE（3）操作和SE（2）基底轨迹。第三，异步后退视界执行器执行在线状态匹配：每个生成的动作块重新对齐当前物理姿态，从而在执行前丢弃过期的航点。整个系统通过四个长期家庭任务进行评估，每项任务在100次试验中平均成功率为83.8%。与ACT和扩散政策的对比显示，仅胸部-亲属标签即可缩小大部分差距;在线状态匹配完成剩余部分。这些结果表明，在测试条件下的移动模仿学习中，显式运动因数分解结合状态级延迟对齐，无需对底层策略类进行架构调整，即可提供有效解决方案。