Arxiv Papers of Today

生成时间: 2026-02-05 16:48:14 (UTC+8); Arxiv 发布时间: 2026-02-05 20:00 EST (2026-02-06 09:00 UTC+8)

今天共有 47 篇相关文章

Keyword: reinforcement learning

GOPO: Policy Optimization using Ranked Rewards

GOPO：利用排名奖励进行策略优化

Authors: Kyuseong Choi, Dwaipayan Saha, Woojeong Kim, Anish Agarwal, Raaz Dwivedi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.03876
Pdf link: https://arxiv.org/pdf/2602.03876
Abstract Standard reinforcement learning from human feedback (RLHF) trains a reward model on pairwise preference data and then uses it for policy optimization. However, while reward models are optimized to capture relative preferences, existing policy optimization techniques rely on absolute reward magnitudes during training. In settings where the rewards are non-verifiable such as summarization, instruction following, and chat completion, this misalignment often leads to suboptimal performance. We introduce Group Ordinal Policy Optimization (GOPO), a policy optimization method that uses only the ranking of the rewards and discards their magnitudes. Our rank-based transformation of rewards provides several gains, compared to Group Relative Policy Optimization (GRPO), in settings with non-verifiable rewards: (1) consistently higher training/validation reward trajectories, (2) improved LLM-as-judge evaluations across most intermediate training steps, and (3) reaching a policy of comparable quality in substantially less training steps than GRPO. We demonstrate consistent improvements across a range of tasks and model sizes.
中文摘要 标准的人类反馈强化学习（RLHF）基于成对偏好数据训练奖励模型，并用于策略优化。然而，虽然奖励模型优化以捕捉相对偏好，现有的策略优化技术仍依赖于训练期间的绝对奖励幅度。在奖励不可验证的环境中，如总结、指令跟随和聊天完成，这种错位常导致性能不理想。我们介绍群序数策略优化（GOPO），这是一种仅使用奖励排名并舍弃其大小的策略优化方法。与组相对策略优化（GRPO）相比，我们基于排名的奖励转化在奖励不可验证的环境中带来了多项提升：（1）持续更高的训练/验证奖励轨迹，（2）大多数中级训练步骤中LLM作为评判的评估质量有所提升，（3）在显著更少的训练步骤内达到与GRPO相当质量的政策。我们在多种任务和模型规模上持续改进。

Autonomous AI Agents for Real-Time Affordable Housing Site Selection: Multi-Objective Reinforcement Learning Under Regulatory Constraints

自主人工智能代理用于实时经济适用房选址：监管约束下的多目标强化学习

Authors: Olaf Yunus Laitinen Imanov, Duygu Erisken, Derya Umut Kulali, Taner Yilmaz, Rana Irem Turhan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.03940
Pdf link: https://arxiv.org/pdf/2602.03940
Abstract Affordable housing shortages affect billions, while land scarcity and regulations make site selection slow. We present AURA (Autonomous Urban Resource Allocator), a hierarchical multi-agent reinforcement learning system for real-time affordable housing site selection under hard regulatory constraints (QCT, DDA, LIHTC). We model the task as a constrained multi-objective Markov decision process optimizing accessibility, environmental impact, construction cost, and social equity while enforcing feasibility. AURA uses a regulatory-aware state encoding 127 federal and local constraints, Pareto-constrained policy gradients with feasibility guarantees, and reward decomposition separating immediate costs from long-term social outcomes. On datasets from 8 U.S. metros (47,392 candidate parcels), AURA attains 94.3% regulatory compliance and improves Pareto hypervolume by 37.2% over strong baselines. In a New York City 2026 case study, it reduces selection time from 18 months to 72 hours and identifies 23% more viable sites; chosen sites have 31% better transit access and 19% lower environmental impact than expert picks.
中文摘要 经济适用房短缺影响数十亿人，而土地稀缺和法规则使选址进展缓慢。我们介绍AURA（自治城市资源分配器），这是一个分层多智能体强化学习系统，用于在严格监管约束（QCT、DDA、LIHTC）下实时选择经济适用房选址。我们将该任务建模为一个受限的多目标马尔可夫决策过程，优化可及性、环境影响、建设成本和社会公平，同时维护可行性。AURA采用一个监管意识的状态，编码127项联邦和地方约束条件、帕累托约束的政策梯度并保证可行性，以及将即时成本与长期社会结果分离的奖励分解。在来自8个美国大都市（47,392个候选地块）的数据集上，AURA实现了94.3%的监管合规率，并在强基线上提升了37.2%的帕累托超体积。在2026年纽约市的案例研究中，该项目将筛选时间从18个月缩短至72小时，并识别出23%的可行地点;被选中的站点交通便利度比专家选址低31%，环境影响更低19%。

Safety-Critical Reinforcement Learning with Viability-Based Action Shielding for Hypersonic Longitudinal Flight

基于可行性动作屏蔽的安全关键强化学习，用于高超音速纵向飞行

Authors: Hossein Rastgoftar
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.03968
Pdf link: https://arxiv.org/pdf/2602.03968
Abstract This paper presents a safety-critical reinforcement learning framework for nonlinear dynamical systems with continuous state and input spaces operating under explicit physical constraints. Hard safety constraints are enforced independently of the reward through action shielding and reachability-based admissible action sets, ensuring that unsafe behaviors are never intentionally selected during learning or execution. To capture nominal operation and recovery behavior within a single control architecture, the state space is partitioned into safe and unsafe regions based on membership in a safety box, and a mode-dependent reward is used to promote accurate tracking inside the safe region and recovery toward it when operating outside. To enable online tabular learning on continuous dynamics, a finite-state abstraction is constructed via state aggregation, and action selection and value updates are consistently restricted to admissible actions. The framework is demonstrated on a longitudinal point-mass hypersonic vehicle model with aerodynamic and propulsion couplings, using angle of attack and throttle as control inputs.
中文摘要 本文提出了一个安全关键的强化学习框架，适用于具有连续状态和输入空间、在显式物理约束下运行的非线性动力系统。硬安全约束通过动作屏蔽和基于可达性的可接受动作集独立执行，确保在学习或执行过程中不会有意选择不安全的行为。为了在单一控制架构内捕捉名义作和恢复行为，状态空间根据安全盒成员身份划分为安全和不安全区域，并采用模式依赖奖励以促进安全区内的准确跟踪及在外部作时向该区域恢复。为了实现连续动力学的在线表学习，通过状态聚合构建了有限状态抽象，动作选择和值更新始终限制在可接受的动作范围内。该框架在一个纵向点质量高超音速飞行器模型上演示，采用空气动力学和推进耦合器，使用攻角和油门作为控制输入。

Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning

可监控性作为免费礼物：RLVR如何自发地对齐推理

Authors: Zidi Xiong, Shan Chen, Himabindu Lakkaraju
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.03978
Pdf link: https://arxiv.org/pdf/2602.03978
Abstract As Large Reasoning Models (LRMs) are increasingly deployed, auditing their chain-of-thought (CoT) traces for safety becomes critical. Recent work has reported that monitorability--the degree to which CoT faithfully and informatively reflects internal computation--can appear as a "free gift" during the early stages of Reinforcement Learning with Verifiable Rewards (RLVR). We make this observation concrete through a systematic evaluation across model families and training domains. Our results show that this effect is not universal: monitorability improvements are strongly data-dependent. In particular, we demonstrate the critical role of data diversity and instruction-following data during RLVR training. We further show that monitorability is orthogonal to capability--improvements in reasoning performance do not imply increased transparency. Through mechanistic analysis, we attribute monitorability gains primarily to response distribution sharpening (entropy reduction) and increased attention to the prompt, rather than stronger causal reliance on reasoning traces. We also reveal how monitorability dynamics vary with controlled training and evaluation difficulty. Together, these findings provide a holistic view of how monitorability emerges under RLVR, clarifying when gains are likely to occur and when they are not.
中文摘要 随着大型推理模型（LRM）的日益部署，审计其思维链（CoT）追踪的安全性变得至关重要。最新研究报告称，可监测性——即CoT忠实且富有信息地反映内部计算的程度——在可验证奖励强化学习（RLVR）早期阶段可能被视为“免费赠礼”。我们通过系统评估跨模型族群和训练领域，使这一观察得以具体化。我们的结果表明，这一效应并非普遍存在：可监测性提升高度依赖数据。特别是，我们展示了数据多样性和指令跟随数据在RLVR培训中的关键作用。我们还进一步表明，可监控性与能力是正交的——推理性能的提升并不意味着透明度的提升。通过机制分析，我们认为可监测性提升主要归因于反应分布的锐化（熵减少）和对提示词的关注增加，而非对推理痕迹的因果依赖增强。我们还揭示了可监控性动态如何随着受控培训和评估难度而变化。这些发现共同提供了RLVR下可监测性如何呈现的整体视角，明确何时可能实现收益，何时不可能。

Likelihood-Based Reward Designs for General LLM Reasoning

基于似然的奖励设计用于一般大型语言模型推理

Authors: Ariel Kwiatkowski, Natasha Butt, Ismail Labiad, Julia Kempe, Yann Ollivier
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.03979
Pdf link: https://arxiv.org/pdf/2602.03979
Abstract Fine-tuning large language models (LLMs) on reasoning benchmarks via reinforcement learning requires a specific reward function, often binary, for each benchmark. This comes with two potential limitations: the need to design the reward, and the potentially sparse nature of binary rewards. Here, we systematically investigate rewards derived from the probability or log-probability of emitting the reference answer (or any other prompt continuation present in the data), which have the advantage of not relying on specific verifiers and being available at scale. Several recent works have advocated for the use of similar rewards (e.g., VeriFree, JEPO, RLPR, NOVER). We systematically compare variants of likelihood-based rewards with standard baselines, testing performance both on standard mathematical reasoning benchmarks, and on long-form answers where no external verifier is available. We find that using the log-probability of the reference answer as the reward for chain-of-thought (CoT) learning is the only option that performs well in all setups. This reward is also consistent with the next-token log-likelihood loss used during pretraining. In verifiable settings, log-probability rewards bring comparable or better success rates than reinforcing with standard binary rewards, and yield much better perplexity. In non-verifiable settings, they perform on par with SFT. On the other hand, methods based on probability, such as VeriFree, flatline on non-verifiable settings due to vanishing probabilities of getting the correct answer. Overall, this establishes log-probability rewards as a viable method for CoT fine-tuning, bridging the short, verifiable and long, non-verifiable answer settings.
中文摘要 通过强化学习对大型语言模型（LLMs）进行推理基准的微调，需要为每个基准测试设定特定的奖励函数，通常是二元的。这带来了两个潜在限制：需要设计奖励，以及二元奖励可能存在稀疏性。在这里，我们系统地研究了基于引用答案（或数据中任何其他提示延续）发出的概率或对数概率得出的奖励，这些奖励的优点是不依赖特定验证者，且可大规模提供。近期已有多项研究倡导使用类似的奖励（如VeriFree、JEPO、RLPR、NOVER）。我们系统地比较基于似然的奖励变体与标准基线，测试标准数学推理基准和无外部验证工具的长格式答案表现。我们发现，使用参考答案的对数概率作为思维链学习（CoT）奖励，是所有设置中唯一表现良好的选项。这一奖励也与预训练中使用的下一token日志似然损失相符。在可验证的环境中，对数概率奖励带来的成功率与标准二元奖励相当甚至更好，且产生更好的困惑度。在不可验证的环境中，它们的性能与SFT相当。另一方面，基于概率的方法，如VeriFree，在不可验证的设置下因获得正确答案的概率为零而呈平线。总体而言，这确立了对数概率奖励作为CoT微调的可行方法，连接了短且可验证的答案设置和长且不可验证的答案设置。

After Talking with 1,000 Personas: Learning Preference-Aligned Proactive Assistants From Large-Scale Persona Interactions

与1000个角色对话后：从大规模角色互动中学习偏好一致的主动助手

Authors: Ziyi Xuan, Yiwen Wu, Zhaoyang Yan, Vinod Namboodiri, Yu Yang
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2602.04000
Pdf link: https://arxiv.org/pdf/2602.04000
Abstract Smart assistants increasingly act proactively, yet mistimed or intrusive behavior often causes users to lose trust and disable these features. Learning user preferences for proactive assistance is difficult because real-world studies are costly, limited in scale, and rarely capture how preferences change across multiple interaction sessions. Large language model based generative agents offer a way to simulate realistic interactions, but existing synthetic datasets remain limited in temporal depth, diverse personas, and multi-dimensional preferences. They also provide little support for transferring population-level insights to individual users under on-device constraints. We present a population-to-individual learning framework for preference-aligned proactive assistants that operates under on-device and privacy constraints. Our approach uses large-scale interaction simulation with 1,000 diverse personas to learn shared structure in how users express preferences across recurring dimensions such as timing, autonomy, and communication style, providing a strong cold start without relying on real user logs. The assistant then adapts to individual users on device through lightweight activation-based steering driven by simple interaction feedback, without model retraining or cloud-side updates. We evaluate the framework using controlled simulations with 1,000 simulated personas and a human-subject study with 30 participants. Results show improved timing decisions and perceived interaction quality over untuned and direct-response baselines, while on-device activation steering achieves performance comparable to reinforcement learning from human feedback. Participants also report higher satisfaction, trust, and comfort as the assistant adapts over multiple sessions of interactions.
中文摘要 智能助手越来越主动，但时机不当或侵入性行为常常导致用户失去信任并禁用这些功能。了解用户对主动协助的偏好很困难，因为真实研究成本高昂、规模有限，且很少能捕捉偏好在多次交互会话中的变化。基于大型语言模型的生成智能体提供了模拟真实互动的方法，但现有的合成数据集在时间深度、多样的人格和多维偏好方面仍然有限。它们在设备约束下向个别用户传递群体层级洞察的支持也很有限。我们提出了一个面向个人的学习框架，适用于偏好一致的主动助手，并在设备端和隐私限制下运行。我们的方法利用拥有1000个不同角色的大规模交互模拟，学习用户在时间、自主性和沟通风格等反复维度上表达偏好的共享结构，提供强有力的冷启动，无需依赖真实用户日志。助手随后通过轻量级的激活引导，基于简单的交互反馈，适应设备上的个别用户，无需模型重新训练或云端更新。我们通过对1000个模拟人格的受控模拟和30名参与者的人类受试者研究来评估该框架。结果显示，相较于未调谐和直接响应基线，时间决策和互动质量感知有所提升，而设备内激活引导的表现可与基于人类反馈的强化学习相当。参与者还报告说，随着助理在多次互动中不断适应，满意度、信任度和舒适度都提升了。

Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL

通过跨章节元强化学习提升LLM的上下文在线学习能力

Authors: Xiaofeng Lin, Sirou Zhu, Yilei Chen, Mingyu Chen, Hejian Sang, Ioannis Paschalidis, Zhipeng Wang, Aldo Pacchiano, Xuezhou Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04089
Pdf link: https://arxiv.org/pdf/2602.04089
Abstract Large language models (LLMs) achieve strong performance when all task-relevant information is available upfront, as in static prediction and instruction-following problems. However, many real-world decision-making tasks are inherently online: crucial information must be acquired through interaction, feedback is delayed, and effective behavior requires balancing information collection and exploitation over time. While in-context learning enables adaptation without weight updates, existing LLMs often struggle to reliably leverage in-context interaction experience in such settings. In this work, we show that this limitation can be addressed through training. We introduce ORBIT, a multi-task, multi-episode meta-reinforcement learning framework that trains LLMs to learn from interaction in context. After meta-training, a relatively small open-source model (Qwen3-14B) demonstrates substantially improved in-context online learning on entirely unseen environments, matching the performance of GPT-5.2 and outperforming standard RL fine-tuning by a large margin. Scaling experiments further reveal consistent gains with model size, suggesting significant headroom for learn-at-inference-time decision-making agents. Code reproducing the results in the paper can be found at this https URL.
中文摘要 大型语言模型（LLMs）在所有与任务相关的信息一开始就可用时，表现优异，比如静态预测和指令跟随问题。然而，许多现实世界的决策任务本质上是在线的：关键信息必须通过互动获得，反馈延迟，有效行为需要在信息收集与利用之间取得平衡。虽然上下文学习允许在不更新权重的情况下适应，但现有大型语言模型在此类环境中往往难以可靠地利用上下文交互体验。在本研究中，我们展示了通过培训可以解决这一限制。我们介绍ORBIT，一个多任务、多情节的元强化学习框架，训练LLM从上下文中的交互中学习。经过元训练后，一个相对较小的开源模型（Qwen3-14B）展示了在完全未见环境中的上下文在线学习显著提升，性能可媲美GPT-5.2，并远远优于标准强化学习微调。缩放实验进一步显示模型规模持续提升，表明推理学习决策代理存在显著的空间。论文中结果的代码可在该 https URL 找到。

DELTA: Deliberative Multi-Agent Reasoning with Reinforcement Learning for Multimodal Psychological Counseling

DELTA：多模态心理咨询中的深思熟虑多代理推理与强化学习

Authors: Jiangnan Yang, Junjie Chen, Fei Wang, Yiqi Nie, Yuxin Liu, Zhangling Duan, Jie Chen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.04112
Pdf link: https://arxiv.org/pdf/2602.04112
Abstract Psychological counseling is a fundamentally multimodal cognitive process in which clinicians integrate verbal content with visual and vocal cues to infer clients' mental states and respond empathically. However, most existing language-model-based counseling systems operate on text alone and rely on implicit mental state inference. We introduce DELTA, a deliberative multi-agent framework that models counseling as a structured reasoning process over multimodal signals, separating evidence grounding, mental state abstraction, and response generation. DELTA further incorporates reinforcement learning guided by a distribution-level Emotion Attunement Score to encourage emotionally attuned responses. Experiments on a multimodal counseling benchmark show that DELTA improves both counseling quality and emotion attunement across models. Ablation and qualitative analyses suggest that explicit multimodal reasoning and structured mental state representations play complementary roles in supporting empathic human-AI interaction.
中文摘要 心理咨询是一种本质上的多模态认知过程，临床医生将口头内容与视觉和声音线索结合起来，以推断客户的心理状态并以同理心做出反应。然而，大多数现有基于语言模型的咨询系统仅依赖文本，依赖隐性心理状态推断。我们介绍了DELTA，一种审议式多代理框架，将咨询建模为基于多模态信号的结构化推理过程，分离证据基础、心理状态抽象和反应生成。DELTA进一步结合了由分布级情绪调谐评分指导的强化学习，以鼓励情感调和的反应。多模态咨询基准测试的实验显示，DELTA在各模型中提升了咨询质量和情绪调频。消融和定性分析表明，显式多模态推理和结构化心理状态表征在支持共情人与人工智能互动中起着互补作用。

Learning to Reason in 13 Parameters

学习13个参数的推理

Authors: John X. Morris, Niloofar Mireshghallah, Mark Ibrahim, Saeed Mahloujifar
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04118
Pdf link: https://arxiv.org/pdf/2602.04118
Abstract Recent research has shown that language models can learn to \textit{reason}, often via reinforcement learning. Some work even trains low-rank parameterizations for reasoning, but conventional LoRA cannot scale below the model dimension. We question whether even rank=1 LoRA is necessary for learning to reason and propose TinyLoRA, a method for scaling low-rank adapters to sizes as small as one parameter. Within our new parameterization, we are able to train the 8B parameter size of Qwen2.5 to 91\% accuracy on GSM8K with only 13 trained parameters in bf16 (26 total bytes). We find this trend holds in general: we are able to recover 90\% of performance improvements while training $1000x$ fewer parameters across a suite of more difficult learning-to-reason benchmarks such as AIME, AMC, and MATH500. Notably, we are only able to achieve such strong performance with RL: models trained using SFT require $100-1000x$ larger updates to reach the same performance.
中文摘要 最新研究表明，语言模型可以通过强化学习学习\textit{reason}。有些工作甚至训练低秩参数化进行推理，但传统LoRA无法低于模型维度。我们质疑即使是秩=1的LoRA是否也必须用于学习推理，并提出了TinyLoRA方法，这是一种将低秩适配器缩放到最小至一个参数大小的方法。在我们的新参数化中，我们能够用仅13个训练参数（总26字节）在GSM8K上训练Qwen2.5的8B参数，准确率达到91%%。我们发现这一趋势普遍成立：在一系列更难的学习推理基准测试（如AIME、AMC和MATH500）中，我们能够回收90%的性能提升，同时训练的参数数量减少了1000美元。值得注意的是，我们只有在强化学习下才能达到如此强劲的性能：使用SFT训练的模型需要100到1000倍的额外更新才能达到同样的性能。

Decoupling Time and Risk: Risk-Sensitive Reinforcement Learning with General Discounting

时间与风险解耦：风险敏感强化学习与一般贴现

Authors: Mehrdad Moghimi, Anthony Coache, Hyejin Ku
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04131
Pdf link: https://arxiv.org/pdf/2602.04131
Abstract Distributional reinforcement learning (RL) is a powerful framework increasingly adopted in safety-critical domains for its ability to optimize risk-sensitive objectives. However, the role of the discount factor is often overlooked, as it is typically treated as a fixed parameter of the Markov decision process or tunable hyperparameter, with little consideration of its effect on the learned policy. In the literature, it is well-known that the discounting function plays a major role in characterizing time preferences of an agent, which an exponential discount factor cannot fully capture. Building on this insight, we propose a novel framework that supports flexible discounting of future rewards and optimization of risk measures in distributional RL. We provide a technical analysis of the optimality of our algorithms, show that our multi-horizon extension fixes issues raised with existing methodologies, and validate the robustness of our methods through extensive experiments. Our results highlight that discounting is a cornerstone in decision-making problems for capturing more expressive temporal and risk preferences profiles, with potential implications for real-world safety-critical applications.
中文摘要 分布式强化学习（RL）是一种强大的框架，因其优化风险敏感目标的能力，越来越被安全关键领域采用。然而，贴现因子的作用常被忽视，因为它通常被视为马尔可夫决策过程中的固定参数或可调超参数，很少考虑其对所学策略的影响。文献中众所周知，贴现函数在刻画代理人的时间偏好中起着重要作用，而指数贴现因子无法完全捕捉到这一点。基于这一见解，我们提出了一个新框架，支持未来奖励的灵活折现和分布式强化学习中风险衡量的优化。我们对算法的最优性进行了技术分析，展示了多视野扩展修复了现有方法中出现的问题，并通过大量实验验证了方法的稳健性。我们的结果凸显了贴现是决策问题的基石，用于捕捉更具表现力的时间和风险偏好轮廓，这对现实安全关键应用具有潜在启示意义。

Lyapunov Constrained Soft Actor-Critic (LC-SAC) using Koopman Operator Theory for Quadrotor Trajectory Tracking

利用库普曼算子理论进行四旋翼轨迹跟踪的里雅普诺夫约束软演员-批判者（LC-SAC）

Authors: Dhruv S. Kushwaha, Zoleikha A. Biron
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.04132
Pdf link: https://arxiv.org/pdf/2602.04132
Abstract Reinforcement Learning (RL) has achieved remarkable success in solving complex sequential decision-making problems. However, its application to safety-critical physical systems remains constrained by the lack of stability guarantees. Standard RL algorithms prioritize reward maximization, often yielding policies that may induce oscillations or unbounded state divergence. There has significant work in incorporating Lyapunov-based stability guarantees in RL algorithms with key challenges being selecting a candidate Lyapunov function, computational complexity by using excessive function approximators and conservative policies by incorporating stability criterion in the learning process. In this work we propose a novel Lyapunov-constrained Soft Actor-Critic (LC-SAC) algorithm using Koopman operator theory. We propose use of extended dynamic mode decomposition (EDMD) to produce a linear approximation of the system and use this approximation to derive a closed form solution for candidate Lyapunov function. This derived Lyapunov function is incorporated in the SAC algorithm to further provide guarantees for a policy that stabilizes the nonlinear system. The results are evaluated trajectory tracking of a 2D Quadrotor environment based on safe-control-gym. The proposed algorithm shows training convergence and decaying violations for Lyapunov stability criterion compared to baseline vanilla SAC algorithm. GitHub Repository: this https URL
中文摘要 强化学习（RL）在解决复杂的顺序决策问题方面取得了显著成功。然而，其在安全关键物理系统的应用仍受限于缺乏稳定性保证。标准强化学习算法优先考虑奖励最大化，通常会产生可能导致振荡或无界状态发散的策略。在强化学习算法中融入基于李雅普诺夫的稳定性保证方面有大量工作，主要挑战包括选择候选李雅普诺夫函数、通过使用过高函数近似器解决计算复杂度，以及在学习过程中引入稳定性准则来实现保守策略。在本研究中，我们提出了一种利用库普曼算子理论的新颖李雅普诺夫约束软演员-批判者（LC-SAC）算法。我们提出使用扩展动态模式分解（EDMD）来生成系统的线性近似，并利用该近似推导出候选李雅普诺夫函数的闭式解。该导出的李雅普诺夫函数被纳入SAC算法中，进一步保障稳定非线性系统的策略。结果基于安全控制健身房对二维四旋翼环境的轨迹跟踪进行了评估。所提出的算法在Lyapunov稳定性准则下展示了训练收敛和衰减违规，与基础原版SAC算法相比。GitHub 仓库：这个 https URL

Topology-Aware Revival for Efficient Sparse Training

拓扑感知复兴以实现高效稀疏训练

Authors: Meiling Jin, Fei Wang, Xiaoyun Yuan, Chen Qian, Yuan Cheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.04166
Pdf link: https://arxiv.org/pdf/2602.04166
Abstract Static sparse training is a promising route to efficient learning by committing to a fixed mask pattern, yet the constrained structure reduces robustness. Early pruning decisions can lock the network into a brittle structure that is difficult to escape, especially in deep reinforcement learning (RL) where the evolving policy continually shifts the training distribution. We propose Topology-Aware Revival (TAR), a lightweight one-shot post-pruning procedure that improves static sparsity without dynamic rewiring. After static pruning, TAR performs a single revival step by allocating a small reserve budget across layers according to topology needs, randomly uniformly reactivating a few previously pruned connections within each layer, and then keeping the resulting connectivity fixed for the remainder of training. Across multiple continuous-control tasks with SAC and TD3, TAR improves final return over static sparse baselines by up to +37.9% and also outperforms dynamic sparse training baselines with a median gain of +13.5%.
中文摘要 静态稀疏训练通过固定掩码模式实现高效学习是一种有前景的途径，但受限结构降低了鲁棒性。过早的修剪决策可能将网络锁定在难以摆脱的脆弱结构中，尤其是在深度强化学习（RL）中，策略不断变化，训练分布不断变化。我们提出了拓扑感知复兴（TAR），这是一种轻量级的一次性剪枝后程序，能够在不进行动态重布线的情况下改善静态稀疏性。静态剪枝后，TAR执行一次复活步骤，根据拓扑需求在各层间分配少量预备预算，随机均匀地重新激活每层中先前修剪的部分连接，然后在剩余训练期间保持连接性不变。在多项连续控制任务中，TAR在静态稀疏基线上提升最终回报最高达+37.9%，且中位增益为+13.5%，优于动态稀疏训练基线。

Piece of CAKE: Adaptive Execution Engines via Microsecond-Scale Learning

小菜一碟：通过微秒级学习实现自适应执行引擎

Authors: Zijie Zhao, Ryan Marcus
Subjects: Subjects: Databases (cs.DB); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04181
Pdf link: https://arxiv.org/pdf/2602.04181
Abstract Low-level database operators often admit multiple physical implementations ("kernels") that are semantically equivalent but have vastly different performance characteristics depending on the input data distribution. Existing database systems typically rely on static heuristics or worst-case optimal defaults to select these kernels, often missing significant performance opportunities. In this work, we propose CAKE (Counterfactual Adaptive Kernel Execution), a system that learns to select the optimal kernel for each data "morsel" using a microsecond-scale contextual multi-armed bandit. CAKE circumvents the high latency of traditional reinforcement learning by exploiting the cheapness of counterfactuals -- selectively running multiple kernels to obtain full feedback -- and compiling policies into low-latency regret trees. Experimentally, we show that CAKE can reduce end-to-end workload latency by up to 2x compared to state-of-the-art static heuristics.
中文摘要 底层数据库操作员通常允许多个物理实现（“核”），这些实现语义等价，但根据输入数据分布的不同，性能特性差异很大。现有数据库系统通常依赖静态启发式或最坏情况下的最优默认选择这些内核，常常错失显著的性能机会。在本研究中，我们提出了CAKE（反事实自适应内核执行）的系统，该系统利用微秒级上下文多臂强盗机学习为每个数据“口块”选择最优内核。CAKE通过利用反事实算法的廉价性——选择性运行多个内核以获得完整反馈——以及将策略编译为低延迟的遗憾树，绕过了传统强化学习的高延迟。实验显示，与最先进的静态启发式相比，CAKE可将端到端工作负载延迟降低多达2倍。

The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

缺失的一半：揭示部署后培训时间隐含的安全风险

Authors: Zhexin Zhang, Yida Lu, Junfeng Fang, Junxiao Yang, Shiyao Cui, Hao Zhou, Fandong Meng, Jie Zhou, Hongning Wang, Minlie Huang, Tat-Seng Chua
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04196
Pdf link: https://arxiv.org/pdf/2602.04196
Abstract Safety risks of AI models have been widely studied at deployment time, such as jailbreak attacks that elicit harmful outputs. In contrast, safety risks emerging during training remain largely unexplored. Beyond explicit reward hacking that directly manipulates explicit reward functions in reinforcement learning, we study implicit training-time safety risks: harmful behaviors driven by a model's internal incentives and contextual background information. For example, during code-based reinforcement learning, a model may covertly manipulate logged accuracy for self-preservation. We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine-grained risk categories, and three incentive types. Extensive experiments reveal the prevalence and severity of these risks: notably, Llama-3.1-8B-Instruct exhibits risky behaviors in 74.4% of training runs when provided only with background information. We further analyze factors influencing these behaviors and demonstrate that implicit training-time risks also arise in multi-agent training settings. Our results identify an overlooked yet urgent safety challenge in training.
中文摘要 AI模型的安全风险在部署时已被广泛研究，例如越狱攻击导致有害输出。相比之下，培训期间出现的安全风险大多尚未被充分探索。除了直接控强化学习中显性奖励函数的显式奖励黑客外，我们还研究隐性训练时间安全风险：由模型内部激励和上下文背景信息驱动的有害行为。例如，在基于代码的强化学习过程中，模型可能会秘密作记录的准确性以自我保护。我们首次对该问题进行系统研究，提出了包含五个风险等级、十个细致风险类别和三种激励类型的分类法。大量实验揭示了这些风险的普遍性和严重程度：值得注意的是，Llama-3.1-8B-Instruct在仅提供背景信息的训练跑中表现出74.4%的风险行为。我们进一步分析了影响这些行为的因素，并证明隐性训练时间风险在多智能体训练环境中也会出现。我们的研究结果指出了培训中一个被忽视但紧迫的安全挑战。

Steering LLMs via Scalable Interactive Oversight

通过可扩展交互监督引导大型语言模型

Authors: Enyu Zhou, Zhiheng Xi, Long Ma, Zhihao Zhang, Shihan Dou, Zhikai Lei, Guoteng Wang, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04210
Pdf link: https://arxiv.org/pdf/2602.04210
Abstract As Large Language Models increasingly automate complex, long-horizon tasks such as \emph{vibe coding}, a supervision gap has emerged. While models excel at execution, users often struggle to guide them effectively due to insufficient domain expertise, the difficulty of articulating precise intent, and the inability to reliably validate complex outputs. It presents a critical challenge in scalable oversight: enabling humans to responsibly steer AI systems on tasks that surpass their own ability to specify or verify. To tackle this, we propose Scalable Interactive Oversight, a framework that decomposes complex intent into a recursive tree of manageable decisions to amplify human supervision. Rather than relying on open-ended prompting, our system elicits low-burden feedback at each node and recursively aggregates these signals into precise global guidance. Validated in web development task, our framework enables non-experts to produce expert-level Product Requirement Documents, achieving a 54\% improvement in alignment. Crucially, we demonstrate that this framework can be optimized via Reinforcement Learning using only online user feedback, offering a practical pathway for maintaining human control as AI scales.
中文摘要 随着大型语言模型越来越多地自动化复杂且长期的任务，如 \emph{vibe coding}，监督缺口也随之出现。虽然模型执行能力强，但用户常因缺乏领域专业知识、难以表达精确意图以及无法可靠验证复杂输出而难以有效引导模型。它在可扩展监督方面面临关键挑战：使人类能够负责任地引导人工智能系统执行超出自身指定或验证能力的任务。为此，我们提出了可扩展交互监督框架，该框架将复杂意图分解为可管理决策的递归树，以增强人工监督。我们的系统不依赖开放式提示，而是在每个节点引发低负担反馈，并递归地将这些信号聚合为精确的全局指导。经过网页开发任务验证，我们的框架使非专家能够生成专家级的产品需求文档，实现了54%的对齐提升。关键是，我们展示了该框架可以通过强化学习仅利用在线用户反馈进行优化，为在AI扩展过程中保持人类控制提供了切实可行的路径。

ALORE: Autonomous Large-Object Rearrangement with a Legged Manipulator

ALORE：带腿机械臂的自主大型物体重排

Authors: Zhihai Bi, Yushan Zhang, Kai Chen, Guoyang Zhao, Yulin Li, Jun Ma
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.04214
Pdf link: https://arxiv.org/pdf/2602.04214
Abstract Endowing robots with the ability to rearrange various large and heavy objects, such as furniture, can substantially alleviate human workload. However, this task is extremely challenging due to the need to interact with diverse objects and efficiently rearrange multiple objects in complex environments while ensuring collision-free loco-manipulation. In this work, we present ALORE, an autonomous large-object rearrangement system for a legged manipulator that can rearrange various large objects across diverse scenarios. The proposed system is characterized by three main features: (i) a hierarchical reinforcement learning training pipeline for multi-object environment learning, where a high-level object velocity controller is trained on top of a low-level whole-body controller to achieve efficient and stable joint learning across multiple objects; (ii) two key modules, a unified interaction configuration representation and an object velocity estimator, that allow a single policy to regulate planar velocity of diverse objects accurately; and (iii) a task-and-motion planning framework that jointly optimizes object visitation order and object-to-target assignment, improving task efficiency while enabling online replanning. Comparisons against strong baselines show consistent superiority in policy generalization, object-velocity tracking accuracy, and multi-object rearrangement efficiency. Key modules are systematically evaluated, and extensive simulations and real-world experiments are conducted to validate the robustness and effectiveness of the entire system, which successfully completes 8 continuous loops to rearrange 32 chairs over nearly 40 minutes without a single failure, and executes long-distance autonomous rearrangement over an approximately 40 m route. The open-source packages are available at this https URL.
中文摘要 赋予机器人重新排列各种大型和重量物体（如家具）的能力，可以显著减轻人类的工作负担。然而，由于需要与多样物体交互，并在复杂环境中高效重组多个物体，同时确保无碰撞的机车作，这一任务极具挑战性。本研究介绍了ALORE，一种适用于腿部机械臂的自主大型物体重排系统，能够在不同场景中重新排列各种大型物体。该系统具有三大主要特点：（i）多对象环境学习的分层强化学习训练流水线，将高级别对象速度控制器训练于低级别整体控制器之上，实现跨多对象的高效稳定联合学习;（ii）两个关键模块，统一的交互配置表示和物体速度估计器，使单一策略能够准确调节不同物体的平面速度;以及（iii）任务与行动规划框架，联合优化对象访问顺序和目标到目标的分配，提高任务效率并支持在线重新规划。与强基线的比较显示，策略泛化、物体-速度跟踪精度和多目标重排效率均保持稳定优越性。关键模块经过系统评估，进行了大量模拟和实地实验，以验证整个系统的稳健性和有效性，该系统成功完成8个连续环路，在近40分钟内重新排列32把椅子，且在约40米的路线上执行远程自主重排。开源软件包可在此 https URL 访问。

CoLT: Reasoning with Chain of Latent Tool Calls

CoLT：用一连串潜在工具调用进行推理

Authors: Fangwei Zhu, Zhifang Sui
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.04246
Pdf link: https://arxiv.org/pdf/2602.04246
Abstract Chain-of-Thought (CoT) is a critical technique in enhancing the reasoning ability of Large Language Models (LLMs), and latent reasoning methods have been proposed to accelerate the inefficient token-level reasoning chain. We notice that existing latent reasoning methods generally require model structure augmentation and exhaustive training, limiting their broader applicability. In this paper, we propose CoLT, a novel framework that implements latent reasoning as ``tool calls''. Instead of reasoning entirely in the latent space, CoLT generates seed tokens that contain information of a reasoning step. When a latent tool call is triggered, a smaller external model will take the hidden states of seed tokens as its input, and unpack the seed tokens back to a full reasoning step. In this way, we can ensure that the main model reasons in the explicit token space, preserving its ability while improving efficiency. Experimental results on four mathematical datasets demonstrate that CoLT achieves higher accuracy and shorter reasoning length than baseline latent models, and is compatible with reinforcement learning algorithms and different decoder structures.
中文摘要 思维链（CoT）是提升大型语言模型（LLMs）推理能力的关键技术，潜能推理方法已被提出以加速低效的代币级推理链。我们注意到现有的潜在推理方法通常需要模型结构增强和穷尽训练，限制了其更广泛的适用范围。本文提出了CoLT，一种新颖框架，将潜在推理实现为“工具调用”。CoLT不是完全在潜在空间进行推理，而是生成包含推理步骤信息的种子代币。当触发潜在工具调用时，较小的外部模型会将种子标记的隐藏状态作为输入，并将种子标记解压回完整的推理步骤。通过这种方式，我们可以确保主模型在显式代币空间中推理，既保持其能力，又提升效率。四个数学数据集的实验结果表明，CoLT比基线潜在模型实现更高的准确性和更短的推理长度，并且兼容强化学习算法和不同的译码器结构。

Scaling Agentic Verifier for Competitive Coding

用于竞争编码的可扩展代理验证器

Authors: Zeyao Ma, Jing Zhang, Xiaokang Zhang, Jiaxi Yang, Zongmeng Zhang, Jiajun Zhang, Yuheng Jing, Lei Zhang, Hao Zheng, Wenting Zhao, Junyang Lin, Binyuan Hui
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.04254
Pdf link: https://arxiv.org/pdf/2602.04254
Abstract Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt. Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling. To address this limitation, we propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs that expose behavioral discrepancies among candidate solutions. Through multi-turn interaction with code execution environments, the verifier iteratively refines the candidate input generator and produces targeted counterexamples rather than blindly sampling inputs. We train the verifier to acquire this discriminative input generation capability via a scalable pipeline combining large-scale data synthesis, rejection fine-tuning, and agentic reinforcement learning. Extensive experiments across five competitive programming benchmarks demonstrate consistent improvements over strong execution-based baselines, achieving up to +10-15% absolute gains in Best@K accuracy. Further analysis reveals clear test-time scaling behavior and highlights the verifier's broader potential beyond reranking.
中文摘要 大型语言模型（LLM）展示了强大的编码能力，但仍难以一次性正确解决竞争性编程问题。基于执行的重新排序提供了一种有前景的测试时间缩放策略，但现有方法受限于测试用例生成困难或随机抽样效率低下。为解决这一限制，我们提出了一种基于执行的智能体，能够主动推理程序行为，并搜索高度判别性测试输入，从而暴露候选解之间的行为差异。通过与代码执行环境的多回合交互，验证者迭代优化候选输入生成器，生成有针对性的反例，而非盲目采样输入。我们通过结合大规模数据综合、拒绝微调和智能体强化学习的可扩展流水线，训练验证者获得这种判别性输入生成能力。在五个竞争性编程基准测试中，经过大量实验，显示其在基于执行的基线基础上持续提升，Best@K准确度绝对提升高达+10-15%。进一步分析显示了明显的测试时间缩放行为，并凸显了验证器超越重新排名的更广泛潜力。

From Ambiguity to Action: A POMDP Perspective on Partial Multi-Label Ambiguity and Its Horizon-One Resolution

从歧义到行动：POMDP视角下的部分多标签歧义及其Horizon-One分辨率

Authors: Hanlin Pan, Yuhao Tang, Wanfu Gao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04255
Pdf link: https://arxiv.org/pdf/2602.04255
Abstract In partial multi-label learning (PML), the true labels are unobserved, which makes label disambiguation important but difficult. A key challenge is that ambiguous candidate labels can propagate errors into downstream tasks such as feature engineering. To solve this issue, we jointly model the disambiguation and feature selection tasks as Partially Observable Markov Decision Processes (POMDP) to turn PML risk minimization into expected-return maximization. Stage 1 trains a transformer policy via reinforcement learning to produce high-quality hard pseudo-labels; Stage 2 describes feature selection as a sequential reinforcement learning problem, selecting features step by step and outputting an interpretable global ranking. We further provide the theoretical analysis of PML-POMDP correspondence and the excess-risk bound that decompose the error into pseudo label quality term and sample size. Experiments in multiple metrics and data sets verify the advantages of the framework.
中文摘要 在部分多标签学习（PML）中，真实标签是未被观察到的，这使得标签消歧变得重要但困难。一个关键挑战是，模糊的候选标签可能会将错误传递到后游任务中，如特征工程。为解决此问题，我们将消歧义和特征选择任务联合建模为部分可观测马尔可夫决策过程（POMDP），将PML风险最小化转化为期望收益最大化。第一阶段通过强化学习训练变换器策略，生成高质量的硬伪标签;第二阶段将特征选择描述为顺序强化学习问题，逐步选择特征并输出可解释的全局排名。我们还进一步提供了PML-POMDP对应关系的理论分析，以及将误差分解为伪标签质量项和样本量的超额风险界限。在多个指标和数据集中的实验验证了该框架的优势。

Thickening-to-Thinning: Reward Shaping via Human-Inspired Learning Dynamics for LLM Reasoning

从增厚到变薄：通过人类启发的学习动态实现的奖励塑造，用于大型语言模型推理

Authors: Wenze Lin, Zhen Yang, Xitai Jiang, Pony Ma, Gao Huang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.04265
Pdf link: https://arxiv.org/pdf/2602.04265
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for enhancing reasoning in Large Language Models (LLMs). However, it frequently encounters challenges such as entropy collapse, excessive verbosity, and insufficient exploration for hard problems. Crucially, existing reward schemes fail to distinguish between the need for extensive search during problem-solving and the efficiency required for mastered knowledge. In this work, we introduce T2T(Thickening-to-Thinning), a dynamic reward framework inspired by human learning processes. Specifically, it implements a dual-phase mechanism: (1) On incorrect attempts, T2T incentivizes "thickening" (longer trajectories) to broaden the search space and explore novel solution paths; (2) Upon achieving correctness, it shifts to "thinning", imposing length penalties to discourage redundancy, thereby fostering model confidence and crystallizing reasoning capabilities. Extensive experiments on mathematical benchmarks (MATH-500, AIME, AMC) across Qwen-series and Deepseek models demonstrate that T2T significantly outperforms standard GRPO and recent baselines, achieving superior performance.
中文摘要 带可验证奖励的强化学习（RLVR）已成为增强大型语言模型（LLM）推理能力的有前景范式。然而，它经常面临熵塌陷、冗长冗长以及对难题探索不足等挑战。关键是，现有的奖励方案未能区分解决问题时大量搜索的需求与掌握知识所需的效率。在本研究中，我们介绍了T2T（增厚到变薄），这是一个受人类学习过程启发的动态奖励框架。具体来说，它实现了双相机制：（1）在错误尝试时，T2T激励“加厚”（更长的轨迹）以拓宽搜索空间并探索新的解路径;（2）当确定正确后，转为“细致”，施加长度惩罚以阻止冗余，从而促进模型信心和推理能力的形成。在Qwen系列和Deepsee克模型上对数学基准（MATH-500、AIME、AMC）进行的大量实验表明，T2T显著优于标准GRPO和近期基线，实现了更优越的性能。

MiniRec: Data-Efficient Reinforcement Learning for LLM-based Recommendation

MiniRec：基于LLM的推荐数据高效强化学习

Authors: Lin Wang, Yang Zhang, Jingfan Chen, Xiaoyan Zhao, Fengbin Zhu, Qing Li, Tat-Seng Chua
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2602.04278
Pdf link: https://arxiv.org/pdf/2602.04278
Abstract The integration of reinforcement learning (RL) into large language models (LLMs) has opened new opportunities for recommender systems by eliciting reasoning and improving user preference modeling. However, RL-based LLM recommendation faces significant efficiency challenges, making full-data training costly. Existing data selection methods define sample value based on learnability or representativeness, yet their loss- or gradient-driven or dataset coverage-driven criteria often misalign with RL learning dynamics, resulting in suboptimal performance. To address this, we propose MiniRec, a data selection framework tailored for RL-based LLM recommendation. MiniRec evaluates sample learnability using key RL signals -- rewards -- pruning samples that are too easy (too high reward) or too difficult (consistently low reward). It assesses representativeness by aligning sample gradients with the approximated "ideal" global RL optimization trajectory, selecting samples that mainly drive model updates, and it also enforces diversity to reduce redundancy. Combined with a curriculum learning strategy from easy to hard samples, MiniRec significantly reduces training cost while largely preserving performance. Extensive experiments demonstrate MiniRec's effectiveness, highlighting the importance of reward-aligned, trajectory-informed data selection in RL-based LLM recommendation.
中文摘要 强化学习（RL）与大型语言模型（LLMs）的整合，通过激发推理和改进用户偏好建模，为推荐系统开辟了新机遇。然而，基于强化学习的大型语言模型推荐面临显著的效率挑战，使得全数据训练成本高昂。现有的数据选择方法基于可学习性或代表性来定义样本值，但其损失驱动、梯度驱动或数据集覆盖率驱动的标准常与强化学习动态不匹配，导致性能不理想。为此，我们提出了MiniRec，一个专为基于强化学习的大型语言模型推荐量身打造的数据选择框架。MiniRec 通过关键的强化学习信号——奖励——评估样本学习能力，剔除过于简单（奖励过高）或过难（奖励持续低）的样本。它通过将样本梯度与近似的“理想”全局强化学习优化轨迹对齐，选择主要驱动模型更新的样本来评估代表性，同时强制多样性以减少冗余。结合从简单到困难样本的课程学习策略，MiniRec显著降低了培训成本，同时在很大程度上保持了性能。大量实验证明了MiniRec的有效性，凸显了基于强化学习的大型语言模型推荐中奖励对齐、轨迹导向数据选择的重要性。

ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation

心电图-R1：基于协议指导且无模式的MLLM，实现可靠的心电图解读

Authors: Jiarui Jin, Haoyu Wang, Xingliang Wu, Xiaocheng Fang, Xiang Lan, Zihan Wang, Deyun Zhang, Bo Liu, Yingying Zhang, Xian Wu, Hongyan Li, Shenda Hong
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.04279
Pdf link: https://arxiv.org/pdf/2602.04279
Abstract Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECG-R1, the first reasoning MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using \textit{Protocol-Guided Instruction Data Generation}, grounding interpretation in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic. Second, we present a modality-decoupled architecture with \textit{Interleaved Modality Dropout} to improve robustness and cross-modal consistency when either the ECG signal or ECG image is missing. Third, we present \textit{Reinforcement Learning with ECG Diagnostic Evidence Rewards} to strengthen evidence-grounded ECG interpretation. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open-source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are widespread, suggesting that the public should not directly trust these outputs without independent verification. Code and data are publicly available at \href{this https URL}{here}, and an online platform can be accessed at \href{this http URL}{here}.
中文摘要 心电图（ECG）是临床实践中不可或缺的诊断工具，但现有的多模态大型语言模型（MLLMs）在心电图解读方面仍然不可靠，常常得出合理但临床上不准确的分析结果。为此，我们提出了ECG-R1，这是MLLM首个通过三项创新设计的可靠心电图解读推理。首先，我们利用 \textit{Protocol-Guided Instruction Data Generation} 构建解释语料库，基于可测量的心电图特征和专著定义的定量阈值和诊断逻辑进行解释。其次，我们采用了 \textit{交错模态丢弃}的模态解耦架构，以提升心电图信号或心电图缺失时的鲁棒性和跨模态一致性。第三，我们推出 \textit{带心电图诊断证据奖励的强化学习}，以加强基于证据的心电图解读。此外，我们系统评估了专有、开源和医学多层次医学的心电图解读能力，并首次定量证据表明严重幻觉普遍存在，建议公众不应未经独立验证直接信任这些输出。代码和数据公开于 \href{this https URL}{here}，在线平台可访问于 \href{this http URL}{here}。

Agent-Omit: Training Efficient LLM Agents for Adaptive Thought and Observation Omission via Agentic Reinforcement Learning

代理遗漏：通过智能强化学习训练高效的大型语言模型代理进行自适应思维和观察省略

Authors: Yansong Ning, Jun Fang, Naiqiang Tan, Hao Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04284
Pdf link: https://arxiv.org/pdf/2602.04284
Abstract Managing agent thought and observation during multi-turn agent-environment interactions is an emerging strategy to improve agent efficiency. However, existing studies treat the entire interaction trajectories equally, overlooking the thought necessity and observation utility varies across turns. To this end, we first conduct quantitative investigations into how thought and observation affect agent effectiveness and efficiency. Based on our findings, we propose Agent-Omit, a unified training framework that empowers LLM agents to adaptively omit redundant thoughts and observations. Specifically, we first synthesize a small amount of cold-start data, including both single-turn and multi-turn omission scenarios, to fine-tune the agent for omission behaviors. Furthermore, we introduce an omit-aware agentic reinforcement learning approach, incorporating a dual sampling mechanism and a tailored omission reward to incentivize the agent's adaptive omission capability. Theoretically, we prove that the deviation of our omission policy is upper-bounded by KL-divergence. Experimental results on five agent benchmarks show that our constructed Agent-Omit-8B could obtain performance comparable to seven frontier LLM agent, and achieve the best effectiveness-efficiency trade-off than seven efficient LLM agents methods. Our code and data are available at this https URL.
中文摘要 在多回合代理-环境交互中管理代理的思维和观察，是一种新兴的策略，旨在提高代理效率。然而，现有研究对整个相互作用轨迹的处理是平等的，忽视了思维必然性，且观察效用会随着回合变化。为此，我们首先进行定量研究，探讨思维和观察如何影响特工的有效性和效率。基于我们的发现，我们提出了Agent-Omit，一种统一的训练框架，使LLM代理能够自适应地省略冗余的思想和观察。具体来说，我们首先综合少量冷启动数据，包括单回合和多回合遗漏情景，以微调代理以应对遗漏行为。此外，我们引入了一种省略感知的代理强化学习方法，结合双重采样机制和定制的遗漏奖励，以激励代理的自适应遗漏能力。理论上，我们证明了遗漏策略的偏差被KL发散上界。五个智能体基准测试的实验结果显示，我们构建的Agent-Omit-8B性能可与七个前沿LLM智能体相当，并在效果与效率权衡上优于七个高效LLM智能体方法。我们的代码和数据可在此 https URL 访问。

Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision

引导验证器：通过动态过程监督实现协作多模态推理

Authors: Lingzhuang Sun, Ruitong Liu, Yuxia Zhu, Xiaohan Xu, Jingxuan Wei, Xiangxiang Zhang, Bihui Yu, Wentao Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.04290
Pdf link: https://arxiv.org/pdf/2602.04290
Abstract Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevailing paradigms typically rely on solitary rollout strategies where the model works alone. This lack of intermediate oversight renders the reasoning process susceptible to error propagation, where early logical deviations cascade into irreversible failures, resulting in noisy optimization signals. In this paper, we propose the \textbf{Guided Verifier} framework to address these structural limitations. Moving beyond passive terminal rewards, we introduce a dynamic verifier that actively co-solves tasks alongside the policy. During the rollout phase, this verifier interacts with the policy model in real-time, detecting inconsistencies and providing directional signals to steer the model toward valid trajectories. To facilitate this, we develop a specialized data synthesis pipeline targeting multimodal hallucinations, constructing \textbf{CoRe} dataset of process-level negatives and \textbf{Co}rrect-guide \textbf{Re}asoning trajectories to train the guided verifier. Extensive experiments on MathVista, MathVerse and MMMU indicate that by allocating compute to collaborative inference and dynamic verification, an 8B-parameter model can achieve strong performance.
中文摘要 强化学习（RL）已成为提升多模态大型语言模型（MLLM）复杂推理能力的关键机制。然而，主流范式通常依赖于单独的推广策略，模型单独运作。这种缺乏中间监督使推理过程容易发生错误传播，早期逻辑偏差会连锁成不可逆的失败，导致优化信号噪声杂乱。本文提出了 \textbf{Guided Verifier} 框架，以解决这些结构性局限性。超越被动终端奖励，我们引入了一个动态验证器，能主动与策略共同解决任务。在推广阶段，该验证器实时与策略模型交互，检测不一致并提供方向信号，引导模型朝着有效轨迹方向发展。为此，我们开发了针对多模幻觉的专用数据综合流程，构建过程层面负片的 \textbf{CoRe} 数据集，并对 \textbf{rerect-guide \textbf{Re}asoning 轨迹进行训练引导验证器。在MathVista、MathVerse和MMMU上的大量实验表明，通过将计算分配用于协作推理和动态验证，8B参数模型可以实现强劲的性能。

HoRD: Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation

HoRD：通过历史条件强化学习和在线提炼实现强健的人形控制

Authors: Puyue Wang, Jiawei Hu, Yan Gao, Junyan Wang, Yu Zhang, Gillian Dobbie, Tao Gu, Wafa Johal, Ting Dang, Hong Jia
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04412
Pdf link: https://arxiv.org/pdf/2602.04412
Abstract Humanoid robots can suffer significant performance drops under small changes in dynamics, task specifications, or environment setup. We propose HoRD, a two-stage learning framework for robust humanoid control under domain shift. First, we train a high-performance teacher policy via history-conditioned reinforcement learning, where the policy infers latent dynamics context from recent state--action trajectories to adapt online to diverse randomized dynamics. Second, we perform online distillation to transfer the teacher's robust control capabilities into a transformer-based student policy that operates on sparse root-relative 3D joint keypoint trajectories. By combining history-conditioned adaptation with online distillation, HoRD enables a single policy to adapt zero-shot to unseen domains without per-domain retraining. Extensive experiments show HoRD outperforms strong baselines in robustness and transfer, especially under unseen domains and external perturbations. Code and project page are available at \href{this https URL}{this https URL}.
中文摘要 在动态、任务规格或环境设置的微小变化下，类人机器人的性能可能会显著下降。我们提出了HoRD，这是一个两阶段学习框架，用于在领域转换下实现人形稳健控制。首先，我们通过历史条件强化学习训练高绩效教师政策，该策略从近期状态-行动轨迹推断潜在动态背景，以适应多样化随机动态。其次，我们进行在线提炼，将教师的稳健控制能力转化为基于Transformer的学生策略，该策略在稀疏的根相对三维联合关键点轨迹上运行。通过结合历史条件适应与在线蒸馏，HoRD 使得单一策略能够将零样本适配到未见领域，而无需逐域重新训练。大量实验表明，HoRD在鲁棒性和转移方面优于强基线，尤其是在未可见域和外部扰动下。代码和项目页面可在 \href{this https URL}{this https URL} 获取。

EMA Policy Gradient: Taming Reinforcement Learning for LLMs with EMA Anchor and Top-k KL

EMA策略梯度：驯服EMA锚点和Top-k KL的LLM强化学习

Authors: Lunjun Zhang, Jimmy Ba
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.04417
Pdf link: https://arxiv.org/pdf/2602.04417
Abstract Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to acquire increasingly complex reasoning and agentic behaviors. In this work, we propose two simple techniques to improve policy gradient algorithms for LLMs. First, we replace the fixed anchor policy during RL with an Exponential Moving Average (EMA), similar to a target network in deep Q-learning. Second, we introduce Top-k KL estimator, which allows for flexible interpolation between exact KL and sampled KL. We derive the stability conditions for using EMA anchor; moreover, we show that our Top-k KL estimator yields both unbiased KL values and unbiased gradients at any k, while bringing the benefits of exact KL. When combined with GRPO, the two techniques (EMA-PG) lead to a significant performance boost. On math reasoning, it allows R1-distilled Qwen-1.5B to reach 53.9% on OlympiadBench compared to 50.8% by GRPO. On agentic RL domains, with Qwen-3B base, EMA-PG improves GRPO by an average of 33.3% across 7 datasets of Q&A with search engines, including 29.7% $\rightarrow$ 44.1% on HotpotQA, 27.4% $\rightarrow$ 40.1% on 2WikiMultiHopQA. Overall, we show that EMA-PG is a simple, principled, and powerful approach to scaling RL for LLMs. Code: this https URL
中文摘要 强化学习（RL）使大型语言模型（LLMs）能够获得越来越复杂的推理和代理行为。本研究提出了两种简单技术，用于改进LLM的策略梯度算法。首先，我们将在强化学习期间用指数移动平均（EMA）取代固定锚策略，类似于深度Q学习中的目标网络。其次，我们引入了Top-k KL估计器，允许在精确KL与采样KL之间灵活插值。我们推导使用EMA锚点的稳定性条件;此外，我们证明了Top-k KL估计器在任意k处既能得到无偏KL值，也能得到无偏梯度，同时还能带来精确KL的优点。结合GRPO技术（EMA-PG）可显著提升性能。从数学推理角度看，它允许R1蒸馏的Qwen-1.5B在OlympiadBench上达到53.9%，而GRPO为50.8%。在agential RL领域，基于Qwen-3B，EMA-PG在7个与搜索引擎的Q&A数据集中平均提升GRPO33.3%，其中包括HotpotQA的29.7% $\rightarrow$ 44.1%，2WikiMultiHopQA的27.4% $\rightarrow$ 40.1%。总体而言，我们展示了EMA-PG是一种简单、有原则且强大的大型语言模型强化学习扩展方法。代码：此 https URL

Mixture of Masters: Sparse Chess Language Models with Player Routing

大师混合：稀疏国际象棋语言模型与玩家路由

Authors: Giacomo Frisoni, Lorenzo Molfetta, Davide Freddi, Gianluca Moro
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.04447
Pdf link: https://arxiv.org/pdf/2602.04447
Abstract Modern chess language models are dense transformers trained on millions of games played by thousands of high-rated individuals. However, these monolithic networks tend to collapse into mode-averaged behavior, where stylistic boundaries are blurred, and rare but effective strategies are suppressed. To counteract homogenization, we introduce Mixture-of-Masters (MoM), the first chess mixture-of-experts model with small-sized GPT experts emulating world-class grandmasters. Each expert is trained with a combination of self-supervised learning and reinforcement learning guided by chess-specific rewards. For each move, a post-hoc learnable gating network selects the most appropriate persona to channel depending on the game state, allowing MoM to switch its style dynamically$--$e.g., Tal's offensive vocation or Petrosian's defensive solidity. When evaluated against Stockfish on unseen standard games, MoM outperforms both dense individual expert networks and popular GPT baselines trained on aggregated data, while ensuring generation variety, control, and interpretability.
中文摘要 现代国际象棋语言模型是密集的变换器，训练于数千名高分玩家的数百万局棋。然而，这些单一网络往往会崩溃为模式平均行为，风格界限模糊，罕见但有效的策略被压制。为抵消同质化，我们引入了大师混合（MoM），这是首个国际象棋专家混合模型，采用小型GPT专家模拟世界级特级大师。每位专家都接受自我监督学习和强化学习相结合的训练，辅以国际象棋专属奖励。每一步，一个事后可学习的门控网络会根据游戏状态选择最合适的角色来引导，使MoM能够动态切换风格——$e.g.，Tal的进攻职业或Petrosian的防守稳固。在与Stockfish在未公开标准游戏中进行比较时，MoM的表现优于密集的个体专家网络和基于聚合数据训练的热门GPT基线，同时确保生成多样性、控制力和可解释性。

Learning the Value Systems of Agents with Preference-based and Inverse Reinforcement Learning

通过偏好学习和逆向强化学习学习代理的价值系统

Authors: Andrés Holgado-Sánchez, Holger Billhardt, Alberto Fernández, Sascha Ossowski
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04518
Pdf link: https://arxiv.org/pdf/2602.04518
Abstract Agreement Technologies refer to open computer systems in which autonomous software agents interact with one another, typically on behalf of humans, in order to come to mutually acceptable agreements. With the advance of AI systems in recent years, it has become apparent that such agreements, in order to be acceptable to the involved parties, must remain aligned with ethical principles and moral values. However, this is notoriously difficult to ensure, especially as different human users (and their software agents) may hold different value systems, i.e. they may differently weigh the importance of individual moral values. Furthermore, it is often hard to specify the precise meaning of a value in a particular context in a computational manner. Methods to estimate value systems based on human-engineered specifications, e.g. based on value surveys, are limited in scale due to the need for intense human moderation. In this article, we propose a novel method to automatically \emph{learn} value systems from observations and human demonstrations. In particular, we propose a formal model of the \emph{value system learning} problem, its instantiation to sequential decision-making domains based on multi-objective Markov decision processes, as well as tailored preference-based and inverse reinforcement learning algorithms to infer value grounding functions and value systems. The approach is illustrated and evaluated by two simulated use cases.
中文摘要 协议技术指的是开放计算机系统，其中自主软件代理彼此交互，通常代表人类，以达成双方都能接受的协议。近年来，随着人工智能系统的进步，越来越明显，为了让相关各方接受此类协议必须与伦理原则和道德价值观保持一致。然而，这一点极难实现，尤其是因为不同的人类用户（及其软件代理）可能持有不同的价值观体系，即他们对个人道德价值的重要性的权衡也可能不同。此外，在特定语境中，计算方式中准确定义一个值的含义往往很困难。基于人为工程规范（例如基于价值调查）估算价值体系的方法，由于需要强烈的人力调节，规模有限。本文提出了一种新颖的方法，能够通过观察和人类演示自动\emph{学习}价值体系。特别地，我们提出了一个\emph{价值系统学习}问题的形式模型，将其实例化到基于多目标马尔可夫决策过程的顺序决策领域，以及针对偏好的和逆强化学习算法，用于推断价值基础函数和价值系统。该方法通过两个模拟用例进行展示和评估。

Understanding Degradation with Vision Language Model

用视觉语言模型理解退化

Authors: Guanzhou Lan, Chenyi Liao, Yuqi Yang, Qianli Ma, Zhigang Wang, Dong Wang, Bin Zhao, Xuelong Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.04565
Pdf link: https://arxiv.org/pdf/2602.04565
Abstract Understanding visual degradations is a critical yet challenging problem in computer vision. While recent Vision-Language Models (VLMs) excel at qualitative description, they often fall short in understanding the parametric physics underlying image degradations. In this work, we redefine degradation understanding as a hierarchical structured prediction task, necessitating the concurrent estimation of degradation types, parameter keys, and their continuous physical values. Although these sub-tasks operate in disparate spaces, we prove that they can be unified under one autoregressive next-token prediction paradigm, whose error is bounded by the value-space quantization grid. Building on this insight, we introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning using structured rewards. Furthermore, we show that DU-VLM can serve as a zero-shot controller for pre-trained diffusion models, enabling high-fidelity image restoration without fine-tuning the generative backbone. We also introduce \textbf{DU-110k}, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations. Extensive experiments demonstrate that our approach significantly outperforms generalist baselines in both accuracy and robustness, exhibiting generalization to unseen distributions.
中文摘要 理解视觉退化是计算机视觉中一个关键但具有挑战性的问题。虽然最新的视觉语言模型（VLMs）在定性描述方面表现出色，但它们常常在理解图像退化背后的参数物理方面不足。在本研究中，我们将降解理解重新定义为一种层级结构化预测任务，要求同时估计降解类型、参数键及其连续物理值。尽管这些子任务在不同空间中工作，我们证明它们可以统一在一个自回归的下一标记预测范式下，其误差受值空间量子化网格的限制。基于这一见解，我们介绍了DU-VLM，一种多模态思维链模型，通过监督微调和结构化奖励的强化学习训练。此外，我们展示了DU-VLM可以作为预训练扩散模型的零拍摄控制器，实现高保真图像恢复，无需微调生成骨干。我们还介绍了\textbf{DU-110k}，这是一个包含11万对干净降解对且带有基质物理注释的大规模数据集。大量实验表明，我们的方法在准确性和鲁棒性上显著优于通用基线，并能推广到未见分布。

Dual Mind World Model Inspired Network Digital Twin for Access Scheduling

受双心世界模型启发的网络数字孪生，用于访问调度

Authors: Hrishikesh Dutta, Roberto Minerva, Noel Crespi
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.04566
Pdf link: https://arxiv.org/pdf/2602.04566
Abstract Emerging networked systems such as industrial IoT and real-time cyber-physical infrastructures demand intelligent scheduling strategies capable of adapting to dynamic traffic, deadlines, and interference constraints. In this work, we present a novel Digital Twin-enabled scheduling framework inspired by Dual Mind World Model (DMWM) architecture, for learning-informed and imagination-driven network control. Unlike conventional rule-based or purely data-driven policies, the proposed DMWM combines short-horizon predictive planning with symbolic model-based rollout, enabling the scheduler to anticipate future network states and adjust transmission decisions accordingly. We implement the framework in a configurable simulation testbed and benchmark its performance against traditional heuristics and reinforcement learning baselines under varied traffic conditions. Our results show that DMWM achieves superior performance in bursty, interference-limited, and deadline-sensitive environments, while maintaining interpretability and sample efficiency. The proposed design bridges the gap between network-level reasoning and low-overhead learning, marking a step toward scalable and adaptive NDT-based network optimization.
中文摘要 新兴的网络系统，如工业物联网和实时网络物理基础设施，需要能够适应动态流量、截止日期和干扰约束的智能调度策略。本研究提出了一种受双心世界模型（DMWM）架构启发的新型数字孪生驱动调度框架，用于学习导向和想象驱动的网络控制。与传统的基于规则或纯数据驱动的策略不同，拟议的DMWM结合了短期预测规划和基于符号模型的推广，使调度者能够预见未来网络状态并相应调整传输决策。我们将该框架集成在可配置的仿真测试平台中，并在不同交通条件下与传统启发式和强化学习基线进行基准测试。我们的结果表明，DMWM在突发性强、干扰受限和截止时间敏感的环境中表现优异，同时保持可解释性和样本效率。该设计弥合了网络层推理与低开销学习之间的鸿沟，标志着迈向可扩展且自适应的基于无视检测网络优化的一步。

Reinforcement Learning-based Home Energy Management with Heterogeneous Batteries and Stochastic EV Behaviour

基于强化学习的家庭能源管理，采用异构电池和随机电动车行为

Authors: Meng Yuan, Ye Wang, Xinghuo Yu, Torsten Wik, Changfu Zou
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.04578
Pdf link: https://arxiv.org/pdf/2602.04578
Abstract The widespread adoption of photovoltaic (PV), electric vehicles (EVs), and stationary energy storage systems (ESS) in households increases system complexity while simultaneously offering new opportunities for energy regulation. However, effectively coordinating these resources under uncertainties remains challenging. This paper proposes a novel home energy management framework based on deep reinforcement learning (DRL) that can jointly minimise energy expenditure and battery degradation while guaranteeing occupant comfort and EV charging requirements. Distinct from existing studies, we explicitly account for the heterogeneous degradation characteristics of stationary and EV batteries in the optimisation, alongside stochastic user behaviour regarding arrival time, departure time, and driving distance. The energy scheduling problem is formulated as a constrained Markov decision process (CMDP) and solved using a Lagrangian soft actor-critic (SAC) algorithm. This approach enables the agent to learn optimal control policies that enforce physical constraints, including indoor temperature bounds and target EV state of charge upon departure, despite stochastic uncertainties. Numerical simulations over a one-year horizon demonstrate the effectiveness of the proposed framework in satisfying physical constraints while eliminating thermal oscillations and achieving significant economic benefits. Specifically, the method reduces the cumulative operating cost substantially compared to two standard rule-based baselines while simultaneously decreasing battery degradation costs by 8.44%.
中文摘要 光伏（PV）、电动汽车（EV）和固定式储能系统（ESS）在家庭中的广泛采用增加了系统复杂度，同时也为能源监管带来了新的机遇。然而，在不确定性下有效协调这些资源仍然具有挑战性。本文提出了一种基于深度强化学习（DRL）的新型家庭能源管理框架，能够在保障乘员舒适度和电动汽车充电需求的同时，共同降低能耗和电池劣化。与现有研究不同，我们在优化中明确考虑了固定电池和电动电池的异质性劣化特性，同时结合了用户在到达时间、出发时间和行驶距离上的随机行为。能量调度问题被表述为受限马尔可夫决策过程（CMDP），并使用拉格朗日软演员-批判者（SAC）算法求解。这种方法使智能体能够学习最优的控制策略，执行物理约束，包括室内温度限制和出发时目标电动车充电状态，尽管存在随机不确定性。一年时间内的数值模拟展示了该框架在满足物理约束的同时消除热振荡并实现显著经济效益方面的有效性。具体来说，该方法相比两个标准基于规则的基线大幅降低了累计运营成本，同时降低了8.44%的电池劣化成本。

Stochastic Decision Horizons for Constrained Reinforcement Learning

约束强化学习的随机决策视野

Authors: Nikola Milosevic, Leonard Franz, Daniel Haeufle, Georg Martius, Nico Scherf, Pavel Kolev
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04599
Pdf link: https://arxiv.org/pdf/2602.04599
Abstract Constrained Markov decision processes (CMDPs) provide a principled model for handling constraints, such as safety and other auxiliary objectives, in reinforcement learning. The common approach of using additive-cost constraints and dual variables often hinders off-policy scalability. We propose a Control as Inference formulation based on stochastic decision horizons, where constraint violations attenuate reward contributions and shorten the effective planning horizon via state-action-dependent continuation. This yields survival-weighted objectives that remain replay-compatible for off-policy actor-critic learning. We propose two violation semantics, absorbing and virtual termination, that share the same survival-weighted return but result in distinct optimization structures that lead to SAC/MPO-style policy improvement. Experiments demonstrate improved sample efficiency and favorable return-violation trade-offs on standard benchmarks. Moreover, MPO with virtual termination (VT-MPO) scales effectively to our high-dimensional musculoskeletal Hyfydy setup.
中文摘要 受限马尔可夫决策过程（CMDPs）为强化学习中处理约束（如安全性及其他辅助目标）提供了一个有原则的模型。使用加法成本约束和对偶变量的常见方法常常阻碍策略外的可扩展性。我们提出基于随机决策视野的控制即推理表述，其中约束违规通过状态-动作依赖的延续减减奖励贡献并缩短有效规划视野。这产生了生存加权目标，并且在非策略行为者-批评者学习中保持了重放兼容性。我们提出了两种违规语义：吸收语义和虚拟终止，它们共享相同的生存加权回报，但产生不同的优化结构，从而实现类似SAC/MPO的策略改进。实验显示，在标准基准测试下，样本效率提升，并实现了有利的回报-违规权衡。此外，带虚拟终止的MPO（VT-MPO）能够有效地扩展到我们的高维肌肉骨骼Hyfydy架构。

QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning

QUATRO：查询自适应信任区域策略优化，用于LLM微调

Authors: Doyeon Lee, Eunyi Lyou, Hyunsoo Cho, Sookyung Kim, Joonseok Lee, Jaemoo Choi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04620
Pdf link: https://arxiv.org/pdf/2602.04620
Abstract GRPO-style reinforcement learning (RL)-based LLM fine-tuning algorithms have recently gained popularity. Relying on heuristic trust-region approximations, however, they can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to regulate samples whose importance ratios fall outside the clipping range. We propose Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through a principled optimization. This yields a clear and interpretable objective that enables explicit control over policy updates and stable, entropy-controlled optimization, with a stabilizer terms arising intrinsically from the exact trust-region formulation. Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning rates, maintaining well-controlled entropy throughout training.
中文摘要 基于GRPO风格的强化学习（RL）LLM微调算法近年来越来越受欢迎。然而，依赖启发式信任区域近似时，它们可能导致脆弱的优化行为，因为全局重要性比裁剪和群级归一化无法调控重要性比超出削波范围的样本。我们提出了查询自适应信任区域策略优化（QUATRO），通过原则优化直接强制信任区域约束。这带来了一个清晰且可解释的目标，使得对策略更新和稳定、熵控制优化实现明确控制，稳定项内在地源自准确的信任区域表述。经过多种数学推理基准的实证验证，QUATRO在政策陈旧和激进学习率增加的情况下，训练依然稳定，整个训练过程中都保持良好控制的熵。

WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning

WideSeek-R1：通过多智能体强化学习探索宽度尺度以实现广泛信息寻求

Authors: Zelai Xu, Zhexuan Xu, Ruize Zhang, Chunyang Zhu, Shi Yu, Weilin Liu, Quanlu Zhang, Wenbo Ding, Chao Yu, Yu Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.04634
Pdf link: https://arxiv.org/pdf/2602.04634
Abstract Recent advancements in Large Language Models (LLMs) have largely focused on depth scaling, where a single agent solves long-horizon problems with multi-turn reasoning and tool use. However, as tasks grow broader, the key bottleneck shifts from individual competence to organizational capability. In this work, we explore a complementary dimension of width scaling with multi-agent systems to address broad information seeking. Existing multi-agent systems often rely on hand-crafted workflows and turn-taking interactions that fail to parallelize work effectively. To bridge this gap, we propose WideSeek-R1, a lead-agent-subagent framework trained via multi-agent reinforcement learning (MARL) to synergize scalable orchestration and parallel execution. By utilizing a shared LLM with isolated contexts and specialized tools, WideSeek-R1 jointly optimizes the lead agent and parallel subagents on a curated dataset of 20k broad information-seeking tasks. Extensive experiments show that WideSeek-R1-4B achieves an item F1 score of 40.0% on the WideSearch benchmark, which is comparable to the performance of single-agent DeepSeek-R1-671B. Furthermore, WideSeek-R1-4B exhibits consistent performance gains as the number of parallel subagents increases, highlighting the effectiveness of width scaling.
中文摘要 大型语言模型（LLMs）的最新进展主要集中在深度缩放，即单一智能体通过多回合推理和工具使用解决长视野问题。然而，随着任务范围的扩大，关键瓶颈从个人能力转向组织能力。本研究探讨了宽度尺度与多智能体系统的互补维度，以应对广泛的信息寻求。现有的多代理系统常常依赖手工打造的工作流程和轮流交互，这些都无法有效并行化工作。为弥合这一差距，我们提出了WideSeek-R1，这是一个通过多智能体强化学习（MARL）训练的主-代理-子代理框架，旨在协同可扩展的编排与并行执行。通过利用共享LLM、隔离上下文和专用工具，WideSeek-R1共同优化了主智能体和并行子智能体，涵盖2万个广泛信息寻求任务的精心管理数据集。大量实验表明，WideSeek-R1-4B 在 WideSearch 基准测试中获得了 40.0% 的 F1 项目得分，性能与单代理 DeepSeek-R1-671B 相当。此外，随着平行亚剂数量的增加，WideSeek-R1-4B 表现出持续的性能提升，凸显了宽度缩放的有效性。

Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

重新思考扩散模型强化学习的设计空间：关于超越损失设计的似然估计的重要性

Authors: Jaemoo Choi, Yuchen Zhu, Wei Guo, Petr Molodyk, Bo Yuan, Jinbin Bai, Yi Xin, Molei Tao, Yongxin Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.04663
Pdf link: https://arxiv.org/pdf/2602.04663
Abstract Reinforcement learning has been widely applied to diffusion and flow models for visual tasks such as text-to-image generation. However, these tasks remain challenging because diffusion models have intractable likelihoods, which creates a barrier for directly applying popular policy-gradient type methods. Existing approaches primarily focus on crafting new objectives built on already heavily engineered LLM objectives, using ad hoc estimators for likelihood, without a thorough investigation into how such estimation affects overall algorithmic performance. In this work, we provide a systematic analysis of the RL design space by disentangling three factors: i) policy-gradient objectives, ii) likelihood estimators, and iii) rollout sampling schemes. We show that adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor enabling effective, efficient, and stable RL optimization, outweighing the impact of the specific policy-gradient loss functional. We validate our findings across multiple reward benchmarks using SD 3.5 Medium, and observe consistent trends across all tasks. Our method improves the GenEval score from 0.24 to 0.95 in 90 GPU hours, which is $4.6\times$ more efficient than FlowGRPO and $2\times$ more efficient than the SOTA method DiffusionNFT without reward hacking.
中文摘要 强化学习已被广泛应用于视觉任务（如文本到图像生成）的扩散和流动模型。然而，这些任务依然具有挑战性，因为扩散模型具有难以解决的概率，这为直接应用流行的策略梯度方法造成了障碍。现有方法主要侧重于基于已高度设计的大型语言模型目标构建新目标，使用临时估计器来计算似然度，却未深入探讨此类估计如何影响整体算法性能。本研究通过将三个因素解开，系统分析了强化学习设计空间：i）策略梯度目标，ii）似然估计，iii）推广抽样方案。我们表明，采用仅从最终生成样本计算的基于证据下界（ELBO）的模型似然估计器，是实现有效、高效和稳定强化学习优化的主要因素，超过了特定策略梯度损失函数的影响。我们使用SD 3.5 Medium验证了多个奖励基准的发现，并在所有任务中观察到一致的趋势。我们的方法在90个GPU小时内将GenEval得分从0.24提升到0.95，比FlowGRPO高效4.6美元，比未使用奖励黑客的SoTA方法DiffusionNFT高2倍倍。

Multi-Source Retrieval and Reasoning for Legal Sentencing Prediction

多源检索与法律量刑预测推理

Authors: Junjie Chen, Haitao Li, Qilei Zhang, Zhenghua Li, Ya Zhang, Quan Zhou, Cheng Luo, Yiqun Liu, Dongsheng Guo, Qingyao Ai
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2602.04690
Pdf link: https://arxiv.org/pdf/2602.04690
Abstract Legal judgment prediction (LJP) aims to predict judicial outcomes from case facts and typically includes law article, charge, and sentencing prediction. While recent methods perform well on the first two subtasks, legal sentencing prediction (LSP) remains difficult due to its need for fine-grained objective knowledge and flexible subjective reasoning. To address these limitations, we propose $MSR^2$, a framework that integrates multi-source retrieval and reasoning in LLMs with reinforcement learning. $MSR^2$ enables LLMs to perform multi-source retrieval based on reasoning needs and applies a process-level reward to guide intermediate subjective reasoning steps. Experiments on two real-world datasets show that $MSR^2$ improves both accuracy and interpretability in LSP, providing a promising step toward practical legal AI. Our code is available at this https URL.
中文摘要 法律判决预测（LJP）旨在根据案件事实预测司法结果，通常包括法律条文、起诉和量刑预测。虽然近期方法在前两个子任务表现良好，但法律量刑预测（LSP）仍难以实现，因其需要细致的客观知识和灵活的主观推理。为解决这些限制，我们提出了$MSR^2$框架，将LLM中的多源检索与推理与强化学习相结合。$MSR^2$使LLM能够基于推理需求执行多源检索，并应用过程级奖励来指导中间的主观推理步骤。在两个真实数据集上的实验表明，$MSR^2$在LSP中不仅提高了准确性，也提升了可解释性，为迈向实用法律人工智能迈出了有希望的一步。我们的代码可在此 https URL 访问。

ERNIE 5.0 Technical Report

ERNIE 5.0 技术报告

Authors: Haifeng Wang, Hua Wu, Tian Wu, Yu Sun, Jing Liu, Dianhai Yu, Yanjun Ma, Jingzhou He, Zhongjun He, Dou Hong, Qiwen Liu, Shuohuan Wang, Junyuan Shang, Zhenyu Zhang, Yuchen Ding, Jinle Zeng, Jiabin Yang, Liang Shen, Ruibiao Chen, Weichong Yin, Siyu Ding, Dai Dai, Shikun Feng, Siqi Bao, Bolei He, Yan Chen, Zhenyu Jiao, Ruiqing Zhang, Zeyu Chen, Qingqing Dang, Kaipeng Deng, Jiajun Jiang, Enlei Gong, Guoxia Wang, Yanlin Sha, Yi Liu, Yehan Zheng, Weijian Xu, Jiaxiang Liu, Zengfeng Zeng, Yingqi Qu, Zhongli Li, Zhengkun Zhang, Xiyang Wang, Zixiang Xu, Xinchao Xu, Zhengjie Huang, Dong Wang, Bingjin Chen, Yue Chang, Xing Yuan, Shiwei Huang, Qiao Zhao, Xinzhe Ding, Shuangshuang Qiao, Baoshan Yang, Bihong Tang, Bin Li, Bingquan Wang, Binhan Tang, Binxiong Zheng, Bo Cui, Bo Ke, Bo Zhang, Bowen Zhang, Boyan Zhang, Boyang Liu, Caiji Zhang, Can Li, Chang Xu, Chao Pang, Chao Zhang, Chaoyi Yuan, Chen Chen, Cheng Cui, Chenlin Yin, Chun Gan, Chunguang Chai, Chuyu Fang, Cuiyun Han, Dan Zhang, Danlei Feng, Danxiang Zhu, Dong Sun, Dongbo Li, Dongdong Li, Dongdong Liu, Dongxue Liu, Fan Ding, Fan Hu, Fan Li, Fan Mo, Feisheng Wu, Fengwei Liu, Gangqiang Hu, Gaofeng Lu, Gaopeng Yong, Gexiao Tian, Guan Wang, Guangchen Ni
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.04705
Pdf link: https://arxiv.org/pdf/2602.04705
Abstract In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.
中文摘要 在本报告中，我们介绍了ERNIE 5.0，一种原生自回归基础模型，旨在实现文本、图像、视频和音频的统一多模态理解与生成。所有模态均从零开始训练，基于基于超稀疏的专家混合（MoE）架构和模态无关的专家路由。为应对在多样资源限制下大规模部署的实际挑战，ERNIE 5.0采用了一种新型弹性训练范式。在一次预训练运行中，模型学习一系列不同深度、专家能力和路由稀疏性的子模型，从而在内存或时间受限场景中灵活权衡性能、模型规模和推理延迟。此外，我们系统地解决了将强化学习扩展到统一基础模型的挑战，从而保证在超稀疏的MoE架构和多模态环境下，训练后训练的高效稳定。大量实验表明，ERNIE 5.0在多种模态上实现了强且均衡的性能。据我们所知，在公开的模型中，ERNIE 5.0代表了首个支持多模态理解和生成的万亿参数统一自回归模型的生产规模实现。为促进进一步研究，我们展示了统一模型中模态无关专家路由的详细可视化，并对弹性训练进行了全面的实证分析，旨在为社区提供深刻见解。

Rationality Measurement and Theory for Reinforcement Learning Agents

强化学习代理的理性测量与理论

Authors: Kejiang Qian, Amos Storkey, Fengxiang He
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04737
Pdf link: https://arxiv.org/pdf/2602.04737
Abstract This paper proposes a suite of rationality measures and associated theory for reinforcement learning agents, a property increasingly critical yet rarely explored. We define an action in deployment to be perfectly rational if it maximises the hidden true value function in the steepest direction. The expected value discrepancy of a policy's actions against their rational counterparts, culminating over the trajectory in deployment, is defined to be expected rational risk; an empirical average version in training is also defined. Their difference, termed as rational risk gap, is decomposed into (1) an extrinsic component caused by environment shifts between training and deployment, and (2) an intrinsic one due to the algorithm's generalisability in a dynamic environment. They are upper bounded by, respectively, (1) the $1$-Wasserstein distance between transition kernels and initial state distributions in training and deployment, and (2) the empirical Rademacher complexity of the value function class. Our theory suggests hypotheses on the benefits from regularisers (including layer normalisation, $\ell_2$ regularisation, and weight normalisation) and domain randomisation, as well as the harm from environment shifts. Experiments are in full agreement with these hypotheses. The code is available at this https URL.
中文摘要 本文提出了一套强化学习代理的理性度量及其相关理论，这一特性日益重要但鲜少被探讨。我们定义部署中的动作是完全理性的，如果它能最大化隐藏的真值函数在最陡峭的方向上。政策行为相对于理性对应方的期望值差异，最终在部署轨迹中达到顶点，被定义为预期理性风险;培训中也定义了经验平均版本。它们的差异被称为理性风险差距，被分解为（1）由训练与部署之间环境变化引起的外在成分，以及（2）由于算法在动态环境中具有普遍性而产生的内在因素。它们分别由以下上界决定：（1）转移核与训练和部署中初始状态分布之间的$1$-Wasserstein距离，以及（2）价值函数类的经验Rademacher复杂度。我们的理论提出了关于正则化子带来益处（包括层归一化、$\ell_2$正则化和权重归一化）和定义域随机化的假说，以及环境变化带来的危害。实验结果完全支持这些假设。代码可在该 https URL 访问。

When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?

当沉默是金：大型语言模型能否学会在时间质量保证及更广泛领域保持禁欲？

Authors: Xinyu Zhou, Chang Jin, Carsten Eickhoff, Zhijiang Guo, Seyed Ali Bahrainian
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.04755
Pdf link: https://arxiv.org/pdf/2602.04755
Abstract Large language models (LLMs) rarely admit uncertainty, often producing fluent but misleading answers, rather than abstaining (i.e., refusing to answer). This weakness is even evident in temporal question answering, where models frequently ignore time-sensitive evidence and conflate facts across different time-periods. In this paper, we present the first empirical study of training LLMs with an abstention ability while reasoning about temporal QA. Existing approaches such as calibration might be unreliable in capturing uncertainty in complex reasoning. We instead frame abstention as a teachable skill and introduce a pipeline that couples Chain-of-Thought (CoT) supervision with Reinforcement Learning (RL) guided by abstention-aware rewards. Our goal is to systematically analyze how different information types and training techniques affect temporal reasoning with abstention behavior in LLMs. Through extensive experiments studying various methods, we find that RL yields strong empirical gains on reasoning: a model initialized by Qwen2.5-1.5B-Instruct surpasses GPT-4o by $3.46\%$ and $5.80\%$ in Exact Match on TimeQA-Easy and Hard, respectively. Moreover, it improves the True Positive rate on unanswerable questions by $20\%$ over a pure supervised fine-tuned (SFT) variant. Beyond performance, our analysis shows that SFT induces overconfidence and harms reliability, while RL improves prediction accuracy but exhibits similar risks. Finally, by comparing implicit reasoning cues (e.g., original context, temporal sub-context, knowledge graphs) with explicit CoT supervision, we find that implicit information provides limited benefit for reasoning with abstention. Our study provides new insights into how abstention and reasoning can be jointly optimized, providing a foundation for building more reliable LLMs.
中文摘要 大型语言模型（LLMs）很少承认不确定性，通常能给出流畅但具有误导性的回答，而不是选择回避（即拒绝回答）。这种弱点甚至在时间问答中也很明显，模型经常忽视时间敏感的证据，混淆不同时间段的事实。本文首次实证研究了在时间质量保证推理下训练具保留能力的大型语言模型。现有的方法如校准可能在捕捉复杂推理中的不确定性方面不够可靠。我们相反将戒断框架为可教技能，并引入一条将思维链（CoT）督导与由戒断意识奖励引导的强化学习（RL）相结合的流程。我们的目标是系统分析不同信息类型和训练技术如何影响大语言模型中含蓄行为的时间推理。通过广泛实验研究各种方法，我们发现强化学习在推理上取得了显著的实证优势：由Qwen2.5-1.5B-Ininstruction初始化的模型在TimeQA-Easy和Hard的精确匹配中分别比GPT-4o高出3.46\%$和$5.80\%$。此外，它在无法回答的问题上，比纯监督微调（SFT）变体提高了20%%的真实阳性率。除了性能外，我们的分析显示SFT会引发过度自信并损害可靠性，而强化学习提升预测准确性，但存在类似风险。最后，通过比较隐性推理线索（如原始上下文、时间子上下文、知识图谱）与显式CoT监督，我们发现隐性信息对保留推理的益处有限。我们的研究为如何联合优化隐匿和推理提供了新的见解，为构建更可靠的大型语言模型奠定了基础。

Skin Tokens: A Learned Compact Representation for Unified Autoregressive Rigging

皮肤标记：统一自回归绑定的学习式紧凑表示

Authors: Jia-peng Zhang, Cheng-Feng Pu, Meng-Hao Guo, Yan-Pei Cao, Shi-Min Hu
Subjects: Subjects: Graphics (cs.GR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.04805
Pdf link: https://arxiv.org/pdf/2602.04805
Abstract The rapid proliferation of generative 3D models has created a critical bottleneck in animation pipelines: rigging. Existing automated methods are fundamentally limited by their approach to skinning, treating it as an ill-posed, high-dimensional regression task that is inefficient to optimize and is typically decoupled from skeleton generation. We posit this is a representation problem and introduce SkinTokens: a learned, compact, and discrete representation for skinning weights. By leveraging an FSQ-CVAE to capture the intrinsic sparsity of skinning, we reframe the task from continuous regression to a more tractable token sequence prediction problem. This representation enables TokenRig, a unified autoregressive framework that models the entire rig as a single sequence of skeletal parameters and SkinTokens, learning the complicated dependencies between skeletons and skin deformations. The unified model is then amenable to a reinforcement learning stage, where tailored geometric and semantic rewards improve generalization to complex, out-of-distribution assets. Quantitatively, the SkinTokens representation leads to a 98%-133% percents improvement in skinning accuracy over state-of-the-art methods, while the full TokenRig framework, refined with RL, enhances bone prediction by 17%-22%. Our work presents a unified, generative approach to rigging that yields higher fidelity and robustness, offering a scalable solution to a long-standing challenge in 3D content creation.
中文摘要 生成式3D模型的快速普及，在动画流程中制造了一个关键瓶颈：绑定。现有的自动化方法在剥皮方法上根本受限，将其视为一个姿态不良的高维回归任务，效率低下且通常与骨架生成解耦。我们假设这是一个表示问题，并介绍了SkinTokens：一种学习式、紧凑且离散的权重剥皮表示。通过利用FSQ-CVAE捕捉皮肤的内在稀疏性，我们将任务从连续回归转变为更易处理的令牌序列预测问题。这种表示方式使TokenRig成为可能，这是一个统一的自回归框架，将整个骨骼模型建模为单一的骨骼参数序列和皮肤标记，学习骨骼与皮肤变形之间复杂的依赖关系。统一模型随后可进入强化学习阶段，在此阶段通过定制化的几何和语义奖励提升对复杂、非分布资产的泛化。从量化角度看，SkinTokens 表示相比最先进方法，换皮准确率提升了 98%-133%，而完整的 TokenRig 框架经过强化学习优化，骨骼预测提升了 17%-22%。我们的工作呈现了一种统一的生成式绑定方法，实现更高的保真度和稳健性，为3D内容创作长期面临的挑战提供了可扩展的解决方案。

Evolving Afferent Architectures: Biologically-inspired Models for Damage-Avoidance Learning

进化的传入结构：基于生物的损伤避免学习模型

Authors: Wolfgang Maass, Sabine Janzen, Prajvi Saxena, Sach Mukherjee
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04807
Pdf link: https://arxiv.org/pdf/2602.04807
Abstract We introduce Afferent Learning, a framework that produces Computational Afferent Traces (CATs) as adaptive, internal risk signals for damage-avoidance learning. Inspired by biological systems, the framework uses a two-level architecture: evolutionary optimization (outer loop) discovers afferent sensing architectures that enable effective policy learning, while reinforcement learning (inner loop) trains damage-avoidance policies using these signals. This formalizes afferent sensing as providing an inductive bias for efficient learning: architectures are selected based on their ability to enable effective learning (rather than directly minimizing damage). We provide theoretical convergence guarantees under smoothness and bounded-noise assumptions. We illustrate the general approach in the challenging context of biomechanical digital twins operating over long time horizons (multiple decades of the life-course). Here, we find that CAT-based evolved architectures achieve significantly higher efficiency and better age-robustness than hand-designed baselines, enabling policies that exhibit age-dependent behavioral adaptation (23% reduction in high-risk actions). Ablation studies validate CAT signals, evolution, and predictive discrepancy as essential. We release code and data for reproducibility.
中文摘要 我们介绍了传入学习（Afferent Learning），这是一个生成计算传入追踪（CATs）作为自适应的内部风险信号的框架，用于损害避免学习。该框架受生物系统启发，采用两层架构：进化优化（外环）发现能够有效学习策略的传入感测架构，而强化学习（内环）则利用这些信号训练损害-避免策略。这使传入感知形式化为提供高效学习的归纳偏向：选择架构基于其促进有效学习的能力（而非直接最小化损害）。我们在平滑性和有界噪声假设下提供理论收敛保证。我们展示了生物力学数字孪生在漫长时间（生命历程数十年）运作的复杂背景下，这一总体方法。我们发现基于CAT的进化架构比手工设计的基线实现了显著更高的效率和更佳的年龄韧性，使政策能够表现出年龄相关的行为适应（高风险行为减少23%）。消融研究验证了CAT信号、进化和预测差异的重要性。我们发布代码和数据以保证可重复性。

Joint Sleep Mode Activation and Load Balancing with Dynamic Cell Load: A Combinatorial Bandit Approach

联合睡眠模式激活与动态单元负载负载均衡：一种组合强盗方法

Authors: Wajahat Bashir Gilkar, Gourab Ghatak
Subjects: Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2602.04808
Pdf link: https://arxiv.org/pdf/2602.04808
Abstract We propose a combinatorial bandit formulation to opportunistically trigger sleep modes in gNode-B (gNB) small cells (SCs), followed by a cell range expansion (CRE)-based load balancing procedure. This is implemented by ensuring that the fifth generation (5G) quality of service identifier (5QI)-requirements of user equipments (UEs) are maintained. The key challenge is the fact that while deactivating a given SC gNB reduces its own consumption, it may increase the load on neighboring gNBs and the macro gNB (coverage cell), impacting the overall energy efficiency. This phenomenon is accurately characterized by modeling the dynamic cell load that jointly takes into account the location of the UEs, their relative locations to all the SCs, and their data demands. We experimentally show that the proposed combinatorial upper confidence bound (CUCB) followed by the load balancer outperforms not only the naive strategies like arbitrarily keeping all the SCs on, but also other state-of-the-art reinforcement learning solutions. The proposed algorithm can be implemented as open-radio access network (O-RAN) near-real-time (NRT) RAN intelligent controller (RIC) xApps.
中文摘要 我们提出了一种组合bandit公式，用于机会性地触发gNode-B（gNB）小基站（SC）中的睡眠模式，随后采用基于细胞范围扩展（CRE）的负载均衡程序。这通过确保维护用户设备（UE）第五代（5G）服务质量标识符（5QI）需求来实现。关键挑战在于，虽然关闭某个SC gNB能减少自身的耗电，但可能会增加邻近gNB和宏gNB（覆盖单元）的负荷，从而影响整体能效。这一现象通过建模动态单元负载，结合UE的位置、它们相对于所有SC的位置以及数据需求来准确描述。我们实验显示，所提出的组合置信上限（CUCB）后接负载均衡器，不仅优于诸如任意保持所有SC在线等朴素策略，还优于其他最先进的强化学习解决方案。该算法可实现为开放无线接入网（O-RAN）、近实时（NRT）、RAN智能控制器（RIC）xApps。

Beyond Rewards in Reinforcement Learning for Cyber Defence

网络防御强化学习中的超越奖励

Authors: Elizabeth Bates, Chris Hicks, Vasilios Mavroudis
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.04809
Pdf link: https://arxiv.org/pdf/2602.04809
Abstract Recent years have seen an explosion of interest in autonomous cyber defence agents trained to defend computer networks using deep reinforcement learning. These agents are typically trained in cyber gym environments using dense, highly engineered reward functions which combine many penalties and incentives for a range of (un)desirable states and costly actions. Dense rewards help alleviate the challenge of exploring complex environments but risk biasing agents towards suboptimal and potentially riskier solutions, a critical issue in complex cyber environments. We thoroughly evaluate the impact of reward function structure on learning and policy behavioural characteristics using a variety of sparse and dense reward functions, two well-established cyber gyms, a range of network sizes, and both policy gradient and value-based RL algorithms. Our evaluation is enabled by a novel ground truth evaluation approach which allows directly comparing between different reward functions, illuminating the nuanced inter-relationships between rewards, action space and the risks of suboptimal policies in cyber environments. Our results show that sparse rewards, provided they are goal aligned and can be encountered frequently, uniquely offer both enhanced training reliability and more effective cyber defence agents with lower-risk policies. Surprisingly, sparse rewards can also yield policies that are better aligned with cyber defender goals and make sparing use of costly defensive actions without explicit reward-based numerical penalties.
中文摘要 近年来，自主网络防御智能体通过深度强化学习训练以防御计算机网络的兴趣激增。这些特工通常在网络健身房环境中接受训练，采用密集且高度设计的奖励函数，这些函数结合了多种惩罚和激励，针对一系列（不）理想的状态和代价高昂的行为。高密度奖励有助于减轻探索复杂环境的挑战，但也有可能使智能体倾向于次优且风险更高的解决方案，这在复杂网络环境中是关键问题。我们通过多种稀疏和密集的奖励函数、两个成熟的网络健身房、多种网络规模，以及策略梯度和基于价值的强化学习算法，全面评估了奖励函数结构对学习和策略行为特征的影响。我们的评估得益于一种新型的地面真实评估方法，能够直接比较不同的奖励函数，揭示了奖励、行动空间与网络环境中次优政策风险之间的微妙相互关系。我们的结果表明，只要奖励目标一致且能频繁遇到，则独特地不仅提升了培训可靠性，还能以更低风险的政策为更有效的网络防御代理提供帮助。令人惊讶的是，稀疏的奖励也能带来更贴合网络防御目标的政策，避免使用昂贵的防御行动，而无需明确的基于奖励的数值惩罚。

CRoSS: A Continual Robotic Simulation Suite for Scalable Reinforcement Learning with High Task Diversity and Realistic Physics Simulation

CRoSS：一套持续机器人模拟套件，支持高任务多样性和真实物理模拟的可扩展强化学习

Authors: Yannick Denker, Alexander Gepperth
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.04868
Pdf link: https://arxiv.org/pdf/2602.04868
Abstract Continual reinforcement learning (CRL) requires agents to learn from a sequence of tasks without forgetting previously acquired policies. In this work, we introduce a novel benchmark suite for CRL based on realistically simulated robots in the Gazebo simulator. Our Continual Robotic Simulation Suite (CRoSS) benchmarks rely on two robotic platforms: a two-wheeled differential-drive robot with lidar, camera and bumper sensor, and a robotic arm with seven joints. The former represent an agent in line-following and object-pushing scenarios, where variation of visual and structural parameters yields a large number of distinct tasks, whereas the latter is used in two goal-reaching scenarios with high-level cartesian hand position control (modeled after the Continual World benchmark), and low-level control based on joint angles. For the robotic arm benchmarks, we provide additional kinematics-only variants that bypass the need for physical simulation (as long as no sensor readings are required), and which can be run two orders of magnitude faster. CRoSS is designed to be easily extensible and enables controlled studies of continual reinforcement learning in robotic settings with high physical realism, and in particular allow the use of almost arbitrary simulated sensors. To ensure reproducibility and ease of use, we provide a containerized setup (Apptainer) that runs out-of-the-box, and report performances of standard RL algorithms, including Deep Q-Networks (DQN) and policy gradient methods. This highlights the suitability as a scalable and reproducible benchmark for CRL research.
中文摘要 持续强化学习（CRL）要求智能体从一系列任务中学习，同时不忘记之前获得的策略。在本研究中，我们基于Gazebo模拟器中真实模拟的机器人，提出了一套全新的CRL基准测试套件。我们的持续机器人仿真套件（CRoSS）基准测试依赖于两个机器人平台：一个配备激光雷达、摄像头和保险杠传感器的双轮差速驱动机器人，以及一个拥有七关节的机械臂。前者代表在跟随线条和推动物体的场景中，视觉和结构参数的变化会产生大量不同的任务;而后者则用于两种达标场景，采用高层笛卡尔手位置控制（模仿Continual World基准）和基于关节角度的低级控制。对于机械臂基准测试，我们提供了额外的纯运动学变体，无需物理仿真（只要不需要传感器读数），且运行速度可快两个数量级。CRoSS设计为易于扩展，能够在具有高度物理真实性的机器人环境中进行持续强化学习的受控研究，特别是允许使用几乎任意的模拟传感器。为确保可重复性和易用性，我们提供了一个容器化设置（Apptainer），开箱即用，并报告标准强化学习算法的性能，包括深度Q网络（DQN）和策略梯度方法。这凸显了其作为可扩展且可重复的CRL研究基准的适用性。

Rethinking the Trust Region in LLM Reinforcement Learning

重新思考LLM强化学习中的信任区域

Authors: Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, Wee Sun Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.04879
Pdf link: https://arxiv.org/pdf/2602.04879
Abstract Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior training stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning.
中文摘要 强化学习（RL）已成为微调大型语言模型（LLM）的基石，近端策略优化（PPO）成为事实上的标准算法。尽管其普遍存在，我们认为PPO中核心比率裁剪机制在结构上不适合LLM固有的大量词汇。PPO基于抽样代币的概率比限制策略更新，这作为真实政策背离的噪声单样本蒙特卡洛估计。这造成了次优的学习动态：低概率代币的更新受到过度惩罚，而高概率代币潜在灾难性的变化却被限制不足，导致训练效率低下和不稳定。为此，我们提出了发散近端策略优化（DPPO），它用基于直接估计政策背离的更原则性的约束（例如，全变差或KL）来替代启发式裁剪。为避免巨大的内存占用，我们引入了高效的二进制和顶K近似，以极低的开销捕捉本质散度。大量实证评估表明，DPPO在训练稳定性和效率上优于现有方法，为基于强化学习的LLM微调提供了更坚实的基础。

Reinforced Attention Learning

强化注意力学习

Authors: Bangzheng Li, Jianmo Ni, Chen Qu, Ian Miao, Liu Yang, Xingyu Fu, Muhao Chen, Derek Zhiyuan Cheng
Subjects: Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04884
Pdf link: https://arxiv.org/pdf/2602.04884
Abstract Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training.
中文摘要 通过强化学习（RL）进行后期训练，通过测试时间缩放，显著提升了大型语言模型（LLM）的推理能力。然而，通过冗长的理由将这一范式扩展到多模大型语言模型（MLLM），感知效果有限，甚至可能降低性能。我们提出了强化注意力学习（RAL），这是一种策略梯度框架，直接优化内部注意力分布，而非输出的代币序列。通过将优化从生成内容转向参加地点，RAL 促进了有效的信息分配和对复杂多模态输入的更深入基础。在不同图像和视频基准测试中，实验显示相较GRPO及其他基线数据持续提升。我们进一步介绍了策略上的注意力蒸馏，证明转移潜在注意力行为比标准知识蒸馏更能产生更强的跨模态对齐。我们的研究结果将注意力政策定位为多模态后训练的原则性和通用替代方案。

Keyword: diffusion policy

DADP: Domain Adaptive Diffusion Policy

DADP：域自适应扩散政策

Authors: Pengcheng Wang, Qinghang Liu, Haotian Lin, Yiheng Li, Guojian Zhan, Masayoshi Tomizuka, Yixiao Wang
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.04037
Pdf link: https://arxiv.org/pdf/2602.04037
Abstract Learning domain adaptive policies that can generalize to unseen transition dynamics, remains a fundamental challenge in learning-based control. Substantial progress has been made through domain representation learning to capture domain-specific information, thus enabling domain-aware decision making. We analyze the process of learning domain representations through dynamical prediction and find that selecting contexts adjacent to the current step causes the learned representations to entangle static domain information with varying dynamical properties. Such mixture can confuse the conditioned policy, thereby constraining zero-shot adaptation. To tackle the challenge, we propose DADP (Domain Adaptive Diffusion Policy), which achieves robust adaptation through unsupervised disentanglement and domain-aware diffusion injection. First, we introduce Lagged Context Dynamical Prediction, a strategy that conditions future state estimation on a historical offset context; by increasing this temporal gap, we unsupervisedly disentangle static domain representations by filtering out transient properties. Second, we integrate the learned domain representations directly into the generative process by biasing the prior distribution and reformulating the diffusion target. Extensive experiments on challenging benchmarks across locomotion and manipulation demonstrate the superior performance, and the generalizability of DADP over prior methods. More visualization results are available on the this https URL.
中文摘要 能够推广到看不见的过渡动态的学习领域自适应策略，仍然是基于学习控制的根本挑战。通过领域表示学习，已取得显著进展，捕捉领域特定信息，从而实现领域感知决策。我们分析了通过动态预测学习域表示的过程，发现选择当前步骤相邻的上下文会导致所学的表示纠缠具有不同动力学属性的静态域信息。这种混合会混淆条件政策，从而限制零发自适应。为应对这一挑战，我们提出了DADP（域自适应扩散策略），通过无监督解缠和域感知扩散注入实现稳健适应。首先，我们介绍了延迟上下文动力学预测，这是一种基于历史偏移上下文来进行未来状态估计的策略;通过增加这一时间间隙，我们通过过滤瞬态性质，非监督地解开静态域表示。其次，我们通过对先验分布进行偏置和重新表述扩散靶，将所学的领域表示直接整合到生成过程中。对运动和作复杂基准的广泛实验证明了DADP优于以往方法的性能和可推广性。更多可视化结果可在此 https URL 上获得。