Arxiv Papers of Today

生成时间: 2026-06-01 21:39:54 (UTC+8); Arxiv 发布时间: 2026-06-01 20:00 EDT (2026-06-02 08:00 UTC+8)

今天共有 69 篇相关文章

Keyword: reinforcement learning

Delayed Repression and Emergent Instability in Adaptive Multi-Agent Systems

自适应多智能体系统中的延迟抑制与突发不稳定性

Authors: Igor Itkin
Subjects: Subjects: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Dynamical Systems (math.DS)
Arxiv link: https://arxiv.org/abs/2605.30392
Pdf link: https://arxiv.org/pdf/2605.30392
Abstract Regulatory institutions (from content moderation platforms to financial supervisors) observe, deliberate, and intervene only after a characteristic delay. We ask whether this processing lag alone can destabilize a multi-agent system that would otherwise remain stable, without exogenous shocks, coordination among agents, or malicious actors. We study this question in two stages. First, we analyze a delayed replicator equation in which autonomous agents receive a benefit from radical behavior but face punishment based on a lagged institutional alarm signal. We derive a closed-form critical delay threshold beyond which the unique interior equilibrium loses stability through a Hopf bifurcation, and prove via center manifold reduction that the bifurcation is supercritical (producing bounded oscillations, not explosive growth) for the entire sigmoid response-function family. Second, we embed $N=240$ agents on a network and equip them with reinforcement learning (tabular Q-learning), comparing three decision architectures in a factorial design: non-reactive agents (fixed policy), reactive agents (threshold heuristic without memory), and Q-learning agents (adaptive with cumulative value estimates). The results reveal a hierarchy opposite to the naive expectation that learning amplifies instability: non-reactive agents are immune to delay (0% runaway across all tested values), reactive agents collapse catastrophically (96% runaway by delay $\geq 8$ steps), and Q-learning agents achieve partial resilience (66% runaway at delay $= 20$). The destabilizing ingredient is reactivity to delayed signals: agents that immediately exploit low-alarm windows trigger oscillatory feedback loops. Learning buffers this through implicit punishment memory encoded in Q-values
中文摘要 监管机构（从内容审核平台到金融监管机构）仅在典型延迟后进行观察、审议和干预。我们思考，仅凭这种处理滞后是否能破坏一个本应保持稳定、没有外生冲击、代理间协调或恶意行为者的多智能体系统。我们将这个问题分两个阶段进行研究。首先，我们分析了一个延迟复制者方程，其中自主智能体从激进行为中获益，但因机构警报信号滞后而受到惩罚。我们推导出一个闭式临界延迟阈值，超过该阈值时唯一的内部平衡通过Hopf分岔失去稳定性，并通过中心流形约简证明该分岔在整个S形响应函数族中是超临界的（产生有界振荡，而非爆炸性增长）。其次，我们将$N=240$的代理嵌入网络，并赋予它们强化学习（表式Q学习），比较了三种因子设计中的决策架构：非反应性代理（固定策略）、反应代理（无记忆阈值启发式）和Q学习代理（具备累积价值估计的自适应）。结果揭示了与天真预期学习会放大不稳定性的层级相反：非反应性代理对延迟免疫（所有测试值中0%失控），反应性代理则灾难性崩溃（96%因延迟$\geq 8$步而失控），Q学习代理实现部分韧性（66%失控，延迟$=20$）。导致不稳定的因素是对延迟信号的反应性：立即利用低警报窗口的代理会触发振荡反馈回路。学习通过编码在Q值中的隐性惩罚记忆来缓冲这些

Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics

通过状态增强和可分动态共识实现可扩展的受限多智能体强化学习

Authors: Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30461
Pdf link: https://arxiv.org/pdf/2605.30461
Abstract We present a distributed approach for constrained Multi-Agent Reinforcement Learning (MARL) that combines state-augmented policy learning with distributed consensus over dual variables. Our method targets systems where agents have separable dynamics but must coordinate to satisfy global resource constraints, a setting in which, as we demonstrate empirically, independent learning fails to produce feasible solutions because agents cannot determine appropriate individual contributions toward collective constraint satisfaction. The key technical contribution is showing that lightweight neighbor-to-neighbor consensus over Lagrange multipliers suffices for globally coordinated constraint enforcement while preserving the scalability of independent training. Each agent learns a single augmented policy offline, conditioned on both its local state and a dual variable encoding constraint feedback. During execution, agents reach agreement on this dual variable through local communication alone. We prove that under mild connectivity assumptions, the consensus error among agents' multipliers is bounded, and show that this translates to a bounded constraint violation that decreases with graph connectivity and the number of consensus rounds. Unlike centralized training with decentralized execution (CTDE) approaches, whose complexity grows at least quadratically with agent count, our method scales linearly in both training and execution. Experiments on smart grid demand response demonstrate that consensus coordination is \emph{essential for feasibility}: without it, agents satisfy grid capacity constraints only by indefinitely postponing demand, a degenerate non-solution. With consensus, agents converge to a shared dual variable and satisfy both grid constraints and demand fulfillment, scaling to thousands of agents while CTDE baselines are limited to dozens.
中文摘要 我们提出了一种分布式的受限多智能体强化学习（MARL）方法，结合状态增强策略学习与对偶变量的分布式共识。我们的方法针对的是主体之间具有可分离动态但必须协调以满足全局资源约束的系统，正如我们实证所示，独立学习无法产生可行的解决方案，因为主体无法确定对集体约束满足的适当贡献。关键的技术贡献是表明，关于拉格朗日乘数的轻量级邻居共识足以实现全局协调的约束执行，同时保持独立训练的可扩展性。每个代理离线学习一个增强策略，条件是其本地状态和对偶变量编码约束反馈。在执行过程中，代理仅通过局部通信就该对偶变量达成一致。我们证明在轻度连通性假设下，代理乘数之间的共识误差是有界的，并证明这转化为一个有界约束违规，且随着图连通性和共识轮数的增加而减小。与中心化训练加去中心化执行（CTDE）方法不同，后者的复杂度至少与代理数量成平方增长，我们的方法在训练和执行上均呈线性扩展。智能电网需求响应的实验表明，共识协调是不可行性的——没有共识协调，代理只能通过无限期推迟需求来满足电网容量约束，这是一种退化的非解决方案。在共识下，代理会收敛到共享的对偶变量，同时满足网格约束和需求满足，可扩展到数千代理，而CTDE基线限制为数十。

Improving Small Language Models for Code Generation with Reinforcement Learning from Verification Feedback

通过验证反馈强化学习改进小语言模型用于代码生成

Authors: Egor Skopin, Evgeny Kotelnikov
Subjects: Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.30478
Pdf link: https://arxiv.org/pdf/2605.30478
Abstract Reinforcement learning with verifiable rewards (RLVR) trains language models using programmatically checkable signals such as unit-test outcomes, enabling direct optimization for functional correctness in code generation. We conduct an empirical study of RLVR for Python code generation on the MBPP benchmark using two small models (Qwen3-0.6B and Llama3.2-1B) with LoRA fine-tuning. Across multiple reward formulations such as: unit-test-only rewards, static-analysis-only shaping via the Ruff linter, and a combined reward, we compare group-based policy optimization variants (GRPO and GSPO) and evaluate both functional correctness and behavioral diagnostics. In our experimental setting, RLVR improves pass@1 on MBPP test by up to 13 percentage points under proposed combined reward configuration. However, we find that reward shaping can induce systematic behavioral shifts: using only static-analysis penalties may bias the policy toward shorter completions that reduce lint errors without reliably improving functional correctness. In contrast, combined rewards mitigate this degeneration and yield more stable trade-offs between correctness and style constraints. Overall, our results highlight that RLVR effectiveness for code generation is highly sensitive to reward design and optimization granularity, and that diagnostics beyond pass@1, including generation length, Ruff severity profiles, and execution error types are useful for identifying failure modes.
中文摘要 带可验证奖励的强化学习（RLVR）使用可编程检查的信号（如单元测试结果）训练语言模型，实现代码生成中功能正确性的直接优化。我们利用两个小模型（Qwen3-0.6B 和 Llama3.2-1B）并进行 LoRA 微调，对 Python 代码生成的 RLVR 进行了实证研究。通过多种奖励表述，如：仅单元测试奖励、仅静态分析的Ruff linter整形以及组合奖励，我们比较基于群体的策略优化变体（GRPO和GSPO），并评估功能正确性和行为诊断。在我们的实验环境中，RLVR在MBPP测试中提升pass@1，在拟议的联合奖励配置下可提升最多13个百分点。然而，我们发现奖励塑造可以引发系统性的行为转变：仅使用静态分析惩罚可能会使策略偏向于更短的完成，从而减少绒毛错误，却无法可靠地提升功能正确性。相比之下，组合奖励减轻了这种退化，并在正确性与风格约束之间实现更稳定的权衡。总体而言，我们的结果强调RLVR在代码生成方面的有效性对奖励设计和优化的细度高度敏感，且超越pass@1的诊断，包括生成长度、Ruff严重度剖面和执行错误类型，对于识别故障模式非常有用。

Physics-informed Goal-Conditioned Reinforcement Learning under Hybrid Contact Dynamics

混合接触动力学下的物理知情目标条件强化学习

Authors: Vittorio Giammarino, Anastasios Manganaris, Ahmed H. Qureshi
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.30503
Pdf link: https://arxiv.org/pdf/2605.30503
Abstract Learning to reach arbitrary goals from sparse feedback requires agents to infer a rich notion of reachability across state--goal pairs. Goal-conditioned reinforcement learning (GCRL) tackles this challenge by learning policies that generalize across goals, but this generalization becomes increasingly difficult as the underlying dynamics become high-dimensional, hybrid, or contact-dependent. To address this issue, physics-informed GCRL (Pi-GCRL) introduces optimal-control-inspired inductive biases into goal-conditioned value learning. While Pi-GCRL methods have proven effective in navigation and object-free goal-reaching domains, their reliability in contact-rich tasks remains unclear, where contact interactions induce hybrid dynamics, mode-dependent controllability, and nonsmooth value landscapes. In this work, we show that these structural properties can cause existing Pi-GCRL methods to degrade when applied naively to contact-rich manipulation. Motivated by this analysis, we introduce contact-aware and hierarchical formulations that apply physics-informed inductive biases selectively across the manipulation problem. Our results provide a principled step toward extending Pi-GCRL to contact-rich manipulation.
中文摘要 学习从稀疏反馈中达到任意目标，需要代理推断跨状态的可达性——目标对。目标条件强化学习（GCRL）通过学习跨目标推广的策略来应对这一挑战，但随着底层动态变得高维、混合或接触依赖，这种推广变得越来越困难。为解决这一问题，基于物理的GCRL（Pi-GCRL）在目标条件值学习中引入了最优控制启发的归纳偏差。尽管Pi-GCRL方法在导航和无对象目标达度领域已被证明有效，但在接触丰富任务中的可靠性尚不明确，因为接触交互会导致混合动力学、模式依赖可控性和非平滑值景观。本研究表明，这些结构性质在朴素应用于接触富导时会导致现有Pi-GCRL方法退化。基于该分析，我们引入了接触感知和层级表述，选择性地应用于操作问题中的物理知情归纳偏差。我们的结果为将Pi-GCRL推广到富接触操作提供了有原则的一步。

Destruction is a General Strategy to Learn Generation; Diffusion's Strength is to Take it Seriously; Exploration is the Future

毁灭是一种通用的学习策略，世代;扩散的优势在于认真对待;探索是未来

Authors: Pierre-André Noël
Subjects: Subjects: Machine Learning (cs.LG); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2605.30553
Pdf link: https://arxiv.org/pdf/2605.30553
Abstract I present diffusion models as part of a family of machine learning techniques that withhold information from a model's input and train it to guess the withheld information. I argue that diffusion's destroying approach to withholding is more flexible than typical hand-crafted information withholding techniques, providing a rich training playground that could be advantageous in some settings, notably data-scarce ones. I then address subtle issues that may arise when porting reinforcement learning techniques to the diffusion context, and wonder how such exploration problems could be addressed in more diffusion-native ways. I do not have definitive answers, but I do point my fingers in directions I deem interesting. A tutorial follows this thesis, expanding on the destroy-then-generate perspective. A novel kind of probabilistic graphical models is introduced to facilitate the tutorial's exposition.
中文摘要 我将扩散模型介绍为一系列机器学习技术的一部分，这些技术对模型输入隐瞒信息，并训练模型猜测这些隐瞒的信息。我认为，扩散的破坏性隐瞒方法比典型手工隐瞒信息技术更灵活，提供了一个丰富的训练场，在某些环境下，尤其是数据稀缺的环境中，可能具有优势。接着，我会探讨在将强化学习技术移植到扩散环境中可能出现的细微问题，并思考如何以更原生扩散的方式解决这些探索问题。我没有确切答案，但我会指向我认为有趣的方向。本论点后有教程，扩展了“毁灭后生成”的观点。引入了一种新型概率图形模型，以便于教程的讲解。

Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving

不确定性感知和时间受控的自动驾驶强化学习专家建议

Authors: Ahmed Abouelazm, Felix Klingebiel, Philip Schörner, J. Marius Zöllner
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30576
Pdf link: https://arxiv.org/pdf/2605.30576
Abstract Exploration in reinforcement learning for autonomous driving is inherently unsafe: agents must experience novel behaviors to learn, yet exploration can lead to collisions or off-road driving. We propose an uncertainty-aware framework that leverages expert advice to guide exploration while avoiding long-term dependence. Advice is triggered when epistemic or aleatoric uncertainty exceeds adaptive thresholds derived from rolling buffers, ensuring advice evolves with the agent's confidence. A commitment-cooldown strategy with a stochastic early-stop heuristic regulates the duration and frequency of guidance, exposing the agent to coherent maneuvers without exhausting the advice budget. Expert and agent experiences are combined in a shared replay buffer within an off-policy implicit quantile network (IQN) backbone, enabling efficient reuse of expert trajectories. Experiments in CARLA show that our method outperforms the IQN baseline, improving success by 5-7% and reducing failures, demonstrating that risk-sensitive uncertainty coupled with regulated expert integration enables safer and more efficient exploration for sensor-based RL policy learning in unsignalized intersection navigation.
中文摘要 在自动驾驶强化学习中的探索本质上是不安全的：智能体必须体验新颖的行为才能学习，但探索过程可能导致碰撞或越野驾驶。我们提出一个不确定性意识的框架，利用专家建议引导探索，同时避免长期依赖。当认知或偶然性不确定性超过由滚动缓冲区得出的适应阈值时，建议被触发，确保建议能随着智能体的信心演变。承诺冷却策略配合随机提前停止启发式，调节引导的持续时间和频率，使代理在不耗尽建议预算的情况下进行连贯操作。专家和代理的经验在非策略隐式分位数网络（IQN）骨干网内的共享重放缓冲区中合并，从而实现专家轨迹的高效重用。CARLA中的实验显示，我们的方法优于IQN基线，成功率提升5-7%，并减少失败，表明风险敏感的不确定性结合受监管的专家集成，能够更安全、更高效地探索无信号路口导航中的传感器驱动RL政策学习。

Constrained Flow Optimization via Sequential Fine Tuning for Molecular Design

通过顺序微调实现分子设计的受限流动优化

Authors: Sven Gutjahr, Riccardo De Santi, Luca Schaufelberger, Kjell Jorner, Andreas Krause
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.30610
Pdf link: https://arxiv.org/pdf/2605.30610
Abstract Adapting generative foundation models, in particular diffusion and flow models, to optimize given reward functions (e.g., binding affinity) while satisfying constraints (e.g., molecular synthesizability) is fundamental for their adoption in real-world scientific discovery applications such as molecular design or protein engineering. While recent works have introduced scalable methods for reward-guided fine-tuning of such models via reinforcement learning and control schemes, it remains an open problem how to algorithmically trade-off reward maximization and constraint satisfaction in a reliable and predictable manner. Motivated by this challenge, we first present a rigorous framework for Constrained Generative Optimization, which brings an optimization viewpoint to the introduced adaptation problem and retrieves the relevant task of constrained generation as a sub-case. Then, we introduce Constrained Flow Optimization (CFO), an algorithm that automatically and provably balances reward maximization and constraint satisfaction by reducing the original problem to sequential fine-tuning via established, scalable methods. We provide convergence guarantees for constrained generative optimization and constrained generation via CFO. Ultimately, we present an experimental evaluation of CFO on both synthetic, yet illustrative, settings, and a molecular design task. Across these evaluations, CFO achieves consistent increases in reward while ensuring high constraint satisfaction, showcasing its practical utility for constrained generative optimization.
中文摘要 调整生成基础模型，特别是扩散和流动模型，以优化给定的奖励函数（如结合亲和力），同时满足约束条件（如分子合成），是其在分子设计或蛋白质工程等现实科学发现应用中的基础。尽管近期工作引入了可扩展的方法，通过强化学习和控制方案实现此类模型的奖励引导微调，但如何以可靠且可预测的方式在算法上权衡奖励最大化与约束满足仍是一个悬而未决的问题。受此挑战激励，我们首先提出了一个严谨的受限生成优化框架，从优化视角引入适应问题，并检索了受限生成这一子情形的相关任务。随后，我们引入了受限流优化（CFO），这是一种通过既有且可扩展的方法，自动且可证明地平衡奖励最大化和约束满足，将原始问题简化为顺序微调。我们通过CFO为受限生成优化和受限生成提供收敛保证。最终，我们对CFO在合成且具说明性的环境和分子设计任务中进行了实验评估。在这些评估中，CFO在保证高约束满足度的同时实现了持续的奖励提升，展示了其在受限生成优化中的实用价值。

ZAPS-DA: Zero-Phase Action Policy Smoothing with Decoupled Actor for Continuous Control in Reinforcement Learning

ZAPS-DA：基于解耦演员的零相位动作策略平滑，用于强化学习中的连续控制

Authors: Faiq Shamass
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.30612
Pdf link: https://arxiv.org/pdf/2605.30612
Abstract Continuous control policies trained with off-policy reinforcement learning frequently exhibit high-frequency action jitter, rendering direct deployment on physical actuators impractical. Post-hoc filtering attenuates jitter but introduces phase lag; embedding smoothness penalties in the actor's loss couples them with the RL gradient and conflates reward regression with over-aggressive smoothing. We present ZAPS-DA, a framework that reduces action jitter at deployment with negligible phase lag and no post-processing. ZAPS-DA pairs an unmodified main actor (trained by the base RL loss) with a separate decoupled actor trained via supervised imitation of zero-phase filtered targets stored in the replay buffer. The deployed policy is the decoupled actor: a feed-forward map from the current observation to a smooth action, with no inference-time filter and no action-history input -- a mechanism we term causal distillation of a non-causal filter. A magnitude-matched MSE loss provides zero-hyperparameter portability across optimizer classes. Validated with Soft Actor-Critic and a Savitzky--Golay filter in two driving simulators using paired n=150 evaluation protocols: on MetaDrive, ZAPS-DA reduces steering jitter by 14--21x and throttle jitter by 3--5x (all $p < 10^{-4}$, Bonferroni-corrected) while matching task-completion (p=0.28 success, p=0.31 crash) at a 6.3% reward cost; on a custom Webots adaptive cruise control environment, the same SG configuration produces a Pareto improvement -- reward parity (p=0.121), 8--45x steering jitter reduction, and total task-failure rate reduced from 2.0% to 0.7%.
中文摘要 通过非策略强化学习训练的持续控制策略常表现出高频动作抖动，使得直接部署在物理执行器上变得不切实际。后期滤波能减弱抖动，但引入相位滞后;将平滑惩罚嵌入演员的损失中，会与现实学习梯度结合，并将奖励回归与过于激进的平滑混淆。我们介绍了ZAPS-DA框架，该框架在部署时减少动作抖动，且相位延迟可忽略，无需后处理。ZAPS-DA将未修改的主演员（由基础RL损耗训练）与一个通过监督模拟零相位滤波目标训练的独立解耦演员配对。部署的策略是解耦的参与者：从当前观察到平滑动作的前馈映射，没有推理时间滤波器，也没有动作历史输入——我们称之为非因果滤波器的因果提炼机制。幅度匹配的MSE损耗为优化器类别间提供了零超参数的可移植性。在两个驾驶模拟器中使用 Soft Actor-Critic 和 Savitzky-Golay 滤波器，使用配对的 n=150 评估协议验证：在 MetaDrive 上，ZAPS-DA 将转向抖动减少 14--21 倍，油门抖动减少 3--5 倍（均$p < 10^{-4}$，Bonferroni 校正），同时任务完成率（p=0.28 成功，p=0.31 崩溃），奖励成本为 6.3%;在定制的Webots自适应巡航控制环境中，同样的SG配置实现了帕累托改进——奖励平价（p=0.121）、8-45倍转向抖动减少，总任务失败率从2.0%降至0.7%。

Temporally Encoded Double DQN for Proactive PRB Allocation in O-RAN Enabled Industrial Networks

在支持O-RAN的工业网络中，用于主动PRB分配的时序编码双DQN

Authors: Elahe Delavari, Xingqi Wu, Junaid Farooq
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2605.30630
Pdf link: https://arxiv.org/pdf/2605.30630
Abstract Fifth-generation (5G) wireless systems are increasingly adopted in smart manufacturing to support heterogeneous industrial workloads through services such as enhanced Mobile Broadband (eMBB) and Ultra-Reliable Low-Latency Communication (URLLC). However, industrial traffic is inherently process-driven and temporally correlated. So, static or reactive schedulers in the Open Radio Access Network (O-RAN) are inadequate for such non-stationary conditions, leading to sub-optimal utilization and violation of latency-reliability guarantees. This paper proposes a temporal-aware deep reinforcement learning (DRL) xApp for proactive Physical Resource Block (PRB) allocation in O-RAN-enabled industrial networks. The proposed framework integrates a long short-term memory (LSTM) encoder within a Double Deep Q-Network (DQN) to model sequential dependencies among slice-level Key Performance Indicators (KPIs), enabling predictive and stable decision-making. A continuous-time Markov chain (CTMC) traffic model is incorporated to emulate machine concurrency and process burstiness. Experimental results show that the LSTM-Double DQN improves slice satisfaction, and buffer stability under moderate and heavy load, with the longest sequence window providing the strongest gains.
中文摘要 第五代（5G）无线系统越来越多地被智能制造采用，以支持异构工业工作负载，如增强型移动宽带（eMBB）和超可靠低延迟通信（URLLC）等服务。然而，工业交通本质上是流程驱动且时间相关性的。因此，开放无线接入网（O-RAN）中的静态或反应调度器在此类非固定条件下不足，导致利用率不佳，并违反延迟可靠性保证。本文提出了一种时间感知深度强化学习（DRL）xApp，用于O-RAN支持的工业网络中主动分配物理资源块（PRB）。该框架在双深度Q网络（DQN）中集成了长短期记忆（LSTM）编码器，以建模切片级关键绩效指标（KPI）之间的顺序依赖关系，实现预测性和稳定的决策。集成了连续时间马尔可夫链（CTMC）流量模型，以模拟机器并发和进程突发性。实验结果显示，LSTM-Double DQN在中重负载下能提升切片满意度和缓冲稳定性，且最长的序列窗口带来最显著的提升。

Convergence of Steepest Descent and Adam under Non-Uniform Smoothness

在非均匀光滑性下，最陡下降与亚当的收敛

Authors: Sharan Vaswani, Yifan Sun, Reza Babanezhad
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2605.30648
Pdf link: https://arxiv.org/pdf/2605.30648
Abstract Recent work has analyzed the convergence of first-order methods under non-uniform smoothness assumptions that better model the loss landscape in machine learning tasks. We generalize this assumption to objectives whose curvature is an affine function of the objective value. This property is satisfied by a broad class of problems, including logistic regression, generalized linear models with a logistic link function, softmax policy gradient in reinforcement learning, and a class of neural networks. Under this assumption and gradient domination conditions, we establish a general convergence rate for the steepest descent method, and deterministic, diagonal variants of RMSProp and Adam. Our results imply that for logistic regression on separable data and the softmax policy gradient objective, sign GD converges linearly and is provably faster than GD. Furthermore, we show that for a class of two-layer neural networks on separable data, RMSProp and Adam can converge at a linear rate with a constant step-size and momentum parameter. Finally, we present a lower bound demonstrating that, under our assumption, RMSProp and Adam are provably faster than AdaGrad, AMSGrad, gradient descent, and heavy-ball momentum.
中文摘要 近期工作分析了在非均匀光滑性假设下一阶方法的收敛性，以更好地模拟机器学习任务中的损失景观。我们将此假设推广到曲率为目标值仿射函数的目标。这一特性被广泛的问题满足，包括逻辑回归、带有逻辑链路函数的广义线性模型、强化学习中的软极大策略梯度以及一类神经网络。在此假设和梯度支配条件下，我们建立了最陡下降法的一般收敛率，以及RMSProp和Adam的确定性对角线变体。我们的结果表明，对于可分离数据的逻辑回归和软最大策略梯度目标，符号GD是线性的，且可证明速度快于GD。此外，我们证明对于可分数据上的一类两层神经网络，RMSProp 和 Adam 可以以恒定步长和动量参数的线性速率收敛。最后，我们给出一个下界，证明在假设下，RMSProp 和 Adam 可证明的速度快于 AdaGrad、AMSGrad、梯度下降和重球动量。

Learning to Perceive the World Through Control: Empowerment-Based Representation Learning

通过控制来感知世界：基于赋权的表征学习

Authors: Mahsa Bastankhah, Sophie Broderick, Benjamin Eysenbach
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.30656
Pdf link: https://arxiv.org/pdf/2605.30656
Abstract In many practical reinforcement learning environments, observations are far higher-dimensional than the variables that matter for control. In this work, we ask: can we learn representations that capture only control-relevant features of the environment? We study this question through the empowerment objective, which maximizes an agent's influence over the environment and is widely used for unsupervised skill learning. We show that empowerment agents induce two distinct representations -- forward and backward -- that capture complementary aspects of the state, and both of which are invariant to control-irrelevant features. Thus, empowerment maximization leads agents to learn an implicit, control-centric model of the world. Our analysis highlights the importance of learning representations through interaction rather than from passive datasets: interaction aimed at maximizing control is essential for learning useful invariance properties, a perspective that aligns closely with the causal learning literature.
中文摘要 在许多实用的强化学习环境中，观察的维度远高于控制中重要的变量。在本研究中，我们探讨：我们能否学习仅捕捉环境控制相关特征的表征？我们通过赋权目标研究这个问题，最大化代理对环境的影响，并广泛应用于无监督技能学习。我们表明赋权代理诱导两种不同的表征——正向和反向——捕捉状态的互补方面，且这两者都不受控制无关特征的影响。因此，赋权最大化引导代理学习一种隐性、以控制为中心的世界模型。我们的分析强调了通过交互而非被动数据集学习表征的重要性：旨在最大化控制的交互对于学习有用的不变性质至关重要，这一观点与因果学习文献高度契合。

Reinforcement Learning for Special Education: Aligning LLM Tutors to Diverse Learners through Disability-Adaptive Training

特殊教育强化学习：通过残障适应培训将LLM导师与多元学习者对齐

Authors: Unggi Lee, Jihoi Na, Yeil Jeong, Haeun Park, Yeonju Jang
Subjects: Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2605.30670
Pdf link: https://arxiv.org/pdf/2605.30670
Abstract Large language models are increasingly deployed as intelligent tutors, yet research on aligning them for special education remains absent. Recent work has applied reinforcement learning to LLM tutors, but these methods target a generic learner in a single domain (mathematics) and do not address the cognitive and communicative diversity of learners with disabilities. We introduce \emph{Special-R1}, a framework that extends pedagogical RL to special education through two components: (1) a two-dimensional adaptive system prompt that couples a difficulty-based support level with a disability-specific teaching style across five disability profiles; and (2) a persona-aware Thinking Reward whose judge rubric is conditioned on the learner's disability profile. On a persona-augmented test set of 690 multi-turn dialogues, our full model raises persona-aware Fit from 6.75 (generic baseline) to 8.40 (+1.65) and SPED-rubric Helpfulness from 0.720 to 0.768, leading on the four-component Total (2.911, +0.064 over the runner-up) while remaining within 0.01 of the strongest variant on the out-of-domain OpenLearnLM benchmark (8.53). Ablations show that the Thinking Reward becomes effective only in combination with adaptive prompting, and that residual weakness on specific learning disability in mathematics motivates targeted multimodal extensions.
中文摘要 大型语言模型越来越多地被用作智能导师，但关于将其与特殊教育对齐的研究仍然缺失。近期研究将强化学习应用于LLM导师，但这些方法针对的是单一领域的通用学习者（数学），未能解决残障学习者的认知和交流多样性。我们引入了\emph{Special-R1}，这是一个通过两个组成部分将教学强化学习扩展到特殊教育的框架：（1）一个二维自适应系统提示，将基于困难的支持水平与跨五个残障档案的特殊教学风格相结合;以及（2）以人格感知思维奖励为条件，其评判评分标准基于学习者的残疾特征。在一个由690个多回合对话组成的人物增强测试集中，我们的完整模型将角色感知契合度从6.75（通用基线）提升至8.40（+1.65），特殊教育评分标准帮助度从0.720提升至0.768，在四成分总分中领先（2.911，比次者+0.064），同时在域外OpenLearnLM基准测试中仍领先最强变体0.01（8.53）。消融分析显示，思维奖励只有与适应性提示结合才能有效，而数学中特定学习障碍的残余弱点激励了有针对性的多模态扩展。

Universal Decision Learners

通用决策学习者

Authors: Sridhar Mahadevan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.30694
Pdf link: https://arxiv.org/pdf/2605.30694
Abstract Many theories of decision making -- planning, reinforcement learning, causal intervention, online learning, and game-theoretic equilibrium -- turn local information into globally coherent behavior. This paper proposes a common categorical formulation: a Universal Decision Learner (UDL) extends a partially specified decision functor from observed contexts to new contexts by a pair of universal constructions. Left Kan extensions express rollout, aggregation, and candidate generation; right Kan extensions express consistency, constraint satisfaction, and fixed-point semantics. The central claim is not that every decision problem has the same algorithm, but that many decision formalisms instantiate the same universal problem: extend local behavioral data canonically, then characterize the globally coherent extensions. We give the abstract UDL construction, prove its universal comparison property, define Kan-invariant behavioral equivalence and minimal abstractions, and show how Bellman equations, planning recursions, causal interventions, online regret, and equilibria arise as special cases. The supplementary material develops the reinforcement-learning specialization in more detail.
中文摘要 许多决策理论——规划、强化学习、因果干预、在线学习和博弈论均衡——将局部信息转化为全局连贯的行为。本文提出了一个常见的范畴表述：通用决策学习器（UDL）通过一对普遍构造，将部分指定的决策函子从观察到的上下文扩展到新的上下文。左Kan扩展表达展开、聚合和候选生成;右Kan扩展表达了一致性、约束满足和不动点语义。核心主张不是每个决策问题都有相同的算法，而是许多决策形式主义实例化了同一个普遍问题：先规范地扩展局部行为数据，然后刻画全局连贯的扩展。我们给出了抽象的UDL构造，证明了其普遍比较性质，定义了Kan不变的行为等价和最小抽象，并展示了Bellman方程、计划递归、因果干预、在线后悔和均衡作为特例的出现。补充材料对强化学习的专业化进行了更详细的发展。

ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents

ExpGraph：基于图结构化记忆的模型无关体验学习，面向LLM代理

Authors: Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu, Xueqiang Xu, Haozhen Zhang, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.30712
Pdf link: https://arxiv.org/pdf/2605.30712
Abstract Large language model (LLM) agents have shown strong capabilities in reasoning, tool use, and multi-step interaction, but they often solve tasks from scratch and fail to reuse successful strategies or failure lessons from prior experience. Fine-tuning on collected experience can improve reuse, but it is inflexible when stronger or more suitable executors emerge. We propose ExpGraph, a model-agnostic experience learning framework that enables frozen and replaceable LLM executors to improve through external experience reuse without parameter updates. ExpGraph summarizes historical trajectories into reusable skills and failure lessons, organizes them as nodes in a self-evolving experience graph, and retrieves useful experiences through graph diffusion and utility-aware ranking. A lightweight retrieval copilot is trained with reinforcement learning using feedback that compares executor performance with and without retrieved experiences, while the graph is updated online from downstream task outcomes. We evaluate ExpGraph on ExpSuite, covering question answering, mathematical reasoning, code generation, and multi-step agentic environments including ALFWorld and AppWorld. ExpGraph improves over the strongest baseline by 12.2% and 4.7% on static tasks with smaller and larger executors, and by 21.4% and 12.7% in agentic environments, while reducing average interaction steps by 12.7% and 21.6%. Ablations show that graph-structured experience, utility-aware ranking, and adaptive retrieval jointly enable effective experience reuse across diverse tasks and executor models.
中文摘要 大型语言模型（LLM）代理在推理、工具使用和多步交互方面表现出强大的能力，但他们常常从零开始解决任务，且未能重用成功策略或经验中的失败经验。对收集经验的微调可以提升再利用率，但当更强大或更合适的执行人出现时，这种做法就变得僵化。我们提出了ExpGraph，一个模型无关的体验学习框架，使冻结和可替换的LLM执行者能够通过外部体验重用改进，而无需参数更新。ExpGraph将历史轨迹总结为可复用的技能和失败经验教训，将其组织为自我演变经验图中的节点，并通过图扩散和效用意识排名检索有用经验。轻量级检索副驾驶通过反馈进行强化学习训练，比较执行者在有和无检索经验的情况下表现，同时图表会从下游任务结果在线更新。我们在ExpSuite上评估ExpGraph，涵盖问答、数学推理、代码生成以及包括ALFWorld和AppWorld在内的多步代理环境。在执行者大小的静态任务中，ExpGraph相较于最强基线提升了12.2%和4.7%，在代理环境中分别提升了21.4%和12.7%，平均交互步骤分别减少了12.7%和21.6%。消融分析表明，图结构化的经验、效用感知排名和自适应检索共同实现了在不同任务和执行者模型间的有效经验重用。

When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

什么时候大型语言模型足够作为顺序强化学习任务的策略优化器？

Authors: Stephane Hatgis-Kessell, Emma Brunskill
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30719
Pdf link: https://arxiv.org/pdf/2605.30719
Abstract We study when large language models (LLMs) can serve as effective black-box policy optimizers for reinforcement learning (RL) tasks, i.e., when can we replace classical RL algorithms with an LLM? We explore this question by introducing Prompted Policy Optimization (PromptPO), an iterative method that prompts an LLM with Python descriptions of the state space, action space, and reward function, then has it generate and refine executable policies based on rollout feedback. Across hard exploration environments, Meta-World robotics tasks, and several real-world control problems, PromptPO often matches or exceeds the performance of standard RL baselines while using substantially fewer environment interactions. To maximize expected return, and without further explicit prompting, the policies PromptPO outputs range from tuned proportional controllers or rule-based plans to policies that run planning algorithms like value iteration. Our results demonstrate that LLM-based policy optimization is sufficient when the LLM can leverage prior knowledge about the environment or optimization strategy. PromptPO underperforms standard RL baselines in MuJoCo domains. This demonstrates possible limitations of LLM-based policy optimization to settings that requiring fine-grained continuous control.
中文摘要 我们研究大型语言模型（LLM）何时能作为强化学习（RL）任务的有效黑箱策略优化器，即何时可以用大型语言模型替代经典强化学习算法？我们通过介绍提示策略优化（PromptPO）来探讨这个问题，这是一种迭代方法，它通过提示LLM并用Python描述状态空间、动作空间和奖励函数，然后根据推展反馈生成和优化可执行策略。在困难的探索环境、元世界机器人任务以及若干现实控制问题中，PromptPO常常能与标准强化学习基线相当甚至超越，同时使用显著减少的环境交互。为了最大化预期回报，且无需进一步明确提示，PromptPO 输出的策略范围从调整比例控制器或基于规则的计划，到运行规划算法如价值迭代的策略。我们的结果表明，当LLM能够利用对环境或优化策略的先验知识时，基于LLM的策略优化是足够的。PromptPO在MuJoCo域中表现不及标准强化学习基线。这表明基于LLM的策略优化在需要细粒度连续控制的环境中可能存在局限性。

MosaicLeaks:Privacy Risks in Querying-in-the-Open for Deep Research Agents

MosaicLeaks：公开查询深度研究代理的隐私风险

Authors: Alexander Gurung, Spandana Gella, Alexandre Drouin, Issam H. Laradji, Perouz Taslakian, Rafael Pardinas
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.30727
Pdf link: https://arxiv.org/pdf/2605.30727
Abstract Deep research agents increasingly combine private local documents with external tools like web retrieval, creating a privacy risk: an agent's external queries may leak sensitive information from its local context. This risk is amplified by the mosaic effect, where individual queries may appear harmless but become revealing in aggregate. We introduce MosaicLeaks, a benchmark of 1,001 multi-hop deep research tasks that chain private enterprise documents and a public web corpus, forcing agents to make external queries that depend on local information. We evaluate leakage with an adversary LLM that observes only the agent's external queries and attempts to infer private information at three levels: the agent's research intent, answers to specific private questions and verifiable claims about the enterprise documents. We find that models across families and sizes frequently leak at all three levels, that zero-shot privacy prompting reduces but does not eliminate leakage and that reinforcement learning for task performance alone worsens leakage. To address this, we propose Privacy-Aware Deep Research (PA-DR), an RL framework that combines situational rewards for task success with a learned privacy classifier to provide dense credit assignment over both per-query and mosaic-level leakage. Training Qwen3-4B-Instruct with PA-DR improves accuracy from 48.7% to 58.7% and reduces answer and full-information leakage from 34.0% to 9.9%.
中文摘要 深度研究代理越来越多地将私密本地文档与外部工具如网页检索结合起来，这带来了隐私风险：代理的外部查询可能泄露其本地环境中的敏感信息。这种风险因马赛克效应而被放大，单个查询看似无害，但整体却变得具有揭示性。我们介绍了MosaicLeaks，这是一个包含1001个多跳深度研究任务的基准工具，将私有企业文档和公共网络语料串联起来，迫使代理进行依赖本地信息的外部查询。我们用一个仅观察代理外部查询并试图在三个层面推断私人信息的对手大型语言模型来评估泄露：代理的研究意图、对特定私人问题的回答以及关于企业文档的可验证声明。我们发现，跨家族和规模的模型在三个层级上都经常出现泄漏，零样本隐私提示虽然减少但未能消除泄漏，而仅针对任务表现的强化学习则加剧了泄漏。为此，我们提出了隐私感知深度研究（PA-DR）框架，该框架结合了任务成功的情境奖励与学习中的隐私分类器，在每次查询和马赛克层级泄露上均提供密集的信用分配。用PA-DR训练Qwen3-4B-Instruct可以将准确率从48.7%提升到58.7%，并将答案和完整信息泄露率从34.0%降至9.9%。

Generating Graph-like Rules for Knowledge Graph Reasoning via Diffusion Models

通过扩散模型生成类图论规则以实现知识图推理

Authors: Haoxiang Cheng, Yunfei Wang, Chao Chen, Kewei Cheng, Zhipeng Lin, Haoxuan Li, Changjun Fan, Shixuan Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30747
Pdf link: https://arxiv.org/pdf/2605.30747
Abstract Logical rules constitute a cornerstone of knowledge graph (KG) reasoning, valued for their interpretability and ability to model relational patterns. However, existing rule mining methods predominantly focus on simple chain-like rules and therefore neglect the richer relational information encoded in graph-like structures, such as cycles and branches. This limitation is further exacerbated by computational bottlenecks caused by the combinatorial explosion of the search space, which is especially challenging for graph-like rules. Meanwhile, generative approaches such as diffusion models, despite their success in other domains, can not be directly applied to rule mining because their training objectives are not aligned with the goal of learning high-quality rules, and non-differentiable KG rule quality metrics cannot directly guide model optimization. To address these limitations, we propose GRiD, a framework that reformulates graph-like rule discovery as a discrete generative process conditioned on the target relation. GRiD employs a two-phase training strategy. First, supervised pre-training enables GRiD to capture structural priors from subgraphs sampled from the KG meta-graph. Subsequently, reinforcement learning is applied to fine-tune GRiD through policy gradient optimization guided directly by non-differentiable rule-quality metrics. Experiments on six benchmark datasets show that GRiD achieves competitive performance on KG completion tasks. Ablation studies confirm the efficiency and robustness of GRiD and further show that graph-like rules complement chain-like rules in KG completion. Our codes and datasets are available in this https URL
中文摘要 逻辑规则构成了知识图谱（KG）推理的基石，因其可解释性和建模关系模式的能力而备受重视。然而，现有的规则挖掘方法主要关注简单的链状规则，因此忽视了图状结构中编码的更丰富的关系信息，如循环和分支。这一限制因搜索空间的组合爆炸导致的计算瓶颈而进一步加剧，这对类图规则尤其具有挑战性。与此同时，尽管扩散模型等生成方法在其他领域取得成功，但由于其训练目标与学习高质量规则的目标不一致，且不可微分的 KG 规则质量指标无法直接指导模型优化，因此无法直接应用于规则挖掘。为解决这些局限性，我们提出了GRiD框架，将图样规则发现重新表述为基于目标关系的离散生成过程。GRiD采用两阶段培训策略。首先，监督预训练使GRiD能够从KG元图中采样的子图中捕获结构先验。随后，强化学习被应用于通过策略梯度优化，直接以不可微分的规则质量指标为指导，对GRiD进行微调。对六个基准数据集的实验表明，GRiD在KG完成任务中具有竞争力。消融研究证实了GRiD的效率和鲁棒性，并进一步表明图样规则在KG补全中补充链式规则。我们的代码和数据集可在此 https URL 中获取

FLAG: Flow Policy MaxEnt-RL by Latent Augmented Guidance

旗帜：通过潜在增强指导实现的流量策略 MaxEnt-RL

Authors: Sungha Kim, Gawon Lee, Jusuk Lee, Jonghae Park, H. Jin Kim, Daesol Cho
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.30749
Pdf link: https://arxiv.org/pdf/2605.30749
Abstract Maximum entropy reinforcement learning (MaxEnt-RL) enables robust exploration, yet practical implementations often restrict policies to simple Gaussians. While recent approaches incorporate expressive generative policies via importance-weighted supervised learning, they are prone to importance weight collapse, which limits their scalability in high-dimensional action spaces. Our key insight is to mitigate this limitation by localizing the sampling region, avoiding the weight degeneracy induced by importance sampling over the entire action space. To instantiate this insight, we introduce \textbf{FLAG} (\textbf{F}low policy with \textbf{L}atent-\textbf{A}ugmented \textbf{G}uidance). FLAG augments the state space with a flow latent variable and optimizes a provably consistent proxy MaxEnt-RL objective. We empirically demonstrate that FLAG enables expressive policy optimization with limited importance samples and scales to high-dimensional control tasks. Furthermore, FLAG achieves state-of-the-art performance across challenging benchmarks. Our project webpage: this https URL
中文摘要 最大熵强化学习（MaxEnt-RL）支持鲁棒探索，但实际实现通常将策略限制在简单的高斯分布。虽然近期方法通过重要性加权监督学习引入表达性生成策略，但它们容易出现重要性权重崩溃，限制了其在高维动作空间中的可扩展性。我们的关键见解是通过局部化抽样区域来缓解这一限制，避免重要性抽样在整个作用空间中引发的权重简并。为了实现这一洞察，我们引入了 \textbf{FLAG}（\textbf{F}low 策略，带有 \textbf{L}atent-\textbf{A}ugmented \textbf{G}uidance）。FLAG通过流量潜在变量补充状态空间，并优化可证明一致的代理MaxEnt-RL目标。我们实证证明，FLAG能够在有限重要性样本下实现表达式策略优化，并可扩展到高维控制任务。此外，FLAG在具有挑战性的基准测试中实现了最先进的性能。我们的项目网页：这个 https URL

Efficient and Uncertainty-Aware Diffusion Framework for Offline-to-Online Reinforcement Learning

高效且具备不确定性意识的扩散框架，用于线下到在线强化学习

Authors: Ha Manh Bui, Metod Jazbec, Eric Nalisnick, Anqi Liu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.30776
Pdf link: https://arxiv.org/pdf/2605.30776
Abstract Offline-to-Online Reinforcement Learning (O2O-RL) leverages an offline, pre-trained policy to minimize costly online interactions. Although data-efficient, O2O-RL is susceptible to shifts between offline and online distributions. Existing work aims to mitigate the harm of this shift by finetuning the policy on trajectory data sampled from a diffusion model. Inspired by this line of work, we propose DUAL: an efficient \textbf{D}iffusion \textbf{U}ncertainty-\textbf{A}ware framework for offline-to-online reinforcement \textbf{L}earning. DUAL utilizes the prior knowledge of the diffusion model to distill a fast-sampling diffusion actor policy and transition model in the offline phase. DUAL also employs a Laplace approximation and distance transition-state-shift detection, thereby using uncertainty quantification to improve exploration versus exploitation in the online phase. We formally show that our actor loss with the Laplace approximation provides a proxy for a principled estimate of epistemic uncertainty. Empirically, DUAL improves the online expected return over O2O-RL baselines across multiple settings and environments.
中文摘要 离线到在线强化学习（O2O-RL）利用离线预先训练的策略，最大限度地减少昂贵的在线互动。尽管数据效率高，O2O-RL仍易受离线与在线分布之间的变化影响。现有工作旨在通过微调从扩散模型中抽样轨迹数据的政策来减轻这种偏移带来的伤害。受这一领域的启发，我们提出了DUAL：一个高效的\textbf{D}iffusion \textbf{U}ncertainty-\textbf{A}ware框架，用于离线到在线的\textbf{L}收益。DUAL利用扩散模型的先验知识，在离线阶段提取出快速采样的扩散行为者策略和过渡模型。DUAL还采用拉普拉斯近似和距离转移态-移位检测，从而利用不确定性量化改善在线阶段的探索与利用。我们正式证明了用拉普拉斯近似计算的行为者损失可以作为一个原则性认识不确定性估计的代理。从实证角度看，DUAL在多个环境和环境下提升了在线预期回报相较于O2O-RL基线的表现。

Learning Agent-Compatible Context Management for Long-Horizon Tasks

学习长期任务的代理兼容上下文管理

Authors: Lu Yi, Runlin Lei, Liuyi Yao, Yuexiang Xie, Yuyang Li, Wenhao Zhang, Zhewei Wei, Yaliang Li, Jian-Yun Nie
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30785
Pdf link: https://arxiv.org/pdf/2605.30785
Abstract LLM agents increasingly face long-horizon tasks such as web search and deep research in real-world applications, where accumulated context can cause long-context degradation and reasoning failures. Prior work mitigates this through context management with agent-side context control or fixed strategies such as summarization, which require training the agent itself for adaptation - making it impractical for closed-source agents and ignoring that different agents may require different strategies. We introduce Adaptive Context Management (AdaCoM), which trains an external LLM to manage the context of a frozen agent through flexible modification actions and end-to-end reinforcement learning. Across diverse agents on web search and deep research benchmarks, AdaCoM substantially improves performance by preserving task constraints and progress while pruning stale content. The learned strategies reveal a Fidelity-Reliability Trade-off: agents with higher vanilla ReAct performance benefit from higher-fidelity context preservation, whereas lower-performing agents require more aggressive compression to stay within a reliable reasoning regime. Transfer experiments show that AdaCoM generalizes most effectively across agents with similar capability (measured by vanilla ReAct performance), suggesting a practical path toward reusable context managers for agent systems.
中文摘要 LLM代理越来越多地面临长期任务，如网页搜索和现实应用中的深度研究，积累的上下文可能导致长期上下文退化和推理失败。以往的工作通过上下文管理（agent side context control）或固定策略（如摘要）来缓解这一问题，这些方法需要训练代理自身适应——这使得闭源代理不切实际，且忽略了不同代理可能需要不同策略的事实。我们介绍了自适应上下文管理（AdaCoM），它通过灵活的修改动作和端到端强化学习，训练外部LLM管理冻结代理的上下文。在多样化的代理网络搜索和深度研究基准中，AdaCoM通过保留任务约束和进度，同时修剪陈旧内容，显著提升了性能。所学策略揭示了一个忠诚度与可靠性的权衡：原版ReAct表现较高的代理能从更高的保真度上下文保留中受益，而表现较低的代理则需要更激进的压缩才能保持在可靠的推理范围内。传输实验表明，AdaCoM在具有相似能力（以原版ReAct性能衡量）的代理间推广效果最为有效，这为代理系统提供了可重用上下文管理器的实用路径。

Feat2Go: Visual Feature-Grounded Value Estimation for Embodied Reinforcement Learning

Feat2Go：具身强化学习中的视觉特征基础价值估计

Authors: Junyang Shu, Zhiwei Lin, Bingqing Wei, Yongtao Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.30795
Pdf link: https://arxiv.org/pdf/2605.30795
Abstract Reinforcement learning is a promising approach for improving the capabilities of vision-language-action (VLA) models while avoiding the heavy data requirements of imitation learning. However, its effectiveness for VLA models is often constrained by sparse supervision and the difficulty of designing informative reward signals for long-horizon manipulation. In this work, we present Feat2Go, a fine-grained value estimation framework for embodied reinforcement learning. Specifically, Feat2Go first derives a continuous progress target from a pretrained visual world model by measuring patch-level similarity to subgoal states and partitioning episodes into semantic stages with trend-based clustering. We then train an embodied value model to predict this structural progress from the current observation and task instruction, and use the predicted value to reshape terminal rewards during policy optimization. The proposed framework is compatible with existing VLA policy reinforcement learning pipelines, including PPO and GRPO, and does not rely on manual reward engineering. Extensive experiments on ManiSkill3 and RoboTwin 2.0 demonstrate that Feat2Go consistently improves the performance of existing VLA models under both single-arm and bimanual manipulation settings. More specifically, on ManiSkill3, Feat2Go improves OpenVLAOFT from 17.5% to 82.9% average out-of-distribution success while retaining 96.9% in-distribution performance. On RoboTwin 2.0, Feat2Go achieves an average success rate of 88.8% in domain-randomized task settings, outperforming prior reinforcement learning methods.
中文摘要 强化学习是一种有前景的方法，可以提升视觉-语言-动作（VLA）模型的能力，同时避免模仿学习带来的大量数据需求。然而，其在VLA模型中的有效性常受限于监督稀疏以及设计长视距操作中信息性奖励信号的困难。在本研究中，我们提出了Feat2Go，一种用于具身强化学习的细粒度价值估计框架。具体来说，Feat2Go 首先通过测量补丁级与子目标状态的相似度，并通过趋势聚类将剧集划分为语义阶段，从而从预训练的可视化世界模型中推导出一个连续进度目标。然后，我们训练一个具含价值模型，预测当前观察和任务指令中的结构性进展，并利用预测价值在策略优化过程中重塑终端奖励。该框架与现有的VLA策略强化学习流程（包括PPO和GRPO）兼容，且不依赖人工奖励工程。在 ManiSkill3 和 RoboTwin 2.0 上的大量实验表明，Feat2Go 在单臂和双手操作设置下，持续提升现有 VLA 模型的性能。更具体地说，在ManiSkill3上，Feat2Go将OpenVLAOFT的平均非发行成功率从17.5%提升至82.9%，同时保持96.9%的发行内表现。在RoboTwin 2.0平台上，Feat2Go在领域随机任务设置中平均成功率为88.8%，优于以往的强化学习方法。

Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

以结构感知奖励为基础的计划者中心深度研究强化学习

Authors: Mustafa Anis Hussain, Xinle Wu, Yao Lu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30824
Pdf link: https://arxiv.org/pdf/2605.30824
Abstract Deep research tasks require LLMs to plan what to investigate, retrieve evidence, and synthesize long-form answers across multiple branches of inquiry. Existing training paradigms either rely on short-form verifiable QA as a proxy or optimize monolithic long trajectories, which makes planning and execution difficult to disentangle and yields weak credit assignment for the planning process. We propose DecomposeR, a planner-centric deep research framework that represents research plans as typed directed acyclic graphs (DAGs), allowing planning to be made explicit, structured, and rewardable. We train a Qwen3-8B model in two stages: planner reinforcement learning (RL) first learns graph structure and query decomposition to improve research planning, and answerer reinforcement learning (RL) then learns branch-level execution and final synthesis conditioned on the learned plan. By assigning rewards to explicit planner tokens and structured components rather than to a flat trajectory, DecomposeR enables finer-grained optimization of planning while reducing the ambiguity of end-to-end training. Experiments show that DecomposeR-8B improves over strong comparable open baselines by 5.1-8.0 points on popular long-form benchmarks due to improved planning and answering capabilities.
中文摘要 深度研究任务要求大型语言模型规划调查内容、检索证据，并综合多个研究分支的长形答案。现有的培训范式要么依赖短格式可验证的质量保证作为代理，要么优化单一的长轨迹，这使得规划和执行难以理清，也导致规划过程的信用分配较弱。我们提出了DecomposeR，一个以规划者为中心的深度研究框架，将研究计划表示为类型有向无环图（DAGs），使规划能够显式、结构化且具有回报性。我们将Qwen3-8B模型分为两个阶段进行训练：规划强化学习（RL）首先学习图结构和查询分解以改善研究规划，回答者强化学习（RL）随后学习基于所学计划的分支级执行和最终综合。通过将奖励分配给显式的规划者代币和结构化组件，而非固定的轨迹，DecomposeR 实现了更细粒度的规划优化，同时减少端到端训练的歧义。实验显示，由于规划和答题能力的提升，DecomposeR-8B在流行的长形式基准测试中相较于强的可比开放基线提升了5.1-8.0个百分点。

SLAT: Segment-Level Adaptive Trimming for Efficient CoT Reasoning

SLAT：用于高效CoT推理的分段级自适应修剪

Authors: Jian Yao, Xiongcai Luo, Ran Cheng, Kay Chen Tan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30832
Pdf link: https://arxiv.org/pdf/2605.30832
Abstract Recent advances in Large Reasoning Models have significantly improved chain-of-thought (CoT) capabilities via reinforcement learning (RL). However, generated reasoning chains frequently suffer from structural redundancy (i.e., \emph{overthinking}), incurring high computational overhead without improving answer correctness. Existing mitigation strategies typically rely on token-uniform length penalties, which provide coarse, segment-agnostic pressure toward shorter outputs and can inadvertently suppress useful reasoning alongside redundancy. To address this, we demonstrate that inefficiency concentrates in high-probability segments with low marginal utility. We derive a theoretical characterization of segment suboptimality under the correctness-length trade-off objective and propose \textsc{SLAT} (Segment-Level Adaptive Trimming), an RL framework that selectively suppresses redundant segments based on this criterion. Empirical results on standard benchmarks indicate that \textsc{SLAT} establishes a superior accuracy-efficiency Pareto frontier, reducing reasoning length by $50\%$ relative to uncompressed baselines while maintaining competitive accuracy. Overall, our results suggest that theoretically grounded, segment-aware trimming is a promising direction for efficient CoT reasoning in large language models.
中文摘要 大型推理模型的最新进展通过强化学习（RL）显著提升了思维链（CoT）能力。然而，生成推理链常常存在结构冗余（即\emph{过度思考}），导致计算开销较高，且无法提升答案正确性。现有的缓解策略通常依赖于令牌均匀长度惩罚，这对较短输出产生粗略、段无关的压力，可能无意中抑制有用的推理和冗余。为此，我们证明低效率集中在概率高且边际效用较低的部分。我们基于正确性长度权衡目标推导出段次优性的理论表征，并提出了\textsc{SLAT}（段级自适应裁剪）框架，基于该标准选择性地抑制冗余段。标准基准测试的实证结果表明，\textsc{SLAT} 建立了更优越的准确率与效率帕累托前沿，相较于未压缩基线，推理长度减少了 $50\%$，同时保持了竞争性准确性。总体来看，我们的结果表明，理论基础、有分段感知的裁剪是大型语言模型中高效CoT推理的一个有前景的方向。

A Lecture Note on Offline RL and IRL, Part II: Foundations of Inverse Reinforcement Learning and Dynamic Discrete Choice Models

离线强化学习与现实学习讲义笔记第二部分：逆强化学习基础与动态离散选择模型

Authors: Enoch Hyunwook Kang
Subjects: Subjects: Machine Learning (cs.LG); Econometrics (econ.EM)
Arxiv link: https://arxiv.org/abs/2605.30843
Pdf link: https://arxiv.org/pdf/2605.30843
Abstract In the forward reinforcement-learning problem, the reward is fixed and known; the learner is asked to find a good policy or value function. Here we turn the question around. Given offline data generated by an expert, can we recover the reward the expert was optimizing? This is the inverse reinforcement learning problem, and remarkably, two communities, structural econometricians studying dynamic discrete choice (DDC) and machine learners studying entropy-regularized IRL, have been working on exactly the same probabilistic model under different names. We begin by proving their equivalence. We then develop the classical identification result of Magnac and Thesmar and the classical computational paradigms that grew out of it: Rust's nested fixed-point algorithm, the conditional-choice-probability approach of Hotz and Miller, and the two temporal-difference approaches of Adusumilli and Eckardt: linear semi-gradient TD and approximate value iteration. Each route has its limits: dimensionality, transition-kernel estimation, the deadly triad, or projected fixed-point bias. We then walk through the modern ML/IRL strand: adversarial IRL, occupancy matching, IQ-Learn, and offline ML-IRL, deriving each method's actual objective and stating precisely what it does and does not identify. We close with the empirical-risk-minimization framework of Kang et al., which yields a gradient-based estimator for offline IRL/DDC.
中文摘要 在前向强化学习问题中，奖励是固定且已知的;学习者被要求找到一个好的策略或价值函数。这里我们反过来回答这个问题。给定专家生成的离线数据，我们能否恢复专家优化的奖励？这就是逆强化学习问题，令人惊讶的是，两个群体——研究动态离散选择（DDC）的结构计量经济学家和研究熵正则化现实世界的机器学习者——一直在研究完全相同的概率模型，但名称不同。我们首先证明它们的等价性。随后，我们发展了Magnac和Thesmar的经典识别结果及其衍生的经典计算范式：Rust的嵌套不动点算法、Hotz和Miller的条件选择概率方法，以及Adusumilli和Eckardt的两种时间差分方法：线性半梯度TD和近似值迭代。每种路径都有其限制：维度、转移核估计、致命三元或投影不动点偏置。接着我们探讨现代机器学习/现实学习（MLL）分支：对抗性现实学习、占用匹配、IQ-学习和离线机器学习-现实学习，推导每种方法的实际目标，并准确说明其识别和不识别的内容。我们以Kang等人的经验-风险最小化框架作结，该框架给出基于梯度的离线IRL/DDC估计器。

Safe Equilibrium Policy Optimization for Strategic Agent Policies

战略代理政策的安全均衡优化

Authors: Karthika Arumugam, Kiran Kumar Manku, Amit Dhanda
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30854
Pdf link: https://arxiv.org/pdf/2605.30854
Abstract Language models fine-tuned with reinforcement learning typically optimize for task reward, ignoring multi-agent strategic structure. Because these agents condition on natural language game-state descriptions and emit actions through free-form generation, strategic failure modes -- exploiting weaker opponents, coordinating on harmful equilibria, and externalizing costs are inseparable from the language interface itself. We propose Safe Equilibrium Policy Optimization (\sepo{}), a training objective that augments expected payoff with explicit penalties for exploitability, collusion risk, and externality cost. We implement \sepo{} as a reward signal for Group Relative Policy Optimization (GRPO), applied to Gemma~4 E4B-it and Qwen~3.5-4B after supervised fine-tuning (SFT). Evaluated across five strategic domains: Iterated Prisoner's Dilemma, repeated auctions, two negotiation variants, and Kuhn Poker. \sepo{} achieves zero exploit-pool advantage in Kuhn Poker for both models, outperforms the base model on safety in four domains, and corrects the over-cooperative behavior introduced by SFT. In negotiation, \sepo{} achieves a positive-safety outcome and only the positive normalized relative advantage of any negotiation configuration. Ablation experiments confirm that per-rollout exploit computation is necessary: a shared constant penalty cancels in GRPO advantage normalization (constant control-variate property), producing zero gradient. To support further research in strategic safety for agents, we release our \href{this https URL}{code} and SFT datasets.
中文摘要 经过强化学习微调的语言模型通常优化任务奖励，忽略多智能体的战略结构。由于这些代理基于自然语言博弈状态描述，并通过自由生成方式执行动作，战略性失败模式——利用弱小对手、协调有害均衡以及外部化成本——与语言界面本身密不可分。我们提出了安全均衡策略优化（\sepo{}），这是一种训练目标，通过明确的可利用性、串通风险和外部性成本的惩罚来增强预期收益。我们将 \sepo{} 作为群相对策略优化（GRPO）的奖励信号，应用于 Gemma~4 E4B-it 和 Qwen~3.5-4B，经过监督微调（SFT）。评估涵盖五个战略领域：反复囚徒困境、重复拍卖、两种谈判变体和库恩扑克。\sepo{}在Kuhn扑克中两种模型均实现零利用池优势，在四个安全领域优于基础模型，并纠正了SFT引入的过度合作行为。在协商中，\sepo{} 实现了正安全性结果，且仅获得任何协商配置的正归一化相对优势。消融实验证实每次推出利用计算是必要的：共享的恒定惩罚在GRPO优势归一化（恒定控制变量性质）中相互抵消，从而产生零梯度。为了支持代理战略安全的进一步研究，我们发布了我们的\href{this https URL}{code}和SFT数据集。

DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning

DARTS：分布式感知的主动推广轨迹塑造，加速LLM强化学习

Authors: Yujie Wang, Siwei Chen, Longzan Luo, Xinyi Liu, Xupeng Miao, Fangcheng Fu, Bin Cui
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30859
Pdf link: https://arxiv.org/pdf/2605.30859
Abstract Reinforcement Learning (RL) has become pivotal for improving model capabilities yet suffers from rollout efficiency bottlenecks due to the long-tail response length distribution. While existing works mitigate the impact of long tails via prompt-level tail scheduling, we focus on the root source of inefficiency: the distribution itself. Specifically, we characterize the long-tail distribution at a finer granularity, identifying intra-prompt long tails, and revealing that they frequently consist of ineffective verbosity. To address this, we propose a novel paradigm of active distribution shaping to shape the rollout distribution towards conciseness and certainty, thereby fundamentally resolving tail-induced overheads. We achieve this through a distribution-aware trajectory sampling mechanism, which selects trajectories from a redundant exploration space for each prompt, and an adaptive redundancy allocation scheme to maximize both shaping effectiveness and system efficiency. Experiments demonstrate significant acceleration over state-of-the-art systems by up to 1.77x without compromising model performance.
中文摘要 强化学习（RL）已成为提升模型能力的关键手段，但由于响应长度分布长，部署效率瓶颈也存在。现有工作通过提示层面尾部调度来减轻长尾的影响，我们关注的是低效的根源：分布本身。具体来说，我们以更细粒度刻画长尾分布，识别提示内长尾，并揭示它们常常伴随着无效的冗长。为此，我们提出了一种新的主动分布塑造范式，旨在将推广分布朝向简洁和确定性地塑造，从而从根本上解决尾部引发的开销。我们通过分布感知轨迹采样机制实现这一点，该机制为每个提示从冗余探索空间中选择轨迹，并采用自适应冗余分配方案，以最大化塑形效果和系统效率。实验显示，在最先进系统上，可显著加速最多1.77倍，且不影响模型性能。

Distilling LLM Feedback for Lean Theorem Proving

精益定理证明中提取LLM反馈

Authors: Gaetan Narozniak, Gérard Biau, Rémi Munos, Ahmad Rammal, Pierre Marion
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30861
Pdf link: https://arxiv.org/pdf/2605.30861
Abstract Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.
中文摘要 推理模型的后期训练通常结合监督微调与可验证奖励的强化学习，最常见的是GRPO。然而，该算法存在奖励稀少、探索有限和模式崩溃的问题。基于近期关于自我蒸馏的研究，我们提出了反馈蒸馏（Feedback Distillation）训练方法，即模型在令牌层面训练以匹配其自身分布，条件是基于语言模型产生的特权反馈。反馈蒸馏提供代币级监督，并可注入外部知识。评估我们用于Lean4定理证明的方法，发现反馈蒸馏比GRPO保持了更高的生成轨迹多样性，从而实现更高的策略熵和更好的pass@k缩放性。这两种方法是互补的：从反馈蒸馏检查点初始化GRPO的表现优于单独使用。总的来说，我们的结果为复杂推理的训练后续改进提供了有希望的方向。

GUI-C$^2$: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning

GUI-C$^2$：通过难度感知强化学习实现粗细GUI基础化

Authors: Junlong Li, Chao Hao, Lap-Pui Chau, Yi Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.30884
Pdf link: https://arxiv.org/pdf/2605.30884
Abstract Existing agentic reinforcement learning methods for GUI grounding have limitations at two levels. At the data level, current approaches typically treat all training samples equally, although their training value to the baseline model varies with difficulty. Overlooking this can greatly reduce training efficiency or even cause collapse. At the strategy level, existing frameworks struggle to balance the trade-off between cropping larger regions for sufficient context and smaller ones for reduced redundancy, a tension inherent to tool-augmented grounding agents. In addition, overly complex decision-making is difficult for small-parameter models and significantly increases inference time. To address these issues, at the data level, we propose GUI-D, a data mining and difficulty scoring pipeline that identifies the training-worthy samples by proper testing and assigns difficulty scores to guide subsequent training weights. At the strategy level, we propose GUI-C$^2$, which employs an area-gated coarse-to-fine refinement mechanism that progressively narrows the visual field via model-internal uncertainty signals, adaptively reserving context for large targets while amplifying precision for small ones, reinforced by improvement-aware stage rewards that ensure each refinement genuinely advances grounding. Meanwhile, we simplify the decision-making process to greatly reduce additional inference time. Finally, extensive experiments show that our method achieves state-of-the-art performance. The code and data will be publicly available.
中文摘要 现有的智能强化学习方法在两个层面上存在局限性。在数据层面，当前方法通常对所有训练样本一视同仁，尽管它们与基线模型的训练值会因难度而异。忽视这一点可能会大幅降低训练效率，甚至导致崩溃。在战略层面，现有框架难以在裁剪较大区域以获得足够上下文与减少冗余的小区域之间取得权衡，这种张力是工具增强接地代理固有的。此外，过于复杂的决策对于小参数模型来说很困难，并且显著增加推断时间。为解决这些问题，在数据层面，我们提出了GUI-D，一种数据挖掘和难度评分流程，通过适当测试识别适合训练的样本，并分配难度分数以指导后续训练权重。在策略层面，我们提出了GUI-C$^2$，采用区域门控的粗细精炼机制，通过模型内部不确定性信号逐步缩小视野，适性地为大型目标保留上下文，同时对小目标提升精度，并通过改进意识阶段奖励强化，确保每次精细真正推进接地。同时，我们简化了决策过程，大大减少了额外的推理时间。最后，大量实验表明我们的方法实现了最先进的性能。代码和数据将对外公开。

Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments

零崩溃：不连续奖励环境中策略梯度方法的失败模式

Authors: Nishant Kumar, Enrique Areyan Viqueira, Amy Greenwald
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.30896
Pdf link: https://arxiv.org/pdf/2605.30896
Abstract Bidding in repeated auctions is a central challenge for reinforcement learning (RL), combining continuous control with the strategic complexities of digital advertising. While policy gradient and value-based methods seem well-suited for these settings, they often struggle with the discontinuous, "cliff-like" nature of auction reward landscapes. In a first-price auction, for example, a bidder receives zero reward until they cross a specific threshold, after which the reward decreases as the bid increases. This creates a landscape of flat, zero-reward regions separated by sharp boundaries. We identify a fundamental failure mode in this setting termed "zero collapse." We show that stochastic exploration and gradient-based updates can cause policies to overshoot optimal high-reward regions and enter flat, zero-reward regimes. Once there, the lack of an informative gradient signal makes recovery extremely sample-inefficient, effectively trapping the agent. We find that actor-critic methods are particularly susceptible, as biased value estimates can accelerate this movement toward unstable regions. Our contributions include: (1) a mechanistic explanation of how discontinuous rewards lead to vanishing signals and zero collapse; (2) an analysis of the interaction between policy stochasticity and step size; and (3) an empirical demonstration of this phenomenon across REINFORCE and actor-critic variants. We propose practical mitigation strategies involving initialization and architectural choices to improve stability. Finally, we introduce a formal RL framework for auction environments highlighting their unique structural properties.
中文摘要 重复拍卖中的竞拍是强化学习（RL）的核心挑战，将持续控制与数字广告的战略复杂性相结合。虽然政策梯度和基于价值的方法似乎非常适合这些环境，但它们常常难以应对拍卖奖励景观的不连续性、“悬崖式”特性。例如，在首价拍卖中，竞标者在达到某个阈值前将获得零奖励，超过此后，随着出价的提高，奖励会减少。这形成了一片由平坦、零奖励区域组成的景观，这些区域之间被明显的边界分隔开来。在此设定中，我们识别出一种基本失效模式，称为“零崩溃”。我们表明，随机探索和基于梯度的更新可能导致策略超出最优高回报区域，进入平坦的零奖励状态。一旦达到，缺乏有用的梯度信号，回收样本效率极低，有效地捕获了该探剂。我们发现actor-critic方法尤其容易受影响，因为有偏见的值估计会加速这种向不稳定区域的移动。我们的贡献包括：（1）机械性解释不连续奖励如何导致信号消失和零崩溃;（2）政策随机性与步长之间相互作用的分析;以及（3）对REINFORCE和actor-critic变体中这一现象的实证演示。我们提出了涉及初始化和架构选择的实用缓解策略，以提升稳定性。最后，我们介绍了一个正式的强化学习框架，突出拍卖环境的独特结构特性。

Inverse Reinforcement Learning without an Optimal Demonstrator: A Feasible Reward Set Approach

无最优演示者的逆向强化学习：一种可行的奖励集方法

Authors: Kihyun Kim, Shripad Deshmukh, Nikos Vlassis, Jiawei Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30903
Pdf link: https://arxiv.org/pdf/2605.30903
Abstract Inverse reinforcement learning (IRL) typically assumes demonstrations from a single optimal demonstrator, but in many applications data come from multiple imperfect demonstrators with heterogeneous suboptimality levels. We study reward learning in this setting through a feasible-reward-set framework: for each demonstrator, we encode its declared suboptimality level as a linear constraint and intersect the resulting feasible sets across demonstrators. Our theoretical analysis shows that the joint feasible set shrinks monotonically as data are added, and we give an exact characterization of when a new demonstrator strictly tightens it. We further establish two recovery guarantees for the feasible reward set of the ground-truth optimal demonstrator: one bound depends on closeness to the optimal occupancy, while the other requires only sufficient coverage and no near-optimal demonstrator. On the practical side, we introduce strategies to address the inherent reward ambiguity in the obtained reward set and provide an offline algorithm with function approximation for high-dimensional environments. Experiments in tabular grid-world and large language model (LLM) fine-tuning settings are consistent with the theoretical predictions and demonstrate the effectiveness of the proposed framework over baselines.
中文摘要 逆强化学习（IRL）通常假设单个最优演示器进行演示，但在许多应用中，数据来自多个具有异构次优水平的不完美演示器。我们通过可行奖励集框架研究该环境中的奖励学习：对于每个示范器，我们将其宣告的次最优水平编码为线性约束，并将所得可行集合交交于示范器之间。我们的理论分析表明，随着数据的加入，联合可行集会单调收缩，我们给出了当新演示子严格收紧它时的精确刻画。我们进一步建立了两种可行奖励集的恢复保证：一种界限依赖于接近最优占用率，另一种仅要求足够覆盖且无需近似最优演示器。在实际操作方面，我们提出了解决奖励集内在奖励模糊性的策略，并为高维环境提供了带有函数近似的离线算法。在表格网格世界和大型语言模型（LLM）微调环境中的实验与理论预测一致，并展示了所提框架在基线上的有效性。

Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR

关注证据：多模态RLVR的证据锚定空间注意力监督

Authors: Ruina Hu, Chen Wang, Lai Wei, Jionghao Bai, Bin Yu, Weiran Huang, Kai Wang, Yue Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.30912
Pdf link: https://arxiv.org/pdf/2605.30912
Abstract Reinforcement learning with verifiable rewards (RLVR) improves vision-language models (VLMs) by optimizing outcome rewards derived from final answers. However, such outcome-only rewards do not tell the model which image regions justify an answer. For questions that require visual grounding, these rewards cannot distinguish responses supported by relevant visual evidence from those produced by language-prior shortcuts or lucky guesses. We introduce EASE (Evidence-Anchored Spatial Attention), which augments multimodal RLVR with visual-evidence process supervision. EASE converts annotated evidence regions into a smoothed visual-token target and uses it to guide response-to-image attention during RL training, but only on high-reward trajectories. The annotations are used solely as privileged training labels, while inference requires only the original image and question. Across Qwen2.5-VL-7B, Qwen3-VL-4B, and Qwen3-VL-8B, EASE raises average scores over DAPO by 2.5 to 3.1 points on perception, hallucination, visual math, and multimodal reasoning benchmarks. Diagnostics and ablations show that EASE better aligns visual attention with annotated evidence regions.
中文摘要 带有可验证奖励的强化学习（RLVR）通过优化最终答案得出的结果奖励，改进视觉语言模型（VLM）。然而，这种仅以结果为基础的奖励并不能告诉模型哪些图像区域值得回答。对于需要视觉基础的问题，这些奖励无法区分有相关视觉证据支持的回答与语言前置捷径或幸运猜测产生的回答。我们引入了EASE（证据锚定空间注意力），它通过视觉证据过程监督来增强多模态RVR。EASE将注释的证据区域转换为平滑的视觉标记目标，并用于引导强化学习训练中的对图像的反应注意力，但仅限于高奖励轨迹。注释仅作为特权训练标签使用，而推断只需原始图像和问题。在Qwen2.5-VL-7B、Qwen3-VL-4B和Qwen3-VL-8B中，EASE在感知、幻觉、视觉数学和多模态推理基准测试中，平均得分提升了2.5至3.1分。诊断和消融显示，EASE能更好地将视觉注意力与注释的证据区域对齐。

Automating Formal Verification with Reinforcement Learning and Recursive Inference

通过强化学习和递归推理实现形式化验证自动化

Authors: Max Tan
Subjects: Subjects: Machine Learning (cs.LG); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2605.30914
Pdf link: https://arxiv.org/pdf/2605.30914
Abstract Automated formal verification remains challenging for large language models because data for proof assistants and verification-aware languages is scarce, and correctness depends on satisfying precise machine-checkable specifications rather than producing plausible code. This thesis studies how verifier environments can improve LLM generation of verified programs and proofs through reinforcement learning from verifiable rewards (RLVR) and verifier-guided inference-time search. First, we train open-source models in Dafny with RLVR using Group Relative Policy Optimization (GRPO) and related variants, assembling generated candidates into complete programs and scoring them with compiler and verifier outcomes. Initial experiments on an APPS-derived Dafny dataset increased verified reward from 2.2% to 58.1%, but revealed specification hacking, where models exploit weak formal specifications instead of implementing the intended solutions. After filtering underspecified and vulnerable tasks, multi-turn RLVR on the refined benchmark improves the verified pass rate from 9.7% to 31.1%. Second, we develop a verifier-guided inference scaffold in Lean that treats proof generation as structured search over decomposed subgoals, verifier feedback, diagnostics, and repair. With a fixed base model, the full scaffold with proof reviser improves pass rate on an initial VeriCoding pilot set from 46.2% under direct repair to 69.2%. On the larger VERINA dataset, whole-task decomposition plus proof reviser solves 7 of 42 previously unsolved tasks. We also introduce Dalek-Bench, a repository-scale Lean benchmark derived from the Rust $\texttt{curve25519-dalek}$ verification project; preliminary results remain weak, indicating that stronger progress evaluation and task-specific tool-use policies are still needed.
中文摘要 自动化形式验证对于大型语言模型依然具有挑战性，因为证明助手和验证感知语言的数据稀缺，正确性依赖于满足精确的机器可检查规格，而非生成合理的代码。本论文研究验证器环境如何通过可验证奖励强化学习（RLVR）和验证者引导推理时间搜索，提升验证程序和证明的LLM生成。首先，我们利用 RLVR 在 Dafny 中训练开源模型，使用组相对策略优化（GRPO）及相关变体，将生成的候选对象组装成完整程序，并用编译器和验证器的结果进行评分。在APPS衍生的Dafny数据集上的初步实验将验证奖励从2.2%提升到58.1%，但暴露出规范作弊，即模型利用薄弱的形式规范而非实现预期解决方案。在筛选出未明确且易受攻击的任务后，在精炼基准测试上进行多回合RLVR后，验证通过率从9.7%提升至31.1%。其次，我们在精益中开发了一个验证者引导的推理支架，将证明生成视为对分解子目标、验证者反馈、诊断和修复的结构化搜索。在固定基础模型中，带校样修订器的全支架可将初始VeriCoding 试点组的通过率从直接维修中的46.2%提升至69.2%。在更大的VERINA数据集中，全任务分解加证明修订器解决了42个未解决任务中的7个。我们还介绍了Dalek-Bench，这是一个源自Rust $\texttt{curve25519-dalek}$验证项目的仓库级精益基准;初步结果仍然薄弱，表明仍需更强有力的进展评估和针对任务的工具使用政策。

De-attribute to Forget for LLM Unlearning

去属性到Forget，用于LLM逆学习

Authors: Xinyang Lu, Jiabao Pan, Rachael Hwee Ling Sim, See-Kiong Ng, Anthony Kum Hoe Tung, Bryan Kian Hsiang Low
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30919
Pdf link: https://arxiv.org/pdf/2605.30919
Abstract The rapid development of large language models (LLMs) has raised concerns on the use of inappropriate data for training, which has led to a growing interest in LLM unlearning. Many existing LLM unlearning approaches rely on optimizing prediction loss(es), such as maximizing the loss on the forget set, but often face critical issues like over-forgetting and poor model utility. To address them, this paper novelly frames the optimization objective for LLM unlearning as one of zeroing out data attribution instead. In particular, we propose the first LLM unlearning framework based on data attribution rewards called DareU that performs reinforcement learning to update the LLM by reducing the attribution score of its generated responses (i.e., de-attributing) to the forget data owners. Empirical evaluation using an LLM classifier as an efficient approximation of attribution shows that DareU outperforms existing baselines by achieving effective unlearning while balancing forget quality and model utility well.
中文摘要 大型语言模型（LLM）的快速发展引发了人们对不当数据训练的担忧，这也引发了对LLM去学习兴趣的增长。许多现有的大型语言模型去学习方法依赖于优化预测损失，比如最大化遗忘集的损失，但常常面临过度遗忘和模型效用性差等关键问题。为此，本文新颖地将LLM去学习的优化目标定为将数据归因归零。特别是，我们提出了首个基于数据归因奖励的大型语言模型去学习框架，该框架通过强化学习来更新LLM，降低其生成反应的归因分数（即去归属）给遗忘数据所有者。使用LLM分类器作为高效归因近似的实证评估显示，DareU通过实现有效的去学习，在平衡遗忘质量和模型效用的同时，优于现有基线。

Enhancing Human-Likeness in Reinforcement Learning Agents via Hierarchical Macro Action Quantization

通过层级宏动作量化增强强化学习主体中的人性相似性

Authors: Usman Nizamani, M. Shaheer Luqman, Fawad Javed Fateh, Ali Shah Ali, Murad Popattia, M. Zeeshan Zia, Quoc-Huy Tran
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.30928
Pdf link: https://arxiv.org/pdf/2605.30928
Abstract Human-like agents are a long-standing goal of artificial intelligence. Despite strong performance, most reinforcement learning (RL) agents remain reward-driven and often exhibit behaviors that differ from humans, limiting interpretability and reliability. In this work, we introduce a novel human-like RL framework that predicts action sequences closely aligned with human behaviors while maximizing rewards. Specifically, we encode human demonstrations into macro actions using a hierarchical macro action quantization approach (termed HiMAQ) consisting of two successive levels of vector quantization. The lower quantization level maps input actions to fine-grained subaction clusters, while the higher quantization level aggregates these subaction clusters into action clusters. Extensive evaluations on the D4RL benchmarks show that our hierarchical approach outperforms the non-hierarchical baseline (MAQ), achieving better human-likeness scores while maintaining comparable or better success rates than previous RL agents. The improvements generalize across integrations with various RL algorithms, namely IQL, SAC, and RLPD.
中文摘要 类人智能体是人工智能长期以来的目标。尽管性能强劲，大多数强化学习（RL）代理仍以奖励为驱动，且常表现出与人类不同的行为，限制了可解释性和可靠性。本研究引入了一种新型类人类强化学习框架，能够预测与人类行为密切相关的动作序列，同时最大化奖励。具体来说，我们采用分层宏动作量化方法（称为HiMAQ）将人工演示编码为宏动作，该方法由两个连续的向量量化层次组成。较低的量化层级将输入动作映射到细粒度的子动作簇，而较高的量化层级则将这些子动作簇聚合为动作簇。对D4RL基准的广泛评估表明，我们的分层方法优于非分层基线（MAQ），在保持与之前RL代理相当甚至更好的成功率的同时，实现了更佳的人类相似度评分。这些改进在与多种强化学习算法（即IQL、SAC和RLPD）的集成中具有推广性。

RDGen: Demonstration Generation for High-Quality Robot Learning via Reinforcement Learning

RDGen：通过强化学习实现高质量机器人学习的演示生成

Authors: Zijian Zhu, Menglin Zou, Zhuang Li, Yaojie Tu, Xinhai Sun
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.30957
Pdf link: https://arxiv.org/pdf/2605.30957
Abstract Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robot control. However, their performance remains fundamentally constrained by the availability of high-quality robot trajectory data. In current robot learning practice, such data are primarily collected through human teleoperation, which is labor-intensive, costly, and difficult to scale. In this paper, we propose RDGen, a sim-to-real reinforcement learning framework for generating high-quality robot demonstrations. Rather than employing reinforcement learning solely as the final control policy, RDGen leverages trained RL policies as a structured trajectory generator. The system consists of a VLM-based task parser that identifies task-relevant objects, a Grounding DINO-based object localizer, and an RL policy transferred from simulation to the real robot. Successful rollouts are then harvested as clean, high-quality demonstrations for downstream VLA training, while the simulation stage further provides a scalable source of additional trajectories at little marginal cost. Experiments on a pick-and-place task demonstrate that the transferred RL policy achieves a high task success rate. Compared with human teleoperation, RDGen produces significantly smoother trajectories and yields superior downstream VLA performance. These results indicate that RL-generated demonstrations can serve as more reliable and consistent supervisory signals for robot policy learning.
中文摘要 视觉-语言-行动（VLA）模型已成为通用机器人控制的有前景范式。然而，它们的性能仍受制于高质量机器人轨迹数据的可用性。在当前的机器人学习实践中，这类数据主要通过人工远程操作收集，这既劳动强度高，成本高昂，也难以扩展。本文提出了RDGen，一种模拟到现实的强化学习框架，用于生成高质量的机器人演示。RDGen 不再仅将强化学习作为最终控制策略，而是将训练有素的强化学习策略作为结构化轨迹生成器。该系统由基于VLM的任务解析器组成，用于识别任务相关对象;基于Grounding的DINO对象定位器，以及从模拟传输到真实机器人的强化学习策略。成功的部署随后被收获为干净、高质量的演示，用于下游VLA训练，而模拟阶段则以极低的边际成本提供可扩展的额外轨迹来源。在选择和放置任务上的实验表明，转移的强化学习策略能够实现较高的任务成功率。与人类远程操作相比，RDGen能实现显著更平滑的轨迹，并实现更优越的下游VLA性能。这些结果表明，强化学习生成的演示可以作为机器人政策学习更可靠、更一致的监督信号。

Graph-GRPO: Dependency-Aware Credit Assignment for Generative E-commerce Search Relevance

Graph-GRPO：生成式电子商务搜索相关性的依赖感知信用分配

Authors: Jiarui Che, Yifei Chen, Zhixing Tian, Chenyang Wang, Ziguang Cheng
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2605.31003
Pdf link: https://arxiv.org/pdf/2605.31003
Abstract Search relevance modeling is a core task in e-commerce search systems, assessing how well a user query matches candidate products. Rather than relying on a single holistic matching signal, relevance judgment often requires structured reasoning over query understanding, product understanding, and facet-level matching. With large language models (LLMs), this process is increasingly formulated as chain-of-thought (CoT) reasoning and optimized with reinforcement learning (RL). However, existing RL methods mainly rely on outcome-level rewards and treat the entire reasoning chain as a single optimization unit. This makes it difficult to distinguish faulty reasoning steps from correct intermediate ones, leading to misaligned credit assignment. Although process-reward methods provide denser supervision, they often treat reasoning steps independently and ignore dependency-driven error propagation, making responsibility attribution difficult and limiting the optimization of structured relevance reasoning. We propose Graph-GRPO, a graph-structured extension of GRPO for multi-component relevance reasoning. Graph-GRPO constructs a relevance reasoning dependency graph, where CoT steps are modeled as nodes and their logical dependencies as edges. It propagates outcome-level rewards over the graph to derive step-level credit signals, enabling more accurate fine-grained credit assignment. We further introduce a main-loss-driven controller that adaptively adjusts edge-wise credit-propagation coefficients. Together with CoT random masking for supervised policy initialization and graph-node-based multi-head distillation, we build a trainable and deployable framework for generative relevance modeling. Extensive offline evaluations and online A/B tests on a leading e-commerce platform demonstrate that the Graph-GRPO-based framework improves relevance classification metrics and key engagement metrics.
中文摘要 搜索相关性建模是电子商务搜索系统的核心任务，评估用户查询与候选产品的匹配程度。相关性判断通常需要结构化推理，而非依赖单一的整体匹配信号，而非查询理解、产品理解和面层匹配。对于大型语言模型（LLM），这一过程越来越多地被表述为思维链（CoT）推理，并通过强化学习（RL）进行优化。然而，现有的强化学习方法主要依赖于结果层级的奖励，并将整个推理链视为单一的优化单元。这使得区分错误推理步骤和正确的中级步骤变得困难，导致信用分配错位。尽管过程-奖励方法提供了更密集的监督，但它们通常独立处理推理步骤，忽视依赖驱动的错误传播，使责任归因变得困难，限制了结构化相关性推理的优化。我们提出了Graph-GRPO，这是一种GRPO的图结构扩展，用于多元相关性推理。图-GRPO构建了一个相关性推理依赖图，其中CoT步骤被建模为节点，其逻辑依赖关系被建模为边。它通过在图中传播结果级奖励，推导出阶级信用信号，从而实现更精细的信用分配。我们还引入了一种主损耗驱动控制器，可自适应地调整边缘的信用传播系数。结合CoT随机掩蔽技术进行监督策略初始化和基于图节点的多头蒸馏，我们构建了一个可训练和可部署的生成相关性建模框架。在一家领先电商平台上进行的大量线下评估和在线A/B测试表明，基于Graph-GRPO的框架能够提升相关性分类指标和关键参与度指标。

SDM-Q: Cost-Aware Staged Decision-Making for Multi-Omics Classification with Deep Q-Learning

SDM-Q：基于深度Q学习的成本感知分阶段决策，用于多组学分类

Authors: Nan Mu, Xiaoyang Fan, Chen Zhao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.31014
Pdf link: https://arxiv.org/pdf/2605.31014
Abstract Multi-omics data provide complementary molecular characterizations of disease phenotypes and play an important role in disease diagnosis and subtype classification in precision medicine. However, acquiring complete multi-omics profiles is expensive and time-consuming, while most existing deep learning methods assume full modality availability during inference, resulting in substantial redundancy and limited practicality in clinical settings. To address this issue, we propose SDM-Q, a reinforcement learning framework for adaptive and cost-aware multi-omics classification. Specifically, multi-omics diagnosis is reformulated as a finite-horizon sequential decision problem, where the currently acquired omics modalities define the diagnostic state at each stage. An action--value function determines whether to acquire an additional modality or terminate the decision process and output the final prediction. To balance diagnostic utility and acquisition cost, the reward is defined only at the terminal stage and jointly determined by classification correctness and cumulative modality acquisition cost. A backward stage-wise optimization strategy is introduced to improve policy consistency and training stability. Experiments on four public multi-omics datasets, including ROSMAP, LGG, BRCA, and KIPAN, demonstrate that SDM-Q effectively reduces redundant modality acquisition while maintaining competitive classification performance compared with methods using complete multi-omics inputs. In the BRCA and KIPAN datasets, more than 99\% and 95\% of subjects, respectively, achieve accurate classification using only a single omics modality, while the average number of acquired modalities remains below two for ROSMAP and LGG. These results suggest that cost-aware sequential decision-making provides an effective paradigm for improving the efficiency of precision medicine workflows.
中文摘要 多组学数据为疾病表型提供了互补的分子表征，并在精准医疗中的疾病诊断和亚型分类中发挥重要作用。然而，获取完整的多组学谱既昂贵又耗时，而大多数现有深度学习方法在推断时假设完全可行模式，导致大量冗余和临床应用有限。为解决这一问题，我们提出了SDM-Q，一种用于自适应且成本感知的多组学分类强化学习框架。具体来说，多组学诊断被重新表述为有限视野的序列决策问题，当前获得的组学模态定义了每个阶段的诊断状态。一个动作-值函数决定是否获得额外的模态，还是终止决策过程并输出最终预测。为了平衡诊断效用和获取成本，奖励仅在终末阶段定义，并由分类正确性和累计模式获取成本共同确定。引入了一种向后阶段优化策略，以提升策略一致性和训练稳定性。在包括ROSMAP、LGG、BRCA和KIPAN等四个公开多组学数据集上的实验表明，SDM-Q在保持竞争性分类性能的同时，有效减少冗余模态采集，与使用完整多组学输入的方法相比。在BRCA和KIPAN数据集中，分别超过99%和95%的受试者仅用单一组学模态实现准确分类，而ROSMAP和LGG的平均获得模态数量仍低于两种。这些结果表明，成本意识的顺序决策为提升精准医疗工作流程效率提供了有效范式。

HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster

HADT：一种用于自主地球观测卫星集群的异构多代理差分变压器

Authors: Mohamad A. Hady, Muhammad Anwar Masum, Siyi Hu, Mahardhika Pratama, Jimmy Cao, Ryszard Kowalczyk
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.31023
Pdf link: https://arxiv.org/pdf/2605.31023
Abstract This work addresses the problem of autonomous resource management in heterogeneous satellite cluster conducting Earth Observation (EO) missions including optical and Synthetic Aperture Radar (SAR) satellites. In autonomous operation mode, satellites are equipped with intelligent capabilities enabling real-time decision-making based on the latest conditions, while requiring minimal interaction with ground operators. Traditional scheduling approaches typically rely on mathematical models to represent satellite mission and resource management. Then, this problem is solved by using optimization algorithms. However, such solutions become less effective when the underlying models are not available, over complex, and inaccurate due to dynamic changes and uncertainties inherent in the space mission environment. A promising alternative is to reformulate the problem as a sequential decision-making process and apply model-free reinforcement learning techniques to enable adaptive and real-time resource management. To this end, we propose a novel transformer-based architecture tailored for heterogeneous satellite cluster autonomous EO Mission with relational observations-actions tokenization and differential attention mechanism. Our experimental results demonstrate significant performance improvements compared to the available baselines. Moreover, the proposed architecture exhibits strong adaptability and transferability with respect to varying numbers of satellite clusters.
中文摘要 这项工作解决了执行地球观测（EO）任务的异构卫星集群中自主资源管理的问题，包括光学和合成孔径雷达（SAR）卫星。在自主运行模式下，卫星配备了智能功能，能够根据最新情况实时做出决策，同时几乎无需与地面操作员的交互。传统的调度方法通常依赖数学模型来表示卫星任务和资源管理。然后，通过优化算法解决这个问题。然而，当底层模型不可用、过于复杂且由于航天任务环境的动态变化和不确定性而不准确时，这些解决方案的效果会降低。一个有前景的替代方案是将问题重新表述为顺序决策过程，并应用无模型强化学习技术，以实现自适应和实时资源管理。为此，我们提出了一种基于变换器的新型架构，专为异构卫星集群自治EO任务量身定制，具备关系观测-动作标记化和差分关注机制。我们的实验结果显示，与现有基线相比，性能有显著提升。此外，所提架构在不同数量的卫星集群中表现出强烈的适应性和可迁移性。

Annealed Softmax Greedy in Many-Armed Bayesian Bandits

多臂贝叶斯强盗中的退火软极大贪婪

Authors: William Overman, Mohsen Bayati
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.31034
Pdf link: https://arxiv.org/pdf/2605.31034
Abstract Reinforcement learning with verifiable rewards (RLVR) and group-based policy optimization methods such as GRPO update a stochastic policy by sampling multiple completions per prompt and increasing the policy's probability on those with higher reward, regularized by a KL penalty toward a reference policy. These updates do not include explicit mechanisms that track epistemic uncertainty. This paper studies a stylized explanation for why such uncertainty-agnostic updates can nevertheless be effective. We analyze an annealed softmax (Boltzmann) policy that selects actions according to a softmax of empirical mean rewards in a many-armed Bayesian Bernoulli bandit. Under a linear upper-tail condition on the prior (the $\beta=1$ case of $\beta$-regularity), which implies an abundance of near-optimal arms, we prove that annealed softmax greedy achieves Bayes regret $\tilde{O}(m + T/m)$, and in particular $\tilde{O}(\sqrt{T})$ when the number of arms scales as $m = \Theta(\sqrt{T})$. This is the near-optimal Bayes regret rate in this regime, attained also by empirical-mean greedy. Under $\beta$-regularity, many arms maintain empirical means close to the optimum throughout learning, so when softmax samples an arm other than the empirically best, that arm tends to be another near-optimal one rather than a clearly inferior one. By contrast, with a small number of arms, the same kind of softmax policy can suffer linear regret. The result also provides a structural analogy to RLVR, where a base policy with a non-negligible probability of producing a correct completion plays the role of $\beta$-regularity.
中文摘要 带有可验证奖励的强化学习（RLVR）和基于群体的策略优化方法（如GRPO）通过每个提示抽样多个完成任务，并提高策略对奖励较高者的概率，并对参考策略施加基层惩罚来正则化，从而更新随机策略。这些更新不包含追踪认知不确定性的显式机制。本文研究了为何这种不确定性无关的更新仍然有效，提出了一个有风格化的解释。我们分析了一种退火软极大法（玻尔兹曼）策略，该策略根据经验平均奖励的软最大值在多臂贝叶斯伯努利盗贼中选择动作。在先验的线性上尾条件（$\beta=1$ 的 $\beta$-正则性情况）下，这意味着近最优臂的丰度，我们证明退火软极大贪婪能实现贝叶斯遗憾 $\tilde{O}（m + T/m）$，特别是当臂数随 $m = \Theta（\sqrt{T}）$ 时，实现 $\tilde{O}（\sqrt{T}）$。这是该制度下近似最优的贝叶斯后悔率，也由经验均值贪婪获得。在$\beta$正规性下，许多臂在整个学习过程中保持接近最优的经验平均值，因此当softmax采样非经验最佳臂时，该臂往往是另一个接近最优的臂，而非明显劣势。相比之下，在少数手中，同样的软最高限额政策可能会遭遇线性后悔。该结果还与RLVR形成结构类比，其中一个具有不可忽略概率产生正确完备的基础策略扮演了$\beta$正则性的角色。

The Challenges of Using Reinforcement Learning for Controlling Industrial Energy Systems

利用强化学习控制工业能源系统的挑战

Authors: Tobias Lademann, Théo Vincent, Jan Peters, Matthias Weigold
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.31044
Pdf link: https://arxiv.org/pdf/2605.31044
Abstract Reinforcement learning has shown promising results for optimizing the control of industrial energy systems, yet most existing studies remain limited to the application in simulation environments. We investigate the challenges of deploying reinforcement learning in a real-world industrial energy system, considering a thermal heating network as a use case. We formulate the task as a Markov Decision Process and systematically analyze the associated challenges along the structure of the formal description, including partial observability, action space design, reward design, and the simulation-to-reality gap. The challenges are grounded in an existing real-world deployment, where reinforcement learning achieves operational stability but shows a significant performance gap compared to simulation.
中文摘要 强化学习在优化工业能源系统控制方面取得了有前景的成果，但大多数现有研究仍限于仿真环境中的应用。我们探讨在现实工业能源系统中部署强化学习的挑战，并以热供暖网络为一个应用场景。我们将任务制定为马尔可夫决策过程，并系统分析形式描述结构中的相关挑战，包括部分可观测性、动作空间设计、奖励设计以及模拟与现实的差距。这些挑战基于现有的实际部署，强化学习实现了运行稳定性，但与仿真相比显示出显著的性能差距。

Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination

组合综合：通过原子分解与重组扩展码RLVR

Authors: Jiasheng Zheng, Boxi Cao, Boxi Yu, Yuzhong Zhang, Jialun Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun
Subjects: Subjects: Computation and Language (cs.CL); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2605.31058
Pdf link: https://arxiv.org/pdf/2605.31058
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as the cornerstone for shaping the remarkable coding abilities of Large Language Models (LLMs). However, the scalability of RLVR is severely constrained by the scarcity of sufficiently challenging verifiable code tasks that target near the model's edge of competence. Prior studies often rely on heuristic seed expansions for data synthesis, which severely limits both novelty and difficulty. Consequently, the training value of such data fails to scale proportionally with the size of its synthesis. To this end, we propose Atomic Decomposition and Recombination (ADR), a novel framework that generates verifiable code tasks via decomposition into atomic elements and controlled recombination, thereby enabling the generation of genuinely novel and challenging verifiable code tasks. Experiments and analysis demonstrate that ADR achieves superior originality, difficulty, diversity, and test quality over existing baselines, and consistently delivers greater improvements in code ability across RLVR in diverse downstream domains, including algorithmic programming, tool usage, and data science. Our work sheds light on a new paradigm for novel code task synthesis and scalable RLVR training.
中文摘要 带可验证奖励的强化学习（RLVR）最近已成为塑造大型语言模型（LLM）卓越编码能力的基石。然而，RLVR的可扩展性受到极大限制，因为缺乏足够具有挑战性的可验证代码任务，且这些任务针对模型能力的边缘。以往的研究常依赖启发式种子扩展进行数据综合，这严重限制了新颖性和难度。因此，这些数据的训练值无法与其综合规模成比例地扩展。为此，我们提出了原子分解与重组（ADR）新颖框架，通过分解为原子元素和受控重组，生成可验证的代码任务，从而生成真正新颖且具有挑战性的可验证代码任务。实验和分析表明，ADR在原创性、难度、多样性和测试质量方面优于现有基线，并且在算法编程、工具使用和数据科学等多项下游领域持续提升RLVR的代码能力。我们的研究揭示了一种新颖的代码任务综合和可扩展的RLVR训练范式。

AdaptR1: Reinforcement Learning Based Adaptive Interleaved Thinking in Multi-hop Question Answering

AdaptR1：基于强化学习的多跳问答中的自适应交错思维

Authors: Yuxin Wang, Jiahao Lu, Qifeng Wu, Shicheng Fang, Chuanyuan Tan, Yining Zheng, Xuanjing Huang, Xipeng Qiu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.31062
Pdf link: https://arxiv.org/pdf/2605.31062
Abstract Large Language Models (LLMs) have achieved remarkable performance in complex reasoning tasks through Chain-of-Thought (CoT) prompting. However, this approach often leads to ``over-thinking,'' where models generate unnecessarily long reasoning traces for simple queries and incur avoidable inference cost. While recent work has explored adaptive reasoning, existing methods typically make a single query-level decision about whether to reason. This overlooks the dynamic nature of multi-step tasks, where the need for explicit reasoning varies across intermediate stages. To address this limitation, we introduce AdaptR1, a Reinforcement Learning (RL) based framework for adaptive interleaved thinking in multi-hop Question Answering (QA). Unlike previous approaches that require Supervised Fine-Tuning (SFT) for cold-start initialization, AdaptR1 uses a fully RL-based strategy with a quality-gated efficiency reward to dynamically allocate reasoning budgets at each step. Under the Graph-R1 setting, AdaptR1 reduces average think tokens by 69.71\%, with a 90.35\% reduction on HotpotQA, while maintaining performance comparable to or better than standard baselines. Furthermore, our analysis reveals that overthinking in multi-hop reasoning is not uniformly distributed but occurs predominantly during the initial planning stages, highlighting the effectiveness of step-wise adaptive budget allocation.
中文摘要 大型语言模型（LLMs）通过思维链（CoT）提示在复杂推理任务中取得了显著表现。然而，这种方法常常导致“过度思考”，即模型为简单查询产生不必要的冗长推理轨迹，并产生可避免的推理成本。虽然近期研究探讨了自适应推理，但现有方法通常只需一个查询级决定是否推理。这忽略了多步骤任务的动态特性，在中间阶段对显式推理的需求不同。为解决这一局限，我们引入了AdaptR1，这是一个基于强化学习（RL）的多跳问答（QA）中自适应交错思维框架。与以往需要监督微调（SFT）进行冷启动初始化的方法不同，AdaptR1采用完全基于强化学习的策略，并以质量门槛的效率奖励动态分配每一步的推理预算。在Graph-R1设置下，AdaptR1平均思考标记减少了69.71%，而HotpotQA减少了90.35%的质量，同时保持与标准基线相当甚至更好的性能。此外，我们的分析显示，多跳推理中的过度思考并非均匀分布，主要发生在初期规划阶段，凸显了分阶段自适应预算分配的有效性。

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

iVGR：通过强化学习内化视觉基础推理，促进多层次多层次学习（MLLM）

Authors: Chang-Bin Zhang, Yujie Zhong, Qiang Zhang, Kai Han
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.31096
Pdf link: https://arxiv.org/pdf/2605.31096
Abstract While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (\textbf{iVGR}), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.
中文摘要 虽然视觉基础思维链（CoT）已成为增强多模态大型语言模型（MLLM）细粒度感知的有前景范式，但其在推理阶段的有效性仍未被充分探讨。在本研究中，我们实证发现，在视觉基础的CoT推理中强制使用显式对象框，往往会降低性能，而标准文本CoT则无需明确视觉基础即可推理。我们假设视觉定位能力可以内化到文本的CoT中，而强制性的显式接地会对模型的主要目标——答案预测——引入不必要的干扰。为解决这一问题，我们提出了内化视觉基础推理（\textbf{iVGR}），这是一种新型强化学习框架，将本地化能力转移到文本推理过程中。我们采用双流训练策略，文本流通过提出的一致性奖励与高质量且视觉基础的流对齐，使模型能够在推理过程中无需显式基础的情况下准确定位。大量实验表明，我们的方法在细粒度基准测试中显著优于现有基线，同时保持支持工具辅助推理工作流程的灵活性。

FOCUS: Forcing In-Context Object Localization through Visual Support Constraints and Policy Optimization

重点：通过可视化支持约束和策略优化强制上下文中对象本地化

Authors: Mohammed Asad Karim, Vinay Kumar Verma
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.31145
Pdf link: https://arxiv.org/pdf/2605.31145
Abstract In-context localization (ICL) seeks to localize a target object specified by a small set of support examples in a query image, operating on the fly without training or parameter updates. Despite rapid advances in vision-language models (VLMs), achieving category-agnostic and visually grounded ICL remains an open problem, even though it is essential for applications such as image editing, personalized visual search, and retrieval. Existing methods are fragile and rely on explicit category supervision, which not only limits applicability in realistic settings with unnamed or instance-specific objects but also introduces category bias that steers predictions toward semantic priors rather than visual evidence. We introduce a two-stage training framework that explicitly optimizes in-context attention between support bounding boxes and query images without category supervision. We further refine localization via reinforcement learning using Group Relative Policy Optimization (GRPO) to directly minimize localization error. This formulation enforces visual correspondence over semantic priors, yielding robust instance-level localization. Empirically, a 7B-parameter model trained with our objectives outperforms models up to 72B parameters, demonstrating that context-aware localization objectives can surpass scaling alone. Comprehensive ablations validate the contribution of each component.
中文摘要 上下文本地化（ICL）旨在将查询图像中少数支持示例指定的目标对象本地化，实时操作，无需训练或参数更新。尽管视觉语言模型（VLMs）取得了快速进步，实现类别无关且视觉基础的ICL仍是一个未解之谜，尽管它对图像编辑、个性化视觉搜索和检索等应用至关重要。现有方法较为脆弱，依赖显式类别监督，这不仅限制了在现实环境中无名或实例特定对象的适用性，还引入了类别偏见，使预测趋向语义先验而非视觉证据。我们引入了一个两阶段训练框架，明确优化支持边界框与查询图像之间的上下文注意力，无需类别监督。我们通过使用群相对策略优化（GRPO）进行强化学习进一步优化本地化，以直接最小化局部化误差。该表述在语义先验上强制视觉对应，从而实现了稳健的实例级本地化。实证上，用我们的目标训练的7B参数模型在72B参数范围内的表现优于模型，证明上下文感知的定位目标可以超越单纯的缩放。全面的消融验证了每个成分的贡献。

Convergence of Two-Timescale Markovian Stochastic Approximations with Applications in Reinforcement Learning

两时间尺度马尔可夫随机近似的收敛与强化学习中的应用

Authors: Vagul Mahadevan, Claire Chen, Shuze Daniel Liu, Shangtong Zhang
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.31172
Pdf link: https://arxiv.org/pdf/2605.31172
Abstract This work studies the convergence of two-timescale stochastic approximations (SA), a class of iterative algorithms that update two sets of parameters in fast and slow timescales respectively. Notable examples of two-timescale SA in reinforcement learning (RL) include temporal difference learning with gradient correction (TDC) and actor-critic methods. Previously, the stability (i.e., boundedness) and convergence of two-timescale SA were only established under i.i.d. noise. This work instead establishes the stability and convergence of two-timescale SA under Markovian noise, a setup that is more realistic in RL. Notably, we do not need to use any projection operator and the noise does not need to live in a compact space. Our key technical novelty is to control the fast timescale parameter with the running max of the slow timescale parameter, instead of with the current slow timescale parameter, as most prior works do. As a key application, we establish the first almost sure convergence of TDC with eligibility traces under off-policy learning with linear function approximation.
中文摘要 本研究研究了两时间尺度随机近似（SA）的收敛性，SA是一类迭代算法，分别在快速和慢时间尺度下更新两组参数。强化学习（RL）中两时间尺度SA的著名例子包括带梯度修正的时间差分学习（TDC）和actor-critic方法。此前，两时间尺度SA的稳定性（即有界性）和收敛性仅在i.i.d.噪声下被建立。本研究反而建立了两时间尺度SA在马尔可夫噪声下的稳定性和收敛性，这一设定在强化学习中更为现实。值得注意的是，我们不需要使用任何投影算符，噪声也不必存在于紧凑的空间中。我们的关键技术创新是用慢时间尺度参数的运行最大值来控制快速时间尺度参数，而不是像大多数以往作品那样用当前慢时间尺度参数。作为一个关键应用，我们首次几乎确定TDC与资格性迹在非策略学习中线性函数近似下收敛。

The Regularizing Power of Language-Training Deepfake Detectors

语言训练深度伪造检测器的规范化力量

Authors: Benedikt Hopf, Zongwei Wu, Radu Timofte
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.31192
Pdf link: https://arxiv.org/pdf/2605.31192
Abstract Recently, thanks to the advent of Multimodal-LLMs, deepfake detectors are striving not only to be generalizable but also interpretable. We propose that these two challenges can effectively be tackled jointly, since describable artifacts typically generalize better, opening the possibility to use language as a regularization mechanism. Since deepfake detection generally suffers from overfitting to low-level domain-specific artifacts, our intuition is that an LLM that has been pretrained on language would prefer high-level artifacts that can be described better. This way, we can use high-level features where possible, while training the model to use low-level features where necessary. We utilize a dual-encoder architecture, pairing a frozen specialist detector with a LoRA-tuned MLLM encoder, and a two-stage training curriculum: first, a binary alignment phase demonstrates that the intrinsic capability of MLLMs can effectively combine features to mitigate overfitting to dataset-specific artifacts. To further bolster generalization and achieve interpretability, we employ a reinforcement learning stage that encourages the model to generate descriptive reasoning before classifying, using only binary labels. By rewarding this "explain-then-classify" behavior, we explicitly incentivize the model to prioritize high-level, robust features. Crucially, this process yields both interpretable descriptions and a further boost in cross-dataset performance, even when reasoning chains are omitted at inference. Extensive experiments on benchmark datasets validate our approach, outperforming state-of-the-art methods by a large margin.
中文摘要 近年来，得益于多模大型语言模型（LLM）的出现，深度伪造检测器不仅致力于普及化，还具有可解释性。我们提出这两个挑战可以有效地共同应对，因为可描述的产物通常更具推广性，从而为使用语言作为正则化机制打开了可能性。由于深度伪造检测通常会对低级别域特定性伪造物进行过度拟合，我们的直觉是，经过语言预训练的大型语言模型更倾向于能够更好地描述的高级伪造物。这样，我们可以尽可能使用高层特征，同时训练模型在必要时使用低层特征。我们采用双编码器架构，将冻结的专业检测器与LoRA调优的MLLM编码器结合，并采用两阶段培训课程：首先，二元对齐阶段展示了MLLM的固有能力能够有效组合特征，减少对数据集特定伪造物的过度拟合。为了进一步强化泛化并实现可解释性，我们采用强化学习阶段，鼓励模型在分类前生成描述性推理，仅使用二元标签。通过奖励这种“解释后分类”的行为，我们明确激励模型优先考虑高层次、稳健的特征。关键是，这一过程既能带来可解释的描述，也能进一步提升跨数据集性能，即使推理时省略了推理链。对基准数据集的大量实验验证了我们的方法，远远超过了最先进的方法。

Multivariate Distributional Reinforcement Learning Using Sliced Divergences

多元分布强化学习，利用切片发散

Authors: Baptiste Debes, Tinne Tuytelaars
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.31222
Pdf link: https://arxiv.org/pdf/2605.31222
Abstract Distributional reinforcement learning (DRL) models the full return distribution rather than expectations, but extending it to multivariate settings remains challenging. Many common metrics do not naturally generalize beyond one dimension or lose computational tractability, and the multivariate case introduces additional difficulties such as general matrix discounting, for which no contraction results are available. We introduce Sliced Distributional Reinforcement Learning (SDRL), which lifts tractable one-dimensional divergences to multivariate return distributions via projections. We prove Bellman contraction for uniform slicing under shared scalar discounting, and introduce a maximum-slicing variant with contraction under general dense discount matrices. SDRL supports a broad class of base divergences; we analyze Wasserstein, Cramér, and Maximum Mean Discrepancy (MMD), and characterize which SDRL variants suit the standard single-sample Bellman update used in distributional RL. We evaluate SDRL on a toy chain problem and a gridworld image-based environment as well as a subset of Atari games.
中文摘要 分布强化学习（DRL）建模的是完整回报分布而非预期，但将其推广到多变量环境仍然具有挑战性。许多常见度量自然无法推广到一维之外，或失去计算可解性，多元情况还带来了诸如一般矩阵贴现等额外难题，而对此没有收缩结果。我们介绍了切片分布强化学习（SDRL），通过投影将可处理的一维发散提升为多元返回分布。我们证明了在共享标量贴现下均匀切片的贝尔曼收缩，并引入了在一般稠密贴现矩阵下收缩的最大切片变体。SDRL支持广泛的基底发散;我们分析了Wasserstein、Cramér和最大均值差异（MMD），并表征哪些SDRL变体适合分布强化学习中使用的标准单样本Bellman更新。我们基于玩具链问题、基于网格世界的图像环境以及部分Atari游戏来评估SDRL。

EchoRL: Reinforcement Learning via Rollout Echoing

EchoRL：通过滚动回声进行强化学习

Authors: Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou, Wenke Huang, Sikuan Yan, Yujun Wang, Zixuan Cao, Michael Färber, Xun Xiao, Volker Tresp, Yunpu Ma
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.31228
Pdf link: https://arxiv.org/pdf/2605.31228
Abstract Reinforcement Learning with Verifiable Rewards is an effective route for post-training to strengthen the reasoning capability of large language models. However, as training proceeds, the learning signal can collapse thus makes the training gain become marginal and ineffective. Specifically, a growing fraction of prompts' rollouts become advantage-degenerated: all the self-generated rollouts show verified-success, making the standard deviation over their rewards be zero; accordingly each rollout's advantage becomes degenerated (zero) as well. Given such rollouts' advantages, the policy-gradient for model optimization eventually vanishes, capping the training performance. We argue that some of these rollouts still contain valuable learning signals but unfortunately omitted with the existing RLVR methods. In this paper, inspired through analyzing the entropy pattern behind golden trajectories produced by external expert models, we propose EchoRL for better exploiting the advantage-degenerated rollouts to further improve the training performance. EchoRL is a lightweight module that first identifies an EchoClip from verified-success rollouts based on their step-level entropy values, and then feeds this clip back as an auxiliary supervision signal in the RL objective. Extensive experiments across 10 benchmarks, 5 LLM backbones, and 4 popular RLVR post-training methods demonstrate that EchoRL consistently improves RLVR post-training with minimal overhead.
中文摘要 带有可验证奖励的强化学习是一种有效的后训练途径，旨在增强大型语言模型的推理能力。然而，随着训练的进行，学习信号可能会崩溃，从而使训练增益变得边缘且无效。具体来说，越来越多的提示词推出会变成优势退化：所有自生成的推出都显示已验证成功，使得其奖励的标准差为零;因此，每次推出的优势也会变得退化（为零）。鉴于此类推广的优势，模型优化的策略梯度最终消失，限制训练性能。我们认为，这些推广中仍有宝贵的学习信号，但遗憾的是，现有的RLVR方法遗漏了这些信息。本文通过分析外部专家模型产生的黄金轨迹背后的熵模式，提出了EchoRL以更好地利用优势退化的推广，进一步提升训练性能。EchoRL是一个轻量级模块，首先根据已验证成功的部署中，根据其阶级熵值识别EchoClip，然后将该片段作为辅助监督信号反馈回RL目标中。通过涵盖10个基准测试、5个LLM骨干和4种流行的RLVR训练后方法进行的广泛实验表明，EchoRL能够以最小的开销持续提升训练后RLVR的性能。

Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning

为什么线性循环记忆在部分可观察强化学习中有效

Authors: Yike Zhao, Onno Eberhard, Malek Khammassi, Ali H. Sayed, Michael Muehlebach
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.31261
Pdf link: https://arxiv.org/pdf/2605.31261
Abstract The family of linear recurrent neural networks has shown strong performance as recurrent memory units in partially observable reinforcement learning. We provide a theoretical justification for their empirical effectiveness by constructing and studying two linear filters: (i) the first exactly reproduces the pre-softmax logits of the belief vector in a hidden Markov model (HMM) under a deterministic transition matrix, thereby serving as a sufficient statistic for optimal policy learning, (ii) the second achieves vanishing state-decoding error under a nearly deterministic transition matrix, thus reducing state ambiguity to near zero. The results extend to action-controlled HMMs, where the corresponding linear filters become time-varying with action-dependent dynamics. We illustrate our main results through numerical experiments and further show that the constructed linear filter serves as a strong feature extractor in a small reinforcement learning game.
中文摘要 线性循环神经网络家族在部分可观察的强化学习中表现出作为循环记忆单元的强劲表现。我们通过构造和研究两个线性滤波器，为其经验有效性提供了理论依据：（i）第一个在确定性转移矩阵下，精确重现隐马尔可夫模型（HMM）信念向量的pre-softmax对数，从而作为最优策略学习的充分统计量，（ii）第二个在近乎确定性过渡矩阵下实现状态解码误差为零，从而将状态歧义降到接近零。这些结果也适用于作用控制的HMM，其中相应的线性滤波器随动作动态变化。我们通过数值实验展示主要结果，并进一步证明构造线性滤波器在小型强化学习博弈中具有强的特征提取作用。

DriveMA: Driving Vision-Language-Action Models with verifiable Meta-Actions

DriveMA：驱动具有可验证元行动的视觉-语言-行动模型

Authors: Weicheng Zheng, Yixin Huang, Qiao Sun, Derun Li, Hang Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.31271
Pdf link: https://arxiv.org/pdf/2605.31271
Abstract Driving Vision-Language-Action Models (Driving VLAs) aim to use language to improve end-to-end planning, but the language-action gap limits this promise. We propose DriveMA, a Driving VLA framework built on verifiable meta-actions, which summarize future ego motion into compact language-domain intentions and can be constructed from expert trajectories with a trajectory-grounded annotation pipeline and can be verified against generated trajectories through rule-based projection. DriveMA exploits this verifiability with action-centric supervised training and a data-efficient turn-level credit assignment reinforcement learning framework, explicitly aligning high-level decisions with low-level trajectory planning through dense rewards and precise credit assignment. DriveMA sets a new state of the art on the Waymo Open Dataset Vision-based E2E Driving, achieving a Rater Feedback Score of 8.060 with a 2B model and further improving it to 8.079 with a 4B model; it also obtains competitive closed-loop planning performance on NAVSIM. These results show that even a simple meta-action interface can achieve state-of-the-art planning when made verifiable and optimized for language-action alignment. Code, data, and models will be released to facilitate future research.
中文摘要 推动愿景-语言-行动模型（Driving VLAs）旨在利用语言改善端到端规划，但语言与行动的差距限制了这一承诺。我们提出了DriveMA，这是一个基于可验证元动作的驱动VLA框架，能够将未来的自我运动总结为紧凑的语言领域意图，并可基于基于轨迹的注释流水线从专家轨迹构建，并通过基于规则的投影验证生成轨迹。DriveMA利用这种验证性，采用以行动为中心的监督培训和数据高效的回合级学分强化学习框架，通过密集奖励和精确的学分分配，明确将高层决策与低层次轨迹规划对齐。DriveMA在基于Waymo开放数据集的基于视界的端对端驱动上树立了新水平，在2B模型下获得了8.060的Rater Feedback评分，并在4B模型中进一步提升至8.079;它还在NAVSIM上实现了具有竞争力的闭环规划性能。这些结果表明，即使是简单的元动作接口，只要可验证并优化语言-动作对齐，也能实现最先进的规划。代码、数据和模型将发布，以促进未来研究。

Survival Reinforcement Learning: Toward Scalable Self-Supervised RL

生存强化学习：迈向可扩展的自我监督强化学习

Authors: Franki Nguimatsia-Tiofack, Fabian Schramm, Théotime Le Hellard, Justin Carpentier
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.31273
Pdf link: https://arxiv.org/pdf/2605.31273
Abstract While self-supervised Contrastive Reinforcement Learning (CRL) has shown remarkable depth-scaling capabilities, successfully using networks over 64 layers, scaled CRL still struggles with long-horizon goal-conditioned planning due to the uniformity-tolerance dilemma inherent in contrastive losses. We introduce Survival Reinforcement Learning (SRL), an online classification-based alternative that extends the survival value learning framework by maximizing the agent's dwell time at target goals. SRL bypasses the structural constraints of CRL and mitigates the "bang-bang" control solutions inherent to survival frameworks, which often induce undesirable behavior in complex dynamical systems. Evaluated across diverse robotic benchmarks, scaled SRL matches state-of-the-art CRL on manipulation tasks and outperforms it by 2x to 8x on stable, long-horizon locomotion tasks. Our results provide strong additional evidence that classification-based methods may serve as a key primitive in the broader effort to scale reinforcement learning.
中文摘要 虽然自监督对比强化学习（CRL）展现出显著的深度缩放能力，成功利用64层网络，但缩放CRL在长视野目标条件规划方面仍面临困难，原因是对比损失中固有的均匀容忍困境。我们介绍了生存强化学习（SRL），这是一种基于分类的在线替代方案，通过最大化智能体在目标目标处的停留时间，扩展了生存价值学习框架。SRL绕过了CRL的结构约束，缓解了生存框架固有的“砰砰”控制解决方案，这些方案常在复杂动力系统中引发不良行为。经过多种机器人基准测试的评估，规模化SRL在操作任务中可与最先进的CRL匹敌，在稳定的长地平线运动任务中表现为其2倍至8倍。我们的结果提供了有力的额外证据，表明基于分类的方法可能成为更广泛强化强化学习扩展努力中的关键基础。

The Terminal Representation in Reinforcement Learning

强化学习中的终端表示

Authors: Amir Esterhuysen, Anders Jonsson
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.31289
Pdf link: https://arxiv.org/pdf/2605.31289
Abstract Representation learning is a powerful tool for spatio-temporal abstraction within reinforcement learning (RL). Two well established approaches are through the successor representation (SR) and the default representation (DR). The SR encodes states by the future trajectories they induce, capturing information flow decoupled from reward. The DR builds on this by weighting trajectories with reward, integrating credit-assignment structure into the representation. Eigenvectors of both representations have been used to support a range of downstream tasks -- including option discovery, reward shaping, transfer learning, and exploration. We introduce a structurally distinct formulation: the terminal representation (TR). The TR encodes reward-weighted trajectories similarly to the DR, but can be learned as a lower-dimensionality object, and can be used directly for the mentioned applications without eigenvector computations. Eigendecomposition also imposes the assumption of symmetric transition dynamics, which the TR can bypass. In this work we develop the theoretical foundations of the TR: its derivation, convergence of two learning algorithms, its use for zero-shot compositionality, and equivalences between alternative reward formulations. We further show the TR is embedded in the top DR eigenvector, allowing it to capture the same underlying knowledge without eigendecomposition. Additionally, we provide empirical evidence of the TR as a viable alternative to existing representations in subsidiary applications, while requiring less computational overhead to learn, store, and use.
中文摘要 表征学习是强化学习（RL）中时空抽象的强大工具。两种成熟的方法分别是继任表示（SR）和默认表示（DR）。SR通过诱导的未来轨迹编码状态，捕捉与奖励解耦的信息流。DR在此基础上，通过加权轨迹与奖励，并将学分分配结构整合进表征中。这两种表示的特征向量已被用于支持一系列下游任务——包括选项发现、奖励塑造、迁移学习和探索。我们引入一种结构上不同的表述：终端表示（TR）。TR编码奖励加权轨迹类似于DR，但可作为低维对象学习，且可直接用于上述应用，无需特征向量计算。特征分解还假设了对称跃迁动力学，而TR可以绕过这一假设。本研究中，我们发展了TR的理论基础：其推导、两种学习算法的收敛、零样本合成性的应用，以及不同奖励形式之间的等价性。我们还进一步证明TR嵌入在顶部DR特征向量中，使其能够在不进行特征分解的情况下捕捉相同的底层知识。此外，我们提供了TR作为辅助应用中现有表示的可行替代方案的实证证据，同时在学习、存储和使用上所需的计算开销更低。

Non-Asymptotic Convergence of Stochastic Iterative Algorithms: A Lyapunov Framework

随机迭代算法的非渐近收敛：李雅普诺夫框架

Authors: Zaiwei Chen, Siva Theja Maguluri
Subjects: Subjects: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.31309
Pdf link: https://arxiv.org/pdf/2605.31309
Abstract We survey Lyapunov-based techniques for the finite-time analysis of stochastic iterative algorithms, also known as stochastic approximation (SA) algorithms, for solving fixed-point equations $\bar{F}(x)=x$, where the operator $\bar{F}(\cdot)$ can only be accessed through a noisy oracle. We first focus on the standard setting in which $\bar{F}(\cdot)$ is contractive with respect to some norm and the noise is i.i.d., and explain how generalized Moreau envelopes serve as universal Lyapunov functions, regardless of the underlying norm. We then show how this framework yields mean-square convergence guarantees and applies to stochastic gradient descent, linear SA, and value-based reinforcement learning algorithms such as Q-learning and temporal-difference learning. Finally, we discuss extensions to Markovian noise, seminorm-contractive operators, dissipative operators, and high-probability bounds, and conclude with open problems. The goal is to present a unified and self-contained roadmap for the finite-time analysis of SA and its applications, especially in reinforcement learning.
中文摘要 我们探讨基于李雅普诺夫的随机迭代算法（也称为随机近似（SA）算法的有限时间分析技术，用于求解不动点方程$\bar{F}（x）=x$，其中算符$\bar{F}（\cdot）$只能通过噪声oracle访问。我们首先关注 $\bar{F}（\cdot）$ 相对于某范数是收缩的，噪声为 i.i.d.，并解释广义 Moreau 包络如何作为通用的 Lyapunov 函数，无论底层范数如何。随后，我们展示了该框架如何产生均方收敛保证，并应用于随机梯度下降、线性SA以及基于价值的强化学习算法，如Q-学习和时间差分学习。最后，我们讨论了对马尔可夫噪声的扩展、半范数-收缩算子、耗散算符和高概率界限，并以未解决的问题作结。目标是为有限时间分析SA及其应用，特别是在强化学习领域，提供一个统一且自包含的路线图。

Generalized Intention Modeling in Multi-Agent Reinforcement Learning

多智能体强化学习中的广义意图建模

Authors: Mateusz Odrowaz-Sypniewski, Jasmine Bayrooti, Ajay Shankar, Amanda Prorok
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.31318
Pdf link: https://arxiv.org/pdf/2605.31318
Abstract Modeling an opponent's intent is critical for effective decision-making in non-cooperative, competitive, and general-sum multi-agent reinforcement learning. Existing opponent modeling methods encode intent using an embedding derived from episode information chosen a priori, such as the opponent's next action or a future environment state, and use this to guide the ego-agent's behavior. These approaches assume that the chosen information is universally representative of intent; however, we show empirically that this is not the case as intentions are often task- and environment-dependent. To address this, we introduce a task-adaptive opponent modeling framework that learns a performance-driven mixture of multiple intent representations. We further introduce a new intention representation that maximizes mutual information with the ego-agent's future returns, thereby capturing opponent information that is most directly relevant to performance. Our approach consistently matches or exceeds the performance of state-of-the-art baselines across diverse tasks and yields insights into when and why different opponent modeling strategies succeed.
中文摘要 建模对手意图对于非合作、竞争性和广和多智能体强化学习中的有效决策至关重要。现有的对手建模方法通过嵌入从事先选择的事件信息（如对手的下一步行动或未来环境状态）中得出的嵌入来编码意图，并以此指导自我代理的行为。这些方法假设所选信息普遍代表意图;然而，我们通过实证表明，意图往往依赖于任务和环境。为此，我们引入了一个任务自适应对手建模框架，通过学习基于性能的多种意图表示混合。我们进一步引入了一种新的意图表示，最大化了与自我代理未来收益的互信息，从而捕捉与表现最直接相关的对方信息。我们的方法在不同任务中始终能与最先进基线的表现匹敌甚至超越，并洞察不同对手建模策略何时及为何成功。

Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards

强化学习会加剧无害奖励的涌现错位

Authors: Magnus Jørgenvåg, David Kaczér, Lasse Ruttert, Marvin Gülhan, Lucie Flek, Florian Mai
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.31328
Pdf link: https://arxiv.org/pdf/2605.31328
Abstract Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT) setting, evidence that it also arises from reinforcement learning (RL) is limited to large, closed-source models, leaving the phenomenon expensive to study and difficult to reproduce. We characterize EM from RL in small, off-the-shelf open-weight models along three axes. First, we show that rewarding narrow, overtly misaligned behavior produces substantially higher general-domain misalignment than sample-matched SFT. Second, we show that EM from RL can be induced by reward signals that could plausibly arise naturally, such as unpopular aesthetic preferences or poor rhetorical appeals. Third, we evaluate in-training mitigations developed for SFT-induced EM and find that they broadly transfer, with interleaving on-policy safety data performing best.
中文摘要 涌现错位（EM）是指语言模型在对狭窄错位的例子进行微调后，出现广泛错位的意外倾向。虽然EM在监督微调（SFT）环境中已被广泛研究，但其源自强化学习（RL）的证据仅限于大型闭源模型，这使得该现象的研究成本高且难以复现。我们在三轴的小型现成开放权重模型中对强化学习的电磁学进行了表征。首先，我们表明，奖励狭窄且明显错位的行为，会产生显著更高的广域错位，而样本匹配的SFT则是如此。其次，我们展示了强化学习（RL）中的EM可以由一些合理自然产生的奖励信号诱发，比如不受欢迎的审美偏好或拙劣的修辞呼吁。第三，我们评估针对SFT诱发EM开发的培训中缓解措施，发现其转移性广泛，政策安全数据交错使用，表现最佳。

Dreaming Of Others: Latent Teammate Modeling In World Models For Multi-Agent Reinforcement Learning

梦见他人：多智能体强化学习世界中潜在队友建模

Authors: Tomas Leroy-Stone
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.31361
Pdf link: https://arxiv.org/pdf/2605.31361
Abstract In cooperative multi-agent reinforcement learning (MARL), agents must coordinate with partners whose internal policies and intentions are not directly observable. While world models such as Dreamer have demonstrated strong generalization and sample efficiency in single-agent settings, their application to MARL remains limited by an inability to handle teammate-induced uncertainty. We propose a new perspective: treat teammates as structured, learnable components within the agent's world model. We introduce an architecture that factorizes the latent state of a Dreamer-style recurrent state-space model (RSSM) into environment and teammate components, and learns an auxiliary Theory-of-Mind (ToM) head to infer latent embeddings of partner behavior such as character, intent, and predicted actions from partial trajectories. These teammate latents condition the actor and critic, enabling the agent to imagine and adapt to diverse collaborators. We outline how this approach can support zero-shot and few-shot coordination in partially observable settings and propose a set of benchmarks and evaluation protocols to assess its impact. This work positions world models as not only predictors of environmental dynamics, but as simulators of social behavior, opening new directions for generalizable, human-compatible AI.
中文摘要 在合作多智能体强化学习（MARL）中，智能体必须与内部策略和意图无法直接观察的合作伙伴协调。尽管像Dreamer这样的世界模型在单智能体环境中展现出强大的泛化性和样本效率，但它们在MARL中的应用仍受限于无法处理队友引起的不确定性。我们提出了一种新视角：将队友视为智能体世界模型中结构化、可学习的组成部分。我们引入了一种架构，将Dreamer式循环状态空间模型（RSSM）的潜在状态分解到环境和队友组件中，并学习辅助心智理论（ToM）头，从部分轨迹推断伴侣行为的潜在嵌入，如性格、意图和预测行为。这些队友潜能条件化行为者和评论者，使行动者能够想象并适应多样化的合作者。我们概述了该方法如何在部分可观测环境中支持零射击和少射击协调，并提出了一套基准和评估方案以评估其影响。这项工作将世界模型定位为不仅是环境动态的预测器，更是社会行为的模拟器，为可推广、与人类兼容的人工智能开辟了新方向。

Unlocking Fine-Grained Translation Quality Estimation in LRMs through Synergistically Evolving Implicit and Explicit Reasoning

通过协同演进隐性与显式推理，解锁长程模型中的细粒度翻译质量估计

Authors: Renfei Dang, Xinye Wang, Zhejian Lai, Weilu Xu, Shimin Tao, Daimeng Wei, Min Zhang, Shujian Huang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.31378
Pdf link: https://arxiv.org/pdf/2605.31378
Abstract Large Reasoning Models (LRMs) still struggle with fine-grained translation quality estimation (QE), even with long reasoning chains. We argue that LRMs already possess strong multilingual capabilities, while the core challenge stems from the intrinsic difficulty of learning the fine-grained QE task. In this paper, we propose RIEQE (Reasoning both Implicitly and Explicitly for QE), a simple two-stage training framework that enables the co-evolution of implicit (layer-wise) and explicit (token-wise) reasoning capabilities. To make implicit reasoning feasible, we first decompose the complex QE task into straightforward subtasks. Based on this, our two-stage approach applies: (1) NonThinking-SFT, Supervised Fine-Tuning (SFT) without reasoning chains to directly boost the model's implicit reasoning tendency and capability; and (2) Thinking-RLVR, standard Reinforcement Learning with Verifiable Reward (RLVR) to subsequently strengthen explicit reasoning. Results demonstrate that implicit and explicit reasoning synergistically co-evolve under our framework. On the WMT test sets, RIEQE based on Qwen3-4B-Thinking-2507 surpasses all baselines in explicit reasoning performance, while its implicit reasoning capability is also comparable to the best current encoder-based models. We further provide evidence for the synergistic collaboration between implicit and explicit reasoning, showing how they mutually benefit each other.
中文摘要 大型推理模型（LRM）即使在长推理链下，在细粒度翻译质量估计（QE）方面仍然存在困难。我们认为，LRMs已经具备强大的多语言能力，而核心挑战则源于学习细致量化解任务的固有难度。本文提出了RIEQE（量子化的隐式与显式推理），这是一种简单的两阶段训练框架，使隐式（层级）和显式（令牌级）推理能力能够共同演进。为了使隐式推理成为可能，我们首先将复杂的量子工程任务分解为简单的子任务。基于此，我们的两阶段方法适用：（1）无推理链的非思考SFT，监督式微调（SFT），直接提升模型的隐性推理倾向和能力;以及（2）思维-RLVR，标准的可验证奖励强化学习（RLVR），以进一步强化显性推理。结果表明，隐性与显性推理在我们的框架下协同进化。在WMT测试集中，基于Qwen3-4B-Thinking-2507的RIEQE在显式推理性能上超越所有基线，其隐式推理能力也可与当前最佳基于编码器的模型相媲美。我们还进一步提供了隐性推理与显性推理之间的协同合作证据，展示了它们如何相互受益。

Constrained Multi-Objective Reinforcement Learning with Max-Min Criterion

带最大最小准则的受限多目标强化学习

Authors: Giseung Park, Hyunyoung Nam, Woohyeon Byeon, Amir Leshem, Youngchul Sung
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.31388
Pdf link: https://arxiv.org/pdf/2605.31388
Abstract Multi-Objective Reinforcement Learning (MORL) extends standard RL by optimizing policies with respect to multiple, often conflicting, objectives. While max-min MORL has emerged as an effective approach for promoting fairness, its applicability remains limited, particularly when constraints must be incorporated. In this paper, we propose a MORL framework that integrates the max-min criterion with explicit constraint satisfaction. We establish a theoretical foundation for the proposed framework and validate the resulting algorithm through convergence analysis and experiments in tabular settings. We further demonstrate the practical relevance of our approach in simulated building thermal control, multi-objective locomotion control, and greenhouse-gas-emission-aware traffic management. Across these domains, our method effectively balances fairness and constraint satisfaction in multi-objective decision-making.
中文摘要 多目标强化学习（MORL）通过针对多个且常常冲突的目标优化策略，扩展了标准强化学习。尽管最大最小MORL已成为促进公平的有效方法，但其适用性仍然有限，尤其是在必须纳入约束的情况下。本文提出了一个将最大最小准则与显式约束满足整合的MORL框架。我们为所提出的框架奠定理论基础，并通过收敛分析和表格实验验证算法。我们还进一步展示了我们方法在模拟建筑热控、多目标移动控制和温室气体排放感知交通管理中的实际相关性。在这些领域，我们的方法有效平衡了多目标决策中的公平性和约束满足。

Astra: a generalizable report generation foundation model for 3D computed tomography

Astra：一种适用于3D计算机断层扫描的通用报告生成基础模型

Authors: Zhuhao Wang, Fang Chen, Chaohui Yu, Zihan Li, Yuchao Zheng, Jing Wang, Xuan Yang, Jia Guo, Zhenlu Yang, Xingju Zheng, Yihua Sun, Haojie Han, Xiaoxiao Qin, Zhan Feng, Wenbo Xiao, Chao Zhu, Yuehua Li, Shipeng Zhang, Hao Luo, Yunsong Peng, Fan Wang, Hongen Liao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.31437
Pdf link: https://arxiv.org/pdf/2605.31437
Abstract CT interpretation requires radiologists to review hundreds of volumetric slices per examination, making reporting time-consuming and highly expertise-dependent. Automated CT report generation offers a promising route to improving clinical efficiency, yet the field still lacks a generalizable CT report generation foundation model that supports multi-region reporting and remains robust across external real-world cohorts. Intrinsic inconsistencies in reporting style and diagnostic terminology across cohorts make naive joint training prone to noisy textual supervision, thereby limiting model generalizability. Here we present Astra, a generalizable CT report generation foundation model trained on 90,678 thoracoabdominal CT-report pairs (CTRgDB) with 353,671 abnormalities spanning eight organ systems. By harmonizing report style and further refining diagnostic consistency via reinforcement learning, Astra achieves style-consistent and diagnostically accurate report generation across diverse anatomical regions and institutions. Evaluating on CTRgDB and six external cohorts, Astra achieves state-of-the-art performance with a 44.1% average improvement in fine-grained diagnostic metrics (P<0.001). In real-world clinical workflows, Astra assistance accelerates chest report drafting by 29.6% and improves abdominal report completeness by 11.3% (P<0.001). Furthermore, Astra also demonstrates broad utility as a foundation for CT AI development, improving downstream diagnostic performance and scaling vision-language pretrain through high-quality report synthesis. Overall, Astra serves as a broadly accessible clinical assistant and a pivotal infrastructure for the next generation of AI-powered healthcare.
中文摘要 CT解读需要放射科医生每次检查审查数百个体积切片，这使得报告耗时且高度依赖专业能力。自动化CT报告生成为提升临床效率提供了有前景的途径，但该领域仍缺乏一个支持多区域报告并在外部真实队列中保持稳健的通用CT报告生成基础模型。各队列报告风格和诊断术语的内在不一致，使得朴素的联合训练容易受到文本监督的干扰，从而限制了模型的泛化性。这里介绍Astra，这是一个基于90,678对胸腹CT报告（CTRgDB）训练的可推广CT报告生成基础模型，涵盖8个器官系统、353,671条异常。通过协调报告风格并通过强化学习进一步优化诊断一致性，Astra实现了在不同解剖区域和机构中风格一致且诊断准确的报告生成。通过CTRgDB及六个外部队列评估，Astra在细致诊断指标平均提升44.1%（P<0.001）方面实现了最先进的性能。在实际临床工作流程中，Astra协助加快胸腔报告的制作速度提升29.6%，并提升腹部报告完整性11.3%（P<0.001）。此外，Astra还展示了作为CT AI开发基础的广泛应用，通过高质量报告综合提升后游诊断性能并扩展视觉语言预训练。总体而言，Astra作为一个广泛可及的临床助理，是下一代AI驱动医疗的关键基础设施。

Answer-Set-Programming-based Abstractions for Reinforcement Learning

基于答案集编程的强化学习抽象

Authors: Rafael Bankosegger, Thomas Eiter, Johannes Oetsch
Subjects: Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2605.31444
Pdf link: https://arxiv.org/pdf/2605.31444
Abstract Reinforcement Learning (RL) enables autonomous agents to learn policies from experience, but realistic problems often involve enormous state spaces, making learning and generalisation challenging. Abstraction and approximation are therefore essential. Relational Reinforcement Learning (RRL) offers a way to reason about objects and their relations, and the CARCASS framework by Martijn van Otterlo demonstrates how logical representations can model Markov Decision Processes (MDPs) in first-order domains. Originally implemented in Prolog, CARCASS leverages domain knowledge to create powerful abstractions. We explore Answer-Set Programming (ASP), which is a rich and, contrary to Prolog, fully declarative modelling language, to realise CARCASS abstractions. We evaluate our ASP-based implementation in case studies of two domains, viz. Blocks World and Minigrid. Our results indicate that CARCASS with ASP provides a promising approach to constructing abstractions for RL, especially when domain knowledge is available.
中文摘要 强化学习（RL）使自主智能体能够从经验中学习策略，但现实问题往往涉及巨大的状态空间，使得学习和泛化具有挑战性。因此，抽象和近似是必不可少的。关系强化学习（RRL）提供了对对象及其关系进行推理的方法，Martijn van Otterlo的CARCASS框架展示了逻辑表示如何在一阶域中建模马尔可夫决策过程（MDP）。CARCASS最初用Prolog实现，利用领域知识创建强大的抽象。我们探索了答案集编程（ASP），这是一种丰富且与Prolog相反的全声明式建模语言，用于实现CARCAS抽象。我们在两个领域的案例研究中评估基于ASP的实现，即Blocks World和Minigrid。我们的结果表明，CARCASS结合ASP为构建强化学习的抽象提供了有前景的方法，尤其是在领域知识可得的情况下。

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

DRIFT：解耦推出与重要性加权微调，实现高效多回合优化

Authors: Jian Mu, Tianyi Lin, Chengwei Qin, Zhongxiang Dai, Yao Shu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.31455
Pdf link: https://arxiv.org/pdf/2605.31455
Abstract Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at this https URL.
中文摘要 大型语言模型越来越多地部署在多回合交互环境中，用户可以或环境迭代提供轻量级反馈。不幸的是，优化此类行为在实际操作中带来了一个棘手的难题：在线强化学习能够有效应对多回合动态，但由于每次更新生成完整修正轨迹的成本高昂，成本过高;而离线监督微调（SFT）虽然高效，但存在分布偏移和行为崩溃的问题。为此，我们新颖地提出了DRIFT（解耦扩展与重要性加权微调），这一框架操作化了理论洞见：KL正则化的强化学习目标等价于重要性加权监督学习。DRIFT通过从固定参考策略抽样离线交互轨迹，推导基于收益的重要性权重，并通过加权SFT优化策略，将推广与优化脱钩。通过实证，我们证明 DRIFT 在保持标准监督微调训练效率和简洁性的情况下，能够匹敌甚至超越多回合强化学习基线的性能。代码可在此 https URL 访问。

GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization

GPU 预测器：语言模型作为内核运行时优化的选择性替代

Authors: Zaid Khan, Justin Chih-Yao Chen, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.31464
Pdf link: https://arxiv.org/pdf/2605.31464
Abstract GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repeated measurement on target hardware. While these measurements provide the ground-truth signal necessary for kernel search, they are costly, because each evaluation of a kernel requires compilation and repeated execution on a GPU. As improvements in LLM inference reduce the cost of writing novel kernels and LLM-driven searches scale to large search budgets, on-device evaluation becomes a bottleneck. To address this, we study how LLMs can serve as selective GPU surrogates for kernel evaluation, by forecasting the performance of proposed kernels. A useful surrogate should be accurate, and it should be selective, by knowing when it could be wrong, and deferring to the GPU. To evaluate surrogates, we measure whether their forecasts are accurate, calibrated, and practically useful for recovering fast kernels under limited GPU-measurement budgets. Next, we study whether reinforcement learning can improve forecast accuracy and confidence calibration. Our experiments demonstrate that LLMs can accurately forecast relative kernel performance, that their utility can be improved through reinforcement learning. Used inside a kernel search, the surrogate lets the search consider several times as many candidates under the same GPU evaluation budget, and that leads to finding faster kernels than an equal-budget baseline. These results suggest that LLMs can play a broader role in kernel optimization, by acting as virtual models of a GPU rather than solely as kernel generators for search.
中文摘要 GPU内核是现代深度学习的主力，优化它们（通过进化搜索或编码代理）通常需要在目标硬件上反复测量。虽然这些测量提供了核搜索所需的地面真实信号，但成本高昂，因为每次核值都需要编译并在GPU上重复执行。随着大型语言模型推理的改进降低了编写新内核的成本，且基于大型语言模型的搜索扩展到庞大的搜索预算，设备内评估成为瓶颈。为此，我们研究LLM如何作为内核评估的选择性GPU替代，通过预测拟议内核的性能。一个有用的替代工具应该准确且有选择性，知道何时可能出错，并尊重GPU的判断。评估代理节点时，我们测量其预测是否准确、校准且在有限GPU测量预算下恢复快速核时具有实际价值。接下来，我们研究强化学习是否能提升预测准确性和置信度校准。我们的实验表明，LLM能够准确预测相对核性能，并通过强化学习提升其实用性。在内核搜索中使用替代工具时，搜索可以考虑在同一GPU评估预算下数倍的候选对象，从而找到比同等预算基线更快的内核。这些结果表明，LLMs可以通过作为GPU的虚拟模型，而不仅仅是搜索内核生成器，在内核优化中发挥更广泛的作用。

Batched Differentiable Rigid Body Dynamics in PyTorch for GPU-Accelerated Robot Learning

PyTorch 中的批量可微刚体动力学用于 GPU 加速机器人学习

Authors: Yue Wang, Yanran Xu, Wenbo Wu, Chuanhang Qiu, Zhaoxing Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.31481
Pdf link: https://arxiv.org/pdf/2605.31481
Abstract As robot control shifts toward large-scale reinforcement learning with in-loop dynamics computation, the community's reliance on CPU-bound libraries such as Pinocchio creates a throughput bottleneck in GPU-based training pipelines. We present BARD (Batched Articulated Rigid-body Dynamics), a self-contained PyTorch implementation of Featherstone's rigid-body dynamics algorithms, optimized for batched GPU evaluation and automatic differentiation. Three design choices make this efficient: a tiered lazy-evaluation cache that avoids redundant tree traversals, matmul-free joint transforms via pre-computed Rodrigues constants, and level-parallel propagation that reduces sequential operations to tree-depth batched steps. On five robot models (7-23 DOFs), BARD matches Pinocchio numerically while reaching up to 64x higher throughput for Forward Kinematics and 63x for Jacobians at batch size 4096 on an NVIDIA H200. We validate differentiability through gradient-based system identification on a 7-DOF manipulator, recovering link masses to 1.24% mean error under 5% torque noise, and integrate BARD into an Isaac Lab AMP training pipeline for an 11-DOF spined quadruped with 4096 parallel environments, where it is 8.5x faster than Pinocchio and 2.0x faster than ADAM for in-loop dynamics. BARD is open-sourced at: this https URL.
中文摘要 随着机器人控制向大规模强化学习和循环内动态计算转变，社区对 CPU 受限库如 Pinocchio 的依赖，在基于 GPU 的训练流程中造成吞吐量瓶颈。我们介绍BARD（Batched Articulated Rigid-body Dynamics），这是Featherstone刚体动力学算法的自包含PyTorch实现，优化为批量GPU评估和自动微分。三种设计选择使此过程高效：分层的懒求值缓存以避免冗余树遍历，通过预先计算的罗德里格斯常数实现无matmul的关节变换，以及将顺序操作简化为树深度批量步骤的层并行传播。在五个机器人模型（7-23 DOF）上，BARD 在数值上与匹诺曹匹配，同时在 NVIDIA H200 上，前向运动学的吞吐量高达 64 倍，Jacobian 的吞吐量高达 63 倍，批量规模为 4096。我们通过基于梯度的系统识别验证7自由度机械臂上的微分性，在5%扭矩噪声下将链路质量恢复至1.24%的平均误差，并将BARD集成到Isaac Lab AMP训练流水线中，用于11自由度的脊柱四足系统，支持4096个并行环境，在环内动力学中比匹诺曹快8.5倍，比ADAM快2.0倍。BARD开源地址为：https URL。

Learning Controlled Separation of Small Objects Between Two Fingers with a Tactile Skin

学习用触觉皮肤控制小物体的两根手指分离

Authors: Ulf Kasolowsky, Berthold Bäuml
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.31486
Pdf link: https://arxiv.org/pdf/2605.31486
Abstract We introduce and solve the novel task of controlled separation of small objects with two fingers of a multi-purpose robotic hand: after grasping into a box of small objects, the task is to drop as many of them until a desired number remains between the fingers. The objects are small compared to the width of the fingers but also in absolute terms. In our case little pellets with a diameter of only 6mm are handled. We show that the task can be performed purely tactile (no vision) using a spatially-resolved tactile skin on a fingertip. The separation policy is trained in simulation via reinforcement learning using a straightforward sparse reward, which basically checks if the desired number of objects is reached. In simulation experiments, we provide an exhaustive analysis of the benefits of using spatially-resolved tactile feedback: while an ideal (high-resolution) tactile sensor allows solving the task almost perfectly, a sensor with lower spatial resolution (here 4x4 taxels) still leads to an improvement of up to 20% compared to using only the fingers' joint sensors. For this analysis, we further train an estimator alongside the policy that predicts the ground truth contact positions. Finally, we demonstrate the successful sim-to-real transfer for the DLR-Hand II equipped with a tactile skin.
中文摘要 我们介绍并解决了用多功能机械手的两根手指控制小物体分离这一新任务：在抓取一个装有小物体的盒子后，任务是尽可能多地放入小物体，直到手指间剩下所需的数量。这些物体相对于手指的宽度来说很小，但在绝对值上也算是小的。在我们这里，处理直径仅6毫米的小颗粒。我们展示了该任务可以通过指尖空间分辨的触觉皮肤纯触觉完成（无视觉）。分离策略通过强化学习在模拟中训练，使用简单的稀疏奖励，基本上检查是否达到了目标数量。在模拟实验中，我们对使用空间分辨触觉反馈的优势进行了详尽分析：理想（高分辨率）触觉传感器几乎能完美完成任务，但空间分辨率较低的传感器（此处为4x4 taxels）相比仅使用手指关节传感器，仍能提升多达20%。在本次分析中，我们进一步训练了一个估计器，同时预测了真实接触位置的策略。最后，我们展示了配备触感皮肤的DLR-Hand II成功的模拟到真实转接。

Are Full Rollouts Necessary for On-Policy Distillation?

全面推出是否必须用于政策提炼？

Authors: Yaocheng Zhang, Jiajun Chai, Songjun Tu, Yuqian Fu, Xiaohan Wang, Wei Lin, Guojun Yin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.31490
Pdf link: https://arxiv.org/pdf/2605.31490
Abstract On-policy distillation (OPD) provides dense teacher feedback along rollouts generated by the student and has emerged as a promising post-training paradigm for long-horizon reasoning. However, standard OPD typically generates full rollouts during training, which is computationally expensive and may expose the student to unreliable teacher feedback at late rollout positions, especially during early training. We identify the rollout horizon as a key bottleneck in OPD that substantially impacts training efficiency. Unlike Reinforcement Learning with Verifiable Rewards (RLVR), OPD does not require a complete trajectory or a final answer reward to provide learning signals. This observation suggests that full rollouts may not always be necessary for effective OPD. Motivated by this insight, we propose two simple horizon-control strategies: Progressive OPD (POPD), which gradually expands the rollout horizon during training, and Truncated OPD (TOPD), which permanently performs distillation on reliable truncated rollouts. Experiments on mathematical reasoning show that POPD improves the training efficiency of OPD by up to 3$\times$, while TOPD matches OPD performance using only 10\% of the rollout horizon, leading to substantial wall-clock and memory reductions. These results demonstrate that controlling the rollout horizon offers a simple and practical path to more efficient OPD.
中文摘要 政策提炼（OPD）在学生产生的推广过程中提供密集的教师反馈，已成为一种有前景的培训后推理范式。然而，标准门诊通常在培训期间生成完整的推广，计算成本高，且可能使学生在晚期部署岗位，尤其是早期培训时，暴露于教师反馈不可靠。我们认为推广期是门诊诊断（OPD）中对培训效率有重大影响的关键瓶颈。与可验证奖励强化学习（RLVR）不同，OPD不要求完整的发展轨迹或最终答案奖励来提供学习信号。这一观察表明，全面推广并非有效门诊的必要条件。基于这一见解，我们提出了两种简单的视野控制策略：渐进式OPD（POPD），在训练期间逐步扩展滚动时间，以及截断OPD（TOPD），对可靠的截断滚动进行永久蒸馏。数学推理实验显示，POPD可将OPD的训练效率提升多达3$/时间，而TOPD仅用10%的推广期就能匹配OPD性能，从而大幅减少墙壁时钟和内存。这些结果表明，控制推广水平线为实现更高效OPD提供了一条简单实用的路径。

Skill Reuse as Compression in Agentic RL

技能重用作为能动强化学习中的压缩

Authors: Zhikun Xu, Yu Feng, Jacob Dineen, Taiwei Shi, Jieyu Zhao, Ben Zhou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.31509
Pdf link: https://arxiv.org/pdf/2605.31509
Abstract Large language model agents trained with reinforcement learning (RL) often learn brittle, task-specific shortcuts. We hypothesize that agents generalize better when their successful trajectories are structurally compressible, decomposed into a small set of reusable abstract patterns. To formalize this, we introduce ReuseRL, which grounds agentic RL in the Minimum Description Length (MDL) principle. ReuseRL extracts a shared skill dictionary from successful trajectories and augments the RL objective with a segmentation cost, explicitly penalizing idiosyncratic behaviors that encode poorly. We prove a PAC-Bayes generalization bound for this compression penalty. Across ALFWorld, TextWorld-Cooking, and Countdown-Stepwise, ReuseRL improves in- and out-of-distribution success over vanilla GRPO and strong round-length baselines.
中文摘要 通过强化学习（RL）训练的大型语言模型代理通常学习脆弱的、针对特定任务的捷径。我们假设，当智能体的成功轨迹结构可压缩、分解为一小组可重复使用的抽象模式时，其泛化效果更好。为形式化这一点，我们引入了ReuseRL，将代理性强化学习建立在最小描述长度（MDL）原则之上。ReuseRL从成功轨迹中提取共享技能词典，并通过分割成本增强强化学习目标，明确惩罚编码不佳的特异行为。我们证明了一个 PAC-贝叶斯推广，使该压缩惩罚值值高。在ALFWorld、TextWorld-Cooking和Countdown-Stepwise中，ReuseRL相比原版GRPO和强的轮长基线，提升了分发内外的成功率。

Value Functions as Supermartingale Certificates

作为超级马丁格尔证书的价值函数

Authors: Alessandro Abate, Daniel Contro, Mirco Giacobbe, Agustín Martínez-Suñé, Diptarko Roy
Subjects: Subjects: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2605.31524
Pdf link: https://arxiv.org/pdf/2605.31524
Abstract Certification methods for stochastic systems provide sufficient proof rules, based on real-valued supermartingale certificates, to determine the almost-sure satisfaction of $\omega$-regular properties (and therefore of linear temporal logic) over general state spaces, encompassing both countably infinite and continuous state spaces. Conversely, reinforcement learning (RL) methods for $\omega$-regular tasks have received considerable attention, but they typically lack formal guarantees that the learned policy satisfies the specification, except possibly for finite state and action spaces. We bridge these two lines of research by establishing a novel theoretical connection: under an appropriate reward, the value function associated to a policy that almost surely satisfies an $\omega$-regular property encodes a Streett supermartingale certificate for that specification. Our results, validated experimentally on finite Markov decision processes, hold for finite, countably infinite, and continuous state spaces, suggesting a principled route to certificate synthesis via RL.
中文摘要 随机系统的认证方法提供了足够的证明规则，基于实值超级马丁格尔证书，以确定在涵盖可数无限和连续状态空间的一般状态空间上，几乎确定满足$\omega$-正则性质（因此线性时间逻辑）的满足度。相反，针对$\omega$正则任务的强化学习（RL）方法受到了广泛关注，但通常缺乏形式保证所学策略满足规范，除非有限状态空间和动作空间例外。我们通过建立一个新的理论联系来连接这两条研究线索：在适当的奖励下，与几乎肯定满足$\omega$-正则性质的保单相关的价值函数编码了该规格的斯特里特超级马丁格尔证书。我们的实验验证了有限马尔可夫决策过程的结果，适用于有限、可数无限和连续状态空间，提出了通过强化学习实现证书综合的原则性路径。

Preference-Aware Rubric Learning for Personalized Evaluation

个性化评估的偏好感知评分标准学习

Authors: Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Yoko Yamakata, Tat-Seng Chua
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.31545
Pdf link: https://arxiv.org/pdf/2605.31545
Abstract As Large Language Models (LLMs) evolve from general-purpose assistants to user-centric agents, personalization has become central to aligning model behavior with individual preferences, making the evaluation of personalized alignment a critical bottleneck. Existing evaluation methods-ranging from automatic metrics to LLM-as-a-judge approaches-fail to capture subjective, user-specific preferences embedded in long-term interaction histories. We identify three essential principles for reliable and effective personalized evaluation: Representativeness, User-Consistency, and Discriminativeness. To address these principles, we introduce Personalized Evaluation as Learning, a paradigm that formulates personalized evaluation as a learning problem rather than a static judgment. Under this paradigm, we propose PARL (Preference-Aware Rubric Learning for Personalized Evaluation), a framework that learns to induce preference-aware evaluation rubrics directly from raw user histories and performs a self-validation mechanism to ensure consistency with the user's preferences. PARL integrates rubric induction with a discriminative reinforcement learning objective that contrasts user-authored responses against competitive personalized model outputs, enabling the learned rubrics to capture precise, user-specific decision boundaries. Experiments on real-world personalized text generation tasks show that PARL consistently induces high-fidelity rubrics that reliably identify user-aligned responses and generalize across users and tasks, while capturing stable stylistic preferences and fine-grained evaluative patterns. To ensure reproducibility, our code is available at this https URL.
中文摘要 随着大型语言模型（LLM）从通用助手向以用户为中心的代理演变，个性化已成为将模型行为与个人偏好对齐的核心，使得个性化对齐的评估成为关键瓶颈。现有的评估方法——从自动指标到以LLM为评判方法——未能捕捉长期交互历史中嵌入的主观、用户特定偏好。我们确定了可靠且有效的个性化评估的三大基本原则：代表性、用户一致性和辨别性。为解决这些原则，我们引入了“个性化评估即学习”这一范式，将个性化评估视为学习问题而非静态判断。在该范式下，我们提出了PARL（个性化评估偏好感知评分标准学习）框架，该框架能直接从原始用户历史中引导偏好感知评分标准，并执行自我验证机制以确保与用户偏好的一致性。PARL将评分标准归纳与判别强化学习目标相结合，将用户自写的回答与竞争性的个性化模型输出对比，使学习的评分标准能够捕捉精确且用户特定的决策边界。在真实世界个性化文本生成任务中的实验表明，PARL能够持续诱导高保真度的评分标准，可靠地识别与用户一致的回答，并在用户和任务间泛化，同时捕捉稳定的风格偏好和细粒度的评价模式。为了确保可重复性，我们的代码在此 https URL 上可用。

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

LongTraceRL：通过评分标准奖励学习搜索代理轨迹的长上下文推理

Authors: Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.31584
Pdf link: https://arxiv.org/pdf/2605.31584
Abstract Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textsc{LongTraceRL}. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build \emph{tiered distractors}: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a \emph{rubric reward} that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that \textsc{LongTraceRL} consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at \href{this https URL}{this https URL}.
中文摘要 长上下文推理仍是大型语言模型面临的核心挑战，这些模型常常无法定位并整合大量分散注意力的关键信息。带有可验证奖励的强化学习（RLVR）在该任务中展现出潜力，但现有方法受限于低混淆性干扰和稀疏且仅结果的奖励信号，无法监督中间推理步骤。为了解决这些问题，我们引入了 \textsc{LongTraceRL}。在数据构建中，我们通过知识图随机游走生成多跳问题，并利用搜索代理轨迹构建分层干扰因素：代理阅读但未引用的文档（高度混淆性）和出现在搜索结果中但未打开的文档（低混淆性），形成的训练上下文远比随机抽样或一次性搜索构建更具挑战性。在奖励设计方面，我们提出了一种\emph{评分标准奖励}，利用每条推理链上的黄金实体作为细粒度的实体级流程监督。该评分标准奖励仅应用于最终答案正确（仅正面策略），区分正确回答的推理质量，防止奖励黑客行为。在五个长上下文基准测试中对三台推理大型语言模型（4B-30B）进行的实验表明，\textsc{LongTraceRL} 持续优于强基线，并鼓励全面且有证据基础的推理。代码、数据集和模型可在 \href{this https URL}{this https URL} 获取。

Keyword: diffusion policy

There is no result

Keyword: reinforcement learning

Delayed Repression and Emergent Instability in Adaptive Multi-Agent Systems

自适应多智能体系统中的延迟抑制与突发不稳定性

Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics

通过状态增强和可分动态共识实现可扩展的受限多智能体强化学习

Improving Small Language Models for Code Generation with Reinforcement Learning from Verification Feedback

通过验证反馈强化学习改进小语言模型用于代码生成

Physics-informed Goal-Conditioned Reinforcement Learning under Hybrid Contact Dynamics

混合接触动力学下的物理知情目标条件强化学习

Destruction is a General Strategy to Learn Generation; Diffusion's Strength is to Take it Seriously; Exploration is the Future

毁灭是一种通用的学习策略，世代;扩散的优势在于认真对待;探索是未来

Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving

不确定性感知和时间受控的自动驾驶强化学习专家建议

Constrained Flow Optimization via Sequential Fine Tuning for Molecular Design

通过顺序微调实现分子设计的受限流动优化

ZAPS-DA: Zero-Phase Action Policy Smoothing with Decoupled Actor for Continuous Control in Reinforcement Learning

ZAPS-DA：基于解耦演员的零相位动作策略平滑，用于强化学习中的连续控制

Temporally Encoded Double DQN for Proactive PRB Allocation in O-RAN Enabled Industrial Networks

在支持O-RAN的工业网络中，用于主动PRB分配的时序编码双DQN

Convergence of Steepest Descent and Adam under Non-Uniform Smoothness

在非均匀光滑性下，最陡下降与亚当的收敛

Learning to Perceive the World Through Control: Empowerment-Based Representation Learning

通过控制来感知世界：基于赋权的表征学习

Reinforcement Learning for Special Education: Aligning LLM Tutors to Diverse Learners through Disability-Adaptive Training

特殊教育强化学习：通过残障适应培训将LLM导师与多元学习者对齐

Universal Decision Learners

通用决策学习者

ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents

ExpGraph：基于图结构化记忆的模型无关体验学习，面向LLM代理

When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

什么时候大型语言模型足够作为顺序强化学习任务的策略优化器？

MosaicLeaks:Privacy Risks in Querying-in-the-Open for Deep Research Agents

MosaicLeaks：公开查询深度研究代理的隐私风险

Generating Graph-like Rules for Knowledge Graph Reasoning via Diffusion Models

通过扩散模型生成类图论规则以实现知识图推理

FLAG: Flow Policy MaxEnt-RL by Latent Augmented Guidance

旗帜：通过潜在增强指导实现的流量策略 MaxEnt-RL

Efficient and Uncertainty-Aware Diffusion Framework for Offline-to-Online Reinforcement Learning

高效且具备不确定性意识的扩散框架，用于线下到在线强化学习

Learning Agent-Compatible Context Management for Long-Horizon Tasks

学习长期任务的代理兼容上下文管理

Feat2Go: Visual Feature-Grounded Value Estimation for Embodied Reinforcement Learning

Feat2Go：具身强化学习中的视觉特征基础价值估计

Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

以结构感知奖励为基础的计划者中心深度研究强化学习

SLAT: Segment-Level Adaptive Trimming for Efficient CoT Reasoning

SLAT：用于高效CoT推理的分段级自适应修剪

A Lecture Note on Offline RL and IRL, Part II: Foundations of Inverse Reinforcement Learning and Dynamic Discrete Choice Models

离线强化学习与现实学习讲义笔记 第二部分：逆强化学习基础与动态离散选择模型

Safe Equilibrium Policy Optimization for Strategic Agent Policies

战略代理政策的安全均衡优化

DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning

DARTS：分布式感知的主动推广轨迹塑造，加速LLM强化学习

Distilling LLM Feedback for Lean Theorem Proving

精益定理证明中提取LLM反馈

GUI-C$^2$: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning

GUI-C$^2$：通过难度感知强化学习实现粗细GUI基础化

Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments

零崩溃：不连续奖励环境中策略梯度方法的失败模式

Inverse Reinforcement Learning without an Optimal Demonstrator: A Feasible Reward Set Approach

无最优演示者的逆向强化学习：一种可行的奖励集方法

Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR

关注证据：多模态RLVR的证据锚定空间注意力监督

Automating Formal Verification with Reinforcement Learning and Recursive Inference

通过强化学习和递归推理实现形式化验证自动化

De-attribute to Forget for LLM Unlearning

去属性到Forget，用于LLM逆学习

Enhancing Human-Likeness in Reinforcement Learning Agents via Hierarchical Macro Action Quantization

通过层级宏动作量化增强强化学习主体中的人性相似性

RDGen: Demonstration Generation for High-Quality Robot Learning via Reinforcement Learning

RDGen：通过强化学习实现高质量机器人学习的演示生成

Graph-GRPO: Dependency-Aware Credit Assignment for Generative E-commerce Search Relevance

Graph-GRPO：生成式电子商务搜索相关性的依赖感知信用分配

SDM-Q: Cost-Aware Staged Decision-Making for Multi-Omics Classification with Deep Q-Learning

SDM-Q：基于深度Q学习的成本感知分阶段决策，用于多组学分类

HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster

HADT：一种用于自主地球观测卫星集群的异构多代理差分变压器

Annealed Softmax Greedy in Many-Armed Bayesian Bandits

多臂贝叶斯强盗中的退火软极大贪婪

The Challenges of Using Reinforcement Learning for Controlling Industrial Energy Systems

离线强化学习与现实学习讲义笔记第二部分：逆强化学习基础与动态离散选择模型