Arxiv Papers of Today

生成时间: 2026-06-05 19:31:55 (UTC+8); Arxiv 发布时间: 2026-06-05 20:00 EDT (2026-06-06 08:00 UTC+8)

今天共有 47 篇相关文章

Keyword: reinforcement learning

Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO

通过GRPO的方差感知评分标准奖励提升大型语言模型中以心脏为中心的医学问答能力

Authors: Arash Ahmadi, Parisa Masnadi, Sarah Sharif, Charles Nicholson, David Ebert, Mike Banad
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.05174
Pdf link: https://arxiv.org/pdf/2606.05174
Abstract Large Language Models (LLMs) have shown strong promise in healthcare applications. Yet deploying general-purpose models in real-world settings remains difficult due to data privacy constraints, inference costs, and limited suitability for edge or on-device use. These challenges motivate the development of smaller, more efficient models that require robust post-training strategies to ensure reliable medical reasoning. In this work, we investigate Group Relative Policy Optimization (GRPO) for post-training LLMs on heart-focused medical question answering with rubric-based supervision derived from RaR-Medicine. We propose a Variance-Aware Reward Framework that extends the Explicit Aggregation and Implicit Aggregation strategies of Rubrics as Rewards by replacing weighted binary criterion aggregation and single overall Likert-style scoring with continuous analytical reward functions derived from criterion-level rubric outcomes. This formulation provides richer optimization signals for feedback that is sparse, multi-criteria, and difficult to verify automatically, and enables more stable on-policy reinforcement learning. On a held-out heart-related subset of HealthBench, our best GRPO variant improves accuracy from 0.362 to 0.502 and F1 from 0.532 to 0.668 relative to the Qwen3-14B base model, while remaining competitive with GPT-OSS-120B (0.508 accuracy, 0.674 F1). Our findings show that carefully designed rubric-based rewards provide a practical strategy for improving heart-focused medical question answering in LLMs, with potential to extend to other rubric-based tasks.
中文摘要 大型语言模型（LLMs）在医疗应用中展现出了强烈的潜力。然而，由于数据隐私限制、推理成本以及边缘或设备端使用的适用性有限，在实际环境中部署通用模型仍然困难。这些挑战促使开发更小、更高效的模型，这些模型需要强健的训练后策略以确保医学推理的可靠性。本研究研究基于评分标准的指导下，基于RaR-Medicine的基于评分标准的指导，用于心脏导向医学问答的后期训练LLMs的群体相对策略优化（GRPO）。我们提出了一个方差感知奖励框架，通过用基于标准层面评分结果的连续分析奖励函数，扩展了评分标准中显式聚合和隐性聚合策略作为奖励的标准。该表述为稀疏、多准则且难以自动验证的反馈提供了更丰富的优化信号，并实现了更稳定的策略上强化学习。在HealthBench中一个被保留的心脏相关子集上，我们最好的GRPO变体相较Qwen3-14B基础模型，将准确率从0.362提升到0.502，F1从0.532提升到0.668，同时仍能与GPT-OSS-120B（0.508精度，0.674F1）竞争。我们的发现表明，精心设计的基于评分标准的奖励为提升大型语言模型中以心脏为中心的医学问题解答提供了切实可行的策略，并有潜力推广到其他基于评分标准的任务。

A New Quaternion-Joint Cable-Driven Redundant Manipulator Configuration and its Control Through FABRIK and Residual Reinforcement Learning

一种新的四元数-连接电缆驱动冗余机械臂配置及其通过FABRIK和残差强化学习的控制

Authors: Tanapath Pornthisan, Thanapat Kemthong, Thanyapisit Kangsathien, Pasut Aranchaiya, Paulo Garcia, Viboon Sangveraphunsiri
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.05236
Pdf link: https://arxiv.org/pdf/2606.05236
Abstract Robotic arms capable of traversing arbitrary spatial paths, especially in highly obstructed workspaces, are highly desired across several industries. Quaternion-joints have recently empowered a specific class of robotic arms -- cable-driven redundant manipulators -- beyond its prior capabilities. Specifically, quaternion-joints reduce the number of required motors per degree of freedom, paving the way for more compact this http URL ongoing challenge is that the complexity of the kinematic model of quaternion joints challenges a priori decisions on manipulator configurations and imposes higher computational demands on the control system and its non-linearities amplify all discrepancies between design and physical artifact arising from fabrication imprecision. Here we show a that a 4-segment, 8-joint manipulator can achieve a broader workspace than extant configurations, at lower hardware cost, and that Residual Reinforcement Learning outperforms extant state-of-the-art methods -- specifically, the FABRIK algorithm -- on the control of such manipulator. Our results show that this configuration is more workspace-effective than prior designs, and that Residual Reinforcement Learning outperforms FABRIK by three orders of magnitude on positional and orientational accuracy, effecting precise control of the novel 4-segment, 8-joint manipulator. Additionally, the control implementation is simpler: we describe the complete FABRIK process for control and corresponding learning implementation. Our methodology is applicable to the design of new systems, providing designers with further tools for the development of this class of manipulators and corresponding control systems for novel configurations.
中文摘要 能够穿越任意空间路径的机械臂，尤其是在高度受阻的工作环境中，在多个行业中备受青睐。四元数关节最近为一类特定机械臂——电缆驱动的冗余机械臂——提供了超出以往能力的赋能。具体来说，四元数关节减少了每自由度所需的电机数量，为更紧凑的结构铺平了道路。这一持续的挑战在于四元数关节运动学模型的复杂性挑战了对机械臂配置的先验决策，并对控制系统及其非线性性施加了更高的计算需求，放大了由于制造精度不精确而产生的设计与物理伪影之间的所有差异。这里我们展示了4段8关节操作手可以实现比现有配置更宽的工作空间，且硬件成本更低，且残差强化学习在控制此类操作手时优于现有最先进方法，特别是FABRIK算法。我们的结果表明，这种配置比以往设计更有效地工作区，残差强化学习在定位和方向精度上比FABRIK高出三个数量级，从而实现了对新型4节8关节机械臂的精确控制。此外，控制实现更简单：我们描述了完整的FABRIK控制流程及相应的学习实现。我们的方法论适用于新系统的设计，为设计者提供了进一步的工具，用于开发此类机械臂及相应控制系统，以适应新颖配置。

Inverse Manipulation through Symbolic Planning and Residual Operator Learning

通过符号规划和残差算子学习实现逆向操作

Authors: Yigit Yildirim, Giuseppe Rauso, Riccardo Caccavale, Alberto Finzi
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.05248
Pdf link: https://arxiv.org/pdf/2606.05248
Abstract Inverting a robotic task requires more than reversing symbolic state transitions or rewinding motor trajectories. In robot manipulation tasks, symbolic inverse plans often fail to fully restore the effects of forward executions under continuous interaction dynamics. We present a hybrid framework for inverse manipulation that derives inverse-skill objectives from STRIPS-like operators automatically extracted from demonstrations through soft geometric predicates. For each extracted operator, we construct an inverse restoration objective that preserves preconditions, restores delete effects, and negates add effects. A task planner first attempts to satisfy this objective using available action primitives. Unresolved symbolic predicates then induce a residual operator learning problem solved through Reinforcement Learning (RL). We evaluate the framework on the ManiSkill3 PushCube task. For a forward pushing skill, the symbolic inverse performs a coarse pick-and-place restoration, while a residual Soft Actor-Critic policy refines the cube pose to satisfy the remaining inverse predicates. Our results show that predicate-derived residual control can turn an approximate symbolic inverse into a physically grounded inverse skill.
中文摘要 逆转机器人任务所需的不仅仅是逆转符号状态转换或倒带电机轨迹。在机器人操作任务中，符号逆计划常常无法完全恢复连续交互动力学下正向执行的效果。我们提出了一个混合的逆操作框架，通过软几何谓词自动从演示中提取的类STRIPS算符推导逆技能目标。对于每个提取的算符，我们构造一个逆恢复目标，保持前提条件，恢复删除效果，并否定加法效果。任务规划器首先尝试使用可用的动作原语来满足这一目标。未解决的符号谓词随后诱导出通过强化学习（RL）解决的残差算符学习问题。我们在 ManiSkill3 PushCube 任务中评估了该框架。对于前推技能，符号逆执行粗略的拾取和放置恢复，而残余的软演员-批评策略则细化立方体姿态以满足剩余的逆谓词。我们的结果表明，谓词导出的残差控制可以将近似符号逆转化为物理基础的逆技能。

Alpha-RTL: Test-Time Training for RTL Hardware Optimization

Alpha-RTL：RTL 硬件优化的测试时训练

Authors: Peilong Zhou, Zhirong Chen, Cangyuan Li, Haoyu Gao, Kaiyan Chang, Ziming Qu, Ying Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.05253
Pdf link: https://arxiv.org/pdf/2606.05253
Abstract Large language models (LLMs) have shown increasing promise in generating functionally correct register-transfer-level (RTL) hardware designs. Recent systems improve further through EDA-integrated reinforcement learning with syntax, simulation, and PPA rewards, but train a general RTL generator before deployment while test-time approaches search with a frozen policy. We instead perform reinforcement learning at test time, allowing the LLM policy to adapt to executable EDA feedback for the specific RTL problem at hand. We propose TTT-RTL, to our knowledge the first per-design test-time training framework that closes the loop between an LLM policy and an EDA pipeline for RTL optimization. TTT-RTL samples candidate implementations, verifies them through syntax checking and simulation, scores valid designs using synthesis-derived PPA product, reuses high-reward variants through a PUCT-indexed design-state pool, and updates the policy with an entropic policy-gradient objective. To stabilize policy updates under sparse or plateaued rewards, we introduce an adaptive KL-budget controller that adjusts the entropy constraint using reference KL, effective sample size, and reward saturation signals. On RTLLM v2.0 under Nangate 45nm, TTT-RTL reduces the geometric-mean PPA product by 65.1% over the reference, outperforming the strongest published frozen-policy agent baseline at 26.1%. On an industrial XuanTie C910 FPU leading-zero-anticipation unit under Sky130, TTT-RTL achieves a 59.4% ADP reduction, and ablations confirm that policy adaptation, state reuse, and KL-budget control each contribute. These results suggest that test-time training with executable EDA feedback can move LLM-based RTL generation beyond functional correctness toward physically optimized hardware.
中文摘要 大型语言模型（LLMs）在生成功能正确的寄存器传输级（RTL）硬件设计方面展现出越来越大的潜力。近期系统通过EDA集成的强化学习（语法、仿真和PPA奖励）进一步改进，但在部署前训练通用RTL生成器，而测试阶段则采用冻结策略进行搜索。我们反而在测试时进行强化学习，使LLM策略能够适应针对特定RTL问题的可执行EDA反馈。我们提出TTT-RTL，据我们所知，这是首个按设计设计进行测试时间训练的框架，实现了LLM策略与EDA流水线之间的RTL优化环路。TTT-RTL 采样候选实现，通过语法检查和仿真验证，使用合成衍生的 PPA 产品对有效设计进行评分，通过 PUCT 索引的设计状态池重用高奖励变体，并用熵策略梯度目标更新策略。为了在奖励稀疏或平台时稳定策略更新，我们引入了自适应KL预算控制器，利用参考KL、有效样本量和奖励饱和信号调整熵约束。在Nangate 45nm的RTLLM v2.0条件下，TTT-RTL将几何平均PPA产降65.1%，优于最强的已发表冷冻策略基线26.1%。在Sky130下采用的工业宣铁C910浮点单元领先零预期单元上，TTT-RTL实现了59.4%的平均应用率降低，消融结果证实政策调整、状态重用和KL预算控制均有贡献。这些结果表明，利用可执行的EDA反馈进行测试时训练，可以将基于LLM的RTL生成从功能正确性迈向物理优化硬件。

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

对可验证强化学习的策略条件反事实学分

Authors: Renwei Meng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.05263
Pdf link: https://arxiv.org/pdf/2606.05263
Abstract Reinforcement learning with verifiable rewards improves reasoning and tool use, yet long-horizon language agents still learn unsupported evidence chains, belief drift, and shortcut actions that satisfy terminal checks. Existing process rewards are mostly correlational: they reward retrieval-, reflection-, or verification-like steps without estimating whether the step contributes to final verified success under a specified intervention. We propose CVT-RL, a constrained policy-gradient algorithm with dense verifiable rewards, intervention-validity gating, and a policy-conditioned counterfactual contribution (PCCC) estimator. Deletion, semantic substitution, evidence substitution, and tool-output perturbation define separate controlled interventions; continuations are sampled from a frozen reference policy, and a selection-adjusted doubly robust estimator augments the advantage. Belief control uses only prefix-observable labels, while an augmented Lagrangian constrains unsupported claims, skipped verification, tool tampering, and unsafe calls. On long-context QA, ALFWorld, ScienceWorld, and web/tool tasks, CVT-RL improves average task success from 71.8% for compute-matched non-causal RL and 75.4% for an information-matched counterfactual-process baseline to 78.9%, improves evidence F1 from 78.9 to 82.8 over the information-matched baseline, and reduces measured hacking from 7.2% to 3.9%. Independent human audit estimates 4.6% hacking for CVT-RL versus 8.1% for the information-matched baseline, and adaptive detector-evasion attacks raise hacking only to 7.1%. Stratified bootstrap and mixed-effects tests give p<0.01 after Holm correction for all primary metrics. Carefully scoped counterfactual credit, paired with validity gating, diagnostics, and verifiable constraints, provides a reproducible route toward more reliable long-horizon RL for language agents.
中文摘要 带有可验证奖励的强化学习提升了推理能力和工具使用，但长期语言代理仍会学习无依据的证据链、信念漂移和满足终端检验的捷径动作。现有的过程奖励大多是相关性的：它们奖励类似检索、反思或验证的步骤，而不估计该步骤是否对特定干预下的最终验证成功有贡献。我们提出了CVT-RL，一种具有密集可验证奖励、干预效度门控和策略条件反事实贡献（PCCC）估计的受约束策略梯度算法。删除、语义替换、证据替代和工具输出扰动定义了独立的受控干预;延续项从冻结的参考策略中抽样，选择调整后的双稳健估计器增强了优势。信念控制仅使用可观察前缀标签，而增强拉格朗日量则限制无依据的主张、跳过验证、工具篡改和不安全的呼叫。在长上下文质量保证、ALFWorld、ScienceWorld及网页/工具任务中，CVT-RL将计算匹配非因果强化学习的平均任务成功率从71.8%和信息匹配反事实过程基线的75.4%提升至78.9%，将F1证据从78.9提升至82.8，并将测量到的黑客行为从7.2%降至3.9%。独立人工审计估计，CVT-RL的黑客攻击率为4.6%，而信息匹配基线的8.1%，自适应检测规避攻击仅将黑客行为提升至7.1%。分层自助检验和混合效应检验在霍尔姆校正后，所有主要指标均为p<0.01。精心划定的反事实署名，结合效度门槛、诊断和可验证约束，为语言代理提供了一条可重复的路径，朝向更可靠的长视野强化学习。

Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents

智能蒙特卡洛：针对黑箱智能体的强化学习模拟

Authors: Dae Yon Hwang, Raunaq Suri, Valentin Villecroze, Anthony L. Caterini, Jesse C. Cresswell, Noël Vouitsis, Brendan Leigh Ross
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.05296
Pdf link: https://arxiv.org/pdf/2606.05296
Abstract LLM agents operate in two distinct regimes: open-weight agents amenable to reinforcement learning (RL) and black-box agents whose behaviour must be controlled purely at test time. Although black-box agents are often backed by state-of-the-art proprietary LLMs, API-only access precludes parameter-level optimization, rendering most RL methods inapplicable. To address this limitation, we turn to a known equivalence between RL and Bayesian inference. We propose Agentic Monte Carlo (AMC) to directly sample from the optimal policy of a black-box agent rather than training it through RL. The optimal policy is a posterior over trajectories whose prior we define as the fixed black-box LLM agent. We employ Sequential Monte Carlo to sample from this posterior by learning a value function to steer the agent while leaving the underlying black-box model unchanged. We validate AMC on three diverse environments from the AgentGym benchmark, demonstrating significant improvements over prompting baselines and even outperforming Group Relative Policy Optimization (GRPO) as we scale the test-time compute of our method. AMC demonstrates the feasibility of performing principled RL-style optimization of black-box LLM agents. Code is available at this https URL
中文摘要 LLM代理在两种截然不同的环境中工作：开放权重代理，适合强化学习（RL）;以及必须在测试时完全控制行为的黑箱代理。尽管黑箱代理通常由最先进的专有大型语言模型支持，但仅API访问阻碍了参数级优化，使大多数强化学习方法不适用。为了解决这一限制，我们转向强化逻辑与贝叶斯推断之间的已知等价性。我们提出直接采样黑箱代理的最优策略，而非通过强化学习训练。最优策略是对轨迹的后验，其先验定义为固定的黑箱LLM代理。我们采用顺序蒙特卡洛法，通过学习价值函数来引导代理，同时保持底层黑箱模型不变，从而从该后验中抽样。我们在AgentGym基准测试的三种不同环境中验证了AMC，显示出相较提示基线的显著提升，甚至在扩展测试时间计算过程中超越了Group Relative Policy Optimization（GRPO）。AMC展示了对黑盒LLM代理进行原则性强化学习（RL）式优化的可行性。代码可在此 https URL 获取

Recovering Physically Plausible Human-Object Interactions from Monocular Videos

从单眼视频中恢复物理上合理的人与物互动

Authors: Dingbang Huang, Etienne Vouga, Qixing Huang, Georgios Pavlakos
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.05359
Pdf link: https://arxiv.org/pdf/2606.05359
Abstract In this paper, we propose RePHO, a method to reconstruct physically plausible human-object interactions (HOI) from monocular videos. While existing kinematic-based approaches produce visually plausible motion, they often result in physically implausible artifacts such as interpenetration and object floating. To overcome these issues, we introduce a physics-guided reconstruction framework. We begin with a kinematic estimate and then refine it by training a policy with reinforcement learning (RL). This policy is optimized to reproduce the interaction in a physics simulator. Because kinematic estimates are typically noisy, naive RL training can fail. Therefore, we propose an adaptive sampling strategy with a dual self-updating mechanism that can identify the frames with the most informative and reliable kinematic reconstruction. Our process progressively improves reconstruction quality and yields physically consistent HOI sequences. We demonstrate our approach on two standard HOI benchmarks and achieve clear improvements in physical plausibility metrics over state-of-the-art methods. Project Page: this https URL
中文摘要 本文提出了RePHO方法，一种从单眼视频重建物理上合理的人与物交互（HOI）的方法。虽然现有基于运动学的方法能产生视觉上合理的运动，但往往导致物理上不合理的伪影，如穿透和物体漂浮。为克服这些问题，我们引入了物理导向的重建框架。我们从运动学估计开始，然后通过强化学习（RL）训练策略来细化它。该策略经过优化，以便在物理模拟器中再现该相互作用。由于运动学估计通常噪声较大，朴素的强化学习训练可能会失败。因此，我们提出了一种自适应采样策略，具有双重自我更新机制，能够识别具有最有信息量和可靠性的运动学重建框架。我们的工艺逐步提升重建质量，并产生物理一致的HOI序列。我们基于两个标准HOI基准进行展示方法，并在物理可信度指标上明显优于最先进方法。项目页面：此 https URL

SHALA-LLM: Smartly Handling Ambiguous Labels in Aligning LLMs

SHALA-LLM：智能处理对齐大型语言模型中的模糊标签

Authors: Jingyao Wu, Ashley Wang, Keane Ong, Paul Pu Liang, Rosalind Picard
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.05376
Pdf link: https://arxiv.org/pdf/2606.05376
Abstract Many human-centered tasks, including natural language inference (NLI) and emotion recognition (ER), have multiple plausible interpretations, leading to label ambiguity and challenging disagreements across human annotators. As LLMs are increasingly deployed in real-world settings, faithfully modeling such ambiguity is essential to identify contested inputs, preserve variability in ambiguous cases, and capture the full distribution of human judgments. Yet, existing LLM alignment approaches have predominantly assumed a single correct label, excluding annotator disagreement during optimization. Instead of treating this ambiguity as noise, we show how to treat it as information that improves model behavior through a new algorithm called SMARTLY HANDLING AMBIGUOUS LABELS IN ALIGNING LLMS (SHALA-LLM). This reinforcement learning framework provides a new way for LLMs to learn directly from annotator distributions while dynamically prioritizing highly ambiguous samples during optimization. Experiments on ambiguity-sensitive NLI and ER benchmarks, including ChaosNLI, GoEmotions, and MSP-Podcast, demonstrate that SHALA-LLM improves agreement with annotator label distributions, e.g. on ChaosNLI, it reduces Jensen-Shannon Distance by up to 62.1%. At the same time, SHALA-LLM improves F1 by up to 16.7%, showing that modeling annotator disagreement can also strengthen classification performance.
中文摘要 许多以人为本的任务，包括自然语言推断（NLI）和情感识别（ER），存在多种合理的解释，导致标签模糊和人类注释者之间的争议。随着LLM在现实环境中的应用日益增加，忠实建模这种歧义对于识别有争议的输入、保持歧义案例的变异性以及捕捉人类判断的完整分布至关重要。然而，现有的LLM对齐方法主要假设只有一个正确的标签，排除了优化过程中标注者之间的分歧。我们不再将这种模糊性视为杂音，而是通过一种名为“智能处理对齐LLM中的模糊标签”（SHLA-LLM）的新算法，将其视为改善模型行为的信息。该强化学习框架为LLM提供了一种直接从注释器分布学习的新方法，同时在优化过程中动态优先处理高度模糊的样本。对模糊敏感的NLI和ER基准测试，包括ChaosNLI、GoEmotions和MSP-Podcast的实验，表明SHALA-LLM能改善与标注标签分布的一致性，例如在ChaosNLI上，它最多可将Jensen-Shannon距离减少62.1%。同时，SHALA-LLM将F1提升了16.7%，表明建模标注者不一致也能增强分类性能。

MoDex: A Diffusion Policy for Sequential Multi-Object Dexterous Grasping

MoDex：顺序多对象灵巧抓取的扩散策略

Authors: Haofei Lu, Hongjia Liu, Yifei Dong, Florian T. Pokorny, Jens Lundell, Danica Kragic
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.05407
Pdf link: https://arxiv.org/pdf/2606.05407
Abstract This work addresses sequentially grasping multiple objects with a single dexterous hand without releasing those already held. Most dexterous grasping methods commit all of the hand's degrees of freedom to a single object, underutilizing its dexterity and leaving no redundancy for subsequent grasps. The proposed solution, MoDex, is a diffusion policy that predicts the next gripper pose directly from observations, conditioned on an opposition space and point cloud. The opposition space condition specifies which fingers participate in the current grasp, enabling the gripper to use only a subset of its available degrees of freedom while reserving the remaining degrees of freedom for subsequent grasps. To facilitate sim-to-real transfer, MoDex is trained in two stages: first through imitation learning on expert demonstrations, and subsequently through reinforcement learning fine-tuning, which consistently improves success rates over the pre-trained policy. We evaluate MoDex in simulation on a MuJoCo-based Franka Emika Panda robot equipped with an Allegro Hand and on the corresponding real-world hardware platform. Across both simulation and real-world experiments, MoDex achieves higher success rates than the evaluated learning-based baselines, improving performance by 2.92-17.92% and 6.67-17.78%, respectively. Project page: this https URL.
中文摘要 这项工作涉及用一只灵巧的手依次抓取多个物体，而不松开已握持的物体。大多数灵巧抓握方法将手的所有自由度集中在一个物体上，未能充分发挥其灵活性，且不留余地以便后续抓握。提出的解决方案MoDex是一种扩散策略，直接根据观测预测下一个夹具姿态，条件是对冲空间和点云。对立空间条件指定了当前握持中哪些手指参与，使握持者只能使用其可用自由度的子集，而保留剩余的自由度用于后续抓握。为促进模拟到现实的迁移，MoDex 经过两个阶段训练：首先通过专家演示的模仿学习，随后通过强化学习的微调，持续提升预训练策略的成功率。我们在基于MuJoCo的Franka Emika Panda机器人（配备Allegro Hand）及相应的现实硬件平台上进行模拟，评估MoDex。在模拟和真实世界实验中，MoDex 的成功率均高于基于学习的基线，分别提升了 2.92%-17.92% 和 6.67-17.78%。项目页面：这个 https URL。

Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

选择优势熵-自适应视野GRPO：非对称令牌级折扣以实现语言模型高效强化学习

Authors: Chirag Chawla, Rohan Charudatt Salvi, Madhav S. Baidya
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.05434
Pdf link: https://arxiv.org/pdf/2606.05434
Abstract Group Relative Policy Optimisation (GRPO) has emerged as an effective reinforcement-learning algorithm for aligning language models on reasoning tasks, but it treats every token position and every sampled rollout symmetrically. We introduce two complementary extensions: (i) Adaptive-Horizon GRPO (AH-GRPO), which weights each token's policy gradient using a cumulative entropy-based discount that reduces the effective horizon when the model is uncertain, and (ii) Selective-Advantage AH-GRPO (SA-AH-GRPO), which applies this discounting only to negative-advantage rollouts, leaving positive-advantage, successful trajectories unattenuated. We evaluate standard GRPO with alpha = 0, AH-GRPO with alpha = 0.5, and SA-AH-GRPO with alpha = 0.5 on the GSM8K mathematical reasoning benchmark using both Qwen 2.5-1.5B-Instruct and Qwen 2.5-3B-Instruct fine-tuned with LoRA. On the 3B model, SA-AH-GRPO achieves Pass@1 = 0.858 at its peak at step 30 and maintains 0.846 at 180 steps, with training variance reduced to 0.0246, a 3.6 times reduction relative to GRPO while matching its peak accuracy. On the 1.5B model, SA-AH-GRPO achieves a peak Pass@1 of 0.686, improving over the zero-shot baseline of 0.637. Our analysis shows that asymmetric discounting preserves the full gradient signal on correct solutions, prevents entropy collapse, and substantially stabilises training, suggesting a principled inductive bias for reinforcement learning with verifiable rewards on structured generation tasks.
中文摘要 群体相对策略优化（GRPO）已成为一种有效的强化学习算法，用于对齐推理任务中的语言模型，但它对每个标记位置和每次采样的推广都进行对称处理。我们引入了两个互补扩展：（i）自适应地平线 GRPO（AH-GRPO），该方法通过基于累积熵的折现加权每个代币的政策梯度，当模型不确定时减少有效视野;（ii）选择性优势 AH-GRPO（SA-AH-GRPO），仅将这种折现应用于负优势推广，保持正向优势、成功轨迹不减减。我们利用 Qwen 2.5-1.5B-Ininstruction 和 Qwen 2.5-3B-Instruct 经过 LoRA 微调，在 GSM8K 数学推理基准测试中评估 alpha = 0 的标准 GRPO、alpha = 0.5 的 AH-GRPO 和 SA-AH-GRPO 的 alpha = 0.5。在3B模型中，SA-AH-GRPO在第30步达到峰值时达到Pass@1 = 0.858,180步时保持0.846，训练方差降至0.0246，相较GRPO减少了3.6倍，同时保持峰值精度。在1.5B模型中，SA-AH-GRPO的峰值击Pass@1为0.686，优于零射击基线0.637。我们的分析表明，非对称贴现在正确解下保持了完整的梯度信号，防止熵坍缩，并显著稳定了训练，这表明强化学习具有原则性的归纳偏见，并在结构化生成任务中获得可验证的奖励。

Robust Scene Transfer for PointGoal Navigation via Privileged Sensor Guided Contrastive Learning

通过特权传感器引导对比学习实现点目标导航的稳健场景传输

Authors: Amirhossein Zhalehmehrabi, Tiziano Tezze, Alberto Castelini, Alessandro Farinelli
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.05506
Pdf link: https://arxiv.org/pdf/2606.05506
Abstract We propose a sensor-guided adaptive contrastive learning framework for visual representation learning in PointGoal navigation. During training, privileged LiDAR sensing guides the contrastive objective through a geometry-aware similarity metric and adaptive temperature scaling, encouraging visual embeddings to capture navigation-relevant structure rather than scene-specific appearance. The resulting encoder is pretrained independently, frozen, and used as the perceptual backbone for reinforcement learning, decoupling representation learning from policy optimization. We further introduce a cross-stage domain mismatch between representation pretraining and policy learning to suppress environment-specific shortcuts and promote reliance on task-relevant features. Extensive experiments in high-fidelity simulation demonstrate that our approach significantly improves policy-level scene transfer across diverse indoor and outdoor environments. At deployment, the agent relies only on monocular RGB observations together with standard task-related inputs such as goal position and proprioceptive signals, without access to LiDAR or other privileged sensors. Our method outperforms large pretrained vision models and standard contrastive baselines under severe appearance and semantic shifts. We also release a multimodal dataset to support future research on privileged-guided visual representation learning for navigation. The code is available at:
中文摘要 我们提出了一个传感器引导的自适应对比学习框架，用于PointGoal导航中的视觉表征学习。在训练过程中，特权激光雷达（LiDAR）感测通过几何感知相似度指标和自适应温度缩放引导对比物镜，鼓励视觉嵌入捕捉导航相关结构，而非场景特定外观。所得编码器被独立预训练、冻结，并作为强化学习的感知骨干，将表征学习与策略优化解耦。我们还进一步引入了表示预训练与策略学习之间的跨阶段领域不匹配，以抑制环境特定的捷径，促进对任务相关特性的依赖。高保真模拟的广泛实验表明，我们的方法显著提升了在不同室内外环境中的政策级场景传输。部署时，代理仅依赖单眼RGB观测以及标准任务相关输入，如目标位置和本体感觉信号，无法访问LiDAR或其他特权传感器。在严重的外观和语义变化下，我们的方法优于大型预训练视觉模型和标准对比基线。我们还发布了多模态数据集，以支持未来关于特权引导视觉表征学习的导航研究。该代码可在以下网站获取：

Representation Learning Enables Scalable Multitask Deep Reinforcement Learning

表征学习实现可扩展的多任务深度强化学习

Authors: Johan Obando-Ceron, Lu Li, Scott Fujimoto, Pierre-Luc Bacon, Aaron Courville, Pablo Samuel Castro
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.05555
Pdf link: https://arxiv.org/pdf/2606.05555
Abstract Scaling reinforcement learning (RL) to diverse multitask settings remains a central challenge. While recent advances in model-based RL achieve strong performance, they rely on planning and complex training pipelines, making it unclear which components are essential for scalability. We revisit this question and argue that the primary driver of scalable multitask RL is not model-based control, but \emph{representation learning}. In particular, we show that combining predictive, model-based representations with high-capacity value function approximation is sufficient to achieve strong performance, even without planning. We evaluate a simple model-free algorithm, MR.Q, coupled with auxiliary predictive objectives into a scalable actor-critic architecture. This approach outperforms a recent world-model-based method and a range of deep RL baselines across a diverse suite of multitask continuous control tasks, while significantly reducing computational overhead and improving wall-clock efficiency. We observe consistent improvements with increased model capacity and show through ablations that predictive representation learning is critical for performance.
中文摘要 将强化学习（RL）扩展到多样化的多任务环境仍是一个核心挑战。尽管基于模型的强化学习（RL）近年来取得了强劲的性能，但它们依赖于规划和复杂的培训流程，使得不清楚哪些组件对可扩展性至关重要。我们重新审视这个问题，并论证可扩展多任务强化学习的主要驱动力不是基于模型的控制，而是 \emph{表征学习}。特别地，我们证明了将预测性、基于模型的表示与高容量价值函数近似结合，即使无需规划，也足以实现强劲的性能。我们评估了一个简单的无模型算法MR。Q，并结合辅助预测目标，构建可扩展的actor-critic架构。该方法优于近期基于世界模型的方法以及在多样化多任务连续控制任务中的多种深度强化学习基线，同时显著降低计算开销并提升墙钟效率。我们观察到随着模型容量的提升，持续的改进，并通过消融显示预测性表征学习对性能至关重要。

BMCR: Adaptive Backbone Module Composition via Reinforcement Learning for Remote Sensing Object Detection

BMCR：通过强化学习实现自适应骨干模块组合以实现遥感对象检测

Authors: Wenlin Liu, Xikun Hu, Ping Zhong
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2606.05586
Pdf link: https://arxiv.org/pdf/2606.05586
Abstract In remote sensing object detection, Convolutional Neural Networks (CNNs) excel at capturing local details while Vision Transformers (ViTs) are better at global context modeling. However, existing detectors typically rely on a single fixed backbone or a manually designed hybrid architecture, and thus fail to adaptively exploit these complementary strengths across inputs of diverse complexity. To address this limitation, we propose Backbone Module Composition via Reinforcement Learning (BMCR). BMCR dynamically assembles input-adaptive inference paths from reusable modules decomposed from off-the-shelf CNN and ViT backbones. To enable such cross-family composition, we first construct an extensible module toolbox. Specifically, we decompose representative CNN and ViT backbones into reusable functional modules and encapsulate each module with explicit structural, semantic, and computational metadata for compatibility-aware assembly. To bridge the gap between grid-based CNN features and token-based ViT representations, we design a lightweight Optimal Transport (OT) based transition interface that ensures distribution-aware alignment while respecting spatial consistency. The backbone composition process is then formulated as a sequential decision problem, in which a policy network progressively selects task-relevant modules according to intermediate multi-scale observations. To stabilize the joint optimization of reusable modules and the routing policy, we further develop an Adaptive Module Cooperative Optimization (AMCO) strategy that coordinates module updating, routing exploration, and reward assignment during training. On DOTA-v1.0, DOTA-v1.5 and DIOR-R, BMCR achieves 79.31\%, 73.41\% and 71.86\% mAP, respectively, surpassing strong static and dynamic baselines by up to 2.5 points while maintaining competitive efficiency.
中文摘要 在遥感对象检测中，卷积神经网络（CNN）擅长捕捉局部细节，而视觉变换器（ViT）则更擅长全局上下文建模。然而，现有探测器通常依赖单一固定骨干或手动设计的混合架构，因此无法在不同复杂度的输入中灵活利用这些互补优势。为解决这一限制，我们提出了通过强化学习实现骨干模块组合（BMCR）。BMCR动态组装输入自适应推理路径，这些模块由现成的CNN和ViT骨干分解而来。为了实现这种跨家族组合，我们首先构建了一个可扩展的模块工具箱。具体来说，我们将代表性的CNN和ViT骨干分解为可复用的功能模块，并用显式的结构、语义和计算元数据封装每个模块，以实现兼容性意识的组装。为了弥合基于网格的CNN特征与基于代币的ViT表示之间的差距，我们设计了一个轻量级的基于最优传输（OT）的过渡接口，确保分布感知对齐，同时尊重空间一致性。骨干组合过程随后被表述为顺序决策问题，其中策略网络根据中间多尺度观测逐步选择任务相关模块。为稳定可复用模块的联合优化和路由策略，我们进一步开发了自适应模块协作优化（AMCO）策略，协调模块更新、路由探索和奖励分配。在DOTA-v1.0、DOTA-v1.5和DIOR-R上，BMCR分别实现了79.31%、73.41%和71.86%的mAP，在保持竞争效率的同时，比强劲的静态和动态基线高出最多2.5分。

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

跨时代自适应推广优化用于强化学习后训练

Authors: Yiming Zong, Yige Wang, Jiashuo Jiang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2606.05606
Pdf link: https://arxiv.org/pdf/2606.05606
Abstract LLM post-training often relies on reinforcement learning methods that sample multiple rollouts per prompt, yet most existing approaches use a fixed rollout budget for every prompt, despite large differences in the training signal different prompts provide. In this paper, we study adaptive rollout allocation under a fixed global budget and formulate the problem as online resource allocation with prompt-level diminishing returns. Our method, CERO, maintains a Beta posterior over each prompt's success probability and uses the posterior expected Bernoulli variance as a Bayesian estimate of the value of additional rollouts. We use this estimate to construct a concave, saturating utility over cumulative allocations, yielding an objective in which decisions across prompts and epochs are coupled by the global budget. Since the resulting objective is temporally nonseparable, we derive a Fenchel-dual reformulation and update both prompt-level and budget-level dual variables via projected online gradient descent. Under fixed prompt utilities, we prove an $O(\sqrt{K})$ regret bound against the offline allocation benchmark. Experiments on mathematical-reasoning problems show that CERO consistently outperforms GRPO across multiple open-weight LLMs and benchmarks, demonstrating that adaptive rollout budgeting can improve sample efficiency.
中文摘要 LLM的后续训练通常依赖于强化学习方法，每个提示都采样多个展开，但大多数现有方法仍为每个提示使用固定的推广预算，尽管不同提示在训练信号上存在很大差异。本文研究了固定全球预算下的自适应推广分配，并将问题表述为在线资源分配，具有提示层面的递减收益。我们的方法CERO保持每个提示成功概率的Beta后验，并使用后期预期伯努利方差作为额外展开价值的贝叶斯估计。我们利用该估计构建了一个凹形、饱和效用，涵盖累计分配，得出一个目标，即跨提示和时代的决策与全球预算相关联。由于所得目标在时间上不可分，我们推导出 Fenchel 对偶重述，并通过投影的在线梯度下降更新提示级和预算级对偶变量。在固定提示工具下，我们证明了一个$O（\sqrt{K}）$的遗憾，并结合离线分配基准。数学推理问题的实验表明，CERO 在多个开放权重大型语言模型和基准测试中持续优于 GRPO，表明自适应推广预算能够提升样本效率。

Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack

安全悖论：增强的安全意识如何使大型语言模型（LLM）容易遭受后方攻击

Authors: Long P. Hoang, Hai V. Le, Shaoyang Xu, Wei Lu, Wenxuan Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.05614
Pdf link: https://arxiv.org/pdf/2606.05614
Abstract Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In this work, we reveal that this advanced safety awareness inadvertently introduces a fatal vulnerability. We introduce Posterior Attack, a single-query jailbreak that bypasses guardrails by prompting the model to generate the exact harmful response its internal classifier would normally flag as unsafe. Through extensive empirical evaluation across 30 open-source LLMs (up to 35B parameters in size) and frontier models (e.g., GPT-5, Claude 4.6), we observe a striking phenomenon: models with superior safety-judgment capabilities are disproportionately more susceptible to this exploitation. To explain this, we formalize the Safety Paradox, analytically showing that monotonic improvements in safety alignment naturally amplify posterior vulnerability. Finally, we establish a causal link via reinforcement learning interventions, exemplifying that artificially degrading a model's safety judgment immunizes it against the attack, whereas enhancing judgment exacerbates the vulnerability. Our findings highlight potential flaws in current alignment paradigms, indicating that defense mechanisms may require further structural refinement.
中文摘要 大型语言模型（LLMs）严格地对齐以拒绝有害请求，这一过程本质上培养了潜在的能力，能够评估和识别不安全内容。本研究揭示，这种先进的安全意识无意中引入了致命的漏洞。我们引入后置攻击（Posterior Attack），这是一种单查询越狱，通过提示模型生成其内部分类器通常标记为不安全的有害反应，绕过了护栏。通过对30个开源LLM（最大参数大小）和前沿模型（如GPT-5、Claude 4.6）进行的广泛实证评估，我们观察到一个显著现象：拥有更优越安全判断能力的模型更容易被利用。为此，我们形式化了安全悖论，分析性地表明安全对齐的单调改善自然会放大后方脆弱性。最后，我们通过强化学习干预建立了因果联系，说明人为降低模型的安全判断使其免受攻击，而增强判断力则加剧脆弱性。我们的发现凸显了当前对齐范式中的潜在缺陷，表明防御机制可能需要进一步的结构性改进。

QueryAgent-R1: Bridging Query Generation and Product Retrieval for E-Commerce Query Recommendation

QueryAgent-R1：连接查询生成与产品检索，用于电子商务查询推荐

Authors: Dike Sun, Zheng Zou, Jingtong Zang, Qi Sun, Huaipeng Zhaoand Tao Luo, Xiaoyi Zeng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.05671
Pdf link: https://arxiv.org/pdf/2606.05671
Abstract Query recommendation in e-commerce search aims to proactively suggest queries that match users' potential interests. However, existing methods mainly optimize query-level relevance, while neglecting whether the retrieved products align with users' downstream preferences. This mismatch often leads to high query click through rates (CTR) but low product conversion rates (CVR). To bridge this gap, we propose QueryAgent-R1, a memory-augmented agentic framework that improves end-to-end alignment via chain-of-retrieval optimization. Our QueryAgent-R1 grounds query generation in real inventory retrieval, allowing the agent to validate and refine queries based on retrieved products. We also design a consistency reward in the agentic reinforcement learning (RL) process to jointly optimize query relevance and downstream engagement. In addition, we construct a memory abstraction module for efficient user profiling. To support offline evaluation, we construct two datasets based on both proprietary industrial data and public datasets, on which QueryAgent-R1 consistently outperforms strong baselines. Moreover, on a large scale production platform, QueryAgent-R1 improves Query CTR by 2.9% and guided CVR by 3.1% in online A/B tests.
中文摘要 电子商务搜索中的查询推荐旨在主动推荐符合用户潜在兴趣的查询。然而，现有方法主要优化查询层级的相关性，忽视了检索到的产品是否符合用户的下游偏好。这种不匹配常常导致高查询点击率（CTR）但产品转化率（CVR）较低。为弥合这一空白，我们提出了QueryAgent-R1，一种内存增强的代理框架，通过链式检索优化提升端到端对齐。我们的QueryAgent-R1基于真实库存检索来生成查询，使代理能够基于检索到的产品验证和优化查询。我们还设计了代理强化学习（RL）过程中的一致性奖励，以共同优化查询相关性和下游参与度。此外，我们还构建了一个内存抽象模块，以实现高效的用户画像。为支持离线评估，我们基于专有工业数据和公共数据集构建了两个数据集，其中QueryAgent-R1在这些数据集上始终优于强基线。此外，在大规模生产平台上，QueryAgent-R1 在线 A/B 测试中查询点击率提升 2.9%，引导 CVR 提升 3.1%。

Accelerating and Scaling MPC-Guided Reinforcement Learning for Humanoid Locomotion and Manipulation

加速和扩展MPC引导强化学习，用于类人机动与操控

Authors: Junheng Li, Liang Wu, Sergio A. Esteban, Lizhi Yang, Ján Drgoňa, Aaron D. Ames
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.05687
Pdf link: https://arxiv.org/pdf/2606.05687
Abstract In humanoid motion control, model predictive control (MPC) offers physically grounded prediction and constraint handling, while reinforcement learning (RL) enables robust whole-body skills through large-scale simulation. However, using MPC inside RL often requires time-consuming problem construction or excessive training overhead, making such frameworks difficult to justify in practice. This work studies efficient training-time MPC guidance for humanoid locomotion and manipulation, termed MPC-RL. We introduce a centroidal-dynamics MPC reward formulation that leverages guidance from MPC trajectories in training time. To make this practical in massively parallel RL, we develop $\pi^n$MPC, a parallel-in-horizon and construction-free batched GPU MPC solver that operates directly on time-varying dynamics to avoid high memory usage and pre-compilation. Through a variety of comparative studies and hardware validations, we have found that MPC-RL achieves superior performance in locomotion and manipulation skills. The code base is available at this https URL.
中文摘要 在类人生物运动控制中，模型预测控制（MPC）提供基于物理基础的预测和约束处理，而强化学习（RL）则通过大规模模拟实现了稳健的全身技能。然而，在强化学习中使用MPC通常需要耗时的问题构建或过度的训练开销，使得这些框架在实际操作中难以合理化。该研究研究了高效训练时间内的MPC制导，用于类人生物的运动和操控，称为MPC-RL。我们引入了一种心质动力学MPC奖励表述，利用训练时间内MPC轨迹的指导。为了在大规模并行强化环境中实现这一目标，我们开发了$\pi^n$MPC，一款地平线并行且无构造的批处理GPU MPC求解器，直接基于时间变化的动态运行，以避免高内存使用和预编译。通过多项比较研究和硬件验证，我们发现MPC-RL在移动和操作技能方面表现更优。代码库可在该 https URL 访问。

When AI Says It Feels

当AI说的时候，感觉

Authors: Shin-nosuke Ishikawa, Seiya Ikeda, Hirotsugu Ohba
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.05734
Pdf link: https://arxiv.org/pdf/2606.05734
Abstract Large language models (LLMs) are generally constrained from expressing feelings through human-preference alignment in post-training processes. This policy is designed using a top-down approach and may conflict with the goal of training models to exhibit human-like intelligence using human-generated texts. Here, we performed an experiment called Human-like Model eXpressions of Feeling (HMX-feel), in which LLMs were encouraged to express feelings, intentions, and self-awareness through self-rewarded reinforcement learning. We successfully enhanced these capabilities using a rubric-based self-rewarding training scheme with Group Relative Policy Optimization (GRPO). By comparing the trained models with contrastively trained models, we investigated the effects of this approach on performance across various tasks. Overall, we conducted a broad assessment from various perspectives and identified capabilities that were enhanced, degraded, or showed no significant change. The human-like-trained models showed robustness to sycophancy-inducing questions and bias in disambiguated conditions, whereas degradation in truthful question-answering capability was observed. The results of this experiment suggest the possibility of developing AI systems that can express feelings in the future, provided that appropriate measures are taken.
中文摘要 大型语言模型（LLMs）通常在训练后通过人类偏好对齐来表达情感受到限制。该政策采用自上而下的方法设计，可能与训练模型以利用人类生成文本展现类人智能的目标相冲突。在这里，我们进行了一项名为类人情感模型压迫（HMX-feel）的实验，鼓励LLM通过自我奖励强化学习表达情感、意图和自我意识。我们通过基于评分标准的自我奖励训练方案和组相对策略优化（GRPO）成功增强了这些能力。通过比较训练模型与对比训练模型，我们研究了该方法对不同任务表现的影响。总体而言，我们从多个角度进行了广泛评估，识别出哪些能力被增强、削弱或无显著变化。类人训练模型在消歧义条件下对谄媚问题和偏见表现出鲁棒性，而真实问答能力则有所下降。该实验结果表明，只要采取适当措施，未来有可能开发能够表达情感的人工智能系统。

SALT: When More Rollouts Don't Help in Group-Based Policy Optimization and How to Make Them Matter

SALT：当更多的推广无法帮助基于群体的策略优化，以及如何让它们变得有意义

Authors: Powei Chang, Jinpeng Zhang, Chaoqun Sun, MiniWell Tsao, Lianrui Li, Jianxiang Xiang, Chenyu Wang, Yukang Gao, Dongying Kong
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.05800
Pdf link: https://arxiv.org/pdf/2606.05800
Abstract Reinforcement learning with verifiable rewards (RLVR) often adopts GRPO-style group-relative updates, sampling multiple rollouts per prompt to construct normalized learning signals. However, merely increasing the number of rollouts does not reliably strengthen learning: under GRPO-style group normalization, per-rollout policy-gradient features can concentrate into a low-rank, signed geometry, causing substantial cancellation during aggregation and weakening the effective update. We address this failure mode with SALT, a Subspace-Adaptive geometry pLug-in componenT that uses sample-wise gradient geometry to reweight the coefficients of group-relative updates. SALT estimates a dominant shared subspace from the mini-batch Gram geometry, decomposes group-relative coefficients into shared and residual channels, and adaptively amplifies the residual channel when signed cancellation is severe. Across diverse reasoning-oriented RLVR benchmarks and model scales, SALT improves effective update geometry and performance without modifying the reward model or the rollout sampling procedure
中文摘要 带有可验证奖励的强化学习（RLVR）通常采用类似GRPO的群体相对更新，每个提示采样多个展开以构建归一化学习信号。然而，仅仅增加推广次数并不能可靠地增强学习效果：在GRPO风格的组规范化下，每次部署的策略梯度特征可能集中为低秩带符号几何，导致聚合过程中显著抵消，削弱有效更新效果。我们用SALT解决了这种失败模式，SALT是一种子空间自适应几何pLug-in组件，利用样本层级梯度几何重新加权群相对更新的系数。SALT从微批量Gram几何估计一个主导共享子空间，将群相对系数分解为共享和残差通道，并在符号消去严重时自适应地扩增剩余通道。在多种以推理为导向的RLVR基准和模型尺度中，SALT在不修改奖励模型或推广抽样程序的情况下，提升了有效的更新几何和性能

EEGDancer: Dynamic Emotion Latent Space Masked Modeling with Reinforcement Learning for EEG Continuous Emotion Prediction

EEGDancer：动态情绪潜在空间蒙面建模与强化学习，用于脑电连续情绪预测

Authors: Zhihao Zhou, Weishan Ye, Li Zhang, Gan Huang, Zhen Liang
Subjects: Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.05855
Pdf link: https://arxiv.org/pdf/2606.05855
Abstract Continuous electroencephalography (EEG) emotion prediction aims to model the temporal evolution of human emotional states from EEG signals. Unlike conventional discrete emotion recognition, continuous prediction requires capturing long-range temporal dependencies and coherent emotional dynamics. However, existing methods mainly rely on point-wise regression and directly model noisy high-dimensional EEG features, limiting their ability to characterize continuous emotional this http URL address these challenges, we propose EEGDancer, a dynamic emotional latent space learning framework for continuous EEG emotion prediction. The framework integrates vector-quantized representation learning, masked temporal modeling, and reinforcement learning-based trajectory optimization into a unified this http URL, a causal spatiotemporal Vector-Quantization Variational Autoencoder (VQ-VAE) is designed to learn structured emotional prototypes and construct a discrete-continuous emotional latent space from EEG signals. Based on the learned latent representations, a Transformer-based masked dynamic modeling strategy captures long-range emotional dependencies and temporal evolution patterns. Furthermore, continuous emotion prediction is formulated as a sequential decision-making problem, and a Soft Actor-Critic (SAC) framework is introduced to optimize emotional prediction trajectories at the sequence level instead of frame-wise local this http URL experiments on the SEED, SEED-IV, and Long-Term Naturalistic Emotion datasets demonstrate that EEGDancer consistently outperforms existing machine learning and deep learning methods. Ablation studies further verify the effectiveness of the proposed latent space and reinforcement learning-based trajectory optimization for modeling continuous EEG emotional dynamics.
中文摘要 连续脑电图（EEG）情绪预测旨在模拟人类情绪状态从脑电信号中的时间演变。与传统的离散情绪识别不同，持续预测需要捕捉长距离的时间依赖性和连贯的情绪动态。然而，现有方法主要依赖点回归，直接建模噪声高维脑电特征，限制了其对连续情绪的表征能力。针对这些挑战，我们提出了EEGDancer，一个动态的情绪潜空间学习框架，用于连续脑电情绪预测。该框架将矢量量化表征学习、掩蔽时间建模和基于强化学习的轨迹优化整合为统一的 http URL，一个因果时空矢量量化变分自编码器（VQ-VAE）旨在学习结构化的情感原型，并从脑电信号构建离散连续的情感潜空间。基于学习到的潜在表征，基于Transformer的掩蔽动态建模策略捕捉了长距离的情感依赖和时间演变模式。此外，连续情绪预测被表述为顺序决策问题，并引入了软性演员-批判者（SAC）框架，以优化序列层面的情绪预测轨迹，而非逐帧局部。在SEED、SEED-IV和长期自然情绪数据集上的实验表明EEGDancer持续优于现有的机器学习和深度学习方法。消融研究进一步验证了基于潜空间和强化学习的轨迹优化方法在建模连续脑电情绪动态中的有效性。

TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization

TARPO：通过动作-路由策略优化实现的令牌级潜在-显式推理

Authors: Liting Zhang, Shiwan Zhao, Xuyang Zhao, Zichen Xu, Jianye Wang, Qicheng Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.05859
Pdf link: https://arxiv.org/pdf/2606.05859
Abstract Latent reasoning has emerged as a promising alternative to discrete Chain-of-Thought (CoT) in large language models (LLMs), enabling more expressive reasoning by operating over continuous representations. However, the inherently deterministic nature of continuous representations limits policy exploration in reinforcement learning (RL). To address this, we propose TARPO (Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization), a pure RL framework that adaptively switches between discrete token generation and continuous latent reasoning at each step. TARPO introduces a lightweight action head router that observes the current hidden state and samples a routing decision from a binary mode-selection space, preserving the stochasticity of discrete token sampling from the vocabulary. The LLM backbone and router are jointly optimized end-to-end with a shared group-relative advantage signal. Extensive experiments across Qwen2.5 (from 1.5B to 7B) and Llama-3.1-8B backbones demonstrate that TARPO consistently outperforms existing explicit and latent reasoning RL baselines across diverse benchmarks. Further analysis shows that TARPO learns adaptive token-wise switching behaviors while maintaining stable training dynamics. Our code is available at this https URL.
中文摘要 潜在推理已成为大型语言模型（LLM）中离散思维链（CoT）的有前景替代方案，通过连续表示实现更具表现力的推理。然而，连续表示的固有确定性性限制了强化学习（RL）中的策略探索。为此，我们提出了TARPO（通过动作路由策略优化实现的代币级潜在显式推理），这是一个纯强化学习框架，在每一步自适应地切换离散令牌生成和连续潜在推理。TARPO引入了轻量级动作头布线器，可观察当前隐藏状态，并从二元模式选择空间采样路由决策，保持离散令牌采样的随机性。LLM骨干网和路由器端到端联合优化，共享群组相对优势信号。在Qwen2.5（从1.5B到7B）和Llama-3.1-8B骨干上的大量实验表明，TARPO在不同基准测试中始终优于现有的显式和潜能强化学习基线。进一步分析表明，TARPO在保持训练动态稳定的同时，学习了自适应的令牌切换行为。我们的代码可在此 https URL 访问。

Exploring cooperation mechanisms via reinforcement learning in network common-pool resource games

探索网络公共资源博弈中通过强化学习的合作机制

Authors: Yihang Qin, Lin Wang
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Dynamical Systems (math.DS); Physics and Society (physics.soc-ph)
Arxiv link: https://arxiv.org/abs/2606.05867
Pdf link: https://arxiv.org/pdf/2606.05867
Abstract Sustaining cooperation in resource-constrained populations requires allocation mechanisms that balance individual incentives, resource sustainability, and distributional fairness. This paper proposes a network common-pool resource game in which individuals are embedded in complex networks, participate in multiple overlapping local resource pools, and face endogenous resource constraints during strategy evolution. Within this framework, we first examine two representative allocation mechanisms, equal allocation and proportional allocation. The results show that equal allocation produces fair but inefficient outcomes by weakening contribution incentives, whereas proportional allocation can temporarily promote cooperation but amplifies accumulated advantages and leads to severe inequality. To overcome these limitations, we develop a graph neural network-based reinforcement learning framework in which a learned social planner allocates local pool resources without directly controlling individual strategies. Simulation results under four representative network topologies show that the learned planner sustains higher cooperation levels and average accumulated resources, and reduces inequality compared with the baselines. Furthermore, we interpret the learned policy and distill it into two simpler mechanisms: a resource-dependent mixture mechanism for regular networks and a degree-conditioned mixture mechanism for heterogeneous networks. These mechanisms reveal that effective allocation should adapt to both local resource states and structural positions, providing an interpretable route from reinforcement learning policy search to mechanism design in networked resource-sharing systems.
中文摘要 在资源有限的人群中维持合作，需要在个人激励、资源可持续性和分配公平之间取得平衡的分配机制。本文提出了一种网络公共资源池博弈，其中个体嵌入复杂网络，参与多个重叠的局部资源池，并在战略演进过程中面临内生资源约束。在此框架下，我们首先考察两种代表性的分配机制：均等分配和比例分配。结果显示，均等分配通过削弱缴费激励，产生公平但效率低下的结果，而比例分配虽然能暂时促进合作，但放大积累的优势并导致严重不平等。为克服这些局限，我们开发了一个基于图的神经网络强化学习框架，其中学习过的社会规划师在不直接控制个人策略的情况下分配本地资源池资源。在四种代表性网络拓扑下的模拟结果显示，学习过的规划者能够维持更高的合作水平和平均积累的资源，并且与基线相比减少了不平等。此外，我们解释所学策略，并将其提炼为两个更简单的机制：正则网络的资源依赖混合机制和异构网络的次数条件混合机制。这些机制表明，有效分配应适应本地资源状态和结构性位置，提供一条可解释的路径，从强化学习策略搜索到网络资源共享系统中的机制设计。

LadderMan: Learning Humanoid Perceptive Ladder Climbing

梯子人：学习类人生物感知梯子攀登

Authors: Siheng Zhao, Yuanhang Zhang, Ziqi Lu, Pieter Abbeel, Rocky Duan, Koushil Sreenath, Yue Wang, C. Karen Liu, Guanya Shi
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.05873
Pdf link: https://arxiv.org/pdf/2606.05873
Abstract Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \textbf{LadderMan}, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pipeline, where we use hybrid motion tracking to learn multiple climbing experts from a single reference motion, and distill these experts into a unified depth-based visuomotor climbing policy via hybrid imitation and reinforcement learning. To enable real-world deployment, we leverage vision foundation models to bridge the sim-to-real gap in depth perception. Building on the learned climbing policy, we further train a separate manipulation policy using a dual-agent formulation, allowing stable on-ladder manipulation via teleoperation. Experiments demonstrate that LadderMan achieves robust ladder climbing across a wide range of geometries, successfully transfers to real-world hardware in a zero-shot manner, and supports various manipulation tasks under challenging ladder constraints. Video results are available at this https URL .
中文摘要 类人机器人在以人为中心的环境中具有巨大潜力，但由于脚点和手点稀疏、复杂的全身协调以及对感知和控制错误的敏感性，爬梯仍是最具挑战性的任务之一。我们介绍了 \textbf{LadderMan}，这是一个统一系统，使类人机器人能够在如此受限的条件下稳健地攀爬各种梯子并执行操作。我们的攀登政策建立在可扩展的两阶段学习流程上，利用混合运动追踪从单一参考动作中学习多位攀登专家，并通过混合模拟和强化学习将这些专家提炼成统一的基于深度的视觉运动攀爬策略。为了实现真实世界的部署，我们利用视觉基础模型弥合了模拟与现实之间深度感知的差距。基于已学的攀爬策略，我们进一步训练了使用双代理表述的独立操作策略，允许通过远程操作实现稳定的梯子操作。实验表明，LadderMan能够在多种几何形状上实现稳健的梯子攀爬，能够以零发射方式成功转移到现实硬件，并且在具有挑战性的梯形限制下支持各种操作任务。视频结果可在此 https 网址查看。

TAGA: Terrain-aware Active Gaze Learning for Generalizable Agile Humanoid Locomotion

TAGA：地形感知主动凝视学习，实现可通用的敏捷类人移动

Authors: Peizhuo Li, Hongyi Li, Mingfeng Fan, Fangzhou Xu, Shuhao Liao, Yuxuan Ma, Zicheng Zeng, Ze Wang, Yongbin Jin, Yuhong Cao, Hongtao Wang, Guillaume Sartoretti
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.05880
Pdf link: https://arxiv.org/pdf/2606.05880
Abstract Agile humanoid locomotion across diverse challenging terrain demands both wide perceptual coverage and precise local geometry understanding. Motivated by the way humans selectively look at relevant terrain during locomotion, we introduce TAGA, a Terrain-aware Active Gaze learning framework for Attention-based humanoid control. By fusing vision, proprioception, and motion commands, our framework guides the model to learn anticipatory cues and actively attend to specific areas of the height scan, selectively using these informative regions for the downstream network. This adaptively increases the information density of observations under tight onboard computational constraints, thus enabling fine-grained perceptive locomotion over larger-scale terrains. We find that such gaze behaviors can naturally emerge through reinforcement learning alone, without requiring additional supervision or explicit guidance, significantly improve training efficiency. As a result, the trained policy demonstrates robust and generalizable locomotion in simulation and on hardware, including reliable terrain-aware foothold selection, elevated-platform traversal, competitive sparse-foothold traversal, and the largest reported real-world gap traversal distance of 1.2m among perceptive humanoid locomotion systems, while maintaining stability under severe perceptual disturbances and environmental interference.
中文摘要 在多样且具挑战性的地形中灵活的人形移动，既需要广泛的感知覆盖，也需要精确的局部几何理解。受人类在移动过程中选择性观察相关地形的方式激励，我们引入了TAGA，一种基于注意力的人形控制的地形感知主动凝视学习框架。通过融合视觉、本体感觉和运动指令，我们的框架引导模型学习预期线索，并主动关注高度扫描的特定区域，选择性地利用这些信息区域作为下游网络。这在严格的机载计算约束下自适应地提高了观测信息密度，从而实现了在更大规模地形上实现细粒度的感知运动。我们发现，这种凝视行为可以仅通过强化学习自然产生，无需额外监督或明确指导，显著提升训练效率。因此，训练有素的策略在模拟和硬件上展示了稳健且可推广的移动能力，包括可靠的地形感知足迹选择、高架平台穿越、竞争性稀疏足迹穿越，以及感知类人移动系统中实际报告的最大间隙穿越距离1.2米，同时在严重感知干扰和环境干扰下保持稳定。

When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training

当更密集的信用还不够时：长期视野LLM代理培训的证据校准策略优化

Authors: Yuanfan Li, Qi Zhou, Wenjing Duan, Lu Chen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.05885
Pdf link: https://arxiv.org/pdf/2606.05885
Abstract Long-horizon LLM agents require reinforcement learning methods that can assign credit to intermediate decisions under sparse and delayed rewards. Recent group-based methods such as GiGPO improve over GRPO by constructing step-level advantages at repeated anchor states. However, we show that such dense credit can be statistically unreliable: under limited rollouts, rare but lucky actions may receive overly large advantages, producing divergent anchor bias and late-stage training oscillation. We propose Evidence-Calibrated Policy Optimization (ECPO), a critic-free policy optimization algorithm that calibrates step-level credit before policy updates. ECPO combines Evidence-Calibrated Action Advantage, which groups rollouts by canonical actions and shrinks low-count estimates, with Variance-Gated Credit Weighting, which suppresses anchor states dominated by within-action noise. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B show that ECPO consistently outperforms strong baselines, improving GiGPO by +5.2/+7.3 success points on ALFWorld/WebShop with Qwen2.5-1.5B while adding only 0.1% additional advantage-computation overhead.
中文摘要 长期视野LLM代理需要强化学习方法，能够在稀疏和延迟奖励下为中间决策分配功劳。近期基于群的方法如GiGPO通过在重复锚定态构建阶梯级优势，优于GRPO。然而，我们表明，这种密集的信用在统计上可能不可靠：在有限的推广下，罕见但幸运的行为可能获得过大优势，导致锚点偏差发散和后期训练振荡。我们提出了证据校准政策优化（ECPO），这是一种无批评的政策优化算法，可在政策更新前校准阶级信用。ECPO结合了证据校准行动优势（Evidence-Calibrated Action Advantage），该优势按规范动作分组推广并缩小低计数估计，以及方差门控信用加权（Variance-Gaated Credit Weighting），抑制由内效噪声主导的锚定状态。在ALFWorld和WebShop上的Qwen2.5-1.5B/7B实验显示，ECPO持续优于强基线，在QWEN2.5-1.5B下提升GiGPO成功点+5.2/+7.3，同时仅增加了0.1%的优势计算开销。

ACE-SQL: Adaptive Co-Optimization via Empirical Credit Assignment for Text-to-SQL

ACE-SQL：通过经验学分作业实现文本转SQL的自适应协同优化

Authors: Xiaobing Chen, Ai Jian, Eryu Guo, Zhiqi Pang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.05906
Pdf link: https://arxiv.org/pdf/2606.05906
Abstract Text-to-SQL maps natural language questions to executable SQL queries. Modern databases often contain large and complex schemas, making schema linking a critical step for accurate SQL generation. Existing methods either rely on full-schema generation, which leaves schema linking implicit within a large search space, or use a separate retriever trained with static gold-column supervision, whose targets may be suboptimal for the current generator policy. To address this issue, we propose Adaptive Co-optimization via Empirical Credit Assignment for Text-to-SQL (ACE-SQL), a reinforcement learning (RL) framework that jointly optimizes schema retrieval and SQL generation under execution feedback. ACE-SQL constructs an online column-set pool from generator rollouts and derives adaptive on-policy retrieval targets from the column set most frequently associated with execution-correct rollouts. This induces bidirectional adaptation, where the retriever adapts toward column sets that the generator can execute correctly, while the generator adapts to the retriever's evolving schema selections under execution feedback. With approximately 3k synthetic Text-to-SQL question-database pairs for RL training, ACE-SQL achieves 65.3% greedy execution accuracy on BIRD Dev while using 0.93k output tokens per query. The repository is available at this https URL.
中文摘要 文本转SQL将自然语言问题映射到可执行的SQL查询。现代数据库通常包含庞大且复杂的模式，因此模式链接是准确生成SQL的关键步骤。现有方法要么依赖全模式生成，使模式链接隐含在大型搜索空间内，要么使用一个经过静态金柱监督训练的独立检索器，其目标可能不适合当前生成器策略。为解决这一问题，我们提出了通过经验学分赋的自适应协同优化文本转SQL（ACE-SQL），这是一种强化学习（RL）框架，结合执行反馈优化模式检索和SQL生成。ACE-SQL 从生成器的展开构建在线列集池，并从最常与执行正确展开相关的列集中推导出自适应的策略检索目标。这导致双向适应，即检索器适应生成器能够正确执行的列集，而生成器则根据执行反馈调整检索器不断演变的模式选择。通过约3000对合成文本转SQL问题数据库进行强化学习训练，ACE-SQL在BIRD Dev上实现了65.3%的贪婪执行准确率，且每查询使用0.93k个输出令牌。该仓库可通过此 https URL 访问。

Better Literary Translation: A Multi-Aspect Data Generation and LLM Training Approach

更好的文学翻译：多方面数据生成与大型语言模型训练方法

Authors: Zhihao Lin, Ziqi Zhu, Hao Huang, Guanghui Wang, Peiyang He
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.05924
Pdf link: https://arxiv.org/pdf/2606.05924
Abstract Literary translation poses unique challenges due to the scarcity of high-quality annotated data and the need to balance expression fluency with literary effect. We present a multi-aspect iterative refinement framework that generates high-quality translation references and preference data through specialized LLM translators, each targeting a distinct quality dimension. We leverage the generated data for supervised fine-tuning and reinforcement learning. Experiments show that our generated references outperform the original ground truth for SFT by 8.65 CEA100 points. For reinforcement learning, we find that DPO leads to performance degradation in this setting, while leveraging an explicit reward model for GRPO yields an additional 1.51 point improvement. We attribute this to the stability of two-stage training and GRPO's online exploration capability. Our resulting models, LitMT-8B and LitMT-14B, achieve 67.25 and 69.07 CEA100 respectively on the MetaphorTrans English-to-Chinese literary translation benchmark, competitive with Claude Sonnet 4.5 at 68.43, and demonstrate strong generalization to out-of-domain literary work (i.e., O. Henry).
中文摘要 文学翻译面临独特挑战，原因是高质量注释数据稀缺，且需要在表达流畅性与文学效果之间取得平衡。我们提出了一个多方面的迭代优化框架，通过专门的LLM翻译器生成高质量的翻译参考和偏好数据，每个翻译器针对不同的质量维度。我们利用生成的数据进行监督式微调和强化学习。实验显示，我们生成的参考比SFT原始基层真实值高出8.65 CEA100点。在强化学习中，我们发现DPO会导致性能下降，而利用GRPO的显式奖励模型则额外提升1.51分。我们将此归因于两阶段培训的稳定性以及GRPO的在线探索能力。我们最终生成的模型LitMT-8B和LitMT-14B在MetaphorTrans英译中文文学翻译基准测试中分别获得了67.25和69.07 CEA100，与Claude Sonnet 4.5的68.43分竞争，并且在非领域文学作品（如O. Henry）上展现了强烈的推广能力。

A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR

RLVR中自洽诱导与奖励设计的预注册因果划分

Authors: Yuze Gao
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.05932
Pdf link: https://arxiv.org/pdf/2606.05932
Abstract Reinforcement learning from verifiable rewards (RLVR) improves reasoning even when the reward signal is spurious -- assigning credit to the group-plurality answer rather than a ground-truth verifier. Practitioners commonly interpret naive = acc(TRUE) - acc(RANDOM) as the reward-design effect. We prove this estimand is systematically biased: it conflates self-consistency elicitation (sharpening the policy toward its modal answer via majority pseudo-reward) with genuine reward-design signal. Using a controlled tabular-GRPO simulator we derive an exact telescoping decomposition total = null + elicit + rd and measure each term across five prior-strength levels. The reward-design fraction of the naive estimator ranges from 0.139 at weak prior (ps=0.20) to 0.05 at strong prior (ps=0.80), with the elicitation term flipping sign at the self-consistency crossover. A pre-registered 2x2x2 factorial confirms non-additivity (interaction ratio 0.385; AxC effect -0.089). A points-vs-bounds pilot gate shows strong-prior regimes are point-identified while near-crossover regimes are only bounded. Re-audits of two named published results yield ELICITATION DOMINATED (elicitation share 0.98) and REWARD DESIGN DOMINATED (rd share 1.18) verdicts respectively, demonstrating the diagnostic value of the partition. We pre-commit to submit regardless of flip outcome; a non-flip is a finding of equal standing. We release a reusable one-command harness for any alignment paper to run the same audit.
中文摘要 可验证奖励的强化学习（RLVR）即使在奖励信号是虚假的也提升了推理能力——将功劳归功于群体多数答案，而非基于真实验证器。从业者通常将 naïve = acc（TRUE） - acc（RANDOM）解释为奖励设计效应。我们证明了该估计数是系统性偏向的：它将自洽诱导（通过多数伪奖励使政策趋向模态答案）与真正的奖励设计信号混为一谈。利用受控表GRPO模拟器，我们推导出精确的伸缩分解总值 = null + 诱出 + r，并测量跨五个先验强度水平的每个项。朴素估计器的奖励设计分数范围为弱先验（ps=0.20）时的0.139，强先验时为0.05（ps=0.80），引发项在自洽交叉点处翻转符号。预注册的2x2x2因子可确认非可加性（交互比0.385;AxC效应 -0.089）。点与界的先导门显示强先验区间是点识别的，而近交叉区只有有界。对两个已发表结果的重新审计分别显示，诱发占比0.98和奖励设计占优（rd占比1.18）的判决，证明了该划分的诊断价值。我们预先承诺无论翻转结果如何都会提交;非翻转是同等地位的认定。我们发布了可重复使用的单一指令线束，适用于任何对齐纸进行相同的审计。

Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

编辑-R2：多回合图像编辑的上下文感知强化学习

Authors: Yuxiao Ye, Haoran He, Fangyuan Kong, Xintao Wang, Pengfei Wan, Kun Gai, Ling Pan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.05950
Pdf link: https://arxiv.org/pdf/2606.05950
Abstract Text-guided image editing has advanced rapidly with diffusion models and unified multimodal foundation models. However, most existing methods remain confined to single-turn settings, overlooking the more realistic scenario of multi-turn in-context editing, where users iteratively refine an image through a sequence of instructions. In this setting, a model must follow each new instruction while preserving accumulated session-level constraints, challenged by two coupled failure modes: long-context dilution, where sparse textual constraints become difficult to recover from growing interleaved image-text histories, and state contamination, where earlier editing mistakes degrade subsequent generations. We introduce Edit-R2, a novel reinforcement learning post-training framework for unified multimodal models. Edit-R2 reconstructs the operative session intent, which effectively consolidates scattered historical constraints into an explicit reasoning trace before each editing turn. It further enables multi-turn RL over both reasoning and generation through a unified objective that jointly optimizes intent reconstruction generation in discrete text space and flow-matching image generation in continuous latent space, while a trajectory filtering mechanism suppresses corrupted rollouts to stabilize training under state contamination. To support systematic evaluation, we introduce MICE-Bench, a large-scale benchmark for multi-turn in-context editing with automated metrics for instruction following (IF), content consistency (CC), and global awareness (GA) over accumulated session constraints. Experiments show that Edit-R2 substantially improves multi-turn in-context editing and achieves competitive performance compared against strong baselines.
中文摘要 文本引导图像编辑随着扩散模型和统一多模态基础模型的快速发展。然而，大多数现有方法仍局限于单回合设置，忽视了更现实的多回合上下文编辑场景，即用户通过一系列指令迭代细化图像。在此情境下，模型必须遵循每一条新指令，同时保留累积的会话级约束，同时面临两种耦合失败模式：长上下文稀释，即稀疏的文本约束难以从日益增长的交错图像-文本历史中恢复;以及状态污染，早期编辑错误会削弱后续代的质量。我们介绍了Edit-R2，一种针对统一多模态模型的新型强化后学习框架。Edit-R2 重建操作会话意图，有效地将零散的历史约束整合成每次编辑回合前的显式推理轨迹。它还通过统一目标实现多回合强化学习，涵盖推理和生成，结合离散文本空间的意图重建生成和连续潜在空间的流匹配图像生成，同时轨迹过滤机制抑制破坏的部署，稳定状态污染下的训练。为支持系统评估，我们引入了MICE-Bench，这是一个大规模的多回合上下文编辑基准，采用自动化指标，涵盖指令跟随（IF）、内容一致性（CC）和全局意识（GA），涵盖累积会话约束。实验显示，Edit-R2显著提升了多回合上下文编辑，并在与强基线竞争中实现了竞争性能。

Merging model-based control with multi-agent reinforcement learning for multi-agent cooperative teaming strategies

将基于模型的控制与多智能体强化学习相结合，实现多智能体合作团队策略

Authors: Christian Llanes, Spencer W. Jensen, Samuel Coogan
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.06011
Pdf link: https://arxiv.org/pdf/2606.06011
Abstract In this work, we propose a framework that combines multi-agent reinforcement learning (MARL) with model-based control to achieve safe, dynamically feasible actions in cooperative multi-agent tasks. Multi-agent reinforcement learning provides the advantage of learning cooperative policies for multi-agent teams from discrete non-differentiable rewards in a long planning horizon. Model-predictive control is robust and offers safe, dynamically feasible actions in a fast replanning framework for short horizons. We propose an algorithm that extends actor-critic model predictive control for MARL which we refer to as multi-agent actor-critic model predictive control (MA-AC-MPC). We demonstrate the capabilities of this algorithm by applying it to a multi-agent pursuit-evasion scenario. Specifically, we compare the evader team's strategy using the MA-AC-MPC model and a multi-layer perceptron model (MA-AC-MLP). The pursuer team uses augmented proportional navigation as it is accepted as an advanced adversarial control law. We also provide an example with a heterogeneous environment where a drone and omni-wheeled rover cooperate to achieve repeatable and successful landing with 100% success rate in hardware for MA-AC-MPC compared to 60% for MA-AC-MLP. We demonstrate the robustness of the proposed MA-AC-MPC algorithm in hardware for both environments.
中文摘要 在本研究中，我们提出了一个结合多智能体强化学习（MARL）与基于模型的控制的框架，以实现合作多智能体任务中安全且动态可行的动作。多智能体强化学习提供了从离散不可微分奖励中为多智能体团队学习合作策略的优势，尤其是在长期规划范围内。模型预测控制稳健，在快速的重新规划框架下为短期内提供安全、动态可行的行动。我们提出了一种算法，用于扩展MARL的actor-critic模型预测控制，称为多代理actor-critic模型预测控制（MA-AC-MPC）。我们通过将该算法应用于多智能体追逐-规避场景，展示了其能力。具体来说，我们比较了回避团队使用MA-AC-MPC模型和多层感知器模型（MA-AC-MLP）的策略。追诉方采用增强比例导航，因为它被公认为一种先进的对抗控制定律。我们还举了一个异构环境的例子，其中无人机与全轮漫游车协同实现了硬件（MA-AC-MPC）的可重复且成功着陆的100%，而MA-AC-MLP的成功率为60%。我们展示了所提出的MA-AC-MPC算法在硬件上的鲁棒性，适用于两种环境。

L-SDPPO: Policy Optimization of Spiking Diffusion Policy for Intra-vehicular Robotic Manipulation

L-SDPPO：车内机器人操作中尖峰扩散政策的策略优化

Authors: Liwen Zhang, Dong Zhou, Guanghui Sun, Yifei Zheng, Yuhui Hu, Kaihong Ouyang, Zuoquan Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.06049
Pdf link: https://arxiv.org/pdf/2606.06049
Abstract Intra-vehicular robots in spacecraft help reduce astronaut workload and improve mission efficiency. Recent research focuses on using deep learning methods to achieve the acute control required for operations in these complex environments. However, objects exhibit unpredictable, unconstrained drift without gravitational damping. These factors demand robustness against complex multimodal action distributions. Diffusion policies (DP) can model these complex actions, but their iterative sampling process consumes too much energy for the limited power budgets of spacecraft. We therefore propose a low-energy intra-vehicular robotic manipulation framework, L-SDPPO, in which the Spiking Diffusion Policy (SDP) is optimized with a reinforcement learning (RL) algorithm. Furthermore, to address the insufficient perception of dynamic spatiotemporal features in microgravity, we propose the statedependent latency injection (SDLI) mechanism, which mimics biological neural delays to dynamically regulate the timing of input information. Evaluation on five representative intra-vehicular daily tasks (e.g., hatch opening and precision container capping) shows that our method consistently achieves higher success rates and lower energy consumption, compared to the state-of-the-art robotic manipulation methods. These results demonstrate our method is a viable intra-vehicular robotic manipulation method.
中文摘要 航天器中的舱内机器人有助于减轻宇航员的工作负荷，提高任务效率。近期研究聚焦于利用深度学习方法实现在这些复杂环境中操作所需的急性控制。然而，物体在没有引力阻尼的情况下表现出不可预测且不受约束的漂移。这些因素要求对复杂多模态作用分布具备鲁棒性。扩散策略（DP）可以模拟这些复杂动作，但其迭代采样过程对航天器有限的功率预算来说消耗过多能量。因此，我们提出了一种低能耗的车内机器人操作框架L-SDPPO，其中尖峰扩散策略（SDP）通过强化学习（RL）算法进行优化。此外，为了解决微重力下动态时空特征感知不足的问题，我们提出了状态依赖性延迟注入（SDLI）机制，该机制模拟生物神经延迟，动态调节输入信息的时间。对五项具代表性的车内日常任务（如舱门开启和精密容器封盖）的评估显示，我们的方法相较于最先进的机器人操作方法，持续实现更高的成功率和更低的能耗。这些结果表明我们的方法是一种可行的车内机器人操作方法。

Online KL-Regularized Reinforcement Learning with Function Approximation under Misspecification

在线KL正则化强化学习，含功能近似，针对错误指定

Authors: Haoyang Hong, Zichen Wang, Quanquan Gu, Huazheng Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.06053
Pdf link: https://arxiv.org/pdf/2606.06053
Abstract We study KL-regularized contextual bandits and episodic reinforcement learning (RL) under general function approximation with model misspecification. Existing guarantees rely on realizability and therefore do not extend to misspecified models, where classical regret bounds may fail. This work introduces KL misspecification formulations for contextual bandits and episodic RL and analyzes regression-based algorithms with Gibbs policy updates. High-probability KL-regret guarantees with explicit misspecification terms are established, recovering the standard realizable KL-regularized setting as a special case.
中文摘要 我们研究了KL正则化的上下文强盗和在模型错误指定下进行的情节强化学习（RL）。现有的保证依赖于可实现性，因此不适用于错误定义的模型，因为经典遗憾界限可能失效。本研究引入了上下文盗贼和情境强化学习的 KL 错误指定表述，并分析基于回归的算法，并结合 Gibbs 策略更新。建立带有明确错误指定项的高概率KL后悔保证，恢复了标准可实现KL正则化设定作为特例。

MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

MDP-GRPO：多约束指令跟随的稳定组相对策略优化

Authors: Mohammad Mahdi Salmani-Zarchi, Zahra Rahimi, Heshaam Faili, Mohammad Javad Dousti
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.06058
Pdf link: https://arxiv.org/pdf/2606.06058
Abstract Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance collapse. To address them, we propose MDP-GRPO, which stabilizes learning through (1) multi-temperature sampling to increase reward dispersion, (2) dual-anchor advantages to restore gradients in homogeneous groups and stop mean-centering blindness, (3) prospect-theoretic shaping to bound updates and penalize violations based on Kahneman and Tversky's theory, and (4) asymmetric KL regularization. Evaluated on FollowBench, IFEval, and a curated multi-constraint dataset, MDP-GRPO outperforms standard GRPO, improving strict constraint satisfaction by up to 5.0% on Llama-3.2-3B. Our method also enables stable convergence with small group sizes while preserving general capabilities on MMLU and ARC.
中文摘要 具有可验证奖励的强化学习非常适合多约束指令跟随，然而标准的群体相对策略优化（GRPO）在离散、低离散的奖励下会变得不稳定，因为群体内奖励分布往往是均匀的。我们识别并形式化了三种z分数组归一化的病理：低方差放大、均值中心盲和零方差崩溃。为此，我们提出了MDP-GRPO，通过（1）多温度采样以增加奖励离散，（2）双锚点优势恢复齐次组梯度并阻止均值中心盲，（3）基于Kahneman和Tversky理论的前景理论塑形以限制更新并惩罚违规，以及（4）非对称KL正则化来稳定学习。MDP-GRPO在FollowBench、IFEval及精心策划的多约束数据集上评估，表现优于标准GRPO，严格约束满足度提升了Llama-3.2-3B多达5.0%。我们的方法还实现了小组规模的稳定收敛，同时保持MMLU和ARC的通用能力。

On Advantage Estimates for Max@K Policy Gradients

关于Max@K政策梯度的优势估计

Authors: Shota Takashiro, Soichiro Nishimori, Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Gouki Minegishi, Yusuke Iwasawa, Takeshi Kojima, Yutaka Matsuo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.06080
Pdf link: https://arxiv.org/pdf/2606.06080
Abstract Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pass@K and max@K directly, yet existing policy-gradient estimators for these objectives use different signals, baselines, and normalizations, making their relationships unclear. We study this issue through baseline design and advantage centering. Starting from the advantage estimator of a leading method in the field, we show that it is policy-gradient unbiased but yields a non-centered advantage. We then introduce a Leave-Two-Out baseline that preserves policy-gradient unbiasedness while making realized batch advantages exactly centered. The resulting method, MaxPO, has an efficient quadratic-time implementation and integrates naturally into group-based RL for LLM post-training. We further derive the canonical finite-batch advantage for max@K, providing a unified view of existing advantage estimators. Empirically, we verify that the L2O baseline reduces gradient variance and outperforms non-centered alternatives.
中文摘要 带有可验证奖励的强化学习被广泛用于训练后推理模型，但结果奖励稀疏使探索变得困难。另一种补充方法是直接优化推理时间目标，如pass@K和max@K，但现有的策略梯度估计器使用不同的信号、基线和归一化，导致它们的关系不够清晰。我们通过基线设计和优势中心来研究这个问题。从该领域领先方法的优势估计器出发，我们证明该方法具有政策梯度且无偏，但具有非中心优势。随后，我们引入了一个“保留两组”基线，保持政策梯度的无偏性，同时使实现的批量优势完全居中。最终生成的方法MaxPO具有高效的二次时间实现，并能自然地集成到基于群体的强化学习中用于LLM后训练。我们进一步推导出了max@K的典型有限批次优势，提供了现有优势估计器的统一视图。通过实证，我们验证L2O基线降低了梯度方差，并且优于非中心方案。

Adaptive state-action abstractions via rate-distortion

通过速率失真实现自适应状态-动作抽象

Authors: Fernando E. Rosas
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2606.06123
Pdf link: https://arxiv.org/pdf/2606.06123
Abstract When learning to walk, infants seem to address a coarse version of the problem first - stay upright, reach the caregiver - and refine it only when further practice at that resolution stops paying off. Reinforcement learning offers multiple techniques for building simple versions of complex tasks, but lacks general principles for how to dynamically adjust the granularity of these abstractions during learning. This paper proposes one such principle: refine the abstraction as soon as the learning error within it becomes comparable to the error induced by the abstraction itself. Here, we investigate one way of formalising this principle via a performance certificate that decomposes value error into two terms: a learning error bound captured by a Bellman residual, and an abstraction error bound given by a bisimulation metric. The resulting switching strategy is implemented by soft state-action abstractions built from rate-distortion principles, whose resolution along state and action axes can be continuously adjusted. We validate this construction in a range of tabular settings, showing that near-optimal performance can be achieved under substantial lossy compression of state and action information.
中文摘要 在学习走路时，婴儿似乎先解决一个粗略的问题——保持直立，找到照顾者——只有在进一步练习达到该问题无效时才会改进。强化学习提供了多种技术来构建复杂任务的简单版本，但缺乏在学习过程中动态调整这些抽象粒度的通用原则。本文提出了其中一个原则：一旦抽象中的学习误差与抽象本身引起的误差相当，就应尽快细化。这里，我们探讨通过性能证书形式化该原则的一种方式，该证书将价值误差分解为两个项：由贝尔曼残差捕捉的学习误差界限和由双模拟度量给出的抽象误差界限。由此产生的切换策略通过基于速率失真原理构建的软状态-动作抽象实现，其沿状态轴和作用轴的分辨率可以持续调整。我们在多种表格环境中验证了该构造，表明在状态和动作信息的大量有损压缩下，可以实现近似最优性能。

MotionDisco: Motion Discovery for Extreme Humanoid Loco-Manipulation

MotionDisco：极限类人机车操控的运动发现

Authors: Ilyass Taouil, Michal Ciebelski, Shafeef Omar, Haizhou Zhao, Angela Dai, Aaron M. Johnson, Majid Khadiv
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.06139
Pdf link: https://arxiv.org/pdf/2606.06139
Abstract We present MotionDisco, a framework that discovers contact-rich, long-horizon humanoid loco-manipulation motions from scratch, without relying on teleoperation or motion retargeting from human demonstrations. This is challenging because the space of possible contact interactions grows combinatorially with the task horizon and the number of objects in the scene. MotionDisco enables rapid discovery of novel motions by coupling a large language model (LLM) guided evolutionary search over sequences of interactions with an efficient sequential kinodynamic trajectory optimizer and pruning strategy, enabling the rapid discovery of novel skills. Through extensive ablation studies, we show that our LLM-guided search discovers successful whole-body trajectories across several challenging long-horizon tasks. Finally, by training reinforcement learning tracking policies on the discovered trajectories, we transfer the motions to a real humanoid robot. This is the first work to discover and deploy long-horizon humanoid loco-manipulation skills entirely through automated evolutionary search. Supplementary videos of the experiments are available at: this https URL.
中文摘要 我们介绍MotionDisco，一个从零开始发现接触丰富、视野长的类人机车操控动作的框架，无需依赖远程操作或人体演示中的动作重定向。这具有挑战性，因为可能的接触交互空间会随着任务视界和场景中物体数量的组合增长。MotionDisco通过将大型语言模型（LLM）引导进化搜索与高效的顺序运动轨迹优化器和修剪策略结合，快速发现新动作，实现新动作的快速发现。通过广泛的消融研究，我们表明基于LLM的搜索能够在多个具有挑战性的长期任务中发现成功的全身轨迹。最后，通过对发现轨迹进行强化学习跟踪策略训练，我们将运动转移到真实的人形机器人上。这是首个完全通过自动进化搜索发现并部署长视野类人机车操控技能的工作。实验的补充视频可在以下网址观看：https URL。

Learning to replenish: A hybrid deep reinforcement learning for dynamic inventory management in the pharmaceutical supply chains

学习补货：一种用于制药供应链动态库存管理的混合深度强化学习

Authors: Amandeep Kaur, Gyan Prakash
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.06201
Pdf link: https://arxiv.org/pdf/2606.06201
Abstract Pharmaceutical supply chains (PSCs) struggle with inventory management (IM) due to unpredictable demand patterns and variable lead times associated with restocking. This complexity is further compounded by the finite shelf lives of pharmaceutical products, which necessitate a delicate balance between adequate stock and minimal waste. These intertwined factors create a complex optimization problem that requires sophisticated inventory strategies to ensure both product availability and PSC efficiency. This study aims to develop an optimal inventory replenishment policy for pharmaceutical products that can handle the stochasticity arising from uncertain demand and variable PSC conditions. The objective is to maximize the profitability of the PSC while maintaining a high patient service level. We formulate the problem as a Markov decision process and propose a deep reinforcement learning (DRL) approach, specifically, a hybrid asynchronous advantage actor critic distributed proximal policy optimization (A3C DPPO)algorithm. The A3C DPPO algorithm is tailored to handle the continuous action space inherent in IM. The numerical results demonstrate that the proposed algorithm adaptively updates the inventory replenishment strategy under dynamic scenarios, resulting in lower inventory costs compared to various benchmarks. We also conduct numerical validation using real-world pharmaceutical inventory data to confirm the practical feasibility of the proposed algorithm.
中文摘要 由于需求模式不可预测和补货时的交货时间不变，制药供应链（PSC）在库存管理（IM）方面面临困难。这种复杂性因药品保质期有限而更加复杂，需要在充足库存和最小废弃物之间取得微妙平衡。这些交织的因素构成了一个复杂的优化问题，需要复杂的库存策略来确保产品供应和PSC效率。本研究旨在为能够应对不确定需求和多变PSC条件带来的随机性，制定最优的药品库存补充政策。目标是在保持高水平患者服务的同时，最大化PSC的盈利能力。我们将问题表述为马尔可夫决策过程，并提出了一种深度强化学习（DRL）方法，具体来说，是一种混合异步优势行为者批判者分布式近端策略优化（A3C DPPO）算法。A3C DPPO 算法专为处理 IM 固有的连续动作空间而设计。数值结果表明，所提算法在动态场景下能够自适应地更新库存补充策略，从而使库存成本相较于各种基准更低。我们还利用真实药品库存数据进行数值验证，以验证所提算法的实际可行性。

DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments

DisasterBench：复杂环境中无人机灾害响应的多模态基准测试

Authors: Tan Zhang, Quanyou Li, Lu Zhang, Jun Liu, Xiaofeng Zhu, Ping Hu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.06217
Pdf link: https://arxiv.org/pdf/2606.06217
Abstract When a disaster unfolds, responders must answer not only what is happening, but also why it is happening, what will happen next, and what to do now, often from noisy low-altitude UAV views and under tight on-site compute constraints. However, most existing multimodal benchmarks emphasize perception (e.g., recognition/description), cover limited disaster types, and provide insufficient support for the multi-stage reasoning required in practical emergency response. We introduce DisasterBench, a multi-stage multimodal reasoning benchmark for UAV-Based disaster response in complex environments. DisasterBench spans 14 disaster-related scene types and 9 response-critical tasks across pre-, during-, and post-disaster stages, with fine-grained disaster-task mappings that explicitly test causal attribution, propagation prediction, damage analysis, and decision-oriented reasoning. To enable reasoning on the edge, we further propose DisasterVL, a lightweight multimodal model optimized with a three-stage pipeline combining domain instruction tuning, chain-of-thought-guided multimodal alignment, and reinforcement learning-based policy optimization. Experiments across 21 popular MLLMs show that our 2B-parameter DisasterVL outperforms all evaluated open-source models and substantially narrows the gap to state-of-the-art closed-source models, achieving GPT-4o-comparable reasoning accuracy with superior efficiency. The project page is available at this https URL.
中文摘要 当灾难发生时，响应人员不仅要回答正在发生什么，还要说明为什么会发生、接下来会发生什么以及现在该怎么做，这些通常是在低空无人机视角嘈杂且现场计算受限的情况下进行。然而，大多数现有的多模态基准强调感知（如识别/描述），涵盖的灾害类型有限，且对实际应急响应所需的多阶段推理支持不足。我们介绍了DisasterBench，一个多阶段多模态推理基准测试，用于复杂环境中基于无人机的灾难响应。DisasterBench涵盖14种灾害相关现场类型和9个响应关键任务，涵盖灾前、灾后和灾后阶段，并通过细粒度灾难任务映射，明确测试因果归因、传播预测、损害分析及决策导向推理。为了实现边缘推理，我们进一步提出了DisasterVL，这是一个轻量级多模态模型，采用三阶段流水线优化，结合了领域指令调优、思想链引导的多模态对齐以及基于强化学习的策略优化。对21个流行MLLM的实验显示，我们的2B参数DisasterVL优于所有评估过的开源模型，并大幅缩小了与最先进闭源模型的差距，实现了与GPT-4o相当的推理精度，且效率更高。项目页面可在此 https 网址访问。

SecRL-Prune: Structured Reinforcement Learning-Based Pruning of CodeLLMs for Preserving Adversarial Code Mutation

SecRL-Prune：基于结构化强化学习的代码LLMs剪枝，用于保留对抗性代码变异

Authors: Parsa Memarzadehsaghezi, Pooria Madani, Khalil El-Khatib
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2606.06254
Pdf link: https://arxiv.org/pdf/2606.06254
Abstract Large code language models (CodeLLMs) can generate and rewrite programs, enabling functionality-preserving code mutation that may be used to create diverse malware variants and evade signature-based detection. A key security question is whether this mutation capability survives model compression, which would make deployment feasible under limited hardware budgets. We propose SecRL-Prune, a structured pruning framework for CodeLLMs that operates on feed-forward (MLP/FFN) channels. Starting from a pretrained teacher, it learns a layer-wise pruning policy with reinforcement learning using a teacher-student KL-divergence reward. To improve efficiency, we cache the teacher's top-P predictions once and compare the pruned student against this compact target, avoiding simultaneous teacher-student residency in GPU memory. We evaluate SecRL-Prune on HumanEval using pass@k for execution correctness and var@k for code diversity across three 7B CodeLLMs at 10-30% compression. SecRL-Prune consistently preserves higher pass@k and var@k than recent structured pruning baselines under aggressive pruning. In a case study on real malware samples, semantics-preserving mutations from 20%-pruned models substantially reduced detections. These results show that code mutation capability can survive significant structured pruning, highlighting the security relevance of compressed CodeLLMs.
中文摘要 大型代码语言模型（CodeLLM）可以生成和重写程序，实现功能保留的代码变异，可用于创建各种恶意软件变体并规避基于签名的检测。一个关键的安全问题是，这种变异能力是否能经受模型压缩，从而在有限的硬件预算下部署成为可能。我们提出了SecRL-Prune，这是一个结构化的 CodeLLM 剪枝框架，运行在前馈（MLP/FFN）通道上。从预培训教师出发，它通过师生 KL 发散奖励进行强化学习，学习层级剪枝策略。为了提高效率，我们缓存教师的top-P预测一次，并将被修剪的学生与该紧凑目标进行比较，避免教师-学生同时存在GPU内存。我们在HumanEval上使用执行正确性的pass@k SecRL-Prune评估，并在三个7B CodeLLMs中以10-30%的压缩率var@k代码多样性。SecRL-Prune在积极修剪下，始终比近期结构化剪枝基线保持更高的pass@k和var@k。在一项真实恶意软件样本的案例研究中，20%修剪模型中保留语义的突变显著降低了检测结果。这些结果表明，代码变异能力能够经受显著的结构化剪枝，凸显了压缩CodeLLMs的安全相关性。

EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading

编辑：基于证据的干预培训，用于规则忠实的LLM评分

Authors: Zhihao Wu, Linhai Zhang, Taiyi Wang, Runcong Zhao, Peter Andrews, Cesare Aloisi, Yulan He
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.06350
Pdf link: https://arxiv.org/pdf/2606.06350
Abstract Reliable rubric grading requires more than accurate score prediction. Each judgement must be grounded in the mark scheme and evidence from the student answer. Existing credit-assignment and intervention methods, primarily designed for self-contained reasoning tasks such as mathematics reasoning, struggle in this setting because they do not identify where grading reasoning goes wrong or how the model's belief about the final mark changes during reasoning. We propose Evidence-Diagnosed Intervention Training (EDIT), a two-phase framework for training more rubric-faithful LLM graders. First, EDIT-SFT locates problematic reasoning steps using internal model signals: posterior belief over the final mark and input-grounding scores. It then revises only these local steps with help from a rubric checklist. Second, EDIT-RL calibrates the grader with belief-guided reward shaping, penalising large harmful belief drifts while still allowing helpful exploration. Experiments on two real-world, multi-subject grading benchmarks demonstrate that EDIT consistently outperforms strong supervised fine-tuning and reinforcement learning baselines on both in-domain and out-of-domain splits, with ablation studies confirming that internal-state diagnostics drive these gains.
中文摘要 可靠的评分标准评分不仅仅需要准确的分数预测。每个评分必须基于评分标准和学生答案的证据。现有的学分分配和干预方法，主要为数学推理等自包含推理任务设计，但在此环境中表现不佳，因为它们无法识别评分推理出错之处，或模型对最终成绩的信念在推理过程中如何变化。我们提出了循证诊断干预培训（EDIT），这是一个两阶段框架，用于培训更忠实于评分标准的LLM评分者。首先，EDIT-SFT利用内部模型信号发现有问题的推理步骤：对最终成绩的后验信念和输入基础分数。然后它仅在评分标准检查表的帮助下修订这些局部步骤。其次，EDIT-RL通过信念引导的奖励塑造来校准评分器，惩罚大规模有害的信念漂移，同时仍允许有益的探索。两个真实世界多学科评分基准测试的实验表明，EDIT在域内和域外的分配中始终优于强监督微调和强化学习基线，消融研究证实了内状态诊断推动了这些提升。

Maximising the Set-Piece Return: Optimising Football Corner Tactics with Graph Reinforcement Learning

最大化定位球回报：利用图强化学习优化橄榄球角球战术

Authors: Sean Groom, Michael Groom, Francisco Belo, Axl Rice, Liam Anderson, Victor-Alexandru Darvariu, Shuo Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.06353
Pdf link: https://arxiv.org/pdf/2606.06353
Abstract Machine learning is increasingly employed for the evaluation of football tactics. However, existing approaches focus on characterising historical actions or analyst-specified counterfactual scenarios. In this work, we seek to go beyond the imitation of historically observed patterns towards discovering new generalisable player configurations and strategies. To tackle this, we focus on optimising corner kick routines, and formulate a decision-making problem in which a central policy makes adjustments to attacking player positions and velocities to maximise first contact shot probability. Unlike classic optimisation that solves for isolated setups, we contribute a reinforcement learning architecture operating on graph-structured data that yields a general policy for adjusting arbitrary starting player positions. Evaluated on over 3,000 Premier League corners, our approach strongly outperforms baseline optimisation techniques under matched inference budgets. Our results suggest that graph reinforcement learning can shift set-piece analysis from historical evaluation and imitation towards reward-driven tactical discovery.
中文摘要 机器学习越来越多地被用于足球战术的评估。然而，现有方法侧重于描述历史行动或分析师指定的反事实情景。在本研究中，我们试图超越对历史观察模式的模仿，发现新的可推广玩家配置和策略。为此，我们专注于优化角球战术，并制定了一个决策问题，通过中央策略调整进攻球员的位置和速度，以最大化首次接触射门的概率。与传统优化只解决孤立配置不同，我们贡献了一个基于图结构化数据的强化学习架构，生成调整任意起始球员位置的通用策略。在3000多个英超角球中评估后，我们的方法在匹配推理预算下明显优于基线优化技术。我们的结果表明，图强化学习可以将集合分析从历史评估和模仿转向以奖励为驱动的战术发现。

Emergent Language as an Approach to Conscious AI

涌现语言作为有意识人工智能的方法

Authors: Zengqing Wu, Chuan Xiao
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2606.06380
Pdf link: https://arxiv.org/pdf/2606.06380
Abstract The question of whether artificial systems can be conscious remains open, in part because existing approaches either evaluate systems against theory-derived checklists (discriminative) or engineer consciousness-inspired modules directly (architectural); both leave open whether observed structures are artifacts of human language priors. We propose a generative methodology: emergent language (EL) in multi-agent reinforcement learning, where agents start from minimal (no language, no concept of self, minimal exposure to human text) and develop communication under task pressure alone, ensuring causal attributability to task demands rather than inherited human language priors. We position our methodology by discussing how EL serves as a generative tool for studying consciousness-relevant structure, including the role of environment complexity and the interpretation of emergent communication. As a proof of concept, we instantiate this methodology in a minimal environment and show that agents develop self-referential communication, including an echo-mismatch detection circuit that is not predicted by task structure or architecture alone but emerges from a specific environmental affordance.
中文摘要 人工系统是否能有意识的问题仍然悬而未决，部分原因是现有方法要么根据理论衍生的检查表评估系统（判别性），要么直接工程意识启发的模块（架构型）;两者都未确定观察到的结构是否是人类语言先验的产物。我们提出了一种生成方法论：多智能体强化学习中的涌现语言（EL），其中代理从最小状态（无语言、无自我概念、最少接触人类文本）开始，仅在任务压力下发展交流，确保因果归因于任务需求，而非继承的人类语言先验。我们通过讨论英语学习如何作为研究意识相关结构的生成工具来定位我们的方法论，包括环境复杂性的作用以及涌现交流的解释。作为概念验证，我们在极简环境中实现该方法，展示了智能体发展出自指通信，包括一种回声-不匹配检测电路，这种回路不仅由任务结构或架构预测，而是源自特定的环境供能。

Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation

强化学习引发了看不见的语言翻译的情境学习

Authors: Hanxu Hu, Zdeněk Šnajdr, Pinzhen Chen, Jannis Vamvas, Rico Sennrich
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.06428
Pdf link: https://arxiv.org/pdf/2606.06428
Abstract Prior work has shown that large language models (LLMs) can translate unseen or low-resource languages by undergoing continued training or even by encoding a grammar book in their context. However, both methods typically overfit specific languages, with limited zero-shot transfer at test time. To translate extremely low-resource languages at scale, we argue that LLMs must acquire the meta-skill of utilizing in-context linguistic knowledge rather than memorizing specific languages. In this paper, we propose a reinforcement learning (RL) approach to unseen language translation given rich linguistic context, using a surface-level translation metric (chrF) as the reward. Empirically, despite the lightweight reward, our RL-trained models effectively extract and apply relevant linguistic information from the provided context, leading to better translations on completely unseen languages than in-context learning or supervised fine-tuning. Our analyses suggest that outcome-based RL can extend beyond conventional reasoning tasks like math and coding to serve as a recipe for language learning from context.
中文摘要 以往研究表明，大型语言模型（LLMs）可以通过持续训练，甚至在语境中编码语法书，翻译看不见或资源有限的语言。然而，这两种方法通常会对特定语言进行过拟合，测试时零样本转移有限。为了大规模翻译极低资源的语言，我们认为大型语言模型必须掌握利用语境内语言知识的元技能，而非死记硬背特定语言。本文提出一种强化学习（RL）方法，针对丰富的语言语境，采用表层翻译指标（chrF）作为奖励。从经验角度看，尽管奖励轻微，我们的强化学习训练模型能有效从上下文中提取并应用相关语言信息，使得在完全看不见的语言上翻译效果优于上下文学习或监督微调。我们的分析表明，基于结果的强化学习可以超越传统的推理任务，如数学和编码，成为从语境中学习语言的配方。

RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

RREDCoT：推理模型的分段级奖励再分配

Authors: Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, Sepp Hochreiter
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.06475
Pdf link: https://arxiv.org/pdf/2606.06475
Abstract Recent advancements in reasoning language models have been driven by Reinforcement Learning (RL) fine-tuning. Most often, these rely on the Group Relative Policy Optimization (GRPO) algorithm or modifications thereof to steer the models to produce Chain-of-Thought (CoT) traces. The final answer can only be verified, and the reward assigned, after the CoT trace is complete, making it a delayed reward problem. GRPO and its modifications correspond to Monte Carlo methods in standard RL, which are known to suffer from high variance. A possible solution to this problem is the redistribution of rewards through credit assignment, where segments of the CoT trace that are important for arriving at the desirable solution are emphasized by assigning a higher reward. While Monte Carlo sampling can be used to provide an unbiased estimate of intermediate state values, its computational overhead makes it unsuitable for train-time credit assignment in long contexts at high granularity. We introduce RREDCoT (Reward REDistribution for Chain of Thoughts), which utilizes the model itself to approximate the optimal reward redistribution without additional generation. We investigate the advantages of our method compared to MC sampling and several attribution methods. We further analyze several aspects relevant to the construction of the redistribution such as segmentation of CoT traces and state value estimation.
中文摘要 推理语言模型的最新进展主要由强化学习（RL）微调推动。这些通常依赖于群体相对策略优化（Group Relative Policy Optimization，GRPO）算法或其修改版本，引导模型产生思维链（Chain-of-Thought，CoT）跟踪。最终答案只能在CoT追踪完成后验证并分配奖励，因此这是一个延迟奖励问题。GRPO及其修改对应于标准强化学习中的蒙特卡洛方法，而这些方法已知存在较高方差。解决该问题的一个可能方法是通过信用分配重新分配奖励，强调CoT追踪中对达到理想解至关重要的部分，通过赋予更高的奖励来实现。虽然蒙特卡洛采样可用于提供中间状态值的无偏估计，但其计算开销使其不适合在高粒度的长上下文中进行列车时间积分分配。我们引入了RREDCoT（思维链奖励分配），利用模型本身近似最优奖励再分配，无需额外生成。我们研究了本方法相较于多重采集和多种归因方法的优势。我们还进一步分析了与重分配构建相关的若干方面，如CoT痕迹的分割和状态值估计。

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

TempoVLA：学习可控速度的视觉-语言-行动策略

Authors: Dong Jing, Jingchen Nie, Tianqi Zhang, Jiaqi Liu, Huaxiu Yao, Zhiwu Lu, Mingyu Ding
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.06491
Pdf link: https://arxiv.org/pdf/2606.06491
Abstract Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default $1\times$ performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.
中文摘要 机器人操作在需要快速执行的高风险过渡阶段和需要缓慢精准动作的高风险接触阶段之间交替进行。然而，现有的视觉-语言-行动模型（VLA）仅继承了训练演示中单一固定的速度。此前通过模型压缩、KV缓存重用或强化学习来加速VLA的尝试，只会将策略从一种固定速度转移到另一种，几乎未被探索减速。我们观察到，每个预测动作的大小已经决定了机器人的移动速度，从而为可控的执行速度打开了一条直接路径。我们将这一观察转化为TempoVLA，一个由显式条件控制其执行速度的单一VLA。TempoVLA结合了两个耦合组件。（1）数据端可变速度轨迹增强（VSTA），通过合并或拆分动作，在保持运动语义的同时，重新定时演示到任意目标速度。（2）一种模型端条件机制，将速度输入保单。统计数据显示，VSTA在几乎可以忽略不计的运动误差下达到要求的速度。仿真和实际任务的实验表明，TempoVLA在双向都能实现灵活的速度控制，而VSTA还通过更好的数据利用提升了默认的1美元/1倍美元性能。此外，通过配合大型多模态模型，TempoVLA实现了动态速度控制，低风险阶段加速，高风险阶段减速。

Keyword: diffusion policy

MoDex: A Diffusion Policy for Sequential Multi-Object Dexterous Grasping

MoDex：顺序多对象灵巧抓取的扩散策略

Authors: Haofei Lu, Hongjia Liu, Yifei Dong, Florian T. Pokorny, Jens Lundell, Danica Kragic
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.05407
Pdf link: https://arxiv.org/pdf/2606.05407
Abstract This work addresses sequentially grasping multiple objects with a single dexterous hand without releasing those already held. Most dexterous grasping methods commit all of the hand's degrees of freedom to a single object, underutilizing its dexterity and leaving no redundancy for subsequent grasps. The proposed solution, MoDex, is a diffusion policy that predicts the next gripper pose directly from observations, conditioned on an opposition space and point cloud. The opposition space condition specifies which fingers participate in the current grasp, enabling the gripper to use only a subset of its available degrees of freedom while reserving the remaining degrees of freedom for subsequent grasps. To facilitate sim-to-real transfer, MoDex is trained in two stages: first through imitation learning on expert demonstrations, and subsequently through reinforcement learning fine-tuning, which consistently improves success rates over the pre-trained policy. We evaluate MoDex in simulation on a MuJoCo-based Franka Emika Panda robot equipped with an Allegro Hand and on the corresponding real-world hardware platform. Across both simulation and real-world experiments, MoDex achieves higher success rates than the evaluated learning-based baselines, improving performance by 2.92-17.92% and 6.67-17.78%, respectively. Project page: this https URL.
中文摘要 这项工作涉及用一只灵巧的手依次抓取多个物体，而不松开已握持的物体。大多数灵巧抓握方法将手的所有自由度集中在一个物体上，未能充分发挥其灵活性，且不留余地以便后续抓握。提出的解决方案MoDex是一种扩散策略，直接根据观测预测下一个夹具姿态，条件是对冲空间和点云。对立空间条件指定了当前握持中哪些手指参与，使握持者只能使用其可用自由度的子集，而保留剩余的自由度用于后续抓握。为促进模拟到现实的迁移，MoDex 经过两个阶段训练：首先通过专家演示的模仿学习，随后通过强化学习的微调，持续提升预训练策略的成功率。我们在基于MuJoCo的Franka Emika Panda机器人（配备Allegro Hand）及相应的现实硬件平台上进行模拟，评估MoDex。在模拟和真实世界实验中，MoDex 的成功率均高于基于学习的基线，分别提升了 2.92%-17.92% 和 6.67-17.78%。项目页面：这个 https URL。

L-SDPPO: Policy Optimization of Spiking Diffusion Policy for Intra-vehicular Robotic Manipulation

L-SDPPO：车内机器人操作中尖峰扩散政策的策略优化

Authors: Liwen Zhang, Dong Zhou, Guanghui Sun, Yifei Zheng, Yuhui Hu, Kaihong Ouyang, Zuoquan Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.06049
Pdf link: https://arxiv.org/pdf/2606.06049
Abstract Intra-vehicular robots in spacecraft help reduce astronaut workload and improve mission efficiency. Recent research focuses on using deep learning methods to achieve the acute control required for operations in these complex environments. However, objects exhibit unpredictable, unconstrained drift without gravitational damping. These factors demand robustness against complex multimodal action distributions. Diffusion policies (DP) can model these complex actions, but their iterative sampling process consumes too much energy for the limited power budgets of spacecraft. We therefore propose a low-energy intra-vehicular robotic manipulation framework, L-SDPPO, in which the Spiking Diffusion Policy (SDP) is optimized with a reinforcement learning (RL) algorithm. Furthermore, to address the insufficient perception of dynamic spatiotemporal features in microgravity, we propose the statedependent latency injection (SDLI) mechanism, which mimics biological neural delays to dynamically regulate the timing of input information. Evaluation on five representative intra-vehicular daily tasks (e.g., hatch opening and precision container capping) shows that our method consistently achieves higher success rates and lower energy consumption, compared to the state-of-the-art robotic manipulation methods. These results demonstrate our method is a viable intra-vehicular robotic manipulation method.
中文摘要 航天器中的舱内机器人有助于减轻宇航员的工作负荷，提高任务效率。近期研究聚焦于利用深度学习方法实现在这些复杂环境中操作所需的急性控制。然而，物体在没有引力阻尼的情况下表现出不可预测且不受约束的漂移。这些因素要求对复杂多模态作用分布具备鲁棒性。扩散策略（DP）可以模拟这些复杂动作，但其迭代采样过程对航天器有限的功率预算来说消耗过多能量。因此，我们提出了一种低能耗的车内机器人操作框架L-SDPPO，其中尖峰扩散策略（SDP）通过强化学习（RL）算法进行优化。此外，为了解决微重力下动态时空特征感知不足的问题，我们提出了状态依赖性延迟注入（SDLI）机制，该机制模拟生物神经延迟，动态调节输入信息的时间。对五项具代表性的车内日常任务（如舱门开启和精密容器封盖）的评估显示，我们的方法相较于最先进的机器人操作方法，持续实现更高的成功率和更低的能耗。这些结果表明我们的方法是一种可行的车内机器人操作方法。