Arxiv Papers of Today

生成时间: 2026-06-29 20:36:03 (UTC+8); Arxiv 发布时间: 2026-06-29 20:00 EDT (2026-06-30 08:00 UTC+8)

今天共有 28 篇相关文章

Keyword: reinforcement learning

OverFlowLight: Real-Time Gridlock Prevention and Traffic Signal Optimization for Urban Intersections

OverFlowLight：城市路口的实时交通堵塞预防与交通信号优化

Authors: Mingyuan Li, Boyang Huang, Tianqi Jiang, Chenpu Li, Chunyu Liu, Yang Li, Ruimin Li, Qiang Wu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.27381
Pdf link: https://arxiv.org/pdf/2606.27381
Abstract Queue overflow, a severe consequence of urban traffic congestion, occurs when vehicle queues exceed intersection capacity, obstructing upstream traffic and triggering cascading gridlocks. Prevailing traffic signal control (TSC) algorithms, primarily optimized for throughput, often fail to address overflow during peak hours, exacerbating congestion and creating safety hazards. We propose OverFlowLight, a real-time framework designed to preemptively resolve overflow and enhance overall TSC performance. It first introduces a mechanism to accurately detect overflow in real-time by leveraging multi-modal sensing from cameras and radars. Upon detection, it dynamically generates and inserts dedicated overflow phases into the signal cycle to clear the blocking queues. This is orchestrated by a hybrid control design that combines rapid rule-based overflow intervention with controller back ends such as reinforcement learning (RL) for longer-horizon efficiency. We conducted extensive real-world deployments of OverFlowLight across 43 intersections in three major cities. The framework demonstrates seamless integration with existing RL-based TSC agents, highlighting its modularity and practical applicability. Empirical results show that OverFlowLight reduces overflow incidents by 60.4% and increases network throughput by 18.2% compared to deployed baselines. Furthermore, it substantially diminishes the need for manual intervention common with expert-tuned signal plans. This work presents the first practical, scalable, and data-driven framework for actively preventing traffic gridlock, offering a crucial component for building resilient and efficient urban transportation systems. Our demonstration videos, codes and datasets are available at the anonymous URL, this https URL.
中文摘要 排队溢出是城市交通拥堵的严重后果，当车辆排队超过路口容量时，阻碍上游交通并引发连锁堵塞。主流的交通信号控制（TSC）算法主要针对吞吐量优化，常常在高峰时段未能解决溢出问题，加剧拥堵并带来安全隐患。我们提出了OverFlowLight，一个实时框架，旨在预先解决溢出并提升整体TSC性能。它首先引入了一种通过利用摄像头和雷达多模态传感技术，实时准确检测溢出的机制。检测到后，它会动态生成并插入专用溢出相位，以清除阻断队列。这通过混合控制设计协调，结合了基于规则的快速溢出干预与控制器后端（如强化学习，RL）以实现更长的视野效率。我们在三个主要城市的43个路口进行了广泛的真实部署OverFlowLight。该框架展示了与现有基于强化学习的TSC代理的无缝集成，凸显了其模块化性和实用性。实证结果显示，OverFlowLight 相比部署基线减少了 60.4% 的溢出事件，并提高了 18.2% 的网络吞吐量。此外，它大大减少了专家调校信号计划中常见的人工干预需求。这项工作提出了首个实用、可扩展且数据驱动的框架，用于积极预防交通堵塞，为构建有韧性和高效的城市交通系统提供了关键组成部分。我们的演示视频、代码和数据集可在匿名网址 https URL 上获取。

Support-Constrained RL Enables Real-World Policy Improvement without Real-World Experience

支持受限的强化学习使得无需真实经验也能实现真实政策改进

Authors: Raymond Yu, William Huey, Mustafa Mukadam, Anusha Nagabandi, Abhishek Gupta
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.27475
Pdf link: https://arxiv.org/pdf/2606.27475
Abstract Robots trained on real world data tend to be imprecise, slow, and brittle to perturbations. Improving these policies with reinforcement learning (RL) is an appealing alternative, but this process often requires expensive training in the real world. Performing policy improvement in simulation instead provides a far cheaper alternative, but unconstrained RL in simulation can exploit contact and dynamics mismatches, resulting in unsafe behaviors that do not transfer to hardware. Common forms of regularization can furthermore limit improvement by overconstraining to an imperfect behavior prior. In this work, we propose Support-Constrained Off-Domain REinforcement (SCORE), a real-to-sim-to-real framework that constrains RL in simulation to the support of a generative policy pretrained on real data. We instantiate this constraint through flow steering, restricting SCORE to actions the base policy can already produce, which ensures transferable behaviors while maximizing policy improvement. Improving a policy with SCORE requires minimal effort: it learns from sparse rewards, avoids distillation, and leaves the base policy untouched. Across eight real-world dexterous multi-fingered robotic manipulation tasks, SCORE improves average success rate from 37.8% to 89.9%, compared to 59.5% for the best baseline, and reaches success in 36.8% fewer steps than the base policy. Ultimately, through extensive experiments and ablations, we show that simulation can substantially improve real-world manipulation policies when policy optimization is appropriately constrained, introducing a new paradigm for real-to-sim-to-real policy improvement. Videos and code are available at this https URL.
中文摘要 基于真实世界数据训练的机器人往往不精确、缓慢且易受扰动影响。通过强化学习（RL）改进这些策略是一个有吸引力的替代方案，但这一过程通常需要在现实世界中进行昂贵的培训。在仿真中进行策略改进则提供了更低成本的替代方案，但无约束的RL模拟可以利用接触和动力学不匹配，导致不安全的行为，这些行为无法转移到硬件上。常见的正则化形式还可能通过过度限制先验行为来限制改善。在本研究中，我们提出了支持约束域外强化（SCORE），这是一种从现实到模拟再到现实的框架，它将模拟中的强化学习限制在基于真实数据的生成策略上。我们通过流引导实现这一约束，将SCORE限制在基础策略已能产生的动作上，确保行为可转移，同时最大化策略改进。用SCORE改进策略所需的努力最小：它从稀疏的奖励中学习，避免提炼，并且保持基础策略不变。在八个真实世界的灵巧多指机器人操作任务中，SCORE将平均成功率从37.8%提升到89.9%，而最佳基线为59.5%，并且比基础策略少了36.8%的步骤。最终，通过广泛的实验和消融，我们证明了当策略优化受到适当约束时，仿真能够显著改善现实世界的操作策略，为真实到模拟到真实的策略改进引入了新的范式。视频和代码可在此 https URL 下载。

Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning

内化未来：世界模型规划的统一代理训练范式

Authors: Xuan Zhang, Zhijian Zhou, Lingfeng Qiao, Yulei Qin, Ke Li, Xing Sun, Xiaoyu Tan, Chao Qu, Yuan Qi
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.27483
Pdf link: https://arxiv.org/pdf/2606.27483
Abstract Large language model (LLM) agents have demonstrated strong capability in sequential decision-making, yet they remains fundamentally reactive in long-horizon tasks. Unlike humans who employ "what-if" reasoning to evaluate potential plans before commitment, standard agents lack an internal world model to simulate future outcomes. Therefore, we propose to internalize future-aware planning by training a single autoregressive model to verbalize both a prospective state rollout and a plan-conditioned success estimate-a textual analogue of the Q-value. Crucially, we identify a format-capability gap: simply fine-tuning agents on look-ahead traces during post-training leads to superficial mimicry of foresight without genuine predictive grounding. To bridge this gap, we introduce a three-stage training paradigm: (i) World Model Agentic Mid-Training (WM-AMT) to inject latent predictive capabilities into the policy; (ii) Format-Eliciting SFT (FE-SFT) to structure this injected capability; and (iii) Foresight-Conditioned Reinforcement Learning (FC-RL) to refine the calibration and utility of the generated simulations. Evaluated on search and mathematical reasoning tasks, our approach consistently outperforms other training baselines. Our results demonstrate that effective internal world modeling in LLM agents requires a capability-first training pipeline to achieve grounded and calibrated foresight.
中文摘要 大型语言模型（LLM）代理在顺序决策方面表现出强大的能力，但在长期任务中仍然表现得很被动。与人类通过“如果”推理来评估潜在计划在承诺前进行不同，标准智能体缺乏内在世界模型来模拟未来结果。因此，我们提议通过训练单一自回归模型来内化未来意识规划，同时表达预期状态推广和计划条件的成功估计——即Q值的文本类比。关键是，我们识别出格式-能力差距：仅仅在培训后对前瞻痕迹进行微调，导致缺乏真正预测基础的表面预见模仿。为弥合这一空白，我们引入了三阶段训练范式：（i）世界模型代理中期训练（WM-AMT），将潜在预测能力注入政策中;（ii）格式诱导SFT（FE-SFT）以构建注入能力;以及（iii）前瞻性条件强化学习（FC-RL），以优化生成模拟的校准和实用性。在搜索和数学推理任务中评估，我们的方法始终优于其他训练基线。我们的结果表明，LLM代理中的有效内部世界建模需要以能力为先的培训流程，以实现扎实且校准的前瞻性。

PEBS: Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration

PEBS：按评级者经验-贝叶斯收缩用于RLHF奖励模型校准

Authors: Arnav Raj
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.27578
Pdf link: https://arxiv.org/pdf/2606.27578
Abstract Reward models for Reinforcement Learning from Human Feedback (RLHF) pool preferences across thousands of annotators and fit one global affine calibrator, collapsing raters with systematically different rating-scale offsets and slopes into a single average-rater fit that does not match any individual annotator. PEBS is a per-rater empirical-Bayes shrinkage estimator: it fits per-rater affine calibrators on a held-out slice of each annotator's ratings and applies Morris-James-Stein empirical-Bayes shrinkage toward the population mean, in closed form and without retraining the reward model. On PRISM, PEBS reduces within-user held-out RMSE by 8.58% over the pooled population-slope baseline. The procedure replicates on PluriHarms harm ratings (Qwen-2.5 base, in-family) with a +9.66% RMSE reduction over the same population-slope baseline. PEBS is a closed-form post-hoc estimator for annotator-specific affine calibration in RLHF reward modeling; it leaves the reward base model unchanged and estimates only the rater-level map used at inference time for new ratings.
中文摘要 来自人类反馈的强化学习（RLHF）奖励模型将偏好池化成千上万的标注者，拟合一个全局仿射校准器，将系统性不同的评分量表偏移和斜率的评分者合并为一个不匹配任何单个标注者的平均评分者拟合。PEBS是一种按评分者经验贝叶斯缩减估计器：它在每个标注者的评分中保留的切片上拟合按评分者仿射校准器，并以闭合形式且未重新训练奖励模型，将Morris-James-Stein经验贝叶斯缩小应用到总体均值。在PRISM上，PEBS将用户内部未被保留的RMSE比合并人口斜率基线降低8.58%。该程序在PluriHarms伤害评级（Qwen-2.5基础，家族内）上重复出现，RMSE在相同人口坡基线下减少+9.66%。PEBS是一种闭式后置估计器，用于RLHF奖励建模中针对标注者的仿射校准;它保持奖励基础模型不变，只估计推断时用于新评级的评级者层级映射。

Retroactive Advantage Correction: Closed-Form V-Trace Bias Correction for Delay-Aware RLHF

追溯优势纠正：针对延迟感知RLHF的封闭形式V-trace偏置校正

Authors: Arnav Raj
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.27580
Pdf link: https://arxiv.org/pdf/2606.27580
Abstract Reinforcement learning from human feedback (RLHF) in production does not always have a synchronous reward signal. Code-execution verifiers, slow judge ensembles, and queued human review can return several gradient steps after the rollout that produced them, breaking the synchronous-reward assumption underlying standard PPO. We address this gap with Retroactive Advantage Correction (RAC): each pending slow completion is queued, aged through a non-negative kernel, and reinjected as a clipped residual into the next optimiser step's advantage. We prove that under an unbiased clipped importance ratio, the cumulative RAC correction is exactly unbiased when the effective delay kernel reinjects all of its mass, and carries a bias linear in the unreinjected fraction otherwise; at the no-delay identity kernel it reduces to V-trace. On a tabular Markov decision process (MDP) proof-of-concept, RAC reduces the closed-form policy bias by up to 47.9x at the two-slow-channel configuration, beating wait-for-slow at lower wall-clock cost. RAC integrates with PPO and GRPO through a two-line reward-manager patch.
中文摘要 制作中的人工反馈强化学习（RLHF）并不总是具有同步的奖励信号。代码执行验证器、慢速裁判集合和排队人工审核可以在发布后返回多个梯度步骤，打破了标准PPO背后的同步奖励假设。我们用追溯优势修正（RAC）来弥补这一空白：每个待处理的慢完成都会被排队，经过非负核，并以截断残差的形式重新注入到下一个优化步骤的优势中。我们证明，在无偏截断重要性比下，当有效延迟核重新注入全部质量时，累积RAC修正完全无偏，否则在未重新注入分数中带有线性偏置;在无延迟恒核处，它简化为V-迹。在表格式马尔可夫决策过程（MDP）概念验证中，RAC在双慢信道配置下将封闭式策略偏差减少了最多47.9倍，远胜慢速等待，且成本更低。RAC通过两线奖励管理器补丁与PPO和GRPO集成。

Learning to Throw: Agile and Accurate Cable-Suspended Payload Delivery with a Quadrotor

学习投掷：灵活且精准地用四旋翼悬挂电缆投放有效载荷

Authors: Yifan Zhai, Elia Raimondi, Yunfan Ren, Ismail Geles, Yannick Armati, Jiaxu Xing, Davide Scaramuzza
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.27603
Pdf link: https://arxiv.org/pdf/2606.27603
Abstract Quadrotors offer the agility needed to rapidly transport suspended payloads during time-critical applications, including search-and-rescue and medical delivery. While suspended-payload transport and traversal for these missions are well studied, the highly dynamic targeted release of the payload remains comparatively underexplored. State-of-the-art approaches typically rely on model-based trajectory optimization and tracking; however, these methods often yield sub-optimal performance due to conservative feasibility constraints, tracking errors, and the inherent difficulty of analytically modeling flexible rope dynamics. To overcome these limitations, we propose a hybrid simulation framework that couples a high-fidelity analytical quadrotor model with a physics solver for complex rope and payload interactions. By exchanging forces between the two domains at every step, we obtain a physically accurate simulation of the suspended-payload system. Leveraging this environment, we train a deep reinforcement learning (RL) policy that executes agile, accurate payload throws to designated targets. Deployed zero-shot on hardware, our RL policy pushes the boundary of the agility-accuracy trade-off, outperforming the model-based baseline by reducing the landing error by up to 50% and the throw duration by up to 30%. Ablation studies confirm that the coupled simulation is the key enabler of these gains. We further show that the same pipeline trains a policy driven by visual observations rather than an explicit state estimate, achieving accuracy comparable to that of the state-based policy. To accelerate future research in dynamic aerial manipulation, we open-source the simulator to the community upon acceptance.
中文摘要 四旋翼具备在时间关键应用中快速运输悬挂有效载荷所需的敏捷性，包括搜救和医疗配送。虽然这些任务的悬浮有效载荷运输和穿越已被深入研究，但高度动态的目标释放有效载荷仍相对较少被充分探索。最先进的方法通常依赖基于模型的轨迹优化和跟踪;然而，由于保守的可行性约束、跟踪误差以及对柔性绳动力学进行解析建模的固有难度，这些方法常常导致性能不理想。为克服这些限制，我们提出了一种混合仿真框架，将高精度解析四旋翼模型与复杂的绳索和有效载荷相互作用物理求解器结合起来。通过在每个步骤交换两个域之间的力，我们获得了悬挂有效载荷系统的物理精确模拟。利用这一环境，我们训练深度强化学习（RL）策略，执行敏捷且准确的有效载荷投掷到指定目标。在硬件上部署零发射，我们的强化学习策略突破了敏捷性与准确性的权衡，超越基于模型的基线，降低了最多50%的着陆误差和最多30%的投掷时间。消融研究证实，耦合仿真是实现这些收益的关键因素。我们还进一步表明，同一流程训练的政策基于视觉观察而非明确的州估计，其准确性与基于州的政策相当。为了加速未来动态空中操控的研究，我们将模拟器在被接受后向社区开源。

Qwen-Image-2.0-RL Technical Report

Qwen-Image-2.0-RL 技术报告

Authors: Yixian Xu, Kaiyuan Gao, Yuxiang Chen, Yilei Chen, Zecheng Tang, Zihao Liu, Zikai Zhou, Deqing Li, Hao Meng, Kuan Cao, Jiahao Li, Jie Zhang, Liang Peng, Lihan Jiang, Ningyuan Tang, Shengming Yin, Tianhe Wu, Xiaoyue Chen, Yan Shu, Yanran Zhang, Yi Wang, Yu Wu, Yujia Wu, Zekai Zhang, Zhendong Wang, Xiao Xu, Kun Yan, Chenfei Wu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.27608
Pdf link: https://arxiv.org/pdf/2606.27608
Abstract We present Qwen-Image-2.0-RL, a post-training pipeline that applies reinforcement learning from human feedback (RLHF) and on-policy distillation (OPD) to improve both the visual quality and instruction-following capability of the Qwen-Image-2.0 diffusion model. To provide reliable reward signals, we construct task-specific composite reward models by fine-tuning vision-language models with a pointwise scoring paradigm and chain-of-thought reasoning. For text-to-image generation, the reward models cover alignment, aesthetics, and portrait fidelity dimensions. For image editing tasks, the reward system addresses instruction-following accuracy and face identity preservation. Building on this reward system, we develop a scalable GRPO-based RL training framework, incorporating a hybrid classifier-free guidance (CFG) strategy to preserve pre-trained knowledge, prompt curation via intra-group reward range filtering, and per-category reward weight calibration. To merge the task-specialized RL policies for T2I and editing, we propose on-policy distillation as the final training stage, which consolidates multiple teachers into a single student model through trajectory-level velocity matching. Extensive evaluation shows that Qwen-Image-2.0-RL achieves 57.84 overall score on Qwen-Image-Bench (+2.61 over the base model), Elo ratings of 1193 in text-to-image arena (+78) and 1349 in image edit arena (+93), demonstrating consistent gains in aesthetic quality, prompt adherence, and editing accuracy.
中文摘要 我们介绍Qwen-Image-2.0-RL，这是一个训练后流程，应用人类反馈强化学习（RLHF）和策略中提炼（OPD），以提升Qwen-Image-2.0扩散模型的视觉质量和指令跟随能力。为了提供可靠的奖励信号，我们通过微调基于点的评分范式和思维链推理的视觉语言模型，构建了针对任务的复合奖励模型。在文本转图像生成方面，奖励模型涵盖对齐、美学和肖像真实度维度。对于图像编辑任务，奖励系统关注指令遵循准确性和面部身份保护。基于该奖励系统，我们开发了可扩展的基于GRPO的强化学习训练框架，采用混合无分类器指导（CFG）策略以保留预训练知识，通过组内奖励范围过滤进行提示整理，并实现各类别奖励权重校准。为了整合针对T2I和编辑的任务专用强化学习策略，我们提出以策略精炼作为最终培训阶段，通过轨迹级速度匹配将多位教师整合到单一学生模型中。全面评估显示，Qwen-Image-2.0-RL 在 Qwen-Image-Bench 上整体得分为 57.84（基础模型为 +2.61），在文本转图像领域中获得 1193 分（+78）和图像编辑领域 1349 分（+93），展现出在美学质量、及时遵循和编辑准确性方面的持续提升。

Training Observable Control Policies to Expose Agent State Through Actions

训练可观察控制策略，通过动作暴露代理状态

Authors: Andres Enriquez Fernandez, John J. Bird
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.27609
Pdf link: https://arxiv.org/pdf/2606.27609
Abstract Physical or operational constraints often impose communications limitations on autonomous agents. Such limitations complicate monitoring or multiagent coordination. Even when strong communications are absent, some information may still be available. The remainder of the relevant agent state may be reconstructed via estimation. The actions taken by an agent are a potential source of information -- as the agent interacts with the environment, these actions may be observed even in the absence of explicit communication. We investigate using actions to estimate the state of an agent, using reinforcement learning to develop policies which make the estimation problem more tractable. Policy observability is encouraged through the training reward and is analyzed using simulation of the trained agent. In an aircraft tracking problem a policy with enhanced observability is found that has minimal impact on nominal task performance.
中文摘要 物理或操作限制常常对自主智能体施加通信限制。这些限制使监控或多智能体协调变得复杂。即使缺乏强有力的沟通，一些信息仍然可能存在。相关代理状态的其余部分可以通过估计重建。智能体所采取的行为是潜在的信息来源——当智能体与环境互动时，即使没有明确的交流，这些行为也可能被观察到。我们研究使用动作估计代理状态，利用强化学习制定使估计问题更易处理的策略。通过训练奖励鼓励策略可观察性，并通过对训练代理人的模拟进行分析。在飞机跟踪问题中，发现一种可增强可观测性的策略对名义任务性能影响最小。

Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety

Yuvion LLM：一个具攻击性意识的大型语言模型，用于内容与人工智能安全

Authors: Ting Ma, Xiufeng Huang, Benlei Cui, Xiaowen Xu, Shikai Qiu, Ruijie Jian, Hongxing Li, Guanghui Wang, Longtao Huang, Haiwen Hong, Haolei Xu, Wenjing Jiang, Ziwen Xu, Zhaoyu Fan, Shaoxuan He, Chuxi Xiao, Yujian Li, Xinyue Chen, Chunyang Chai, Wenxuan Liu, Ziheng Wang, Dongjie Zhang, Yangfan Zhou, Libin Dong, Yupeng Cao, Xiaoqian Xia, Jing Wang, Zhe Jiang, Zhenan Ye, Guang Yang, Bin Liu, Wei Peng, Ziqiang Zhu, Meihui Lian, Kaiwen Lv Kacuila, Haidong Ding, Bingyu Zhu, Yan Wang, Hai Zhao, Xuan Jin, Wei Zhao, Pengfei Sun, Wei Wang, Huiming Zhang, Bin Li, Hui Xue
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.27632
Pdf link: https://arxiv.org/pdf/2606.27632
Abstract As large language models are increasingly deployed in real-world systems, safety failures can still lead to harmful outputs and dangerous misuse. We argue that the essence of safety is adversarial: many failures arise not from natural inputs alone, but from strategic attempts to evade model policies and safeguards. However, existing general-purpose model development largely overlook this adversarial nature, and often remain insufficient for realistic safety scenarios involving planning, tool use, and multi-step reasoning, causing measured safety performance to overestimate real deployment robustness. To address this gap, we present Yuvion LLM, a large language model built for adversarially robust content safety and broader AI safety. Yuvion LLM treats adversarial robustness and agentic capability as first-class objectives. Its pipeline combines adversarially aware data construction, knowledge-enhanced continued pretraining, and policy-grounded multi-task safety post-training, including risk-aware supervised fine-tuning and reinforcement learning-based policy optimization, together with safety-aware agentic reinforcement learning for tool use and multi-step reasoning in complex safety scenarios. We further introduce the Yuvion LLM RiskEval (YLRE), a collection of 93 benchmarks across four evaluation categories, covering diverse open and internal evaluations with a focus on safety, adversarial robustness, and real-world capability requirements. Across these evaluations, Yuvion LLM demonstrates clear advantages on safety-focused benchmarks and particularly strong robustness under adversarial conditions, while maintaining solid overall capability. Notably, Yuvion-8B outperforms most state-of-the-art baselines, including substantially larger models such as GPT-5.4 and Qwen3-MAX, on several safety tasks.
中文摘要 随着大型语言模型在现实系统中日益广泛应用，安全失效仍可能导致有害输出和危险的滥用。我们认为安全的本质是对抗性的：许多失败不仅源于自然输入，还源于规避模式政策和保障措施的战略性尝试。然而，现有的通用模型开发大多忽视了这种对抗性质，且常常不足以适应涉及规划、工具使用和多步推理的现实安全情景，导致测量的安全性能高估了实际部署的稳健性。为弥补这一空白，我们介绍了Yuvion LLM，一个大型语言模型，旨在增强对抗性内容安全和更广泛的人工智能安全。Yuvion LLM 将对抗性鲁棒性和代理能力视为一流目标。其流程结合了对抗意识数据构建、知识增强的持续预训练和基于策略的多任务安全后培训，包括风险感知的监督微调和基于强化学习的策略优化，以及用于工具使用和安全和复杂安全场景中多步推理的安全意识智能体强化学习。我们还进一步介绍了Yuvion LLM风险评估（YLRE），这是一项涵盖四个评估类别的93个基准测试，涵盖多样的开放和内部评估，重点关注安全性、对抗性鲁棒性和实际能力需求。在这些评估中，Yuvion LLM在安全导向基准测试中展现出明显优势，尤其是在对抗条件下表现尤为强健，同时保持了稳健的整体能力。值得注意的是，Yuvion-8B在多项安全任务中优于大多数最先进的基线模型，包括GPT-5.4和Qwen3-MAX等更大型号。

MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

MER-R1：通过慢快思维协同进行多模态情绪推理

Authors: Zhiyuan Han, Beier Zhu, Wenwen Tong, Chengwei Qin, Xinyi Wang, Jiayu Zhang, Jiangnan Chen, Hewei Guo, Dongchuan Ran, Lewei Lu, Xun Yang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.27652
Pdf link: https://arxiv.org/pdf/2606.27652
Abstract We find that explicit reasoning does not necessarily translate into better multimodal emotion recognition (MER) accuracy, even though it makes predictions more interpretable. Specifically, for reasoning-based MLLMs, fast thinking by triggering direct answers often outperforms slow thinking after deliberative reasoning. Our empirical analyses show that fast thinking improves recall with broader and more confident predictions, whereas slow thinking favors precision through conservative filtering of incorrect categories. Building on these insights, we propose MER-R1, a reinforcement learning framework that turns slow-fast complementarity into explicit optimization. Dual-objective disentanglement separates recall and precision into two optimization signals, allowing them to be jointly optimized rather than traded off against each other. Slow-fast confidence calibration further aligns the final slow-thinking answer with fast-thinking intuition, strengthening correct emotions while suppressing incorrect ones. In this way, MER-R1 unifies the recall-oriented intuition of fast thinking with the precision-oriented selectivity of slow thinking. We further provide theoretical justification for this synergy, showing that it mitigates variance-induced interference during optimization. Extensive experiments on MER-UniBench and MME-Emotion show that MER-R1 achieves state-of-the-art performance and makes reasoning genuinely benefit emotion recognition.
中文摘要 我们发现，显式推理不一定能转化为更好的多模态情绪识别（MER）准确性，尽管它使预测更具可解性。具体来说，对于基于推理的MLLMs，快速思考通过触发直接回答往往优于经过深思熟虑后慢思考。我们的实证分析显示，快速思考通过更广泛、更自信的预测提升回忆能力，而慢思考则通过保守过滤错误类别来提升准确性。基于这些见解，我们提出了MER-R1，一种将慢快互补转化为显式优化的强化学习框架。双目标解缠将召回和精确度分离为两种优化信号，使它们能够共同优化，而非相互权衡。慢快信心校准进一步将最终的慢思考答案与快速思考的直觉对齐，强化正确的情绪，同时抑制错误的情绪。通过这种方式，MER-R1将快速思考的回忆导向直觉与慢思考的精确导向选择性相结合。我们还进一步为这种协同提供了理论依据，表明它在优化过程中减轻了方差引起的干扰。在MER-UniBench和MME-Emotion上的广泛实验表明，MER-R1实现了最先进的性能，并使推理真正受益于情感识别。

Textual Belief States for World Models: Identifiable Representation Learning Under Strict Mediation

世界模型中的文本信念状态：严格中介下的可识别表征学习

Authors: Xiang Gao, Kaiwen Dong, Yuguang Yao, Padmaja Jonnalagedda, Kamalika Das
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.27681
Pdf link: https://arxiv.org/pdf/2606.27681
Abstract World models in partially observed environments rely on latent representations that summarize interaction history, but in many modern LLM-based architectures predictive performance fails to reflect representation quality due to history bypass, rendering the latent state unidentifiable. Strict latent state mediation, requiring predictions to depend only on the latent state and action, is a classical principle that resolves this, but enforcing it in text-based settings is an open challenge: textual latent states are discrete and non-differentiable, precluding variational training, and expressive LLM decoders readily ignore the bottleneck. We show how to make strict mediation work in the text domain. We formalize why it is necessary, showing that strict mediation makes representation quality empirically testable while history-leaky architectures break this connection. We then introduce textual latent states, which are discrete, interpretable, and variable-length, and factorized GRPO (fGRPO), a tree-structured reinforcement learning method that enforces strict mediation during training. Experiments on TextWorld and ScienceWorld show preserved one-step prediction accuracy alongside up to 57\% gains in representation quality and 98\% improvements in rollout performance, increasing with task complexity and horizon.
中文摘要 部分观测环境中的世界模型依赖于总结交互历史的潜在表示，但在许多现代基于大型语言模型的架构中，由于历史绕过，预测性能无法反映表示质量，使潜态无法识别。严格的潜态中介要求预测仅依赖于潜态和作用，是解决这一问题的经典原则，但在基于文本的环境中执行则是个挑战：文本潜在状态是离散且不可微的，排除了变分训练，而表达式LLM解码器则轻易忽视瓶颈。我们展示了如何在文本领域使严格调解发挥作用。我们形式化了其必要性，表明严格的中介使得表示质量可通过实证检验，而历史泄漏的架构则破坏了这一联系。随后，我们介绍了文本潜在状态，这些状态是离散的、可解释的、可变长度的，以及分解GRPO（fGRPO），这是一种树结构强化学习方法，在训练过程中强制执行严格的中介。TextWorld和ScienceWorld上的实验显示，单步预测准确度保持得以保留，同时表示质量提升高达57%，部署性能提升98%，随着任务复杂度和视野的增加而提升。

Learning to Reason with Curriculum II: Compositional Generalization

通过课程学习推理 II：作曲推广

Authors: Nived Rajaraman, Audrey Huang, Miroslav Dudik, Robert Schapire, Dylan Foster, Akshay Krishnamurthy
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.27721
Pdf link: https://arxiv.org/pdf/2606.27721
Abstract Compositional generalization, the ability to solve complex problems by combining solutions to simpler sub-problems, is a fundamental capability of both natural and artificial intelligence, and a key mechanism underlying chain-of-thought reasoning. However, the theoretical underpinnings of compositional generalization remain poorly understood: when and why does decomposing a problem into parts yield more efficient learning than solving it directly? We study this question through the canonical problem of learning to simulate semiautomata (predicting the outcome of $T$ steps of sequential computation), a model that captures state tracking, regular language recognition, and modular arithmetic. We show that an autocurriculum-based approach building on Part I of this series, recursively decomposing longer sequences into shorter sub-problems, learning to solve them, and composing the solutions, achieves dramatically better statistical complexity than direct methods. (i) For a setting inspired by supervised fine-tuning (SFT) where the learner receives interactive feedback on intermediate states of the computation, curriculum facilitates learning from only $2^{\mathcal{O}(\sqrt{\log T})}$ tokens of supervision; i.e., subpolynomial in the sequence length $T$, overcoming the $\Omega(T)$ token barrier required by direct simulation. (ii) For a setting inspired by reinforcement learning with verifiable rewards (RLVR), where the learner improves a pre-trained reference model using an outcome verifier, we show that curriculum reduces the requirement on the reference model from coverage at the full sequence length $T$ to coverage at a shorter block length $B \ll T$, an exponentially weaker condition.
中文摘要 组合推广，即通过将解法结合到更简单子问题来解决复杂问题的能力，是自然智能和人工智能的基本能力，也是思维链推理背后的关键机制。然而，合成泛化的理论基础仍然不充分：何时以及为何将问题分解为部分比直接求解更有效地学习？我们通过学习模拟半自动机（预测$T$步骤序列计算的结果）这一典型问题来研究这个问题，该模型涵盖状态追踪、正规语言识别和模运算。我们证明，基于本系列第一部分的自教法方法，递归地将较长的序列分解为更短的子问题，学习求解并组合解答，能够显著提升统计复杂度，优于直接方法。（i）对于受监督微调（SFT）启发的环境，学习者在计算中间状态上获得互动反馈，课程仅能从$2^{\mathcal{O}（\sqrt{\log T}）}$代币中进行学习;即序列长度$T$的子多项式，克服了直接仿真所需的$\Omega（T）$令牌障碍。（ii）在受可验证奖励强化学习（RLVR）启发的环境中，学习者使用结果验证器改进预训练的参考模型，我们证明课程将参考模型在全序列长度$T$的覆盖率从降低为在更短块长度$B\ll T$的覆盖率，这一指数级弱条件。

BashCoder-R1: Towards Robust and Explainable Bash Code Generation with Robustness-Aware Group Relative Policy Optimization

BashCoder-R1：迈向具有鲁棒性感知群相对策略优化的稳健且可解释的Bash代码生成

Authors: Lei Yu, Peng Wang, Jia Xu, Jingyuan Zhang, Xin Wang, Jiajia Ma, Li Yang, Changzhi Deng, Zenghua Wang, Fengjun Zhang
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2606.27733
Pdf link: https://arxiv.org/pdf/2606.27733
Abstract Bash scripts are the cornerstone of system administration and DevOps automation, where code quality directly impacts system stability and security. In automated Bash script generation using Large Language Models (LLMs), two interconnected failures emerge: unauditable "black box" reasoning and critical robustness vulnerabilities in generated code. To address both, we propose BashCoder-R1, a novel framework for robust and explainable Bash script generation. Our pipeline combines: (1) Continual Pre-training (CPT) to specialize the model on Bash paradigms; (2) Long Chain-of-Thought Supervised Fine-Tuning (L-CoT SFT) on expert-validated reasoning-and-code samples to emulate proactive risk-aware thinking; and (3) Robustness-Aware Group Relative Policy Optimization (R-GRPO), a reinforcement learning phase optimizing a weighted reward for syntax correctness, robustness (via shellcheck), and format correctness. We evaluate on BashBench, a new benchmark of 952 real-world tasks (773 single-line, 179 multi-line). BashCoder-R1 achieves SyntaxPass (100.00%/94.97%), RobustWarnRate (4.01%/16.47%), RobustPass (95.99%/79.33%), FuncRate (93.01%/93.85%), and FullRate (90.04%/73.18%) for single-line/multi-line tasks, outperforming the strongest baseline DeepSeek-V3.2 (Reasoning) by 37.82% and 20.18% in FullRate. Human evaluation on Functionality, Robustness, and Clarity further confirms BashCoder-R1 achieves the highest quality ratings.
中文摘要 Bash脚本是系统管理和DevOps自动化的基石，代码质量直接影响系统的稳定性和安全。在使用大型语言模型（LLMs）自动生成Bash脚本时，出现了两种相互关联的失败：不可审计的“黑箱”推理和生成代码中的关键鲁棒性漏洞。为应对这两点，我们提出了BashCoder-R1，一个用于稳健且可解释的Bash脚本生成的新框架。我们的流程结合了：（1）持续预训练（CPT），以专注于Bash范式的模型;（2）对专家验证的推理和代码样本进行长链思考监督微调（L-CoT SFT），以模拟主动的风险意识思维;以及（3）鲁棒性感知群相对策略优化（R-GRPO），这是一个强化学习阶段，旨在优化语法正确性、鲁棒性（通过shellcheck）和格式正确性的加权奖励。我们在BashBench上进行了评估，这是一个包含952个真实世界任务的新基准测试（773个单线任务，179个多线任务）。BashCoder-R1在单行/多行任务中实现了SyntaxPass（100.00%/94.97%）、RobustWarnRate（4.01%/16.47%）、RobustPass（95.99%/79.33%）、FuncRate（93.01%/93.85%）和FullRate（90.04%/73.18%），在FullRate中表现优于最强的基线DeepSeek-V3.2（推理）37.82%，在FullRate中优于20.18%。对功能性、鲁棒性和清晰度的人工评估进一步确认 BashCoder-R1 获得了最高质量评级。

ToE: A Hierarchical and Explainable Claim Verification Framework with Dynamic Multi-source Evidence Retrieval and Aggregation

ToE：一个具有动态多来源证据检索与聚合的层级且可解释的索赔验证框架

Authors: Zhaoqi Wang, Zijian Zhang, Kun Zheng, Zhen Li, Xin Li, Chunlei Li, Jiamou Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2606.27736
Pdf link: https://arxiv.org/pdf/2606.27736
Abstract The rapid spread of fake news poses increasing threats to information ecosystems, especially as AI-generated misinformation under Generative Engine Optimization (GEO) poisoning allows adversarially crafted content to be systematically surfaced by retrieval systems, contaminating LLM reasoning. In this paper, we propose Tree of Evidence (ToE), a hierarchical evidence reasoning framework for automated fact-checking that models each claim as a dynamically expanding argument tree. ToE integrates a reinforcement learning-driven multi-source retrieval agent, an evidence evaluation agent, and an argument tree aggregation algorithm to iteratively decompose, retrieve, and verify claims through an explainable evidence chain. We further provide a theoretical analysis of the retrieval process, deriving a formal error bound that guarantees the learned policy converges to a neighborhood of the information-theoretically optimal policy. Experiments across multiple datasets and backbone LLMs demonstrate that ToE achieves improvements ranging from 4 to 24 percentage points over competitive baselines, with particularly pronounced gains on adversarially poisoned inputs.
中文摘要 假新闻的快速传播对信息生态系统构成日益增长的威胁，尤其是在生成引擎优化（GEO）污染下AI生成的虚假信息，使得被检索系统性地呈现对抗性内容，污染了大型语言模型的推理能力。本文提出了证据树（Tree of Evidence，简称ToE），一种层级证据推理框架，用于自动事实核查，将每个主张建模为动态扩展的论证树。ToE集成了强化学习驱动的多源检索代理、证据评估代理和论证树聚合算法，通过可解释的证据链迭代分解、检索和验证主张。我们还进一步对检索过程进行了理论分析，推导出一个形式误差界限，保证所学策略收敛到信息理论最优策略的邻域。跨多个数据集和骨干大型语言模型的实验表明，ToE相比竞争基线提升了4%到24个百分点，尤其是在被攻击性污染的输入上表现尤为显著。

PerturbCellRL: Verifier-Guided Reinforcement Learning for Single-Cell Perturbation Prediction

PerturbCellRL：单细胞扰动预测的验证者引导强化学习

Authors: Dongxia Wu, Mingyu Li, Yuhui Zhang, Anurendra Kumar, Emma Lundberg, Serena Yeung-Levy, Emily B. Fox
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.27752
Pdf link: https://arxiv.org/pdf/2606.27752
Abstract Single-cell perturbation models can reduce costly wet-lab screening by predicting how cells respond transcriptionally to interventions. While recent generative models improve population-level prediction, individual generated cells are not explicitly checked for biological consistency. We introduce PerturbCellRL, a reinforcement learning (RL) framework that post-trains a pretrained single-cell transcriptomic generator using a suite of cell-level verifiers as rewards. These verifiers define four rewards: Pearson top-k similarity, RMSE top-k proximity, DE Spearman, and Pathway activity. The Pathway activity verifier rewards cells whose pathway responses match known perturbation biology. We evaluate PerturbCellRL on multiple genetic and chemical perturbation benchmarks. Across these benchmarks, PerturbCellRL improves over the pretrained flow-matching generator on reward-aligned evaluation metrics and a held-out evaluation metric. Moreover, PerturbCellRL remains competitive with state-of-the-art methods on population-level metrics. Together, these results frame trustworthy single-cell prediction as verifier-guided generative alignment, moving beyond matching expression distributions toward predictions whose single-cell perturbation effects are explicitly checked for biological consistency.
中文摘要 单细胞扰动模型可以通过预测细胞对干预的转录反应，减少昂贵的湿实验室筛查。虽然最新的生成模型提升了种群层面的预测，但单个生成的细胞并未被明确检查其生物学一致性。我们介绍PerturbCellRL，这是一种强化学习（RL）框架，利用一组细胞级验证器作为奖励，对预训练的单细胞转录组生成器进行后期训练。这些验证者定义了四个奖励：Pearson top-k 相似度、RMSE top-k 接近度、DE Spearman 和 Pathway 活动。通路活性验证器奖励通路反应与已知扰动生物学相符的细胞。我们在多个遗传和化学扰动基准测试中评估PerturbCellRL。在这些基准测试中，PerturbCellRL在奖励对齐评估指标和保留评估指标上优于预训练流量匹配生成器。此外，PerturbCellRL在种群层面指标上仍能与最先进方法竞争。这些结果共同构建了可信的单细胞预测作为验证者引导的生成比对，超越了表达分布匹配，转向明确检查单细胞扰动效应的生物学一致性预测。

RS-Diffuser: Risk-Sensitive Diffusion Planning with Distributional Value Guidance

RS-扩散器：风险敏感扩散规划与分布价值指导

Authors: Shiqiang Gong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.27766
Pdf link: https://arxiv.org/pdf/2606.27766
Abstract Offline reinforcement learning enables policy learning from fixed datasets without additional environment interaction, making it appealing for safety-critical applications where online exploration is costly or unsafe. Diffusion-based decision-making methods have recently achieved strong performance in offline RL by modeling rich, multimodal trajectory distributions. However, existing diffusion planners are typically risk-neutral and therefore may overlook rare but catastrophic outcomes that are crucial in real-world deployment. In this work, we propose RS-Diffuser, a risk-sensitive offline diffusion planning framework that combines diffusion-based trajectory generation with distributional value critics. RS-Diffuser learns a diffusion planner over future state trajectories, a separate inverse dynamics model for action decoding, and a Monte Carlo distributional critic that estimates the full return distribution of candidate plans through quantile regression. At sampling time, we incorporate a risk-sensitive guidance signal into the denoising process, using gradients computed from tail-aware objectives such as Conditional Value at Risk to steer generation toward desired risk profiles. As a result, a single trained model can flexibly produce risk-averse, risk-neutral, or risk-seeking behaviors by changing only the inference-time risk parameter. Extensive experiments on risk-sensitive D4RL and risky robot navigation benchmarks demonstrate that RS-Diffuser achieves state-of-the-art performance, improving both overall return and worst-case robustness while reducing safety violations.
中文摘要 离线强化学习使得从固定数据集中进行策略学习而无需额外的环境交互，因此在在线探索成本高昂或不安全的关键应用中具有吸引力。基于扩散的决策方法近年来通过建模丰富的多模态轨迹分布，在离线强化学习中取得了强劲表现。然而，现有的扩散规划者通常风险中立，因此可能忽视那些在实际部署中至关重要的罕见但灾难性结果。在本研究中，我们提出了RS-Diffuser，一种风险敏感的离线扩散规划框架，结合了基于扩散的轨迹生成与分布价值批判。RS-Diffuser学习未来状态轨迹的扩散规划器、一个用于动作解码的独立逆动力学模型，以及一个通过分位数回归估计候选方案完整回报分布的蒙特卡洛分布批评器。在采样时，我们将风险敏感的指导信号纳入去噪过程，利用由尾部感知目标（如风险条件值）计算的梯度，引导生成数据朝向期望的风险曲线。因此，单个训练模型可以通过仅改变推理时间风险参数，灵活地生成风险规避、风险中性或寻求风险的行为。对风险敏感的D4RL和高风险机器人导航基准的广泛实验表明，RS-Diffuser实现了最先进的性能，既提升整体回报，也提升了最坏情况下的鲁棒性，同时减少了安全违规。

NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning

NormGuard：流量匹配强化学习中的奖励保持规范约束

Authors: Tianlin Pan, Lianyu Pang, Cheng Da, Huan Yang, Changqian Yu, Kun Gai, Wenhan Luo
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.27771
Pdf link: https://arxiv.org/pdf/2606.27771
Abstract Reinforcement learning (RL) post-training improves the reward alignment of flow-based generators, but often degrades perceptual quality in ways that are not captured by the reward proxy. We identify a simple structural signature of this drift: across three post-training methods (NFT, AWM, DPO), RL fine-tuning inflates the per-step velocity norm $\|v_\theta\|$ by $5\%$ to $15\%$ relative to the reference. A form of norm inflation has been studied in classifier-free guidance (CFG), where rescaling the velocity back to a reference norm at inference time can mitigate the resulting artifacts. However, this inference-time correction does not transfer cleanly to RL: rescaling $v_\theta$ to match $\|v_{\text{ref}}\|$ at inference time neither improves reward nor fixes the quality degradation, because the inflation is co-adapted into the model weights. Furthermore, an adjoint sensitivity analysis shows that velocity magnitude rescaling carries no coherent first-order reward signal at the batch level, indicating that suppressing norm inflation is unlikely to remove a consistently reward-carrying component. Since inference-time renormalization fails while norm suppression carries no reward cost, training-time intervention is the appropriate strategy. Together, these findings motivate \methodname, a hinge penalty that activates only when $\|v_\theta\|$ exceeds $\|v_{\text{ref}}\|$ and composes additively with any velocity-local base loss. Across two base models, three post-training methods, and two reward proxies, \methodname consistently improves MLLM-judged image quality and forensic realism while preserving reward, with gains that amplify under few-step inference and are not explained by early stopping.
中文摘要 强化学习（RL）在训练后改善了基于流生成器的奖励对齐，但通常会以奖励代理未捕捉的方式降低感知质量。我们识别出该漂移的一个简单结构特征：在三种训练后方法（NFT、AWM、DPO）中，RL微调使每步速度范数$\|v_\theta\|$相对于参考值膨胀$5\%$至$15\%$。在无分类器指导（CFG）中，曾研究过一种范数膨胀形式，在推断时将速度重新标度回参考范数，可以减轻由此产生的伪影。然而，这种推断时间修正并不能干净地转移到强化学习中：在推断时重新调整$v_\theta$以匹配$\|v_{\text{ref}}\|$，既不能改善奖励，也无法解决质量下降，因为通胀是与模型权重共适应的。此外，伴随敏感性分析显示，速度幅度重标度在批次层面不携带相干的一阶奖励信号，表明抑制范数膨胀不太可能消除一个持续携带奖励的成分。由于推断时间重整化失败，而范数抑制无奖励成本，训练时间干预是合适的策略。这些发现共同促使了 \methodname，一种铰接惩罚，只有当 $\|v_\theta\|$ 超过 $\|v_{\text{ref}}\|$ 并与任意速度局部基损合成时才激活。在两个基础模型、三种训练后方法和两种奖励代理中，\methodname 在保持奖励的同时，持续提升了 MLLM 判定图像质量和法医真实性，且在少数步骤推断下效果显著，且不被早期停止解释。

Booster Lab: A Data-Centric Pipeline for Learning Deployable Humanoid Locomotion Policies

Booster Lab：一个以数据为中心的可部署人形运动策略学习流程

Authors: Penghui Chen, Tinglong Zheng, Yufeng Zhang, Mingguo Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.27813
Pdf link: https://arxiv.org/pdf/2606.27813
Abstract Humanoid robot motion learning requires not only task-oriented control policies but also physically feasible and natural behaviors that can be transferred to real robots. However, robot-feasible motion data are often scarce: raw human demonstrations may be incompatible with the robot morphology, open-source clips vary in quality, and simulation-collected robot trajectories still require feasibility checking. To address these challenges, we propose a data-centric training and deployment pipeline that integrates motion data curation, real-to-sim model adaptation, AMP-based reinforcement learning, and sim-to-real deployment. We validate the framework on the Booster T1 robot and further provide preliminary cross-platform validation on Booster K1.
中文摘要 类人机器人运动学习不仅需要任务导向的控制策略，还需要物理上可行且自然的行为，这些行为可以转移到真实机器人上。然而，机器人可行的运动数据往往稀缺：原始人类演示可能与机器人形态不兼容，开源片段质量参差不齐，且通过仿真收集的机器人轨迹仍需进行可行性检查。为应对这些挑战，我们提出了一个以数据为中心的训练与部署流程，整合了运动数据管理、实模型到模拟模型适配、基于AMP的强化学习以及模拟到实物部署。我们在Booster T1机器人上验证了该框架，并进一步对Booster K1进行了初步跨平台验证。

ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents

ATOD：多回合自主代理退火的回合感知策略蒸馏

Authors: Qitai Tan, Zefang Zong, Yang Li, Peng Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.27814
Pdf link: https://arxiv.org/pdf/2606.27814
Abstract Training small language-model agents for long-horizon interactive tasks requires both fast imitation and reward-driven improvement. On-policy distillation (OPD) provides dense teacher guidance and typically improves rapidly in the early stage, but its gains saturate once the student approaches the teacher, limiting the final performance ceiling. Reinforcement learning (RL) directly optimizes environment rewards and encourages exploratory improvement toward a higher reward-defined ceiling, but sparse and delayed feedback makes early-stage learning much less efficient than OPD. In this paper, we propose ATOD (Annealed Turn-aware On-policy Distillation), a hybrid online distillation algorithm that explicitly exploits this complementarity. (1) ATOD uses an annealed OPD-RL schedule: OPD dominates early training to approach teacher-level behavior, while RL is gradually strengthened to drive reward-based exploration. (2) ATOD introduces Turn-level Disagreement-Uncertainty Reweighting (T-DUR), which softly amplifies high-utility turns and improves dense supervision in long trajectories. Experiments on ALFWorld, WebShop, and Search-QA show that ATOD consistently outperforms competing post-training baselines: across the three student sizes, ATOD improves average success rate by 3.03 points over OPD and 23.62 points over GRPO, while surpassing the corresponding teacher models by 2.16 points.
中文摘要 训练小型语言模型代理进行长距离交互任务既需要快速模仿，也需要奖励驱动的改进。政策提炼（OPD）为教师提供了密集的指导，通常在早期阶段提升迅速，但一旦学生接近教师，其收益就会被充分吸收，限制了最终的表现上限。强化学习（RL）直接优化环境奖励，鼓励探索性改进，朝着更高的奖励定义上限努力，但稀疏且延迟的反馈使得早期阶段学习效率远低于OPD。本文提出了ATOD（退火转向感知策略蒸馏），一种混合在线蒸馏算法，明确利用了这种互补性。（1） ATOD采用退火的OPD-RL计划：OPD主导早期培训，以接近教师级行为，而强化学习则逐步强化以驱动基于奖励的探索。（2） ATOD引入了回合级分歧-不确定性重权（T-DUR），该技术温和地放大了高效用转向，并改善了长轨迹中的密集监督。在ALFWorld、WebShop和Search-QA上的实验显示，ATOD持续优于其他培训后基线：在三种学生规模中，ATOD的平均成功率比OPD提升3.03分，比GRPO提升23.62分，同时比相应教师模型高出2.16分。

PPO-EAL: Exact Augmented Lagrangian Proximal Policy Optimization for Safe Robotic Control

PPO-EAL：精确增强拉格朗日近端策略优化，用于安全机器人控制

Authors: Jiatao Ding, Songqun Gao, Andrea Del Prete, Matteo Saveriano
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.27861
Pdf link: https://arxiv.org/pdf/2606.27861
Abstract Reinforcement learning (RL) has emerged as a promising solution to accomplish complex robotic control tasks; however, most of the current work ignores the safety requirements. Safe RL seeks to maximize task performance while satisfying explicit physical constraints, but current algorithms struggle to learn the policy efficiently with precise constraint satisfaction. This work proposes PPO-EAL, a novel first-order constrained policy optimization framework that integrates exact augmented Lagrangian optimization into proximal policy optimization for safe robotic control. By combining clipped policy updates with exact quadratic penalty terms, PPO-EAL achieves theoretically grounded constraint enforcement without requiring impractically large penalty factors. A momentum-regulated multiplier update further improves dual-variable stability, reducing constraint oscillation and unsafe behavior while preserving task performance. We provide exactness and convergence analysis under standard stochastic approximation assumptions. Extensive validation across diverse GPU-accelerated robotic benchmarks-including cart-pole balancing, cart-double-pendulum stabilization, 7-DoF Franka end-effector reaching, and quadrupedal locomotion-demonstrates superior safety precision and reward performance compared with state-of-the-art first-order safe RL baselines. Finally, we demonstrate zero-shot sim-to-real deployment in a contact-rich gear assembly task, where PPO-EAL substantially improves task success, reduces peak contact force, and enhances operational robustness. These results establish PPO-EAL as a general and practically deployable safe RL framework for diverse safety-critical robotic systems.
中文摘要 强化学习（RL）已成为实现复杂机器人控制任务的有前景解决方案;然而，目前大多数工作都忽视了安全要求。安全强化学习旨在最大化任务性能，同时满足显式物理约束，但当前算法在满足精确约束时难以高效学习策略。本研究提出了PPO-EAL理论，这是一种新型一阶约束策略优化框架，将精确增强拉格朗日优化整合进近策略优化，实现机器人安全控制。通过将截短的策略更新与精确的二次惩罚项结合，PPO-EAL实现了理论上有根据的约束执行，而无需不切实际的惩罚因子。动量调节乘数的更新进一步提升了双变量稳定性，减少约束振荡和不安全行为，同时保持任务性能。我们在标准随机近似假设下提供精确性和收敛分析。在多种GPU加速机器人基准测试中进行了广泛验证——包括车杆平衡、车车双摆稳定、7度F弗兰卡端执行器伸缩和四足行走——显示其安全精度和奖励性能优于最先进的一阶安全强化学习基线。最后，我们演示了在接触丰富齿轮组装任务中的零发射模拟到实物部署，PPO-EAL显著提升任务成功率，减少峰值接触力，并增强作战鲁棒性。这些结果确立了PPO-EAL作为一种通用且可实际部署的安全强化学习框架，适用于多种安全关键机器人系统。

From Bootstrapping to Sequence Modeling: A Unified Generative Framework for Personalized Landing-Page Modeling

从自助法到序列建模：个性化着陆页建模的统一生成框架

Authors: Fan Li, Chang Meng, Jiaqi Fu, Shuchang Liu, Tianke Zhang, Xueliang Wang, Xiaoqiang Feng, Yongqi Liu, Kaiqiao Zhan
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2606.27865
Pdf link: https://arxiv.org/pdf/2606.27865
Abstract Modern online platforms increasingly adopt multi-page architectures to accommodate diverse user needs. On these platforms, page navigation (the process of directing users to specific functional pages upon app entry) serves as a critical gateway that shapes user's first impression and significantly influences subsequent engagement. To optimize this process, Kuaishou formulated the task of Personalized Landing Page Modeling (PLPM) and proposed KLAN, a reinforcement learning framework built upon Conservative Q-Learning (CQL). However, CQL-based approaches suffer from two fundamental limitations: (1) the Markov assumption fails to capture the strong non-Markovian temporal dependencies inherent in real-world user behaviors, and (2) TD learning with bootstrapping incurs severe cumulative errors and credit assignment difficulties under delayed rewards, particularly in long-horizon settings where users enter the app multiple times daily. To address these limitations, we propose GLAN (Generative Landing-page Adaptive Navigator), a sequence modeling framework built on Decision Transformer to tackle PLPM from a unified global-local perspective. Specifically, GLAN incorporates two key modules. First, we design the L-RTG module that captures users' inter-day consumption dynamics to provide accurate global guidance for all page assignments within a day. Furthermore, we propose the HRM module that decomposes session-level feedback into fine-grained signals, enabling precise local supervision for each page assignment. Extensive online experiments conducted on the Kuaishou platform demonstrate the effectiveness of GLAN, achieving +0.158\% and +0.108\% improvements on Daily Active Users (DAU) and user Lifetime (LT) respectively.
中文摘要 现代在线平台越来越多地采用多页面架构，以满足多样化的用户需求。在这些平台上，页面导航（即在应用进入后引导用户到特定功能页面的过程）是一个关键的门户，塑造了用户的第一印象，并显著影响后续的互动。为优化这一过程，快手提出了个性化着陆页建模（PLPM）任务，并提出了基于保守Q学习（CQL）的强化学习框架KLAN。然而，基于CQL的方法存在两个根本性局限：（1）马尔可夫假设未能捕捉现实世界用户行为中固有的强烈非马尔可夫时间依赖性;（2）通过自助法进行TD学习在延迟奖励下会引发严重的累积错误和信用分配困难，尤其是在用户每天多次进入应用的长期环境中。为解决这些局限性，我们提出了GLAN（生成着陆页自适应导航器），这是一个基于决策转换器构建的序列建模框架，旨在从统一的全局-局部视角处理PLPM。具体来说，GLAN 包含两个关键模块。首先，我们设计了L-RTG模块，捕捉用户的日常消费动态，为所有页面分配在一天内提供准确的全球指导。此外，我们提出了人力资源管理模块，将会话层级反馈分解为细粒度信号，实现每个页面分配的精确本地监督。在快手平台上进行的大量在线实验展示了GLA的有效性，分别在日活跃用户（DAU）和用户寿命（LT）方面实现了+0.158%和+0.108%的提升。

Reflect-R1: Evidence-Driven Reflection for Self-Correction in Long Video Understanding

Reflect-R1：长视频理解中自我纠正的循证反思

Authors: Shuimu Chen, Yuteng Chen, Yuanshen Guan, Zebang Cheng, Zeyu Zhang, Shengqian Qin, Bin Xia, Jiaran Li, Wenming Yang, Fei Ma
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.27922
Pdf link: https://arxiv.org/pdf/2606.27922
Abstract Current multimodal reflection mechanisms for long video understanding predominantly rely on closed-loop self-reflection within internal parameters. Lacking objective external evidence, models are frequently trapped in blind confidence and often fail to correct errors. Furthermore, applying reinforcement learning to multi-stage reflection pipelines introduces severe policy coupling, which is exacerbated by a critical scarcity of dedicated training data. To address these limitations, this work proposes Reflect-R1, the first Evidence-Driven self-correction framework for long video understanding. The framework constructs a three-stage pipeline consisting of intuition, verification, and arbitration. By dynamically retrieving objective visual evidence to verify initial intuitions and autonomously executing multiple temporal searches to resolve conflicts, it completely breaks the hallucination loop. To overcome policy coupling, we design a stage-decoupled reinforcement learning algorithm named SD-GRPO that independently computes advantage functions across different reasoning stages. Concurrently, we construct a dataset of 120K samples to bridge the training data gap. Extensive experiments on benchmarks such as VideoMME and LongVideoBench demonstrate that Reflect-R1 achieves state-of-the-art performance. Our method significantly improves the genuine rectification rate and enables authentic self-correction strictly grounded in objective evidence.
中文摘要 当前用于长视频理解的多模态反射机制主要依赖于内部参数内的闭环自反射。缺乏客观外部证据，模型常常陷入盲目自信，且常常无法纠正错误。此外，将强化学习应用于多阶段反射流水线会引入严重的策略耦合，而专用训练数据的稀缺使这一问题更加严重。为解决这些局限性，本研究提出了Reflect-R1，这是首个基于证据的长视频理解自我纠正框架。该框架构建了一个三阶段流程，包括直觉、验证和仲裁。通过动态检索客观视觉证据验证初始直觉，并自主执行多次时间搜索以解决冲突，它彻底打破了幻觉循环。为克服策略耦合，我们设计了一种名为SD-GRPO的阶段解耦强化学习算法，能够独立计算不同推理阶段的优势函数。同时，我们构建了一个包含12万个样本的数据集，以弥补训练数据的空白。在VideoMME和LongVideoBench等基准测试上的广泛实验表明，Reflect-R1实现了最先进的性能。我们的方法显著提高了真实纠正率，并实现了基于客观证据的真实自我纠正。

Verifiable Geometry Problem Solving: Solver-Driven Autoformalization and Theorem Proposing

可验证几何问题解决：求解器驱动的自形式化与定理提出

Authors: Can Li, Ting Zhang, Junbo Zhao, Hua Huang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.27926
Pdf link: https://arxiv.org/pdf/2606.27926
Abstract Geometry Problem Solving have increasingly adopt the neuro-symbolic paradigm, combining neural intuition with symbolic rigor. However, current frameworks suffer from severe bottlenecks in two core stages: autoformalization, which treats multimodal translation as a static task decoupled from downstream solver compatibility, and theorem prediction, where solvers frequently hit a deductive impasse due to fixed rule libraries. To address these, we propose SD-GPS, a solver-driven framework that treats the symbolic solver as an execution oracle throughout both formalization and deduction. First, Solver-Driven Autoformalization unifies supervised formal-language adaptation and solvability-guided reinforcement learning into a single module built on QwenVL3-2B, making executability the central training signal. Second, Verified Theorem Proposing introduces an impasse-aware agent that proposes local auxiliary lemmas from current proof states, ensuring soundness by filtering all proposals through symbolic verification. Empirical evaluations on Geometry3K and PGPS9K demonstrate that SD-GPS consistently outperforms existing MLLM, neural, and neuro-symbolic methods across standard completion, multiple-choice, and cross-modal reference regimes, proving that closing the loop between multimodal perception and symbolic execution significantly improves geometric reasoning, offering profound insights into how neural agents can be grounded by formal systems to achieve verifiable problem-solving capabilities.
中文摘要 几何问题解决越来越多地采用神经符号范式，将神经直觉与符号严谨结合。然而，当前框架在两个核心阶段存在严重瓶颈：自形式化，将多模态翻译视为与下游求解器兼容性解耦的静态任务;以及定理预测，求解器因固定规则库而常陷入演绎僵局。为应对这些问题，我们提出了SD-GPS，一种以求解器为驱动的框架，将符号求解器视为形式化和演绎过程中的执行预言机。首先，求解器驱动的自形式化将监督式形式语言适应和可解性引导强化学习统一到基于 QwenVL3-2B 的单一模块中，使可执行性成为核心训练信号。其次，验证定理提出引入了一个具备僵局意识的代理，从当前证明状态中提出局部辅助引理，通过符号验证过滤所有提议确保合理性。对Geometry3K和PGPS9K的实证评估表明，SD-GPS在标准完成、选择题和跨模态参考体系中持续优于现有的MLLM、神经符号和神经符号方法，证明闭合多模态感知与符号执行的循环显著提升了几何推理能力，为如何通过形式系统扎根神经代理以实现可验证的问题解决能力提供了深刻见解。

Two-Stage Fine-Tuning for Protein Sequence Generation with Targeted Amino-Acid Composition

针对靶向氨酸组成的蛋白质序列生成的两阶段微调

Authors: Violeta Basten-Romero, Rubén Muñoz-Tafalla, Anna María Díaz-Rovira, Bertran Miquel-Oliver, Isaac Filella-Merce, Víctor Guallar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Genomics (q-bio.GN)
Arxiv link: https://arxiv.org/abs/2606.27939
Pdf link: https://arxiv.org/pdf/2606.27939
Abstract Protein language models are standard priors for biological sequence generation, but steering them toward explicit distributional design targets remains largely unexplored. We study a constrained protein generation problem in which sequences must match a desired amino-acid (AA) composition profile while preserving plausible sequence statistics and diversity. The motivating application is synthetic feed protein design, where the AA composition of dietary proteins directly determines their nutritional value. We propose a two-stage pipeline in which domain-adaptive fine-tuning (FT) on an in-domain protein dataset is followed by iterative reward-weighted FT via reinforcement learning (RL) anchored against the FT model as a frozen reference. We evaluate the pipeline on two AA compositions and find that FT brings the average composition close to the target, while the subsequent RL enforces specific sequence constraints that FT alone cannot satisfy. We additionally evaluate the design choices of the proposed composition reward term against two baselines and an ablated variant, isolate the contribution of each training stage, and verify that AA composition alignment is achieved without degrading sequence quality.
中文摘要 蛋白质语言模型是生物序列生成的标准先验，但将它们引导到明确的分布设计目标方面仍大多未被探索。我们研究一种受限的蛋白质生成问题，其中序列必须匹配所需的氨基酸（AA）组成谱，同时保持合理的序列统计和多样性。其动机应用是合成饲料蛋白设计，其中膳食蛋白的抗丙酸成分直接决定其营养价值。我们提出了一个两阶段的流程，其中在域内蛋白质数据集上进行域自适应微调（FT），随后通过强化学习（RL）进行迭代奖励加权的FT，并以FT模型为固定参考。我们对两种AA组成进行分析，发现FT使平均组更接近靶点，而后续RL则强制执行FT无法满足的特定序列约束。我们还评估了拟议组成奖励项的设计选择，结合两个基线和一个消融变体，分离每个训练阶段的贡献，并验证AA组成比对在不降低序列质量的情况下实现。

TempAct: Advancing Temporal Plausibility in Autoregressive Video Generation via Planner-Executor RL

TempAct：通过Planner-Executor RL推进自回归视频生成的时间合理性

Authors: Jing Wang, Xiangxin Zhou, Jiajun Liang, Kaiqi Liu, Wanyun Pang, Zhenyu Xie, Tianyu Pang, Xiaodan Liang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.28016
Pdf link: https://arxiv.org/pdf/2606.28016
Abstract Autoregressive (AR) video diffusion models enable low-latency streaming generation by synthesizing videos chunk by chunk with cached visual context, but this chunk-wise formulation makes temporal instruction following ambiguous. A single global prompt does not specify which sub-event should be realized in each chunk, while naively switching to step-wise prompts often leads to delayed reactions, blended step semantics, and error propagation across prompt transitions. These failures are difficult to address with supervised fine-tuning or distillation alone: SFT suffers from exposure bias, while rollout-based distillation still optimizes low-level denoising or teacher-distribution matching rather than directly enforcing action ordering and prompt-transition correctness. We address these challenges with TempAct, a planner--executor reinforcement learning framework that jointly optimizes temporal decomposition and step-conditioned execution for temporally plausible AR video generation. TempAct uses an LLM planner to explore span-aware step prompts that are executable by the video model, and trains an AR diffusion executor to follow these prompts under its own generated histories. Its key mechanism is hierarchical group exploration: candidate plans form planning groups, and each plan induces an execution group of multiple continuations from a shared visual context, enabling plan-level credit assignment for long-horizon temporal outcomes and executor-level credit assignment for prompt-switch behavior. We further design hierarchical rewards that combine plan-quality and full-video temporal feedback for the planner with local transition-level step-following rewards, aesthetic regularization, and KL constraints for the executor. Experiments on Self-Forcing and LongLive show that TempAct improves temporal consistency while preserving overall visual quality.
中文摘要 自回归（AR）视频扩散模型通过逐块合成带有缓存的视觉上下文视频，实现低延迟流媒体生成，但这种按块的表述使得跟随的时间指令变得模糊。单个全局提示并未指定每个片段应实现哪个子事件，而天真地切换到分步骤提示往往会导致反应延迟、步进语义混淆以及提示转换间的错误传播。这些失败难以仅靠监督微调或蒸馏来解决：SFT存在暴露偏差，而基于推广的蒸馏仍优化低级别去噪或教师分布匹配，而非直接强制动作顺序和提示转换正确性。我们通过TempAct来应对这些挑战，TempAct是一款规划者-执行者强化学习框架，能够共同优化时间分解和步骤条件执行，实现时间合理的增强现实视频生成。TempAct 使用大型语言模型规划器探索可由视频模型执行的跨度感知步骤提示，并训练增强现实扩散执行器在其生成的历史记录下遵循这些提示。其关键机制是层级组探索：候选计划形成规划组，每个计划从共享的视觉上下文中诱导多个延续的执行组，从而实现长期时间结果的计划级功劳分配和执行者级的提示切换行为的学分分配。我们还进一步设计了层级奖励，将计划质量和全视频时间反馈结合给规划者，执行者则有本地过渡级的步骤跟随奖励、美学规范化和基层约束。Self-Forcing和LongLive的实验表明，TempAct在保持整体视觉质量的同时，提升了时间一致性。

Regularized Reward-Punishment Reinforcement Learning

正规化奖励-惩罚强化学习

Authors: Jiexin Wang, Eiji Uchibe
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.28152
Pdf link: https://arxiv.org/pdf/2606.28152
Abstract We propose KL-Coupled Policy Regularization (KCPR), a policy coordination framework for Reward-Punishment Reinforcement Learning (RPRL). Based on KCPR, we derive KL-Coupled Soft Optimality (KCSO) and develop its deep realization, klDMP. Unlike existing RPRL approaches that optimize reward-seeking and punishment-related policies largely independently, KCPR enables direct interactions between companion policies by treating each as a dynamically learned prior for the other. KCSO yields coupled soft-optimal policies and KL-regularized Bellman operators, allowing reward and punishment information to jointly influence value propagation. To improve learning stability, we introduce a companion-prior softening mechanism and evaluate separate replay-buffer designs for balancing reward- and punishment-related experience. Experiments in grid-world and Gazebo robotic navigation tasks demonstrate that klDMP improves safety and learning stability while maintaining competitive task performance compared with DQN, SQL and softDMP. These results suggest that policy-level coordination provides an effective mechanism for integrating multiple behavioral objectives and may serve as a useful design principle for reinforcement learning systems with interacting motivational processes.
中文摘要 我们提出了KL耦合政策规范化（KCPR），这是一种用于奖励-惩罚强化学习（RPRL）的策略协调框架。基于KCPR，我们推导出KL耦合软最优性（KCSO），并开发其深度实现klDMP。与现有RPRL方法中几乎独立优化奖励寻求和惩罚相关策略不同，KCPR通过将伴随策略视为动态学习的先验，实现了伴随策略之间的直接交互。KCSO 产生耦合的软最优策略和 KL 正则化的 Bellman 算子，使奖惩信息能够共同影响价值传播。为提升学习稳定性，我们引入了伴先软化机制，并评估了用于平衡奖励与惩罚相关体验的独立重放缓冲设计。网格世界和凉亭机器人导航任务的实验表明，klDMP在保持竞争性任务性能的同时，相较于DQN、SQL和softDMP能提升安全性和学习稳定性。这些结果表明，策略层面协调为整合多重行为目标提供了有效机制，并可能作为具有交互动机过程的强化学习系统设计的有用原则。

Tandem Reinforcement Learning with Verifiable Rewards

带可验证奖励的双人强化学习

Authors: Difan Jiao, Raghav Singhal, Robert West, Ashton Anderson
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.28166
Pdf link: https://arxiv.org/pdf/2606.28166
Abstract Reinforcement learning with verifiable rewards (RLVR) has significantly improved the reasoning capability of large language models, reaching expert or even superhuman performance in domains such as competition math. However, whether weaker agents and humans can actually harness this capability is far less certain, with RLVR documented to drift reasoning toward idiosyncratic patterns such as poor readability and language mixing. Tandem training is a recently introduced paradigm that targets this compatibility problem: a trained, stronger senior co-generates each rollout with a frozen, weaker junior, and the two are rewarded as a team, so the senior is pushed to reason in ways the junior can follow. Yet this paradigm has so far been demonstrated only in proof-of-concept settings, leaving open whether it scales to the long chains of thought of the modern RLVR pipeline. In this work, we propose Tandem Reinforcement Learning (TRL), which carries the tandem training paradigm into RLVR. In TRL, the senior and a frozen junior alternate stochastically to co-generate the reasoning, the resulting generation is rewarded, and the standard GRPO loss is applied to the senior. Training Qwen3-4B-Instruct on competition math, we find that TRL matches vanilla GRPO on solo reasoning capability while three properties emerge together from the same rollout structure: stronger handoff robustness with the junior, reduced distributional drift from the junior, and a chain-of-thought more legible to the junior. Our results demonstrate a promising route for RLVR with practical payoffs in multi-model communication and human compatibility.
中文摘要 带有可验证奖励的强化学习（RLVR）显著提升了大型语言模型的推理能力，在竞争数学等领域达到了专家甚至超人的表现。然而，弱势的代理人和人类是否真的能利用这种能力则不确定，RLVR被记录为推理趋向于可读性差和语言混合等特有模式。双人训练是最近引入的一种范式，针对这一兼容性问题：一位受过训练、实力更强的资深学生与一个被冻结、较弱的学者共同生成每次推广，两人作为团队获得奖励，因此资深学者被推动推理，学者也能跟随。然而，这一范式迄今仅在概念验证环境中得到验证，是否能适应现代RLVR流程的漫长思维链则悬而未决。在本研究中，我们提出了串联强化学习（Tandem Reinforcement Learning，TRL），将串联训练范式引入RLVR。在TRL中，资深者与冻结的初级者随机交替共生成推理，所得生成者获得奖励，标准GRPO损失应用给资深者。在对Qwen3-4B-Instruct进行竞赛数学培训时，我们发现TRL在独立推理能力上与普通GRPO相当，同时有三个特性从同一展开结构中共同显现：初级者更强的交接稳健性、较低级者的分布漂移减少，以及思路链更易被初级者理解。我们的结果为RLVR展示了一条有前景的路径，在多模型通信和人类兼容性方面具有实际收益。

Learning Stable In-Grasp Manipulation in a Non-Dropping Action Space

在非掉落动作空间中学习稳定的抓握操作

Authors: Ha Thang Long Doan, Hikaru Arita, Kazuto Nakashima, Kenji Tahara
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.28196
Pdf link: https://arxiv.org/pdf/2606.28196
Abstract Traditionally, dexterous manipulation controllers are designed using analytic models constrained by strong assumptions about the hand and the objects being manipulated. Reinforcement learning (RL) has become another common approach in which skills are explored openly in an end-to-end manner but is inefficient because of unnoticeable instability and conflicts in learning objectives. This paper attempts to efficiently explore stable and accurate manipulation skills by decomposing dexterous skills into multiple simpler/analyzable components. Each skill component is subsequently learned with constraints and guidance from classical physics and control theory. Our work shows that for stable grasp, in-grasp reposition/reorientation with different objects, sensor/motor noise, latency, and frictional conditions, skill learning becomes efficient and stable with prior knowledge from theory.
中文摘要 传统上，灵巧操作控制器采用分析模型设计，受严格假设为手部和控物体。强化学习（RL）已成为另一种常见方法，技能在端到端开放地探索，但由于学习目标存在不明显的不稳定性和冲突，效率较低。本文试图通过将灵巧技能分解为多个更简单/可分析的组成部分，高效探索稳定且准确的操作技能。每个技能组成部分随后在约束和经典物理学及控制理论指导下学习。我们的研究表明，对于稳定抓握、在抓取中对不同物体的重新定位/重新定向、传感器/运动噪声、延迟和摩擦条件，借助理论的先验知识，技能学习变得高效且稳定。

Keyword: diffusion policy

There is no result