Arxiv Papers of Today

生成时间: 2026-03-18 16:55:28 (UTC+8); Arxiv 发布时间: 2026-03-18 20:00 EDT (2026-03-19 08:00 UTC+8)

今天共有 52 篇相关文章

Keyword: reinforcement learning

SAC-NeRF: Adaptive Ray Sampling for Neural Radiance Fields via Soft Actor-Critic Reinforcement Learning

SAC-NeRF：通过软演员-批判者强化学习实现神经辐射场自适应射线采样

Authors: Chenyu Ge
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.15622
Pdf link: https://arxiv.org/pdf/2603.15622
Abstract Neural Radiance Fields (NeRF) have achieved photorealistic novel view synthesis but suffer from computational inefficiency due to dense ray sampling during volume rendering. We propose SAC-NeRF, a reinforcement learning framework that learns adaptive sampling policies using Soft Actor-Critic (SAC). Our method formulates sampling as a Markov Decision Process where an RL agent learns to allocate samples based on scene characteristics. We introduce three technical components: (1) a Gaussian mixture distribution color model providing uncertainty estimates, (2) a multi-component reward function balancing quality, efficiency, and consistency, and (3) a two-stage training strategy addressing environment non-stationarity. Experiments on Synthetic-NeRF and LLFF datasets show that SAC-NeRF reduces sampling points by 35-48\% while maintaining rendering quality within 0.3-0.8 dB PSNR of dense sampling baselines. While the learned policy is scene-specific and the RL framework adds complexity compared to simpler heuristics, our work demonstrates that data-driven sampling strategies can discover effective patterns that would be difficult to hand-design.
中文摘要 神经辐射场（NeRF）已实现照片级真实的新颖视图合成，但由于体积渲染时的密集射线采样，计算效率较低。我们提出了SAC-NeRF，一种强化学习框架，利用软演员-批判者（SAC）学习自适应采样策略。我们的方法将抽样表述为马尔可夫决策过程，强化学习者学习根据场景特征分配样本。我们引入三个技术组成部分：（1）提供不确定性估计的高斯混合分布色彩模型，（2）多成分奖励函数平衡质量、效率和一致性，（3）解决环境非平稳性的两阶段训练策略。在合成NeRF和LLFF数据集上的实验表明，SAC-NeRF在保持渲染质量在密集采样基线0.3-0.8 dB PSNR范围内的同时，采样点减少了35-48%。虽然所学策略是针对场景的，且强化学习框架相比简单的启发式增加了复杂性，但我们的研究表明，数据驱动的采样策略能够发现难以手工设计的有效模式。

Alternating Reinforcement Learning with Contextual Rubric Rewards

交替强化学习与情境评分标准奖励

Authors: Guangchen Lan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.15646
Pdf link: https://arxiv.org/pdf/2603.15646
Abstract Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance gains. We further introduce a lightweight, search-based adaptation procedure that selects the next meta-class dynamically based on task performance, enabling the policy to emphasize critical objectives and thereby improve the model performance. Empirically, our experiments on the HealthBench dataset with experts annotations demonstrate that ARL-RR uniformly outperforms scalarized methods in both model performance and training efficiency across different model scales (1.7B, 4B, 8B, and 14B).
中文摘要 带评分标准奖励的强化学习（RLRR）是一个框架，通过用结构化、多维度、基于情境的评分标准评估替代标量偏好信号，扩展了传统的人类反馈强化学习（RLHF）和可验证奖励（RLVR）。然而，RLRR中现有的方法仅限于将向量奖励线性压缩为具有固定权重的标量奖励，这对人工评分设计很敏感，且无法捕捉奖励维度之间的相关性。为克服奖励聚合的局限性，本研究提出了带评分标准奖励的交替强化学习（ARL-RR），该框架通过逐一优化语义规矩元类，消除固定标量化的需求。理论上，我们证明奖励聚合会诱导方差收缩效应，这有助于解释性能提升。我们还进一步引入了一种轻量级、基于搜索的适应程序，能够根据任务表现动态选择下一个元类，使策略能够强调关键目标，从而提升模型性能。从经验上看，我们在HealthBench数据集上进行的专家注释实验表明，ARL-RR在不同模型尺度（1.7B、4B、8B和14B）下，在模型性能和训练效率上均优于标量化方法。

Beyond Reward Suppression: Reshaping Steganographic Communication Protocols in MARL via Dynamic Representational Circuit Breaking

超越奖励抑制：通过动态表征电路断裂重塑MARL中的星形图通信协议

Authors: Liu Hung Ming
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.15655
Pdf link: https://arxiv.org/pdf/2603.15655
Abstract In decentralized Multi-Agent Reinforcement Learning (MARL), steganographic collusion -- where agents develop private protocols to evade monitoring -- presents a critical AI safety threat. Existing defenses, limited to behavioral or reward layers, fail to detect coordination in latent communication channels. We introduce the Dynamic Representational Circuit Breaker (DRCB), an architectural defense operating at the optimization substrate. Building on the AI Mother Tongue (AIM) framework, DRCB utilizes a Vector Quantized Variational Autoencoder (VQ-VAE) bottleneck to convert unobservable messages into auditable statistical objects. DRCB monitors signals including Jensen-Shannon Divergence drift, L2-norm codebook displacement, and Randomized Observer Pool accuracy to compute an EMA-based Collusion Score. Threshold breaches trigger four escalating interventions: dynamic adaptation, gradient-space penalty injection into the Advantage function A^pi, temporal reward suppression, and full substrate circuit breaking via codebook shuffling and optimizer state reset. Experiments on a Contextual Prisoner's Dilemma with MNIST labels show that while static monitoring fails (p = 0.3517), DRCB improves observer mean accuracy from 0.858 to 0.938 (+9.3 percent) and reduces volatility by 43 percent, while preserving mean joint reward (p = 0.854). Analysis of 214,298 symbol samples confirms "Semantic Degradation," where high-frequency sequences converge to zero entropy, foreclosing complex steganographic encodings. We identify a "Transparency Paradox" where agents achieve surface-level determinism while preserving residual capacity in long-tail distributions, reflecting Goodhart's Law. This task-agnostic methodology provides a technical path toward MICA-compliant (Multi-Agent Internal Coupling Audit) pre-deployment auditing for autonomous systems.
中文摘要 在去中心化的多智能体强化学习（MARL）中，隐写共谋——代理开发私有协议以规避监控——构成了关键的人工智能安全威胁。现有的防御机制，仅限于行为层或奖励层，无法检测潜在沟通渠道中的协调。我们介绍动态表征断路器（DRCB），这是一种在优化基底上运行的架构防御。基于AI母语（AIM）框架，DRCB利用向量量化变分自编码器（VQ-VAE）瓶颈将不可观察的消息转换为可审计的统计对象。DRCB监测包括Jensen-Shannon发散漂移、L2范数码本位移和随机观察者池精度等信号，以计算基于EMA的共谋评分。阈值突破会触发四种逐步升级的干预措施：动态适应、向优势函数A^pi注入梯度空间惩罚、时间奖励抑制，以及通过密码本洗牌和优化器状态重置实现全基质电路中断。带有MNIST标签的情境囚徒困境实验显示，虽然静态监测失败（p = 0.3517），但DRCB将观察者平均准确率从0.858提升至0.938（+9.3%），波动率降低43%，同时保持平均关节奖励（p = 0.854）。对214,298个符号样本的分析证实了“语义退化”，即高频序列收敛至零熵，从而阻断了复杂的隐写编码。我们发现了一个“透明度悖论”，即代理在长尾分布中保持残余容量的同时，实现表层确定性，反映了古德哈特定律。这种任务无关的方法论为自主系统实现符合MICA（多代理内部耦合审计）的部署前审计提供了技术路径。

BadLLM-TG: A Backdoor Defender powered by LLM Trigger Generator

BadLLM-TG：由LLM触发发生器驱动的后门防御者

Authors: Ruyi Zhang, Heng Gao, Songlei Jian, Yusong Tan, Haifang Zhou
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.15692
Pdf link: https://arxiv.org/pdf/2603.15692
Abstract Backdoor attacks compromise model reliability by using triggers to manipulate outputs. Trigger inversion can accurately locate these triggers via a generator and is therefore critical for backdoor defense. However, the discrete nature of text prevents existing noise-based trigger generator from being applied to nature language processing (NLP). To overcome the limitations, we employ the rich knowledge embedded in large language models (LLMs) and propose a Backdoor defender powered by LLM Trigger Generator, termed BadLLM-TG. It is optimized through prompt-driven reinforcement learning, using the victim model's feedback loss as the reward signal. The generated triggers are then employed to mitigate the backdoor via adversarial training. Experiments show that our method reduces the attack success rate by 76.2\% on average, outperforming the second-best defender by 13.7.
中文摘要 后门攻击通过使用触发器操作输出，从而破坏模型的可靠性。扳机反转可以通过发生器准确定位这些触发点，因此对后门防御至关重要。然而，文本的离散性质阻止了现有基于噪声的触发生成器被应用于自然语言处理（NLP）。为克服这些局限性，我们利用大型语言模型（LLM）中蕴含的丰富知识，提出了由LLM触发器驱动的后门防御器，称为BadLLM-TG。它通过提示驱动的强化学习进行优化，利用受害者模型的反馈损失作为奖励信号。生成的触发器随后被用来通过对抗训练来缓解后门。实验显示，我们的方法平均能降低76.2%的攻击成功率，比第二名防守者高出13.7%。

Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models

Meta-TTRL：一种在统一多模态模型中自我提升测试时间强化学习的元认知框架

Authors: Lit Sin Tan, Junzhe Chen, Xiaolong Fu, Lichen Ma, Junshi Huang, Jianzhong Shi, Yan Li, Lijie Wen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.15724
Pdf link: https://arxiv.org/pdf/2603.15724
Abstract Existing test-time scaling (TTS) methods for unified multimodal models (UMMs) in text-to-image (T2I) generation primarily rely on search or sampling strategies that produce only instance-level improvements, limiting the ability to learn from prior inferences and accumulate knowledge across similar prompts. To overcome these limitations, we propose Meta-TTRL, a metacognitive test-time reinforcement learning framework. Meta-TTRL performs test-time parameter optimization guided by model-intrinsic monitoring signals derived from the meta-knowledge of UMMs, achieving self-improvement and capability-level improvement at test time. Extensive experiments demonstrate that Meta-TTRL generalizes well across three representative UMMs, including Janus-Pro-7B, BAGEL, and Qwen-Image, achieving significant gains on compositional reasoning tasks and multiple T2I benchmarks with limited data. We provide the first comprehensive analysis to investigate the potential of test-time reinforcement learning (TTRL) for T2I generation in UMMs. Our analysis further reveals a key insight underlying effective TTRL: metacognitive synergy, where monitoring signals align with the model's optimization regime to enable self-improvement.
中文摘要 现有的文本转图像（T2I）生成中统一多模态模型（UMM）的测试时间尺度（TTS）方法主要依赖于仅在实例层面产生改进的搜索或采样策略，限制了从先前推断中学习和在类似提示中积累知识的能力。为克服这些局限，我们提出了Meta-TTRL，一种元认知测试时间强化学习框架。Meta-TTRL根据UMM元知识导出的模型内在监测信号，进行测试时间参数优化，实现自我提升和能力层面提升。大量实验表明，Meta-TTRL在三种代表性的UMMs（包括Janus-Pro-7B、BAGEL和Qwen-Image）上推广良好，在组合推理任务和多项T2I基准测试中取得了显著提升，且数据有限。我们首次全面分析测试时间强化学习（TTRL）在UMM中T2I生成的潜力。我们的分析进一步揭示了有效TTRL背后的一个关键洞见：元认知协同效应，即监测信号与模型优化机制保持一致，从而实现自我提升。

Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation

仿真提炼：模拟中的预训练世界模型以实现快速现实适应

Authors: Jacob Levy, Tyler Westenbroek, Kevin Huang, Fernando Palafox, Patrick Yin, Shayegan Omidshafiei, Dong-Ki Kim, Abhishek Gupta, David Fridovich-Keil
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.15759
Pdf link: https://arxiv.org/pdf/2603.15759
Abstract Simulation-to-real transfer remains a central challenge in robotics, as mismatches between simulated and real-world dynamics often lead to failures. While reinforcement learning offers a principled mechanism for adaptation, existing sim-to-real finetuning methods struggle with exploration and long-horizon credit assignment in the low-data regimes typical of real-world robotics. We introduce Simulation Distillation (SimDist), a sim-to-real framework that distills structural priors from a simulator into a latent world model and enables rapid real-world adaptation via online planning and supervised dynamics finetuning. By transferring reward and value models directly from simulation, SimDist provides dense planning signals from raw perception without requiring value learning during deployment. As a result, real-world adaptation reduces to short-horizon system identification, avoiding long-horizon credit assignment and enabling fast, stable improvement. Across precise manipulation and quadruped locomotion tasks, SimDist substantially outperforms prior methods in data efficiency, stability, and final performance. Project website and code: this https URL
中文摘要 仿真到现实的传输仍然是机器人学中的核心挑战，因为模拟与现实动力学的不匹配常常导致失败。虽然强化学习提供了原则性的适应机制，但现有的模拟到现实微调方法在探索和长期信用分配方面存在困难，这些方法在典型的现实机器人低数据环境中存在困难。我们介绍了仿真蒸馏（SimDist），这是一种模拟到现实的框架，将模拟器中的结构先验提炼为潜在世界模型，并通过在线规划和监督动力学微调实现快速的现实世界适应。通过直接从仿真转移奖励和价值模型，SimDist 提供了基于原始感知的密集规划信号，无需在部署时进行价值学习。因此，现实世界的适应简化为短期系统识别，避免长期信用分配，实现快速稳定的改进。在精确操作和四足行走任务中，SimDist 在数据效率、稳定性和最终性能方面远超以往方法。项目网站及代码：此 https URL

CorrectionPlanner: Self-Correction Planner with Reinforcement Learning in Autonomous Driving

CorrectionPlanner：带有自动驾驶强化学习的自我纠正规划器

Authors: Yihong Guo, Dongqiangzi Ye, Sijia Chen, Anqi Liu, Xianming Liu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.15771
Pdf link: https://arxiv.org/pdf/2603.15771
Abstract Autonomous driving requires safe planning, but most learning-based planners lack explicit self-correction ability: once an unsafe action is proposed, there is no mechanism to correct it. Thus, we propose CorrectionPlanner, an autoregressive planner with self-correction that models planning as motion-token generation within a propose, evaluate, and correct loop. At each planning step, the policy proposes an action, namely a motion token, and a learned collision critic predicts whether it will induce a collision within a short horizon. If the critic predicts a collision, we retain the sequence of historical unsafe motion tokens as a self-correction trace, generate the next motion token conditioned on it, and repeat this process until a safe motion token is proposed or the safety criterion is met. This self-correction trace, consisting of all unsafe motion tokens, represents the planner's correction process in motion-token space, analogous to a reasoning trace in language models. We train the planner with imitation learning followed by model-based reinforcement learning using rollouts from a pretrained world model that realistically models agents' reactive behaviors. Closed-loop evaluations show that CorrectionPlanner reduces collision rate by over 20% on Waymax and achieves state-of-the-art planning scores on nuPlan.
中文摘要 自动驾驶需要安全规划，但大多数基于学习的规划者缺乏明确的自我纠正能力：一旦提出不安全的行动，就没有机制去纠正它。因此，我们提出了CorrectionPlanner，这是一种自回归规划器，具有自我纠正功能，将规划建模为在提出、评估和纠正循环中的运动标记生成。在每个规划步骤中，政策都会提出一个动作，即一个运动代币，而有经验的碰撞批评者预测该行为是否会在短时间内引发碰撞。如果批评者预测了碰撞，我们会保留历史上不安全运动标记的序列作为自我纠正轨迹，生成基于该标记的下一个动作标记，并重复此过程，直到提出安全动作标记或满足安全标准。这种由所有不安全运动标记组成的自我纠正轨迹，代表了规划者在运动标记空间中的修正过程，类似于语言模型中的推理轨迹。我们先用模仿学习训练规划器，随后采用基于模型的强化学习，使用预训练世界模型的展开，真实地模拟代理的反应行为。闭环评估显示，CorrectionPlanner在Waymax上将碰撞率降低超过20%，并在nuPlan上获得最先进的规划评分。

Emergent Dexterity via Diverse Resets and Large-Scale Reinforcement Learning

通过多样化重置和大规模强化学习实现的初现灵巧度

Authors: Patrick Yin, Tyler Westenbroek, Zhengyu Zhang, Joshua Tran, Ignacio Dagnino, Eeshani Shilamkar, Numfor Mbiziwo-Tiapo, Simran Bagaria, Xinlei Liu, Galen Mullins, Andrey Kolobov, Abhishek Gupta
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.15789
Pdf link: https://arxiv.org/pdf/2603.15789
Abstract Reinforcement learning in massively parallel physics simulations has driven major progress in sim-to-real robot learning. However, current approaches remain brittle and task-specific, relying on extensive per-task engineering to design rewards, curricula, and demonstrations. Even with this engineering, they often fail on long-horizon, contact-rich manipulation tasks and do not meaningfully scale with compute, as performance quickly saturates when training revisits the same narrow regions of state space. We introduce \Method, a simple and scalable framework that enables on-policy reinforcement learning to robustly solve a broad class of dexterous manipulation tasks using a single reward function, fixed algorithm hyperparameters, no curricula, and no human demonstrations. Our key insight is that long-horizon exploration can be dramatically simplified by using simulator resets to systematically expose the RL algorithm to the diverse set of robot-object interactions which underlie dexterous manipulation. \Method\ programmatically generates such resets with minimal human input, converting additional compute directly into broader behavioral coverage and continued performance gains. We show that \Method\ gracefully scales to long-horizon dexterous manipulation tasks beyond the capabilities of existing approaches and is able to learn robust policies over significantly wider ranges of initial conditions than baselines. Finally, we distill \Method \ into visuomotor policies which display robust retrying behavior and substantially higher success rates than baselines when transferred to the real world zero-shot. Project webpage: this https URL
中文摘要 大规模并行物理模拟中的强化学习推动了模拟到真实机器人学习的重大进展。然而，目前的方法仍然脆弱且针对特定任务，依赖大量逐任务工程来设计奖励、课程和演示。即便采用了这种工程设计，它们在长期、接触丰富的操作任务中常常失败，且随着计算能力的提升，性能在训练中重复访问同一狭窄状态区域时迅速饱和。我们介绍了\Method，一个简单且可扩展的框架，使策略强化学习能够通过单一奖励函数、固定算法超参数、无课程和人工演示，稳健地解决广泛的灵巧操作任务类别。我们的关键见解是，通过使用模拟器重置，系统地让强化学习算法接触到支撑灵巧操作的多样机器人与物体相互作用，可以极大简化长视野探索。\Method\ 以最小的人工输入程序生成此类重置，将额外的计算直接转化为更广泛的行为覆盖和持续的性能提升。我们证明，\Method\能够优雅地扩展到超出现有方法能力范围的长期灵活操作任务，并能够在比基线更广泛的初始条件下学习稳健策略。最后，我们将 \ 方法\提炼为视觉运动策略，这些策略在转入现实世界零射值时表现出强健的重试行为和显著高于基线的成功率。项目网页：此 https URL

Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning

反对强化学习：重新思考高效且可扩展的深度强化学习的核心原则

Authors: Ezgi Korkmaz
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.15871
Pdf link: https://arxiv.org/pdf/2603.15871
Abstract Following the pivotal success of learning strategies to win at tasks, solely by interacting with an environment without any supervision, agents have gained the ability to make sequential decisions in complex MDPs. Yet, reinforcement learning policies face exponentially growing state spaces in high dimensional MDPs resulting in a dichotomy between computational complexity and policy success. In our paper we focus on the agent's interaction with the environment in a high-dimensional MDP during the learning phase and we introduce a theoretically-founded novel paradigm based on experiences obtained through counteractive actions. Our analysis and method provide a theoretical basis for efficient, effective, scalable and accelerated learning, and further comes with zero additional computational complexity while leading to significant acceleration in training. We conduct extensive experiments in the Arcade Learning Environment with high-dimensional state representation MDPs. The experimental results further verify our theoretical analysis, and our method achieves significant performance increase with substantial sample-efficiency in high-dimensional environments.
中文摘要 继通过仅凭无监督与环境互动来学习策略以赢得任务的关键成功后，代理获得了在复杂MDP中做出顺序决策的能力。然而，强化学习策略在高维MDP中面临指数级增长的状态空间，导致计算复杂度与策略成功之间存在二分法。在论文中，我们聚焦于智能体在高维MDP中学习阶段与环境的互动，并基于通过反抗行动获得的经验，提出了一个理论基础的新范式。我们的分析和方法为高效、有效、可扩展和加速学习提供了理论基础，同时不增加计算复杂度，同时显著加速训练。我们在街机学习环境中使用高维状态表示MDP进行了大量实验。实验结果进一步验证了我们的理论分析，我们的方法在高维环境中实现了显著的性能提升和显著的采样效率。

Game-Theory-Assisted Reinforcement Learning for Border Defense: Early Termination based on Analytical Solutions

边境防御博弈论辅助强化学习：基于分析解的早期终止

Authors: Goutam Das, Michael Dorothy, Kyle Volle, Daigo Shishika
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.15907
Pdf link: https://arxiv.org/pdf/2603.15907
Abstract Game theory provides the gold standard for analyzing adversarial engagements, offering strong optimality guarantees. However, these guarantees often become brittle when assumptions such as perfect information are violated. Reinforcement learning (RL), by contrast, is adaptive but can be sample-inefficient in large, complex domains. This paper introduces a hybrid approach that leverages game-theoretic insights to improve RL training efficiency. We study a border defense game with limited perceptual range, where defender performance depends on both search and pursuit strategies, making classical differential game solutions inapplicable. Our method employs the Apollonius Circle (AC) to compute equilibrium in the post-detection phase, enabling early termination of RL episodes without learning pursuit dynamics. This allows RL to concentrate on learning search strategies while guaranteeing optimal continuation after detection. Across single- and multi-defender settings, this early termination method yields 10-20% higher rewards, faster convergence, and more efficient search trajectories. Extensive experiments validate these findings and demonstrate the overall effectiveness of our approach.
中文摘要 博弈论为分析对抗性交战提供了黄金标准，提供了强有力的最优性保证。然而，当诸如完美信息等假设被打破时，这些保证往往变得脆弱。相比之下，强化学习（RL）是自适应的，但在大型复杂领域中样本效率可能较低。本文引入了一种混合方法，利用博弈论洞见提升强化学习训练效率。我们研究一种感知范围有限的边境防御博弈，其中防御者表现依赖于搜索和追击策略，使得经典的差分博弈解不适用。我们的方法采用阿波罗尼乌斯圆（AC）计算检测后阶段的平衡，从而实现在不学习追踪动力学的情况下提前终止强化学习（RL）发作。这使得强化学习能够专注于学习搜索策略，同时保证检测后的最佳延续。在单防御和多防御环境中，这种早期终止方法能带来10%-20%的奖励提升、更快的收敛速度和更高效的搜索轨迹。大量实验验证了这些发现，并证明了我们方法的整体有效性。

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

ExpertGen：可扩展的模拟到现实专家政策，从不完美行为的先验中学习

Authors: Zifan Xu, Ran Gong, Maria Vittoria Minniti, Ahmet Salih Gundogdu, Eric Rosen, Kausik Sivakumar, Riedana Yan, Zixing Wang, Di Deng, Peter Stone, Xiaohan Zhang, Karl Schmeckpeper
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.15956
Pdf link: https://arxiv.org/pdf/2603.15956
Abstract Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, acquiring such data at scale in the real world is prohibitively expensive. This paper introduces ExpertGen, a framework that automates expert policy learning in simulation to enable scalable sim-to-real transfer. ExpertGen first initializes a behavior prior using a diffusion policy trained on imperfect demonstrations, which may be synthesized by large language models or provided by humans. Reinforcement learning is then used to steer this prior toward high task success by optimizing the diffusion model's initial noise while keep original policy frozen. By keeping the pretrained diffusion policy frozen, ExpertGen regularizes exploration to remain within safe, human-like behavior manifolds, while also enabling effective learning with only sparse rewards. Empirical evaluations on challenging manipulation benchmarks demonstrate that ExpertGen reliably produces high-quality expert policies with no reward engineering. On industrial assembly tasks, ExpertGen achieves a 90.5% overall success rate, while on long-horizon manipulation tasks it attains 85% overall success, outperforming all baseline methods. The resulting policies exhibit dexterous control and remain robust across diverse initial configurations and failure states. To validate sim-to-real transfer, the learned state-based expert policies are further distilled into visuomotor policies via DAgger and successfully deployed on real robotic hardware.
中文摘要 学习可通用且稳健的行为克隆策略需要大量高质量的机器人数据。虽然人工演示（例如远程操作）是专家行为的标准来源，但在现实世界中大规模获取此类数据成本高昂。本文介绍了ExpertGen框架，该框架自动化专家在模拟中进行策略学习，实现可扩展的模拟到真实传输。ExpertGen 首先使用基于不完美演示训练的扩散策略来初始化行为，这些演示可以由大型语言模型综合，也可以由人类提供。随后，强化学习通过优化扩散模型的初始噪声，同时保持原始策略冻结，从而引导该先验朝高任务成功率方向发展。通过保持预训练扩散策略冻结，ExpertGen使探索规则化，保持在安全、类人行为的多形内，同时实现了仅有稀少奖励的有效学习。对挑战性操作基准的实证评估表明，ExpertGen 可靠地生成高质量的专家策略，无需奖励工程。在工业组装任务中，ExpertGen 实现了 90.5% 的整体成功率，而在长期操作任务中，其整体成功率达到 85%，超过所有基线方法。最终的政策展现了灵活的控制力，并在多种初始配置和失效状态下依然稳健。为了验证模拟到现实的传输，基于状态的专家策略通过DAgger进一步提炼为视觉运动策略，并成功部署到真实机器人硬件上。

Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

通过多任务强化学习，协调语音大型语言模型中的副语言理解与生成

Authors: Jingxiang Chen, Minseok Kim, Seong-Gyun Leem, Yin Huang, Rashi Rungta, Zhicheng Ouyang, Haibin Wu, Surya Teja Appini, Ankur Bansal, Yang Bai, Yue Liu, Florian Metze, Ahmed A Aly, Anuj Kumar, Ariya Rastrow, Zhaojiang Lin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.15981
Pdf link: https://arxiv.org/pdf/2603.15981
Abstract Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.
中文摘要 语音大型语言模型（LLMs）观察副语言线索，如韵律、情感和非语言声音——这些对意图理解至关重要。然而，利用这些线索面临挑战：训练数据有限、注释难度较高，以及模型利用词汇捷径对副语言信号的利用。我们提出了多任务强化学习（RL），采用思维链提示，能够引发显式的情感推理。为应对数据稀缺性，我们引入了一种副语言学感知的语音LLM（PALLM），通过两阶段流程共同优化音频和副语言感知响应生成的情感分类。实验表明，我们的方法在Expresso、IEMOCAP和RAVDESS上，在监督基线和强专有模型（Gemini-2.5-Pro、GPT-4o-audio）上，提升了8-12%的副语言学理解。结果表明，用多任务强化学习建模副语言推理对于构建情商语音大型语言模型至关重要。

Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition

通过无批判强化学习实现跨用户传感器活动识别的协作时间特征生成

Authors: Xiaozhou Ye, Feng Jiang, Zihan Wang, Xiulai Wang, Yutao Zhang, Kevin I-Kai Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.16043
Pdf link: https://arxiv.org/pdf/2603.16043
Abstract Human Activity Recognition using wearable inertial sensors is foundational to healthcare monitoring, fitness analytics, and context-aware computing, yet its deployment is hindered by cross-user variability arising from heterogeneous physiological traits, motor habits, and sensor placements. Existing domain generalization approaches either neglect temporal dependencies in sensor streams or depend on impractical target-domain annotations. We propose a different paradigm: modeling generalizable feature extraction as a collaborative sequential generation process governed by reinforcement learning. Our framework, CTFG (Collaborative Temporal Feature Generation), employs a Transformer-based autoregressive generator that incrementally constructs feature token sequences, each conditioned on prior context and the encoded sensor input. The generator is optimized via Group-Relative Policy Optimization, a critic-free algorithm that evaluates each generated sequence against a cohort of alternatives sampled from the same input, deriving advantages through intra-group normalization rather than learned value estimation. This design eliminates the distribution-dependent bias inherent in critic-based methods and provides self-calibrating optimization signals that remain stable across heterogeneous user distributions. A tri-objective reward comprising class discrimination, cross-user invariance, and temporal fidelity jointly shapes the feature space to separate activities, align user distributions, and preserve fine-grained temporal content. Evaluations on the DSADS and PAMAP2 benchmarks demonstrate state-of-the-art cross-user accuracy (88.53\% and 75.22\%), substantial reduction in inter-task training variance, accelerated convergence, and robust generalization under varying action-space dimensionalities.
中文摘要 使用可穿戴惯性传感器进行人体活动识别是医疗监测、健身分析和情境感知计算的基础，但其应用受限于生理特征、运动习惯和传感器位置等异质性导致的跨用户差异。现有的领域泛化方法要么忽视传感器流中的时间依赖性，要么依赖不切实际的目标域注释。我们提出另一种范式：将可推广特征提取建模为由强化学习支配的协作顺序生成过程。我们的框架CTFG（协作时间特征生成）采用基于Transformer的自回归生成器，逐步构建特征标记序列，每个序列都基于先前上下文和编码传感器输入。生成器通过群相对策略优化进行优化，这是一种无批评算法，通过从同一输入抽样的一组备选方案评估每个生成序列，通过组内归一化而非学习的值估计获得优势。该设计消除了基于批评者的方法固有的分布依赖偏差，并提供了自校准的优化信号，在异构用户分布中保持稳定。包含类别歧视、跨用户不变性和时间保真度的三重奖励共同塑造了功能空间，以区分活动、对齐用户分布并保持细粒度的时间内容。DSADS和PAMAP2基准测试的评估显示，跨用户准确率达到最先进的（88.53%和75.22%），任务间训练方差显著减少，收敛加速，并在不同动作空间维度下实现稳健泛化。

ARISE: Agent Reasoning with Intrinsic Skill Evolution in Hierarchical Reinforcement Learning

ARISE：层级强化学习中具内在技能演化的智能体推理

Authors: Yu Li, Rui Miao, Zhengling Qi, Tian Lan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.16060
Pdf link: https://arxiv.org/pdf/2603.16060
Abstract The dominant paradigm for improving mathematical reasoning in language models relies on Reinforcement Learning with verifiable rewards. Yet existing methods treat each problem instance in isolation without leveraging the reusable strategies that emerge and accumulate during training. To this end, we introduce ARISE (Agent Reasoning via Intrinsic Skill Evolution), a hierarchical reinforcement learning framework, in which a shared policy operates both to manage skills at high-level and to generate responses at low-level (denoted as a Skills Manager and a Worker, respectively). The Manager maintains a tiered skill library through a dedicated skill generation rollout that performs structured summarization of successful solution traces (after execution), while employing a policy-driven selection mechanism to retrieve relevant skills to condition future rollouts (before execution). A hierarchical reward design guides the co-evolution of reasoning ability and library quality. Experiments on two base models and seven benchmarks spanning both competition mathematics and Omni-MATH show that ARISE consistently outperforms GRPO-family algorithms and memory-augmented baselines, with particularly notable gains on out-of-distribution tasks. Ablation studies confirm that each component contributes to the observed improvements and that library quality and reasoning performance improve in tandem throughout training. Code is available at \href{this https URL}{this https URL}.
中文摘要 提升语言模型中数学推理的主流范式依赖于具有可验证奖励的强化学习。然而，现有方法将每个问题实例孤立处理，未能利用训练过程中出现和积累的可重用策略。为此，我们引入了ARISE（通过内在技能演化实现代理推理），这是一种层级强化学习框架，其中共享策略既用于管理高层技能，也用于生成低层次响应（分别称为技能管理者和工作者）。经理通过专门的技能生成展开维护分层技能库，执行后对成功解决方案追踪进行结构化总结，同时采用策略驱动的选择机制检索相关技能，以条件未来的部署（执行前）。层级奖励设计指导推理能力与库质量的共进化。对两个基础模型和七个基准测试（涵盖竞争数学和全数学）的实验显示，ARISE持续优于GRPO家族算法和内存增强基线，尤其在分布外任务中取得显著提升。消融研究证实，每个组成部分都促进了观察到的改善，且库的质量和推理表现在培训过程中同步提升。代码可在 \href{this https URL}{this https URL} 获取。

Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models

大型奖励模型：利用视觉语言模型生成可推广的在线机器人奖励生成

Authors: Yanru Wu, Weiduo Yuan, Ang Qi, Vitor Guizilini, Jiageng Mao, Yue Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.16065
Pdf link: https://arxiv.org/pdf/2603.16065
Abstract Reinforcement Learning (RL) has shown great potential in refining robotic manipulation policies, yet its efficacy remains strongly bottlenecked by the difficulty of designing generalizable reward functions. In this paper, we propose a framework for online policy refinement by adapting foundation VLMs into online reward generators. We develop a robust, scalable reward model based on a state-of-the-art VLM, trained on a large-scale, multi-source dataset encompassing real-world robot trajectories, human-object interactions, and diverse simulated environments. Unlike prior approaches that evaluate entire trajectories post-hoc, our method leverages the VLM to formulate a multifaceted reward signal comprising process, completion, and temporal contrastive rewards based on current visual observations. Initializing with a base policy trained via Imitation Learning (IL), we employ these VLM rewards to guide the model to correct sub-optimal behaviors in a closed-loop manner. We evaluate our framework on challenging long-horizon manipulation benchmarks requiring sequential execution and precise control. Crucially, our reward model operates in a purely zero-shot manner within these test environments. Experimental results demonstrate that our method significantly improves the success rate of the initial IL policy within just 30 RL iterations, demonstrating remarkable sample efficiency. This empirical evidence highlights that VLM-generated signals can provide reliable feedback to resolve execution errors, effectively eliminating the need for manual reward engineering and facilitating efficient online refinement for robot learning.
中文摘要 强化学习（RL）在优化机器人操作策略方面展现出巨大潜力，但其效能仍受限于设计可推广奖励函数的难度。本文提出一个在线政策优化框架，将基础VLM改造为在线奖励生成器。我们基于最先进的VLM开发了一个稳健、可扩展的奖励模型，训练基于涵盖真实机器人轨迹、人与物交互以及多样化模拟环境的大规模多源数据集。与以往对整个轨迹进行事后评估的方法不同，我们的方法利用VLM构建包含过程、完成和时间对比奖励的多面向奖励信号，基于当前视觉观察。我们初始化时使用通过模仿学习（IL）训练的基础策略，利用这些VLM奖励引导模型以闭环方式纠正次优行为。我们评估了本框架在需要顺序执行和精确控制的挑战性长期操作基准测试上。关键是，我们的奖励模型在这些测试环境中完全采用零射击方式。实验结果表明，我们的方法在仅30次强化学习迭代内显著提升初始IL策略的成功率，展现了显著的样本效率。这一实证证据强调，VLM生成的信号能够提供可靠的反馈以解决执行错误，有效消除了人工奖励工程的需求，促进了机器人学习的高效在线优化。

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

SWE-QA-Pro：代表性的基准和可扩展的代码库级代码理解培训配方

Authors: Songcheng Cai, Zhiheng Lyu, Yuansheng Ni, Xiangchao Chen, Baichuan Zhou, Shenzhe Zhu, Yi Lu, Haozhe Wang, Chi Ruan, Benjamin Schneider, Weixu Zhang, Xiang Li, Andy Zheng, Yuyu Zhang, Ping Nie, Wenhu Chen
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.16124
Pdf link: https://arxiv.org/pdf/2603.16124
Abstract Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clustering to cover under-represented task types and apply a rigorous difficulty calibration process: questions solvable by direct-answer baselines are filtered out. This results in a dataset where agentic workflows significantly outperform direct answering (e.g., a ~13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training data for such complex behaviors, we propose a scalable synthetic data pipeline that powers a two-stage training recipe: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from AI Feedback (RLAIF). This approach allows small open models to learn efficient tool usage and reasoning. Empirically, a Qwen3-8B model trained with our recipe surpasses GPT-4o by 2.3 points on SWE-QA-Pro and substantially narrows the gap to state-of-the-art proprietary models, demonstrating both the validity of our evaluation and the effectiveness of our agentic training workflow.
中文摘要 代理代码库级的理解对于自动化复杂的软件工程任务至关重要，但该领域缺乏可靠的基准。现有评估往往忽视长尾主题，依赖于大型语言模型（LLM）通过记忆知识作弊的流行数据库。为此，我们引入了SWE-QA-Pro，这是一个由多样化长尾仓库和可执行环境构建的基准测试。我们通过议题驱动的聚类执行主题平衡，覆盖代表性不足的任务类型，并采用严格的难度校准流程：通过直接答案基线解决的问题被过滤掉。这导致数据集中代理式工作流显著优于直接回答（例如，Claude Sonnet 4.5 的差距约 13 分），证实了代理式代码库探索的必要性。此外，为了解决此类复杂行为训练数据的稀缺性问题，我们提出了一个可扩展的合成数据流水线，支持两阶段训练方案：监督微调（SFT）和基于AI反馈的强化学习（RLAIF）。这种方法允许小型开放模型学习高效的工具使用和推理。从实证角度看，使用我们配方训练的Qwen3-8B模型在SWE-QA-Pro上比GPT-4o高出2.3个百分点，并大幅缩小了与最先进专有模型的差距，证明了我们评估的有效性和代理训练流程的有效性。

Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

噪声数据对可验证奖励的强化学习具有破坏性

Authors: Yuxuan Zhu, Daniel Kang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.16140
Pdf link: https://arxiv.org/pdf/2603.16140
Abstract Reinforcement learning with verifiable rewards (RLVR) has driven recent capability advances of large language models across various domains. Recent studies suggest that improved RLVR algorithms allow models to learn effectively from incorrect annotations, achieving performance comparable to learning from clean data. In this work, we show that these findings are invalid because the claimed 100% noisy training data is "contaminated" with clean data. After rectifying the dataset with a rigorous re-verification pipeline, we demonstrate that noise is destructive to RLVR. We show that existing RLVR algorithm improvements fail to mitigate the impact of noise, achieving similar performance to that of the basic GRPO. Furthermore, we find that the model trained on truly incorrect annotations performs 8-10% worse than the model trained on clean data across mathematical reasoning benchmarks. Finally, we show that these findings hold for real-world noise in Text2SQL tasks, where training on real-world, human annotation errors cause 5-12% lower accuracy than clean data. Our results show that current RLVR methods cannot yet compensate for poor data quality. High-quality data remains essential.
中文摘要 带有可验证奖励的强化学习（RLVR）推动了大型语言模型在多个领域的能力进步。最新研究表明，改进的RLVR算法使模型能够有效从错误的注释中学习，实现与从干净数据学习相当的性能。在本研究中，我们证明这些发现无效，因为所谓的100%噪声训练数据净的数据“污染”。通过严格的重新验证流程修正数据集后，我们证明噪声对RLVR具有破坏性。我们表明现有RLVR算法改进未能减轻噪声影响，性能与基础GRPO相当。此外，我们发现，在数学推理基准测试中，训练于真正错误注释的模型比在干净数据上训练的模型差8-10%。最后，我们证明这些发现适用于Text2SQL任务中的真实噪声，在这些任务中，基于真实世界人类注释错误的训练会使准确率比干净数据低5-12%。我们的结果表明，目前的RLVR方法尚无法补偿数据质量的不足。高质量的数据依然至关重要。

Communication-Aware Multi-Agent Reinforcement Learning for Decentralized Cooperative UAV Deployment

去中心化协作无人机部署的通信感知多智能体强化学习

Authors: Enguang Fan, Yifan Chen, Zihan Shan, Matthew Caesar, Jae Kim
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.16141
Pdf link: https://arxiv.org/pdf/2603.16141
Abstract Autonomous Unmanned Aerial Vehicle (UAV) swarms are increasingly used as rapidly deployable aerial relays and sensing platforms, yet practical deployments must operate under partial observability and intermittent peer-to-peer links. We present a graph-based multi-agent reinforcement learning framework trained under centralized training with decentralized execution (CTDE): a centralized critic and global state are available only during training, while each UAV executes a shared policy using local observations and messages from nearby neighbors. Our architecture encodes local agent state and nearby entities with an agent-entity attention module, and aggregates inter-UAV messages with neighbor self-attention over a distance-limited communication graph. We evaluate primarily on a cooperative relay deployment task (DroneConnect) and secondarily on an adversarial engagement task (DroneCombat). In DroneConnect, the proposed method achieves high coverage under restricted communication and partial observation (e.g. 74% coverage with M = 5 UAVs and N = 10 nodes) while remaining competitive with a mixed-integer linear programming (MILP) optimization-based offline upper bound, and it generalizes to unseen team sizes without fine-tuning. In the adversarial setting, the same framework transfers without architectural changes and improves win rate over non-communicating baselines.
中文摘要 自主无人机（UAV）群体越来越多地被用作快速部署的空中中继和感测平台，但实际部署必须在部分可观测性和间歇点对点链路下运行。我们提出了一个基于图的多智能体强化学习框架，在集中式训练与去中心化执行（CTDE）下训练：集中式批评者和全局状态仅在训练期间可用，而每架无人机则利用来自邻居的本地观察和消息执行共享策略。我们的架构通过代理-实体注意力模块编码本地代理状态和邻近实体，并通过距离限制的通信图聚合无人机间的自注意信息。我们主要在合作中继部署任务（DroneConnect）上进行评估，其次是对抗性交战任务（DroneCombat）。在DroneConnect中，该方法在受限通信和部分观察下实现高覆盖率（例如M=5无人机和N=10节点时覆盖率74%），同时在基于混合整数线性规划（MILP）优化的离线上界中保持竞争力，且无需微调即可推广到未见团队规模。在对抗环境中，同一框架无需架构变动即可转移，并且比非沟通基线的胜率更高。

HIPO: Instruction Hierarchy via Constrained Reinforcement Learning

HIPO：通过受限强化学习实现指令层级

Authors: Keru Chen, Jun Luo, Sen Lin, Yingbin Liang, Alvaro Velasquez, Nathaniel Bastian, Shaofeng Zou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.16152
Pdf link: https://arxiv.org/pdf/2603.16152
Abstract Hierarchical Instruction Following (HIF) refers to the problem of prompting large language models with a priority-ordered stack of instructions. Standard methods like RLHF and DPO typically fail in this problem since they mainly optimize for a single objective, failing to explicitly enforce system prompt compliance. Meanwhile, supervised fine-tuning relies on mimicking filtered, compliant data, which fails to establish the priority asymmetry at the algorithmic level. In this paper, we introduce \textsc{HIPO}, a novel alignment framework that formulates HIF as a Constrained Markov Decision Process. \textsc{HIPO} elevates system prompts from mere input context to strict algorithmic boundaries. Using a primal-dual safe reinforcement learning approach, the algorithm dynamically enforces system prompt compliance as an explicit constraint, maximizing user utility strictly within this feasible region. Extensive evaluations across diverse model architectures (e.g., Qwen, Phi, Llama) demonstrate that \textsc{HIPO} significantly improves both system compliance and user utility. Furthermore, mechanistic analysis reveals that this constrained optimization autonomously drives the model to shift its attention toward long-range system tokens, providing a principled foundation for reliable LLM deployment in complex workflows.
中文摘要 层级指令跟随（HIF）指的是为大型语言模型提示一个优先级排序的指令栈的问题。标准方法如RLHF和DPO通常在此问题上失败，因为它们主要针对单一目标优化，未能明确执行系统提示符的合规性。而监督微调依赖于模拟经过过滤的合规数据，这在算法层面无法确立优先级不对称性。本文介绍了 \textsc{HIPO}，一种新颖的比对框架，将 HIF 表述为受限马尔可夫决策过程。\textsc{HIPO} 将系统提示从单纯的输入上下文提升为严格的算法边界。通过原始-对偶安全强化学习方法，该算法动态地将系统提示的合规性作为显式约束，最大化用户在该可行区域内的效用。跨多种模型架构（如Qwen、Phi、Llama）的广泛评估表明，\textsc{HIPO}显著提升了系统合规性和用户实用性。此外，机制分析表明，这种受限优化自主地驱动模型将注意力转向远程系统令牌，为复杂工作流中可靠的大型语言模型部署奠定了原则基础。

DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

DyJR：通过动态Jensen-Shannon回放实现可验证奖励的强化学习多样性

Authors: Long Li, Zhijian Zhou, Tianyi Wang, Weidi Xu, Zuming Huang, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.16157
Pdf link: https://arxiv.org/pdf/2603.16157
Abstract While Reinforcement Learning (RL) enhances Large Language Model reasoning, on-policy algorithms like GRPO are sample-inefficient as they discard past rollouts. Existing experience replay methods address this by reusing accurate samples for direct policy updates, but this often incurs high computational costs and causes mode collapse via overfitting. We argue that historical data should prioritize sustaining diversity rather than simply reinforcing accuracy. To this end, we propose Dynamic Jensen-Shannon Replay (DyJR), a simple yet effective regularization framework using a dynamic reference distribution from recent trajectories. DyJR introduces two innovations: (1) A Time-Sensitive Dynamic Buffer that uses FIFO and adaptive sizing to retain only temporally proximal samples, synchronizing with model evolution; and (2) Jensen-Shannon Divergence Regularization, which replaces direct gradient updates with a distributional constraint to prevent diversity collapse. Experiments on mathematical reasoning and Text-to-SQL benchmarks demonstrate that DyJR significantly outperforms GRPO as well as baselines such as RLEP and Ex-GRPO, while maintaining training efficiency comparable to the original GRPO. Furthermore, from the perspective of Rank-$k$ token probability evolution, we show that DyJR enhances diversity and mitigates over-reliance on Rank-1 tokens, elucidating how specific sub-modules of DyJR influence the training dynamics.
中文摘要 虽然强化学习（RL）增强了大型语言模型的推理能力，但像GRPO这样的策略型算法由于丢弃了过去的推广，样本效率较低。现有的经验重放方法通过重复使用准确样本进行直接策略更新来解决这个问题，但这通常会产生高计算成本，并导致模式因过拟合而崩溃。我们主张历史数据应优先维护多样性，而非仅仅强化准确性。为此，我们提出了动态Jensen-Shannon重放（DyJR），这是一种简单但有效的正则化框架，利用近期轨迹的动态参考分布。DyJR引入了两项创新：（1）一种时间敏感动态缓冲区，利用先进先出和自适应大小，仅保留时间上近端的样本，并与模型演化同步;以及（2）Jensen-Shannon 发散正则化，用分布约束替代直接梯度更新以防止多样性崩溃。数学推理和文本转SQL基准测试的实验表明，DyJR在训练效率上显著优于GRPO以及如RLEP和Ex-GRPO的基线，同时保持与原始GRPO相当的训练效率。此外，从Rank-$k$令牌概率演化的角度，我们表明DyJR增强了多样性，减少了对Rank-1令牌的过度依赖，阐明了DyJR特定子模块如何影响训练动态。

Execution-Grounded Credit Assignment for GRPO in Code Generation

代码生成中GRPO的执行基础学分分配

Authors: Abhijit Kumar, Natalya Kumar, Shikhar Gupta
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.16158
Pdf link: https://arxiv.org/pdf/2603.16158
Abstract Critic-free reinforcement learning with verifiable rewards (RLVR) improves code generation by optimizing unit-test pass rates, but GRPO-style updates suffer from coarse credit assignment: a single outcome signal is spread uniformly across long programs even when failure stems from a localized semantic error. We propose Execution-Grounded Credit Assignment (EGCA), which localizes GRPO updates using execution traces. For programs that satisfy algorithmic constraints but fail tests, EGCA executes the candidate and a canonical reference solution (curated once offline; used for analysis, not supervision) under identical instrumentation, identifies the earliest semantic divergence, and assigns advantage only to the corresponding token span while masking downstream tokens. EGCA is a drop-in modification requiring no critic, auxiliary loss, or learned verifier, yielding 82.1% pass@1 on HumanEval (+3.1 over GRPO) and 68.9% on MBPP (+1.5) with 18% wall-clock overhead.
中文摘要 无批评的可验证奖励强化学习（RLVR）通过优化单元测试通过率提升代码生成，但GRPO式更新存在粗糙的信用赋值问题：即使失败源于局部语义错误，单一结果信号仍均匀分布在长程序中。我们提出了执行基础信用分配（EGCA），通过执行轨迹对GRPO更新进行本地化。对于满足算法约束但测试失败的程序，EGCA会在相同的仪器下执行候选方案和一个典范参考解（离线后策划;用于分析，而非监督），识别最早的语义发散，并仅赋予对应的令牌间长优势，同时掩盖下游令牌。EGCA是一种无需批判者、辅助损失或学习验证者的直接修改，HumanEval的400%pass@1（比GRPO+3.1），MBPP为68.9%（+1.5），墙时钟开销为18%。

SQL-ASTRA: Alleviating Sparse Feedback in Agentic SQL via Column-Set Matching and Trajectory Aggregation

SQL-ASTRA：通过列集匹配和轨迹聚合缓解代理SQL中的稀疏反馈

Authors: Long Li, Zhijian Zhou, Jiangxuan Long, Peiyang Liu, Weidi Xu, Zhe Wang, Shirui Pan, Chao Qu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.16161
Pdf link: https://arxiv.org/pdf/2603.16161
Abstract Agentic Reinforcement Learning (RL) shows promise for complex tasks, but Text-to-SQL remains mostly restricted to single-turn paradigms. A primary bottleneck is the credit assignment problem. In traditional paradigms, rewards are determined solely by the final-turn feedback, which ignores the intermediate process and leads to ambiguous credit evaluation. To address this, we propose Agentic SQL, a framework featuring a universal two-tiered reward mechanism designed to provide effective trajectory-level evaluation and dense step-level signals. First, we introduce Aggregated Trajectory Reward (ATR) to resolve multi-turn credit assignment. Using an asymmetric transition matrix, ATR aggregates process-oriented scores to incentivize continuous improvement. Leveraging Lyapunov stability theory, we prove ATR acts as an energy dissipation operator, guaranteeing a cycle-free policy and monotonic convergence. Second, Column-Set Matching Reward (CSMR) provides immediate step-level rewards to mitigate sparsity. By executing queries at each turn, CSMR converts binary (0/1) feedback into dense [0, 1] signals based on partial correctness. Evaluations on BIRD show a 5% gain over binary-reward GRPO. Notably, our approach outperforms SOTA Arctic-Text2SQL-R1-7B on BIRD and Spider 2.0 using identical models, propelling Text-to-SQL toward a robust multi-turn agent paradigm.
中文摘要 代理强化学习（RL）在复杂任务中展现出潜力，但文本转SQL仍主要局限于单回合范式。一个主要瓶颈是信用分配问题。在传统范式中，奖励仅由最终回合反馈决定，忽视了中间过程，导致信用评估模糊不清。为此，我们提出了代理SQL框架，该框架具有通用的两层奖励机制，旨在提供有效的轨迹级评估和密集的步级信号。首先，我们引入了汇总轨迹奖励（ATR）来解决多回合的信用分配。ATR采用非对称过渡矩阵，汇总流程导向的评分以激励持续改进。利用李雅普诺夫稳定性理论，我们证明ATR作为能量耗散算符，保证无循环策略和单调收敛。其次，列集匹配奖励（CSMR）提供即时的步骤级奖励，以减少稀疏性。通过在每回合执行查询，CSMR将二进制（0/1）反馈转换为基于部分正确性的密集[0， 1]信号。BIRD的评估显示，比二元奖励GRPO提高了5%。值得注意的是，我们的方法在使用相同模型时优于SOTA Arctic-Text2SQL-R1-7B在BIRD和Spider 2.0上，推动文本转SQL迈向稳健的多回合代理范式。

Enforcing Task-Specified Compliance Bounds for Humanoids via Anisotropic Lipschitz-Constrained Policies

通过各向异性利普希茨约束策略强制执行任务指定的类人生物合规界限

Authors: Zewen He, Yoshihiko Nakamura
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.16180
Pdf link: https://arxiv.org/pdf/2603.16180
Abstract Reinforcement learning (RL) has demonstrated substantial potential for humanoid bipedal locomotion and the control of complex motions. To cope with oscillations and impacts induced by environmental interactions, compliant control is widely regarded as an effective remedy. However, the model-free nature of RL makes it difficult to impose task-specified and quantitatively verifiable compliance objectives, and classical model-based stiffness designs are not directly applicable. Lipschitz-Constrained Policies (LCP), which regularize the local sensitivity of a policy via gradient penalties, have recently been used to smooth humanoid motions. Nevertheless, existing LCP-based methods typically employ a single scalar Lipschitz budget and lack an explicit connection to physically meaningful compliance specifications in real-world systems. In this study, we propose an anisotropic Lipschitz-constrained policy (ALCP) that maps a task-space stiffness upper bound to a state-dependent Lipschitz-style constraint on the policy Jacobian. The resulting constraint is enforced during RL training via a hinge-squared spectral-norm penalty, preserving physical interpretability while enabling direction-dependent compliance. Experiments on humanoid robots show that ALCP improves locomotion stability and impact robustness, while reducing oscillations and energy usage.
中文摘要 强化学习（RL）已展示出在类人生物双足行走和复杂运动控制方面的巨大潜力。为了应对环境相互作用引起的振荡和冲击，合规控制被广泛认为是一种有效的解决办法。然而，强化学习的无模型特性使得强制执行任务指定且可定量验证的合规目标变得困难，经典基于模型的刚度设计并不直接适用。利普希茨-约束策略（LCP）通过梯度惩罚规范策略的局部敏感性，最近被用于平滑类人生物的运动。然而，现有基于LCP的方法通常采用单一标量Lipschitz预算，且缺乏与现实系统中物理上有意义的合规规范的明确关联。本研究提出一种各向异性利普希茨约束策略（ALCP），将任务空间刚度上界映射为状态依赖的利普希茨式约束，该约束对策略雅可比矩阵。在强化学习训练中，通过铰链平方的频谱范数惩罚来强制执行该约束，既保持物理可解释性，又实现方向依赖的合规性。对类人机器人的实验显示，ALCP提升了运动稳定性和冲击的稳健性，同时减少了振荡和能耗。

Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning

SVG-LLMs中的多任务多奖励强化学习的可靠推理

Authors: Haomin Wang, Qi Wei, Qianli Ma, Shengyuan Ding, Jinhui Yin, Kai Chen, Hongjie Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.16189
Pdf link: https://arxiv.org/pdf/2603.16189
Abstract With the rapid advancement of vision-language models, an increasing number of studies have explored their potential for SVG generation tasks. Although existing approaches improve performance by constructing large-scale SVG datasets and introducing SVG-specific tokens, they still suffer from limited generalization, redundant paths in code outputs, and a lack of explicit reasoning. In this work, we present CTRL-S (Chain-of-Thought Reinforcement Learning for SVG), a unified framework that introduces a chain-of-thought mechanism to explicitly expose the model's reasoning process during SVG generation. To support this structured reasoning, we construct SVG-Sophia, a high-quality dataset containing 145K samples across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks. By training the model to generate group-level structured SVG code, CTRL-S significantly improves structural coherence and visual fidelity. Furthermore, we adopt the GRPO algorithm and design a multi-reward optimization framework, incorporating DINO, image-text similarity, format, and code efficiency rewards. Through joint multi-reward optimization and multi-task training, our approach systematically enhances overall generation capabilities. Extensive experiments show that CTRL-S outperforms existing methods, achieving higher task success rates, superior SVG code quality, and exceptional visual fidelity.
中文摘要 随着视觉语言模型的快速发展，越来越多的研究探索它们在SVG生成任务中的潜力。尽管现有方法通过构建大规模SVG数据集和引入SVG专用标记来提升性能，但它们仍存在推广有限、代码输出中冗余路径以及缺乏显式推理的问题。在本研究中，我们提出了CTRL-S（SVG思维链强化学习），这是一个统一框架，引入了思维链机制，以显式地揭示模型在SVG生成过程中的推理过程。为了支持这种结构化推理，我们构建了SVG-Sophia，这是一个高质量数据集，包含14.5万个样本，涵盖SVG代码细化、文本转SVG和图像转SVG任务。通过训练模型生成组级结构化SVG代码，CTRL-S显著提升了结构一致性和视觉真实性。此外，我们采用GRPO算法，设计多奖励优化框架，整合了DINO、图像-文本相似度、格式和代码效率奖励。通过联合多奖励优化和多任务训练，我们的方法系统地提升了整体生成能力。大量实验表明，CTRL-S 优于现有方法，实现更高的任务成功率、更优越的 SVG 代码质量和卓越的视觉真实度。

Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning

离线探索感知的长链数学推理微调

Authors: Yongyu Mu, Jiali Zeng, Fandong Meng, JingBo Zhu, Tong Xiao
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.16206
Pdf link: https://arxiv.org/pdf/2603.16206
Abstract Through encouraging self-exploration, reinforcement learning from verifiable rewards (RLVR) has significantly advanced the mathematical reasoning capabilities of large language models. As the starting point for RLVR, the capacity of supervised fine-tuning (SFT) to memorize new chain-of-thought trajectories provides a crucial initialization that shapes the subsequent exploration landscape. However, existing research primarily focuses on facilitating exploration during RLVR training, leaving exploration-aware SFT under-explored. To bridge this gap, we propose Offline eXploration-Aware (OXA) fine-tuning. Specifically, OXA optimizes two objectives: promoting low-confidence verified teacher-distillation data to internalize previously uncaptured reasoning patterns, and suppressing high-confidence incorrect self-distillation data to redistribute probability mass of incorrect patterns toward potentially correct candidates. Experimental results across 6 benchmarks show that OXA consistently improves mathematical reasoning performance, especially achieving an average gain of $+6$ Pass@1 and $+5$ Pass@$k$ points compared to conventional SFT on the Qwen2.5-1.5B-Math. Crucially, OXA elevates initial policy entropy, and performance gains persist throughout extensive RLVR training, demonstrating the long-term value of OXA.
中文摘要 通过鼓励自我探索，可验证奖励强化学习（RLVR）显著提升了大型语言模型的数学推理能力。作为RLVR的起点，监督式微调（SFT）记忆新的思维链轨迹的能力，是塑造后续探索格局的关键初始化。然而，现有研究主要集中在促进RLVR培训中的探索，使得探索意识型SFT尚未被充分探索。为弥合这一差距，我们提出了离线探索感知（OXA）微调方案。具体来说，OXA优化了两个目标：促进低置信度的教师提炼数据，以内化此前未捕捉到的推理模式;以及抑制高置信度错误的自我提炼数据，将错误模式的概率量重新分配给潜在正确的候选人。跨越6个基准测试的实验结果显示，OXA持续提升数学推理表现，尤其是在Qwen2.5-1.5B-Math上，平均获得$+6$Pass@1和$+5$ Pass@$k$的提升。关键是，OXA提高了初始政策熵，且性能提升贯穿广泛的RLVR培训，证明了OXA的长期价值。

Dual Consensus: Escaping from Spurious Majority in Unsupervised RLVR via Two-Stage Vote Mechanism

双重共识：通过两阶段投票机制摆脱无监督RLVR中的虚假多数

Authors: Kaixuan Du, Meng Cao, Hang Zhang, Yukun Wang, Xiangzhou Huang, Ni Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.16223
Pdf link: https://arxiv.org/pdf/2603.16223
Abstract Current label-free RLVR approaches for large language models (LLMs), such as TTRL and Self-reward, have demonstrated effectiveness in improving the performance of LLMs on complex reasoning tasks. However, these methods rely heavily on accurate pseudo-label estimation and converge on spurious yet popular answers, thereby trapping in a dominant mode and limiting further improvements. Building on this, we propose Dual Consensus Reinforcement Learning (DCRL), a novel self-supervised training method which is capable of generating more reliable learning signals through a two-stage consensus mechanism. The model initially acts as an anchor, producing dominant responses; then it serves as an explorer, generating diverse auxiliary signals via a temporary unlearning process. The final training target is derived from the harmonic mean of these two signal sets. Notably, the process operates entirely without external models or supervision. Across eight benchmarks and diverse domains, DCRL consistently improves Pass@1 over majority vote while yielding more stable training dynamics. These results demonstrate that DCRL establishes a scalable path toward stronger reasoning without labels.
中文摘要 当前用于大型语言模型（LLM）的无标签RLVR方法，如TTRL和自我奖励，已证明在提升LLMs在复杂推理任务中的表现方面具有有效性。然而，这些方法高度依赖准确的伪标签估计，并趋于虚假但流行的答案，从而陷入主导模式，限制了后续改进。基于此，我们提出了双重共识强化学习（Dual Consensus Reinforcement Learning，DCRL），这是一种新型的自监督训练方法，能够通过两阶段共识机制生成更可靠的学习信号。模型最初充当锚点，产生主导反应;然后它作为探索者，通过临时的去学习过程生成多样化的辅助信号。最终训练目标是从这两个信号集的谐波平均值中得出的。值得注意的是，该过程完全没有外部模型或监督。在八个基准和多样化领域，DCRL持续提升Pass@1多数票，同时带来更稳定的训练动态。这些结果表明，DCRL建立了一条可扩展的路径，实现无标签的更强推理。

VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

VIGOR：VIdeo 以几何为导向的时间生成对齐奖励

Authors: Tengjiao Yin, Jinglei Shi, Heng Guo, Xi Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.16271
Pdf link: https://arxiv.org/pdf/2603.16271
Abstract Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model to align video diffusion models through two complementary pathways: post-training of a bidirectional model via SFT or Reinforcement Learning and inference-time optimization of a Causal Video Model (e.g., Streaming video generator) via test-time scaling with our reward as a path verifier. Experimental results validate the effectiveness of our design, demonstrating that our geometry-based reward provides superior robustness compared to other variants. By enabling efficient inference-time scaling, our method offers a practical solution for enhancing open-source video models without requiring extensive computational resources for retraining.
中文摘要 视频扩散模型在训练过程中缺乏明确的几何监督，导致生成视频中出现物体变形、空间漂移和深度违规等不一致伪影。为解决这一限制，我们提出了一种基于几何的奖励模型，利用预训练的几何基础模型通过跨帧重投影误差评估多视图一致性。与以往测量像素空间不一致的几何度量不同，像素强度可能带来额外噪声，我们的方法以点状方式进行误差计算，从而获得更物理化且稳健的误差度量。此外，我们引入了一种几何感知采样策略，过滤掉低纹理和非语义区域，重点评估具有几何意义且对应关系可靠，以提升鲁棒性。我们将该奖励模型应用于通过两条互补路径对齐视频扩散模型：通过SFT或强化学习对双向模型进行后训练，以及通过测试时间尺度对因果视频模型（如流式视频生成器）进行推理时间优化，奖励作为路径验证器。实验结果验证了我们设计的有效性，表明基于几何的奖励相比其他变体具有更优越的鲁棒性。通过实现高效的推理时间尺度，我们的方法为增强开源视频模型提供了实用的解决方案，而无需大量计算资源进行重新训练。

Agile Interception of a Flying Target using Competitive Reinforcement Learning

利用竞争强化学习对飞行目标进行敏捷拦截

Authors: Timothée Gavin (ENAC-LAB, LAAS-RIS), Simon Lacroix (LAAS-RIS), Murat Bronz (ENAC)
Subjects: Subjects: Robotics (cs.RO); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.16279
Pdf link: https://arxiv.org/pdf/2603.16279
Abstract This article presents a solution to intercept an agile drone by another agile drone carrying a catching net. We formulate the interception as a Competitive Reinforcement Learning problem, where the interceptor and the target drone are controlled by separate policies trained with Proximal Policy Optimization (PPO). We introduce a high-fidelity simulation environment that integrates a realistic quadrotor dynamics model and a low-level control architecture implemented in JAX, which allows for fast parallelized execution on GPUs. We train the agents using low-level control, collective thrust and body rates, to achieve agile flights both for the interceptor and the target. We compare the performance of the trained policies in terms of catch rate, time to catch, and crash rate, against common heuristic baselines and show that our solution outperforms these baselines for interception of agile targets. Finally, we demonstrate the performance of the trained policies in a scaled real-world scenario using agile drones inside an indoor flight arena.
中文摘要 本文提出了一种解决方案，帮助另一架携带捕网的敏捷无人机拦截一架灵活无人机。我们将拦截设计为竞争强化学习问题，拦截机和目标无人机分别由通过近端策略优化（PPO）训练的独立策略控制。我们引入了一个高保真模拟环境，集成了真实的四旋翼动力学模型和由JAX实现的低级控制架构，实现GPU上的快速并行执行。我们通过低空控制、集体推力和体速训练特工，实现拦截机和目标的灵活飞行。我们将训练有素的策略在捕获率、捕获时间和崩溃率方面与常见启发式基线进行比较，结果显示我们的解决方案在拦截敏捷目标方面优于这些基线。最后，我们展示了在室内飞行场内利用敏捷无人机在大规模真实世界场景中所训练政策的表现。

Controlling Fish Schools via Reinforcement Learning of Virtual Fish Movement

通过强化学习虚拟鱼类运动控制鱼群

Authors: Yusuke Nishii, Hiroaki Kawashima
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
Arxiv link: https://arxiv.org/abs/2603.16384
Pdf link: https://arxiv.org/pdf/2603.16384
Abstract This study investigates a method to guide and control fish schools using virtual fish trained with reinforcement learning. We utilize 2D virtual fish displayed on a screen to overcome technical challenges such as durability and movement constraints inherent in physical robotic agents. To address the lack of detailed behavioral models for real fish, we adopt a model-free reinforcement learning approach. First, simulation results show that reinforcement learning can acquire effective movement policies even when simulated real fish frequently ignore the virtual stimulus. Second, real-world experiments with live fish confirm that the learned policy successfully guides fish schools toward specified target directions. Statistical analysis reveals that the proposed method significantly outperforms baseline conditions, including the absence of stimulus and a heuristic "stay-at-edge" strategy. This study provides an early demonstration of how reinforcement learning can be used to influence collective animal behavior through artificial agents.
中文摘要 本研究探讨了一种利用强化学习训练的虚拟鱼类来引导和控制鱼群的方法。我们利用屏幕上显示的二维虚拟鱼类，克服了物理机器人智能体固有的耐用性和运动限制等技术挑战。为了解决真实鱼类缺乏详细行为模型的问题，我们采用了无模型强化学习方法。首先，模拟结果表明，即使模拟的真实鱼类经常忽视虚拟刺激，强化学习也能获得有效的运动策略。其次，真实的活鱼实验证实，所学策略能成功引导鱼群朝向指定目标方向。统计分析显示，所提方法显著优于基础条件，包括缺乏刺激和启发式“保持边缘”策略。本研究早期展示了强化学习如何通过人工代理影响集体动物行为。

Deep Reinforcement Learning-Assisted Automated Operator Portfolio for Constrained Multi-objective Optimization

深度强化学习辅助自动化操作员组合，用于受限多目标优化

Authors: Shuai Shao, Ye Tian, Shangshang Yang, Xingyi Zhang
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2603.16401
Pdf link: https://arxiv.org/pdf/2603.16401
Abstract Constrained multi-objective optimization problems (CMOPs) are of great significance in the context of practical applications, ranging from scientific to engineering domains. Most existing constrained multi-objective evolutionary algorithms (CMOEAs) usually employ fixed operators all the time, which exhibit poor versatility in handling various CMOPs. Therefore, some recent studies have focused on adaptively selecting the best operators for the current population states during the search process. The evolutionary algorithms proposed in these studies learn the value of each operator and recommend the operator with the highest value for the current population, resulting in only a single operator being recommended at each generation, which can potentially lead to local optima and inefficient utilization of function evaluations. To address the dilemma in operator adaptation, this paper proposes a reinforcement learning-based automated operator portfolio approach to learn an allocation scheme of operators at each generation. This approach considers the optimization-related and constraint-related features of the current population as states, the overall improvement in population convergence and diversity as rewards, and different operator portfolios as actions. By utilizing deep neural networks to establish a mapping model between the population states and the expected cumulative rewards, the proposed approach determines the optimal operator portfolio during the evolutionary process. By embedding the proposed approach into existing CMOEAs, a deep reinforcement learning-assisted automated operator portfolio based evolutionary algorithm for solving CMOPs, abbreviated as CMOEA-AOP, is developed. Empirical studies on 33 benchmark problems demonstrate that the proposed algorithm significantly enhances the performance of CMOEAs and exhibits more stable performance across different CMOPs.
中文摘要 受限多目标优化问题（CMOPs）在实际应用中具有重要意义，涵盖从科学到工程领域的广泛应用。大多数现有的受限多目标进化算法（CMOEA）通常始终使用固定算符，这在处理各种CMOP时表现出较差的灵活性。因此，一些近期研究重点在搜索过程中自适应地选择当前种群状态的最佳算符。这些研究中提出的进化算法会学习每个算子的值，并推荐当前总体中值最高的算子，导致每代只推荐一个算子，这可能导致局部最优和函数评估效率低下。为解决算符适应中的困境，本文提出了基于强化学习的自动算子组合方法，用于学习每代算子的配置方案。该方法将当前种群的优化相关和约束相关特征视为状态，将种群趋同和多样性的整体提升视为奖励，将不同的操作员组合视为行动。通过利用深度神经网络建立种群状态与预期累计回报之间的映射模型，该方法确定了进化过程中最优操作员组合。通过将该方法嵌入现有CMOEA中，开发出一种深度强化学习辅助的自动化操作员组合进化算法，简称CMOEA-AOP。对33个基准问题的实证研究表明，所提算法显著提升了CMOEA的性能，并在不同CMOP中表现出更稳定的性能。

Onboard MuJoCo-based Model Predictive Control for Shipboard Crane with Double-Pendulum Sway Suppression

基于MuJoCo的船载预测控制模型，配备双摆抑制

Authors: Oscar Pang, Lisa Coiffard, Paul Templier, Luke Beddow, Kamil Dreczkowski, Antoine Cully
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.16407
Pdf link: https://arxiv.org/pdf/2603.16407
Abstract Transferring heavy payloads in maritime settings relies on efficient crane operation, limited by hazardous double-pendulum payload sway. This sway motion is further exacerbated in offshore environments by external perturbations from wind and ocean waves. Manual suppression of these oscillations on an underactuated crane system by human operators is challenging. Existing control methods struggle in such settings, often relying on simplified analytical models, while deep reinforcement learning (RL) approaches tend to generalise poorly to unseen conditions. Deploying a predictive controller onto compute-constrained, highly non-linear physical systems without relying on extensive offline training or complex analytical models remains a significant challenge. Here we show a complete real-time control pipeline centered on the MuJoCo MPC framework that leverages a cross-entropy method planner to evaluate candidate action sequences directly within a physics simulator. By using simulated rollouts, this sampling-based approach successfully reconciles the conflicting objectives of dynamic target tracking and sway damping without relying on complex analytical models. We demonstrate that the controller can run effectively on a resource-constrained embedded hardware, while outperforming traditional PID and RL baselines in counteracting external base perturbations. Furthermore, our system demonstrates robustness even when subjected to unmodeled physical discrepancies like the introduction of a second payload.
中文摘要 在海事环境中转移重型有效载荷依赖于高效的起重机操作，但受限于危险的双摆有效载荷摆动。这种摆动运动在海上环境下因风和海浪的外部扰动而进一步加剧。人工操作员在欠驱动起重机系统上手动抑制这些振荡具有挑战性。现有控制方法在此类环境中表现不佳，常依赖简化的分析模型，而深度强化学习（RL）方法则难以推广到看不见的条件。在计算受限、高度非线性的物理系统中部署预测控制器，而不依赖大量离线训练或复杂分析模型，仍是一项重大挑战。这里我们展示了一个基于MuJoCo MPC框架的完整实时控制流水线，利用交叉熵方法规划器直接在物理模拟器中评估候选动作序列。通过使用模拟展开，这种基于采样的方法成功调和了动态目标跟踪与摆动阻尼的冲突目标，而无需依赖复杂的分析模型。我们证明了控制器可以在资源受限的嵌入式硬件上高效运行，同时在抵消外部基扰方面优于传统的PID和强化学习基线。此外，我们的系统即使在未模拟的物理差异（如引入第二个有效载荷）时也表现出鲁棒性。

Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences

通过负面AI对齐：为什么负约束结构优于正向偏好

Authors: Quan Cheng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.16417
Pdf link: https://arxiv.org/pdf/2603.16417
Abstract Recent empirical results have demonstrated that training large language models (LLMs) with negative-only feedback can match or exceed standard reinforcement learning from human feedback (RLHF). Negative Sample Reinforcement achieves parity with PPO on mathematical reasoning; Distributional Dispreference Optimization trains effectively using only dispreferred samples; and Constitutional AI outperforms pure RLHF on harmlessness benchmarks. Yet no unified theoretical account explains why negative signals are so effective. This paper proposes such an account: positive preferences and negative constraints are structurally asymmetric. Positive preferences ("which is better") encode continuously coupled, context-dependent human values that cannot be exhaustively specified -- leading models to learn surface correlates such as agreement with the user (sycophancy). Negative constraints ("what is wrong") encode discrete, finite, independently verifiable prohibitions that can converge to a stable boundary. This asymmetry -- rooted in Popper's falsification logic and the epistemology of negative knowledge -- explains both the sycophancy failure of preference-based RLHF and the surprising effectiveness of negative-signal methods. We argue that alignment research should shift its center of gravity from "learning what humans prefer" to "learning what humans reject," and offer testable predictions for this framework.
中文摘要 近期实证结果表明，仅用负反馈训练大型语言模型（LLMs）可以匹敌甚至超过标准的人类反馈强化学习（RLHF）。负样本强化在数学推理上与PPO相当;分布偏好优化仅使用不偏好样本进行训练;宪法人工智能在无害基准测试中优于纯RLHF。然而，没有统一的理论解释为何负面信号如此有效。本文提出了这样的解释：正偏好和负约束在结构上是不对称的。积极偏好（“哪个更好”）编码了连续耦合的、依赖情境的人类价值，无法被详尽描述——这促使模型学习表面相关性，如与用户的一致性（谄媚）。负约束（“哪里错了”）编码了离散的、有限的、可独立验证的禁止，这些限制可以收敛到一个稳定的边界。这种不对称——根植于波普尔的证伪逻辑和负知识的认识论——解释了基于偏好的RLHF的谄媚失败以及负信号方法令人惊讶的有效性。我们主张比对研究应将重心从“学习人类偏好”转向“学习人类拒绝什么”，并为该框架提供可检验的预测。

Agentic AI for SAGIN Resource Management_Semantic Awareness, Orchestration, and Optimization

SAGIN 资源意识、编排与优化中的代理人工智能Management_Semantic

Authors: Linghao Zhang, Haitao Zhao, Bo Xu, Hongbo Zhu, Xianbin Wang
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.16458
Pdf link: https://arxiv.org/pdf/2603.16458
Abstract Space-air-ground integrated networks (SAGIN) promise ubiquitous 6G connectivity but face significant resource management challenges due to heterogeneous infrastructure, dynamic topologies, and stringent quality-of-service (QoS) requirements. Conventional model-driven approaches struggle with scalability and adaptability in such complex environments. This paper presents an agentic artificial intelligence (AI) framework for autonomous SAGIN resource management by embedding large language model (LLM)-based agents into a Monitor-Analyze-Plan- Execute-Knowledge (MAPE-K) control plane. The framework incorporates three specialized agents, namely semantic resource perceivers, intent-driven orchestrators, and adaptive learners, that collaborate through natural language reasoning to bridge the gap between operator intents and network execution. A key innovation is the hierarchical agent-reinforcement learning (RL) collaboration mechanism, wherein LLM-based orchestrators dynamically shape reward functions for RL agents based on semantic network conditions. Validation through UAV-assisted AIGC service orchestration in energy-constrained scenarios demonstrates that LLM-driven reward shaping achieves 14% energy reduction and the lowest average service latency among all compared methods. This agentic paradigm offers a scalable pathway toward adaptive, AI-native 6G networks, capable of autonomously interpreting intents and adapting to dynamic environments.
中文摘要 空地集成网络（SAGIN）承诺实现无处不在的6G连接，但由于基础设施异构、动态拓扑和严格的服务质量（QoS）要求，面临重大资源管理挑战。传统的模型驱动方法在如此复杂的环境中难以实现可扩展性和适应性。本文提出了一种通过将大型语言模型（LLM）代理嵌入监控-分析-计划-执行-知识（MAPE-K）控制平面，实现自主SAGIN资源管理的代理人工智能（AI）框架。该框架包含三种专业代理，即语义资源感知者、意图驱动编排器和自适应学习者，它们通过自然语言推理协作，弥合操作者意图与网络执行之间的鸿沟。一项关键创新是层级智能体-强化学习（RL）协作机制，基于LLM的编排器基于语义网络条件动态塑造RL代理的奖励函数。通过无人机辅助AIGC服务编排在能量受限场景下的验证表明，LLM驱动的奖励塑造实现了14%的能量削减和所有比较方法中最低的平均服务延迟。这一代理范式为自适应、AI原生的6G网络提供了可扩展的路径，能够自主解读意图并适应动态环境。

Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition

《追踪线索，框架真相：开放词汇多模态情绪识别中的混合证据演绎推理》

Authors: Yu Liu, Lei Zhang, Haoxun Li, Hanlei Shi, Yuxuan Ding, Leyuan Qu, Taihao Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2603.16463
Pdf link: https://arxiv.org/pdf/2603.16463
Abstract Open-Vocabulary Multimodal Emotion Recognition (OV-MER) is inherently challenging due to the ambiguity of equivocal multimodal cues, which often stem from distinct unobserved situational dynamics. While Multimodal Large Language Models (MLLMs) offer extensive semantic coverage, their performance is often bottlenecked by premature commitment to dominant data priors, resulting in suboptimal heuristics that overlook crucial, complementary affective cues across modalities. We argue that effective affective reasoning requires more than surface-level association; it necessitates reconstructing nuanced emotional states by synthesizing multiple evidence-grounded rationales that reconcile these observations from diverse latent perspectives. We introduce HyDRA, a Hybrid-evidential Deductive Reasoning Architecture that formalizes inference as a Propose-Verify-Decide protocol. To internalize this abductive process, we employ reinforcement learning with hierarchical reward shaping, aligning the reasoning trajectories with final task performance to ensure they best reconcile the observed multimodal cues. Systematic evaluations validate our design choices, with HyDRA consistently outperforming strong baselines--especially in ambiguous or conflicting scenarios--while providing interpretable, diagnostic evidence traces.
中文摘要 开放词汇多模态情绪识别（OV-MER）本质上具有挑战性，因为模棱两可的多模态线索往往源自不同的未被观察到的情境动态。虽然多模态大型语言模型（MLLM）提供了广泛的语义覆盖，但其性能常因过早投入主导数据先验而受限，导致启发式方法次优，忽视了各模态间关键且互补的情感线索。我们认为，有效的情感推理需要的不仅仅是表面上的联想;它需要通过综合多种基于证据的理由，重新构建细腻的情绪状态，以调和来自不同潜在视角的观察。我们介绍HyDRA，一种混合证据演绎推理架构，将推理形式化为提议-验证-决定协议。为了内化这一溯因过程，我们采用了带有层级奖励塑造的强化学习，使推理轨迹与最终任务表现保持一致，以确保它们能最佳地调和观察到的多模态线索。系统评估验证了我们的设计选择，HyDRA持续优于强基线——尤其是在模糊或冲突情境下——同时提供了可解释的诊断证据痕迹。

Multi-Agent Reinforcement Learning Counteracts Delayed CSI in Multi-Satellite Systems

多智能体强化学习抵消多卫星系统中延迟的CSI

Authors: Marios Aristodemou, Yasaman Omid, Sangarapillai Lambotharan, Mahsa Derakhshan, Lajos Hanzo
Subjects: Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2603.16470
Pdf link: https://arxiv.org/pdf/2603.16470
Abstract The integration of satellite communication networks with next-generation (NG) technologies is a promising approach towards global connectivity. However, the quality of services is highly dependant on the availability of accurate channel state information (CSI). Channel estimation in satellite communications is challenging due to the high propagation delay between terrestrial users and satellites, which results in outdated CSI observations on the satellite side. In this paper, we study the downlink transmission of multiple satellites acting as distributed base stations (BS) to mobile terrestrial users. We propose a multi-agent reinforcement learning (MARL) algorithm which aims for maximising the sum-rate of the users, while coping with the outdated CSI. We design a novel bi-level optimisation, procedure themes as dual stage proximal policy optimisation (DS-PPO), for tackling the problem of large continuous action spaces as well as of independent and non-identically distributed (non-IID) environments in MARL. Specifically, the first stage of DS-PPO maximises the sum-rate for an individual satellite and the second stage maximises the sum-rate when all the satellites cooperate to form a distributed multi-antenna BS. Our numerical results demonstrate the robustness of DS-PPO to CSI imperfections as well as the sum-rate improvement attached by the use of DS-PPO. In addition, we provide the convergence analysis for the DS-PPO along with the computational complexity.
中文摘要 将卫星通信网络与下一代（NG）技术整合，是实现全球互联的有前景的途径。然而，服务质量高度依赖于准确信道状态信息（CSI）的可用性。卫星通信中的信道估计具有挑战性，因为地面用户与卫星之间的传播延迟较高，导致卫星端的CSI观测数据过时。本文研究了多颗作为分布式基站（BS）的卫星向移动地面用户传输的下行链路。我们提出了一种多智能体强化学习（MARL）算法，旨在最大化用户的总和率，同时应对过时的CSI。我们设计了一种新型的双级优化，程序主题为双阶段近端策略优化（DS-PPO），用于解决MARL中大型连续动作空间以及独立且非相同分布（非IID）环境的问题。具体来说，DS-PPO的第一级最大化单个卫星的总和率，第二级在所有卫星协同形成分布式多天线BS时最大化总和率。我们的数值结果证明了DS-PPO对CSI缺陷的鲁棒性，以及DS-PPO带来的总速率提升。此外，我们还提供了DS-PPO的收敛分析以及计算复杂度。

From the Inside Out: Progressive Distribution Refinement for Confidence Calibration

从内而外：基于置信度校准的渐进分布细化

Authors: Xizhong Yang, Yinan Xia, Huiming Wang, Mofei Song
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.16500
Pdf link: https://arxiv.org/pdf/2603.16500
Abstract Leveraging the model's internal information as the self-reward signal in Reinforcement Learning (RL) has received extensive attention due to its label-free nature. While prior works have made significant progress in applying the Test-Time Scaling (TTS) strategies to RL, the discrepancy in internal information between test and training remains inadequately addressed. Moreover, Test-Time Training based on voting-based TTS strategies often suffers from reward hacking problems. To address these issues, we propose DistriTTRL, which leverages the distribution prior of the model's confidence during RL to progressively optimize the reward signal, rather than relying solely on single-query rollouts. Additionally, we mitigate the phenomenon of consistent reward hacking caused by the voting-based TTS strategies through diversity-targeted penalties. Benefiting from this training mechanism where model capability and self-reward signals complement each other, and the mitigation of reward hacking, DistriTTRL has achieved significant performance improvements across multiple models and benchmarks.
中文摘要 利用模型内部信息作为强化学习（RL）中的自我奖励信号，因其无标签特性而受到广泛关注。尽管以往工作在将测试时间缩放（TTS）策略应用于强化学习方面取得了显著进展，但测试与训练之间内部信息的差异仍未得到充分解决。此外，基于投票TTS策略的测试时训练常常存在奖励黑客问题。为解决这些问题，我们提出了DistriTTRL，利用模型在强化学习期间置信度的分布先验，逐步优化奖励信号，而非仅依赖单查询部署。此外，我们通过针对多样性的惩罚措施，缓解了基于投票的TTS策略导致的持续奖励黑客现象。借助模型能力与自我奖励信号互补的训练机制，以及奖励黑客的缓解，DistriTTRL 在多个模型和基准测试中取得了显著的性能提升。

Kamino: GPU-based Massively Parallel Simulation of Multi-Body Systems with Challenging Topologies

Kamino：基于GPU的多体系统大规模并行仿真，具有挑战性的拓扑结构

Authors: Vassilios Tsounis, Guirec Maloisel, Christian Schumacher, Ruben Grandia, Agon Serifi, David Müller, Chris Amevor, Tobias Widmer, Moritz Bächer
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.16536
Pdf link: https://arxiv.org/pdf/2603.16536
Abstract We present Kamino, a GPU-based physics solver for massively parallel simulations of heterogeneous highly-coupled mechanical systems. Implemented in Python using NVIDIA Warp and integrated into the Newton framework, it enables the application of data-driven methods, such as large-scale reinforcement learning, to complex robotic systems that exhibit strongly coupled kinematic and dynamic constraints such as kinematic loops. The latter are often circumvented by practitioners; approximating the system topology as a kinematic tree and incorporating explicit loop-closure constraints or so-called mimic joints. Kamino aims at alleviating this burden by natively supporting these types of coupling. This capability facilitates high-throughput parallelized simulations that capture the true nature of mechanical systems that exploit closed kinematic chains for mechanical advantage. Moreover, Kamino supports heterogeneous worlds, allowing for batched simulation of structurally diverse robots on a single GPU. At its core lies a state-of-the-art constrained optimization algorithm that computes constraint forces by solving the constrained rigid multi-body forward dynamics transcribed as a nonlinear complementarity problem. This leads to high-fidelity simulations that can resolve contact dynamics without resorting to approximate models that simplify and/or convexify the problem. We demonstrate RL policy training on DR Legs, a biped with six nested kinematic loops, generating a feasible walking policy while simulating 4096 parallel environments on a single GPU.
中文摘要 我们介绍Kamino，一款基于GPU的物理求解器，用于大规模并行模拟异构高耦合机械系统。该系统通过 Python 利用 NVIDIA Warp 实现，并集成到 Newton 框架中，使数据驱动方法（如大规模强化学习）能够应用于表现出强烈耦合运动学和动态约束（如运动学环）的复杂机器人系统。后者常被从业者绕过;将系统拓扑近似为运动学树，并包含显式的环闭约束或所谓的拟态关节。Kamino旨在通过原生支持此类耦合来减轻这一负担。这一能力促进了高通量并行模拟，捕捉利用闭合运动链以获得机械优势的机械系统真实本质。此外，Kamino支持异构世界，允许在单一GPU上批量模拟结构多样的机器人。其核心是一种最先进的约束优化算法，通过求解受约束刚性多体前进动力学（以非线性互补性问题转录）来计算约束力。这促成了高精度仿真，能够在不依赖简化和/或凸化问题的近似模型的情况下解决接触动力学。我们在DR Legs上演示强化学习策略训练，这是一种拥有六个嵌套运动学环路的双足动物，在单一GPU上模拟4096个并行环境的同时生成可行的步行策略。

EmoLLM: Appraisal-Grounded Cognitive-Emotional Co-Reasoning in Large Language Models

EmoLLM：基于评估的认知-情感共推理在大型语言模型中

Authors: Yifei Zhang, Mingyang Li, Henry Gao, Liang Zhao
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.16553
Pdf link: https://arxiv.org/pdf/2603.16553
Abstract Large language models (LLMs) demonstrate strong cognitive intelligence (IQ), yet many real-world interactions also require emotional intelligence (EQ) to produce responses that are both factually reliable and emotionally appropriate. In settings such as emotional support, technical assistance, and consultation, effective dialogue depends on how situations are appraised with respect to the user's needs, goals, and coping capacity. Inspired by appraisal theory, we propose EmoLLM, an appraisal-grounded framework for IQ/EQ co-reasoning in dialogue. EmoLLM uses an explicit Appraisal Reasoning Graph (ARG) to structure intermediate reasoning over contextual facts, inferred user needs, appraisal dimensions, emotional states, and response strategies before generating a reply. We train EmoLLM in a multi-turn role-play environment with reinforcement learning, where reverse-perspective reasoning provides reward signals based on predicted user-side consequences of responses. Across diverse dialogue settings, EmoLLM improves emotional state outcomes and response quality over strong baselines while preserving strong factual reliability.
中文摘要 大型语言模型（LLMs）展现出强大的认知智能（IQ），但许多现实世界的互动也需要情商（EQ）才能产生既符合事实可靠性又符合情感的回应。在情感支持、技术支持和咨询等环境中，有效的对话取决于如何根据用户的需求、目标和应对能力来评估情况。受评估理论启发，我们提出了EmoLLM，这是一个基于评估的对话智商/情商共推理框架。EmoLLM使用显式评估推理图（ARG），在生成回复前，构建基于上下文事实、推断用户需求、评估维度、情绪状态和反应策略的中间推理。我们在多轮角色扮演环境中训练EmoLLM，采用强化学习，反向视角推理基于预测的用户端反应提供奖励信号。在多样化的对话环境中，EmoLLM在强有力基线下提升情绪状态结果和反应质量，同时保持了强烈的事实可靠性。

When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective

无监督强化学习何时以及为何在数学推理中取得成功？多形包围视角

Authors: Zelin Zhang, Fei Cheng, Chenhui Chu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.16578
Pdf link: https://arxiv.org/pdf/2603.16578
Abstract Although outcome-based reinforcement learning (RL) significantly advances the mathematical reasoning capabilities of Large Language Models (LLMs), its reliance on computationally expensive ground-truth annotations imposes a severe scalability bottleneck. Unsupervised RL guided by intrinsic rewards offers a scalable alternative, yet it suffers from opaque training dynamics and catastrophic instability, such as policy collapse and reward hacking. In this paper, we first design and evaluate a suite of intrinsic rewards that explicitly enforce concise and certain generation. Second, to discover the boundaries of this approach, we test base models across a spectrum of intrinsic reasoning capabilities, revealing how a model's foundational logical prior dictates its success or failure. Finally, to demystify why certain configurations stabilize while others collapse, we introduce a novel geometric diagnostic lens, showing that successful cases are enveloped by manifolds. Ultimately, our work goes beyond merely demonstrating that enforcing concise and certain responses successfully boosts mathematical reasoning; we reveal when this unsupervised approach breaks down and geometrically diagnose why.
中文摘要 尽管基于结果的强化学习（RL）显著提升了大型语言模型（LLMs）的数学推理能力，但其对计算量大且基础真实注释的依赖带来了严重的可扩展性瓶颈。由内在奖励引导的无监督强化学习提供了可扩展的替代方案，但它存在不透明的训练动态和灾难性的不稳定性，如政策崩溃和奖励黑客攻击。本文首先设计并评估一套明确执行简洁且特定生成的内在奖励。其次，为了发现这种方法的边界，我们测试了跨越多种内在推理能力的基础模型，揭示模型的基础逻辑先验如何决定其成败。最后，为了揭开为何某些配置稳定而另一些崩溃，我们引入了新的几何诊断视角，表明成功案例被流形包覆。归根结底，我们的工作不仅仅是证明执行简洁且特定的回答能有效提升数学推理能力;我们揭示了这种无监督方法何时失效，并几何式地诊断原因。

Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models

理据很重要：通过代理引导批评学习VLMReward模型的可转移评分标准

Authors: Weijie Qiu, Dai Guan, Junxin Wang, Zhihang Li, Yongbo Gai, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.16600
Pdf link: https://arxiv.org/pdf/2603.16600
Abstract Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy's prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at this https URL.
中文摘要 视觉语言模型（VLMs）的生成奖励模型（GRM）通常通过三阶段流程评估输出：评分标准生成、基于标准的评分和最终结论。然而，中间评分标准很少直接优化。以往的工作通常要么将评分标准视为附带，要么依赖昂贵的LLM作为裁判的检查，这些检查没有可区分信号且培训时间有限。我们提出了代理GRM，将代理引导的评分标准验证引入强化学习（RL），以明确提升评分标准质量。具体来说，我们训练轻量级代理代理（代理-SFT和代理-RL），它们将候选评分标准与原始查询和偏好对结合，然后仅凭评分标准作为证据预测偏好排序。代理的预测准确性作为评分标准的奖励，激励模型生成内部一致且可转移的评分标准。凭借约5万个数据样本，代理GRM在VL奖励台、多模奖励台和MM-RLHF奖励台上达到了最先进的结果，超过了在四倍数据上训练的方法。消融结果显示，Proxy-SFT比Proxy-RL更强，隐式奖励聚合表现最佳。关键是，所学评分标准会转移到看不见的评估者手中，在测试时提升奖励准确性，无需额外培训。我们的代码可在此 https URL 访问。

What if Pinocchio Were a Reinforcement Learning Agent: A Normative End-to-End Pipeline

如果匹诺曹是一个强化学习代理：一条规范性的端到端管道，会怎样

Authors: Benoît Alcaraz
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.16651
Pdf link: https://arxiv.org/pdf/2603.16651
Abstract In the past decade, artificial intelligence (AI) has developed quickly. With this rapid progression came the need for systems capable of complying with the rules and norms of our society so that they can be successfully and safely integrated into our daily lives. Inspired by the story of Pinocchio in ``Le avventure di Pinocchio - Storia di un burattino'', this thesis proposes a pipeline that addresses the problem of developing norm compliant and context-aware agents. Building on the AJAR, Jiminy, and NGRL architectures, the work introduces \pino, a hybrid model in which reinforcement learning agents are supervised by argumentation-based normative advisors. In order to make this pipeline operational, this thesis also presents a novel algorithm for automatically extracting the arguments and relationships that underlie the advisors' decisions. Finally, this thesis investigates the phenomenon of \textit{norm avoidance}, providing a definition and a mitigation strategy within the context of reinforcement learning agents. Each component of the pipeline is empirically evaluated. The thesis concludes with a discussion of related work, current limitations, and directions for future research.
中文摘要 在过去十年里，人工智能（AI）发展迅速。随着这种快速发展，带来了能够遵守社会规则和规范的系统的需求，以便这些系统能够成功且安全地融入我们的日常生活。本论文灵感来源于《匹诺曹的冒险——一个布拉蒂诺的故事》中匹诺曹的故事，提出了一个解决规范合规且具情境感知智能体问题的流程。在AJAR、Jiminy和NGRL架构的基础上，该研究引入了\pino，一种由基于论证的规范顾问监督强化学习代理的混合模型。为了使该流程可运行，本论文还提出了一种新颖算法，用于自动提取指导者决策背后的论据和关系。最后，本论文探讨了\textit{规范回避}现象，在强化学习代理的背景下提供了定义和缓解策略。管道的每个组成部分都经过实证评估。论文以相关工作、当前局限性及未来研究方向的讨论作为结尾。

When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

机器人应该什么时候思考？通过强化学习实现具身机器人决策的资源感知推理

Authors: Jun Liu, Pu Zhao, Zhenglun Kong, Xuan Shen, Peiyan Dong, Fan Yang, Lin Cui, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Gaowen Liu, Yanzhi Wang, Dong Huang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.16673
Pdf link: https://arxiv.org/pdf/2603.16673
Abstract Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decision-making during interactions with the environment. However, invoking LLM reasoning introduces substantial computational latency and resource overhead, which can interrupt action execution and reduce system reliability. Excessive reasoning may delay actions, while insufficient reasoning often leads to incorrect decisions and task failures. This raises a fundamental question for embodied agents: when should the agent reason, and when should it act? In this work, we propose RARRL (Resource-Aware Reasoning via Reinforcement Learning), a hierarchical framework for resource-aware orchestration of embodied agents. Rather than learning low-level control policies, RARRL learns a high-level orchestration policy that operates at the agent's decision-making layer. This policy enables the agent to adaptively determine whether to invoke reasoning, which reasoning role to employ, and how much computational budget to allocate based on current observations, execution history, and remaining resources. Extensive experiments, including evaluations with empirical latency profiles derived from the ALFRED benchmark, show that RARRL consistently improves task success rates while reducing execution latency and enhancing robustness compared with fixed or heuristic reasoning strategies. These results demonstrate that adaptive reasoning control is essential for building reliable and efficient embodied robotic agents.
中文摘要 具身机器人系统越来越依赖基于大型语言模型（LLM）的智能体，以支持与环境交互中的高级推理、规划和决策。然而，调用LLM推理会带来显著的计算延迟和资源开销，可能中断动作执行并降低系统可靠性。过度推理可能会延迟行动，推理不足则常导致错误决策和任务失败。这对具身主体提出了一个根本性问题：主体何时应推理，何时应行动？在本研究中，我们提出了RARRL（通过强化学习实现资源感知推理），这是一种用于具身代理资源感知编排的分层框架。RARRL不学习低层控制策略，而是学习运行在代理决策层的高级编排策略。该策略使智能体能够根据当前观察、执行历史和剩余资源，自适应地决定是否调用推理、采用哪种推理角色以及分配多少计算预算。广泛的实验，包括基于ALFRED基准测试的经验延迟剖面评估，表明RARRL相较于固定或启发式推理策略，持续提升任务成功率，同时降低执行延迟并增强鲁棒性。这些结果表明，自适应推理控制对于构建可靠高效的具身机器人智能体至关重要。

Learning Whole-Body Control for a Salamander Robot

学习蝾螈机器人的全身控制

Authors: Mengze Tian, Qiyuan Fu, Chuanfang Ning, Javier Jia Jie Pey, Auke Ijspeert
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.16683
Pdf link: https://arxiv.org/pdf/2603.16683
Abstract Amphibious legged robots inspired by salamanders are promising in applications in complex amphibious environments. However, despite the significant success of training controllers that achieve diverse locomotion behaviors in conventional quadrupedal robots, most salamander robots relied on central-pattern-generator (CPG)-based and model-based coordination strategies for locomotion control. Learning unified joint-level whole-body control that reliably transfers from simulation to highly articulated physical salamander robots remains relatively underexplored. In addition, few legged robots have tried learning-based controllers in amphibious environments. In this work, we employ Reinforcement Learning to map proprioceptive observations and commanded velocities to joint-level actions, allowing coordinated locomotor behaviors to emerge. To deploy these policies on hardware, we adopt a system-level real-to-sim matching and sim-to-real transfer strategy. The learned controller achieves stable and coordinated walking on both flat and uneven terrains in the real world. Beyond terrestrial locomotion, the framework enables transitions between walking and swimming in simulation, highlighting a phenomenon of interest for understanding locomotion across distinct physical modes.
中文摘要 受蝾螈启发的两栖腿机器人在复杂两栖环境中具有前景。然而，尽管在传统四足机器人中训练控制器实现多样化运动行为取得了显著成功，大多数蝾螈机器人仍依赖基于中央模式生成器（CPG）和基于模型的协调策略进行运动控制。学习能够可靠地从仿真转移到高度关节化的物理蝾螈机器人的统一关节级全身控制，仍然相对缺乏探索。此外，很少有腿部机器人尝试在两栖环境中使用基于学习的控制器。在本研究中，我们运用强化学习将本体感觉观察和指令速度映射到关节层面的动作，从而实现协调的运动行为。为了在硬件上部署这些策略，我们采用系统级的实物到模拟匹配和模拟到真实传输策略。学习过的控制器能够在现实世界中的平坦和不平坦地形上实现稳定且协调的行走。除了陆地运动，该框架还实现了行走与游泳之间的模拟转换，凸显了理解不同物理模式运动的一种重要现象。

GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution

GDPO-SR：单步生成图像超分辨率的组直接偏好优化

Authors: Qiaosi Yi, Shuai Li, Rongyuan Wu, Lingchen Sun, Zhengqiang Zhang, Lei Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.16769
Pdf link: https://arxiv.org/pdf/2603.16769
Abstract Recently, reinforcement learning (RL) has been employed for improving generative image super-resolution (ISR) performance. However, the current efforts are focused on multi-step generative ISR, while one-step generative ISR remains underexplored due to its limited stochasticity. In addition, RL methods such as Direct Preference Optimization (DPO) require the generation of positive and negative sample pairs offline, leading to a limited number of samples, while Group Relative Policy Optimization (GRPO) only calculates the likelihood of the entire image, ignoring local details that are crucial for ISR. In this paper, we propose Group Direct Preference Optimization (GDPO), a novel approach to integrate RL into one-step generative ISR model training. First, we introduce a noise-aware one-step diffusion model that can generate diverse ISR outputs. To prevent performance degradation caused by noise injection, we introduce an unequal-timestep strategy to decouple the timestep of noise addition from that of diffusion. We then present the GDPO strategy, which integrates the principle of GRPO into DPO, to calculate the group-relative advantage of each online generated sample for model optimization. Meanwhile, an attribute-aware reward function is designed to dynamically evaluate the score of each sample based on its statistics of smooth and texture areas. Experiments demonstrate the effectiveness of GDPO in enhancing the performance of one-step generative ISR models. Code: this https URL.
中文摘要 近年来，强化学习（RL）被用于提升生成图像超分辨率（ISR）性能。然而，目前的努力主要集中在多步生成式ISR，而单步生成式ISR由于其随机性有限，尚未被充分探索。此外，强化学习方法如直接偏好优化（DPO）要求离线生成正负样本对，导致样本数量有限，而群相对策略优化（GRPO）仅计算整个图像的似然，忽略对ISR至关重要的局部细节。本文提出了群体直接偏好优化（Group Direct Preference Optimization，简称GDPO），这是一种将强化学习整合进一步生成式ISR模型训练的新方法。首先，我们引入了一种噪声感知的一步扩散模型，能够生成多样化的ISR输出。为防止噪声注入导致的性能下降，我们引入了不等时步策略，将噪声添加时间步与扩散时间步解耦。随后，我们介绍了将GRPO原理整合进DPO的GDPO策略，以计算每个在线生成样本的群体相对优势以优化模型。同时，设计了一个属性感知奖励函数，基于平滑和纹理区域的统计数据动态评估每个样本的得分。实验证明了GDPO在提升单步生成式ISR模型性能方面的有效性。代码：这个 https URL。

Anticipatory Planning for Multimodal AI Agents

多模态人工智能代理的前瞻性规划

Authors: Yongyuan Liang, Shijie Zhou, Yu Gu, Hao Tan, Gang Wu, Franck Dernoncourt, Jihyung Kil, Ryan A. Rossi, Ruiyi Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.16777
Pdf link: https://arxiv.org/pdf/2603.16777
Abstract Recent advances in multimodal agents have improved computer-use interaction and tool-usage, yet most existing systems remain reactive, optimizing actions in isolation without reasoning about future states or long-term goals. This limits planning coherence and prevents agents from reliably solving high-level, multi-step tasks. We introduce TraceR1, a two-stage reinforcement learning framework that explicitly trains anticipatory reasoning by forecasting short-horizon trajectories before execution. The first stage performs trajectory-level reinforcement learning with rewards that enforce global consistency across predicted action sequences. The second stage applies grounded reinforcement fine-tuning, using execution feedback from frozen tool agents to refine step-level accuracy and executability. TraceR1 is evaluated across seven benchmarks, covering online computer-use, offline computer-use benchmarks, and multimodal tool-use reasoning tasks, where it achieves substantial improvements in planning stability, execution robustness, and generalization over reactive and single-stage baselines. These results show that anticipatory trajectory reasoning is a key principle for building multimodal agents that can reason, plan, and act effectively in complex real-world environments.
中文摘要 多模态智能体的最新进展改善了计算机使用的交互和工具使用，但大多数现有系统仍处于被动状态，孤立地优化动作，不考虑未来状态或长期目标。这限制了计划的连贯性，阻碍了代理可靠地解决高层次、多步骤的任务。我们引入了TraceR1，一个两阶段的强化学习框架，通过在执行前预测短视距轨迹，明确训练预期推理。第一阶段执行轨迹级强化学习，奖励强化预测动作序列的全局一致性。第二阶段则采用基础加固微调，利用冻结工具代理的执行反馈来优化步骤级精度和可执行性。TraceR1在七个基准测试中进行了评估，涵盖在线计算机使用、离线计算机使用基准以及多模态工具使用推理任务，在规划稳定性、执行鲁棒性和泛化性方面取得了显著提升，优于反应式和单阶段基线。这些结果表明，预见轨迹推理是构建多模态代理的关键原则，能够在复杂的现实环境中有效推理、规划和行动。

Deep Reinforcement Learning-driven Edge Offloading for Latency-constrained XR pipelines

深度强化学习驱动的边缘卸载，适用于延迟受限的XR管道

Authors: Sourya Saha (City University of New York), Saptarshi Debroy (City University of New York)
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.16823
Pdf link: https://arxiv.org/pdf/2603.16823
Abstract Immersive extended reality (XR) applications introduce latency-critical workloads that must satisfy stringent real-time responsiveness while operating on energy- and battery-constrained devices, making execution placement between end devices and nearby edge servers a fundamental systems challenge. Existing approaches to adaptive execution and computation offloading typically optimize average performance metrics and do not fully capture the sustained interaction between real-time latency requirements and device battery lifetime in closed-loop XR workloads. In this paper, we present a battery-aware execution management framework for edge-assisted XR systems that jointly considers execution placement, workload quality, latency requirements, and battery dynamics. We design an online decision mechanism based on a lightweight deep reinforcement learning policy that continuously adapts execution decisions under dynamic network conditions while maintaining high motion-to-photon latency compliance. Experimental results show that the proposed approach extends the projected device battery lifetime by up to 163% compared to latency-optimal local execution while maintaining over 90% motion-to-photon latency compliance under stable network conditions. Such compliance does not fall below 80% even under significantly limited network bandwidth availability, thereby demonstrating the effectiveness of explicitly managing latency-energy trade-offs in immersive XR systems.
中文摘要 沉浸式扩展现实（XR）应用引入了延迟极高的工作负载，必须满足严格的实时响应性，同时运行在能量和电池受限的设备上，这使得终端设备与附近边缘服务器之间的执行布局成为系统根本性挑战。现有的自适应执行和计算卸载方法通常优化平均性能指标，未能完全反映闭环XR工作负载中实时延迟需求与设备电池寿命之间的持续相互作用。本文提出了一个针对边缘辅助XR系统的电池感知执行管理框架，综合考虑执行位置、工作负载质量、延迟需求和电池动态。我们设计了基于轻量级深度强化学习策略的在线决策机制，在动态网络条件下持续调整执行决策，同时保持高运动到光子延迟的合规性。实验结果显示，所提方法可将预计的设备电池寿命延长多达163%，相较于延迟最优的局部执行，同时在稳定网络条件下保持超过90%的运动到光子延迟合规性。即使在网络带宽极度有限的情况下，这种合规率也不会低于80%，从而证明了在沉浸式XR系统中明确管理延迟-能量权衡的有效性。

Learning to Present: Inverse Specification Rewards for Agentic Slide Generation

学习呈现：agentic slide 生成的逆规范奖励

Authors: Karthik Ragunath Ananda Kumar, Subrahmanyam Arunachalam
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.16839
Pdf link: https://arxiv.org/pdf/2603.16839
Abstract Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience-aware communication. This work proposes an OpenEnv-compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi-component reward system combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics, and an inverse specification reward that measures how faithfully generated slides convey their intended purpose. The inverse specification reward, an "inverse task" where an LLM attempts to recover the original specification from generated slides, provides a holistic quality signal. Our approach fine-tunes Qwen2.5-Coder-7B via GRPO, training only 0.5% of parameters on prompts derived from expert demonstrations collected using Claude Opus 4.6. Experiments on 48 diverse business briefs across six models demonstrate that our fine-tuned 7B model achieves 91.2% of Claude Opus 4.6's quality while improving 33.1% over the base model. The six-model comparison reveals that instruction adherence and tool-use compliance, rather than raw parameter count, determine agentic task performance. We contribute SlideRL, an open-source dataset of 288 multi-turn rollout trajectories across all six models: this https URL Code: this https URL
中文摘要 自动化演示生成依然是一项充满挑战的任务，需要连贯的内容创作、视觉设计和受众感知的沟通。本研究提出了一个兼容OpenEnv的强化学习环境，LLM代理通过工具学习研究主题、规划内容并生成专业HTML幻灯片演示。我们引入了一套多元奖励系统，结合了结构验证、渲染质量评估、基于LLM的美学评分、内容质量指标以及衡量真实生成幻灯片传达预期目的的反向规范奖励。反向规范奖励是一种“反向任务”，即大型语言模型尝试从生成的幻灯片中恢复原始规范，提供了整体质量信号。我们的方法通过GRPO微调Qwen2.5-Coder-7B，仅对基于Claude Opus 4.6收集的专家演示生成的提示进行0.5%的参数训练。在六个模型中对48份多样化商业简报进行实验显示，我们经过精细调优的7B模型实现了Claude Opus 4.6的91.2%质量，同时比基础模型提升了33.1%。六模型比较显示，指令遵循和工具使用合规性而非原始参数数量决定了代理任务的表现。我们贡献了SlideRL，这是一个开源数据集，涵盖了六个模型的288条多回合推广轨迹：这个 https URL 代码：这个 https URL

Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning

随机重置加速强化学习中的策略趋同

Authors: Jello Zhou, Vudtiwat Ngampruetikorn, David J. Schwab
Subjects: Subjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Systems and Control (eess.SY); Biological Physics (physics.bio-ph)
Arxiv link: https://arxiv.org/abs/2603.16842
Pdf link: https://arxiv.org/pdf/2603.16842
Abstract Stochastic resetting, where a dynamical process is intermittently returned to a fixed reference state, has emerged as a powerful mechanism for optimizing first-passage properties. Existing theory largely treats static, non-learning processes. Here we ask how stochastic resetting interacts with reinforcement learning, where the underlying dynamics adapt through experience. In tabular grid environments, we find that resetting accelerates policy convergence even when it does not reduce the search time of a purely diffusive agent, indicating a novel mechanism beyond classical first-passage optimization. In a continuous control task with neural-network-based value approximation, we show that random resetting improves deep reinforcement learning when exploration is difficult and rewards are sparse. Unlike temporal discounting, resetting preserves the optimal policy while accelerating convergence by truncating long, uninformative trajectories to enhance value propagation. Our results establish stochastic resetting as a simple, tunable mechanism for accelerating learning, translating a canonical phenomenon of statistical mechanics into an optimization principle for reinforcement learning.
中文摘要 随机重置，即将动力学过程间歇性地返回固定参考态，已成为优化首次通道特性的强大机制。现有理论主要处理静态的非学习过程。这里我们探讨随机重置如何与强化学习相互作用，底层动态如何通过经验进行适应。在表格网格环境中，我们发现即使重置并未减少纯扩散代理的搜索时间，也能加速策略收敛，表明存在超越经典首通道优化的新机制。在基于神经网络的值近似连续控制任务中，我们展示了当探索困难且奖励稀疏时，随机重置能改善深度强化学习。与时间贴现不同，重置保持最优策略，同时通过截断长且无信息的轨迹以增强价值传播，加速收敛。我们的结果确立了随机重置作为一种简单且可调的加速学习机制，将统计力学中的典型现象转化为强化学习的优化原则。

DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

DreamPlan：通过视频世界模型高效强化和精细调优视觉语言规划师

Authors: Emily Yue-Ting Jia, Weiduo Yuan, Tianheng Shi, Vitor Guizilini, Jiageng Mao, Yue Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.16860
Pdf link: https://arxiv.org/pdf/2603.16860
Abstract Robotic manipulation requires sophisticated commonsense reasoning, a capability naturally possessed by large-scale Vision-Language Models (VLMs). While VLMs show promise as zero-shot planners, their lack of grounded physical understanding often leads to compounding errors and low success rates when deployed in complex real-world environments, particularly for challenging tasks like deformable object manipulation. Although Reinforcement Learning (RL) can adapt these planners to specific task dynamics, directly fine-tuning VLMs via real-world interaction is prohibitively expensive, unsafe, and sample-inefficient. To overcome this bottleneck, we introduce DreamPlan, a novel framework for the reinforcement fine-tuning of VLM planners via video world models. Instead of relying on costly physical rollouts, DreamPlan first leverages the zero-shot VLM to collect exploratory interaction data. We demonstrate that this sub-optimal data is sufficient to train an action-conditioned video generation model, which implicitly captures complex real-world physics. Subsequently, the VLM planner is fine-tuned entirely within the "imagination" of this video world model using Odds Ratio Policy Optimization (ORPO). By utilizing these virtual rollouts, physical and task-specific knowledge is efficiently injected into the VLM. Our results indicate that DreamPlan bridges the gap between semantic reasoning and physical grounding, significantly improving manipulation success rates without the need for large-scale real-world data collection. Our project page is this https URL.
中文摘要 机器人操作需要复杂的常识推理能力，而这正是大规模视觉语言模型（VLMs）自然具备的能力。虽然VLM在零发射器规划方面展现出潜力，但其缺乏扎实的物理理解，常导致在复杂现实环境中出现累积错误和低成功率，尤其是在可变形物体操作等具有挑战性的任务中。尽管强化学习（RL）可以将这些规划器调整到特定任务动态，但通过现实世界交互直接微调VLM的成本过高、不安全且样本效率低下。为克服这一瓶颈，我们引入了DreamPlan，这是一个通过视频世界模型对VLM规划器进行强化微调的新型框架。DreamPlan没有依赖昂贵的实体部署，而是首先利用零次VLM收集探索性交互数据。我们证明，这些次优数据足以训练一个动作条件视频生成模型，该模型隐含地捕捉了复杂的现实物理。随后，VLM规划器完全在该视频世界模型的“想象”中，利用比值比策略优化（ORPO）进行微调。通过利用这些虚拟部署，物理和任务特定的知识高效地注入VLM。我们的结果表明，DreamPlan弥合了语义推理与物理基础之间的鸿沟，显著提升了操作成功率，无需大规模的真实世界数据收集。我们的项目页面是这个 https URL。

Efficient Reasoning on the Edge

边缘的高效推理

Authors: Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert, Fabio Valerio Massoli, Evgeny Mironov, Leyla Mirvakhabova, Tribhuvanesh Orekondy, Spyridon Stasis, Andrey Kuzmin, Anna Kuzina, Markus Nagel, Ankita Nayak, Corrado Rainone, Ork de Rooij, Paul N Whatmough, Arash Behboodi, Babak Ehteshami Bejnordi
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.16867
Pdf link: https://arxiv.org/pdf/2603.16867
Abstract Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.
中文摘要 具备思维链推理的大型语言模型（LLM）在复杂问题解决任务中实现了最先进的性能，但其冗长的推理痕迹和庞大的上下文需求使其在边缘部署中不切实际。这些挑战包括高额的令牌生成成本、巨大的KV缓存占用，以及在将推理能力提炼到移动设备小模型时的低效。现有方法常依赖于将较大模型的推理痕迹提炼成较小模型，这些模型冗长且风格上多余，不适合设备内推断。本研究提出一种轻量级方法，利用LoRA适配器结合监督微调，实现小型大型语言模型的推理。我们还通过强化学习在这些适配器上引入了预算强制，显著缩短响应长度且精度损失最小。为了解决内存受限解码问题，我们利用并行测试时间缩放技术，在小幅延迟增加下提高准确性。最后，我们提出了一种动态适配器切换机制，仅在需要时激活推理，以及在提示编码时的KV缓存共享策略，缩短设备推理的首次令牌时间。Qwen2.5-7B的实验表明，我们的方法在严格资源限制下实现高效且准确的推理，使LLM推理在移动场景中非常实用。演示我们解决方案在移动设备上运行的视频可在我们的项目页面观看。

Keyword: diffusion policy

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

ExpertGen：可扩展的模拟到现实专家政策，从不完美行为的先验中学习

Authors: Zifan Xu, Ran Gong, Maria Vittoria Minniti, Ahmet Salih Gundogdu, Eric Rosen, Kausik Sivakumar, Riedana Yan, Zixing Wang, Di Deng, Peter Stone, Xiaohan Zhang, Karl Schmeckpeper
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.15956
Pdf link: https://arxiv.org/pdf/2603.15956
Abstract Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, acquiring such data at scale in the real world is prohibitively expensive. This paper introduces ExpertGen, a framework that automates expert policy learning in simulation to enable scalable sim-to-real transfer. ExpertGen first initializes a behavior prior using a diffusion policy trained on imperfect demonstrations, which may be synthesized by large language models or provided by humans. Reinforcement learning is then used to steer this prior toward high task success by optimizing the diffusion model's initial noise while keep original policy frozen. By keeping the pretrained diffusion policy frozen, ExpertGen regularizes exploration to remain within safe, human-like behavior manifolds, while also enabling effective learning with only sparse rewards. Empirical evaluations on challenging manipulation benchmarks demonstrate that ExpertGen reliably produces high-quality expert policies with no reward engineering. On industrial assembly tasks, ExpertGen achieves a 90.5% overall success rate, while on long-horizon manipulation tasks it attains 85% overall success, outperforming all baseline methods. The resulting policies exhibit dexterous control and remain robust across diverse initial configurations and failure states. To validate sim-to-real transfer, the learned state-based expert policies are further distilled into visuomotor policies via DAgger and successfully deployed on real robotic hardware.
中文摘要 学习可通用且稳健的行为克隆策略需要大量高质量的机器人数据。虽然人工演示（例如远程操作）是专家行为的标准来源，但在现实世界中大规模获取此类数据成本高昂。本文介绍了ExpertGen框架，该框架自动化专家在模拟中进行策略学习，实现可扩展的模拟到真实传输。ExpertGen 首先使用基于不完美演示训练的扩散策略来初始化行为，这些演示可以由大型语言模型综合，也可以由人类提供。随后，强化学习通过优化扩散模型的初始噪声，同时保持原始策略冻结，从而引导该先验朝高任务成功率方向发展。通过保持预训练扩散策略冻结，ExpertGen使探索规则化，保持在安全、类人行为的多形内，同时实现了仅有稀少奖励的有效学习。对挑战性操作基准的实证评估表明，ExpertGen 可靠地生成高质量的专家策略，无需奖励工程。在工业组装任务中，ExpertGen 实现了 90.5% 的整体成功率，而在长期操作任务中，其整体成功率达到 85%，超过所有基线方法。最终的政策展现了灵活的控制力，并在多种初始配置和失效状态下依然稳健。为了验证模拟到现实的传输，基于状态的专家策略通过DAgger进一步提炼为视觉运动策略，并成功部署到真实机器人硬件上。

Encoding Predictability and Legibility for Style-Conditioned Diffusion Policy

样式条件扩散策略的可预测性和可读性编码

Authors: Adrien Jacquet Crétides, Mouad Abrini, Hamed Rahimi, Mohamed Chetouani
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.16368
Pdf link: https://arxiv.org/pdf/2603.16368
Abstract Striking a balance between efficiency and transparent motion is a core challenge in human-robot collaboration, as highly expressive movements often incur unnecessary time and energy costs. In collaborative environments, legibility allows a human observer a better understanding of the robot's actions, increasing safety and trust. However, these behaviors result in sub-optimal and exaggerated trajectories that are redundant in low-ambiguity scenarios where the robot's goal is already obvious. To address this trade-off, we propose Style-Conditioned Diffusion Policy (SCDP), a modular framework that constrains the trajectory generation of a pre-trained diffusion model toward either legibility or efficiency based on the environment's configuration. Our method utilizes a post-training pipeline that freezes the base policy and trains a lightweight scene encoder and conditioning predictor to modulate the diffusion process. At inference time, an ambiguity detection module activates the appropriate conditioning, prioritizing expressive motion only for ambiguous goals and reverting to efficient paths otherwise. We evaluate SCDP on manipulation and navigation tasks, and results show that it enhances legibility in ambiguous settings while preserving optimal efficiency when legibility is unnecessary, all without retraining the base policy.
中文摘要 在效率与透明运动之间取得平衡是人机协作的核心挑战，因为高度表现力的动作往往会消耗不必要的时间和精力。在协作环境中，可读性使人类观察者更好地理解机器人的行为，提高了安全性和信任度。然而，这些行为会导致轨迹不够优且夸张，在机器人目标已经明显的低歧义场景中显得多余。为解决这一权衡，我们提出了风格条件扩散策略（SCDP），这是一种模块化框架，基于环境配置，限制预训练扩散模型的轨迹生成，使其趋向可读性或效率。我们的方法利用训练后流水线冻结基础策略，并训练一个轻量级场景编码器和条件预测器来调制扩散过程。在推理时，歧义检测模块激活相应的条件，仅优先处理模糊目标的表达运动，否则恢复到高效路径。我们评估了SCDP在操作和导航任务中的应用，结果显示它在模糊环境中提升可读性，同时在无需可读性时保持最佳效率，且无需重新训练基础策略。