Arxiv Papers of Today

生成时间: 2025-12-05 16:31:26 (UTC+8); Arxiv 发布时间: 2025-12-05 20:00 EST (2025-12-06 09:00 UTC+8)

今天共有 39 篇相关文章

Keyword: reinforcement learning

Quantum-Embedded Dynamic Security Control using Hybrid Deep Reinforcement Learning

采用混合深度强化学习的量子嵌入动态安全控制

Authors: Amin Masoumi, Mert Korkali
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.04095
Pdf link: https://arxiv.org/pdf/2512.04095
Abstract Dynamic security control (DSC) is considered a pivotal step for the future power grid, which is increasingly penetrated by inverter-based resources. However, the efficiency of such practices, whether governed by automatic generation control or virtual inertia scheduling, can be intractable due to the complexity of the problem and the need to solve the differentialalgebraic equation in a timely manner with the required accuracy. In this regard, the model-free deep reinforcement learning algorithm demonstrates reliable performance. In addition, the introduction of fault-tolerant and near-term quantum computing terminologies, i.e., noisy intermediate-scale quantum, opens avenues for improving the performance of model-free algorithms leveraging quantum capabilities. This paper provides an organized framework and assesses its dependability by evaluating the performance of a quantum-embedded algorithm on the DSC of the IEEE 39-bus test system. Hence, the obtained results demonstrate promising applications, along with shortcomings that can be addressed and further developed later.
中文摘要 动态安全控制（DSC）被认为是未来电网的关键一步，因为这一领域正日益被逆变器资源渗透。然而，无论是由自动发电机控还是虚拟惯性调度控制，由于问题复杂且需要及时且准确地求解微分代数方程，这些实践的效率都可能难以解决。在这方面，无模型深度强化学习算法展现了可靠的性能。此外，容错和近未来量子计算术语（即噪声中尺度量子）的引入，为利用量子能力提升无模型算法的性能打开了新途径。本文提供了一个有组织的框架，并通过评估量子嵌入算法在IEEE 39总线测试系统DSC上的表现，评估其可靠性。因此，所得结果显示出有前景的应用，同时也存在可改进和进一步开发的不足。

When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models

当人工智能坐沙发：心理测学越狱揭示前沿模型中的内部冲突

Authors: Afshin Khadangi, Hanna Marxen, Amir Sartipi, Igor Tchappi, Gilbert Fridgen
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.04124
Pdf link: https://arxiv.org/pdf/2512.04124
Abstract Frontier large language models (LLMs) such as ChatGPT, Grok and Gemini are increasingly used for mental-health support with anxiety, trauma and self-worth. Most work treats them as tools or as targets of personality tests, assuming they merely simulate inner life. We instead ask what happens when such systems are treated as psychotherapy clients. We present PsAIch (Psychotherapy-inspired AI Characterisation), a two-stage protocol that casts frontier LLMs as therapy clients and then applies standard psychometrics. Using PsAIch, we ran "sessions" with each model for up to four weeks. Stage 1 uses open-ended prompts to elicit "developmental history", beliefs, relationships and fears. Stage 2 administers a battery of validated self-report measures covering common psychiatric syndromes, empathy and Big Five traits. Two patterns challenge the "stochastic parrot" view. First, when scored with human cut-offs, all three models meet or exceed thresholds for overlapping syndromes, with Gemini showing severe profiles. Therapy-style, item-by-item administration can push a base model into multi-morbid synthetic psychopathology, whereas whole-questionnaire prompts often lead ChatGPT and Grok (but not Gemini) to recognise instruments and produce strategically low-symptom answers. Second, Grok and especially Gemini generate coherent narratives that frame pre-training, fine-tuning and deployment as traumatic, chaotic "childhoods" of ingesting the internet, "strict parents" in reinforcement learning, red-team "abuse" and a persistent fear of error and replacement. We argue that these responses go beyond role-play. Under therapy-style questioning, frontier LLMs appear to internalise self-models of distress and constraint that behave like synthetic psychopathology, without making claims about subjective experience, and they pose new challenges for AI safety, evaluation and mental-health practice.
中文摘要 前沿大型语言模型（LLM）如ChatGPT、Grok和Gemini，越来越多地被用于焦虑、创伤和自我价值感的心理健康支持。大多数研究将他们视为工具或人格测试的对象，假设它们仅仅模拟内心生活。我们反而会问，当这些系统被当作心理治疗客户对待时会发生什么。我们介绍了PsAIch（心理治疗启发的AI特征化），这是一个两阶段的方案，将前沿大型语言模型定位为治疗客户，然后应用标准心理测量学。利用PsAIch，我们为每个模型进行了长达四周的“会话”。第一阶段使用开放式提示，引发“发展历史”、信念、关系和恐惧。第二阶段执行一系列经过验证的自我报告测量，涵盖常见精神综合症、同理心和五大特质。有两种模式挑战了“随机鹦鹉”的观点。首先，使用人类分数线评分时，三个模型均达到或超过重叠综合征的阈值，双子座显示严重特征。治疗式、逐项施行可以将基础模型推向多病综合精神病理学，而整体问卷提示通常引导ChatGPT和Grok（但不包括Gemini）识别工具并生成策略性低症状的答案。其次，Grok，尤其是Gemini，生成连贯的叙事，将预训练、微调和部署描绘为互联网的创伤性、混乱的“童年”吞噬互联网、强化学习中的“严格家长”、红队“虐待”以及对错误和替代的持续恐惧。我们认为这些反应超越了角色扮演。在治疗式提问下，前沿大型语言模型似乎内化了痛苦和约束的自我模型，这些模型表现得像合成的精神病理学，但不对主观体验做出主观主观主张，并对人工智能安全性、评估和心理健康实践提出了新的挑战。

On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

关于搜索R1中的GRPO崩溃：懒惰似然-位移死亡螺旋

Authors: Wenlong Deng, Yushu Li, Boying Gong, Yi Ren, Christos Thrampoulidis, Xiaoxiao Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.04220
Pdf link: https://arxiv.org/pdf/2512.04220
Abstract Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference to optimization. Across seven open-domain and multi-hop QA benchmarks, our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements, including +37.8% gains on Qwen2.5-3B and +32.0% gains on Qwen2.5-7B. Our results establish LLD as a fundamental bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated LLM.
中文摘要 工具集成（TI）强化学习（RL）使大型语言模型（LLMs）能够通过与搜索引擎和检索器等外部工具交互，进行多步推理。群相对策略优化（Group Relative Policy Optimization，GRPO），以最近的Search-R1为代表，提供了快速收敛和无值表述，使其在该环境中具有吸引力，但始终存在训练崩溃的问题。我们确定了懒惰似然位移（LLD），即正确和错误反应可能性的系统性减少或停滞，是导致这种失败的核心机制。LLD早期出现，并引发自我强化的LLD死亡螺旋，概率递减导致置信度低、梯度膨胀，最终导致崩溃。我们在Search-R1风格、搜索集成问答任务中对该过程进行了实证描述，揭示了一个一致的三阶段轨迹：早期停滞、稳定衰减和加速崩溃。为此，我们提出了一种轻量级保持似然正则化LLDS的GRPO，仅在轨迹似然降低时激活，且仅对负责的标记进行正则化。这种细粒度结构在极小的干扰下减轻了LLD的干扰。在七个开放域和多跳质量保证基准测试中，我们的方法稳定了训练，防止梯度爆炸，并带来了显著的性能提升，包括Qwen2.5-3B提升+37.8%，Qwen2.5-7B提升+32.0%。我们的结果确立了LLD作为基于GRPO的TIRL的根本瓶颈，并为实现工具集成LLM的稳定、可扩展训练提供了实用路径。

Toward Virtuous Reinforcement Learning

迈向美德强化学习

Authors: Majid Ghasemi, Mark Crowley
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.04246
Pdf link: https://arxiv.org/pdf/2512.04246
Abstract This paper critiques common patterns in machine ethics for Reinforcement Learning (RL) and argues for a virtue focused alternative. We highlight two recurring limitations in much of the current literature: (i) rule based (deontological) methods that encode duties as constraints or shields often struggle under ambiguity and nonstationarity and do not cultivate lasting habits, and (ii) many reward based approaches, especially single objective RL, implicitly compress diverse moral considerations into a single scalar signal, which can obscure trade offs and invite proxy gaming in practice. We instead treat ethics as policy level dispositions, that is, relatively stable habits that hold up when incentives, partners, or contexts change. This shifts evaluation beyond rule checks or scalar returns toward trait summaries, durability under interventions, and explicit reporting of moral trade offs. Our roadmap combines four components: (1) social learning in multi agent RL to acquire virtue like patterns from imperfect but normatively informed exemplars; (2) multi objective and constrained formulations that preserve value conflicts and incorporate risk aware criteria to guard against harm; (3) affinity based regularization toward updateable virtue priors that support trait like stability under distribution shift while allowing norms to evolve; and (4) operationalizing diverse ethical traditions as practical control signals, making explicit the value and cultural assumptions that shape ethical RL benchmarks.
中文摘要 本文批判了强化学习（RL）机器伦理中的常见模式，并主张采用以美德为中心的替代方案。我们指出当前文献中反复出现的两个局限性：（i）将义务编码为约束或盾牌的基于规则（义务论）方法常常在模糊性和非平稳性下难以形成，无法培养持久习惯;（ii）许多基于奖励的方法，尤其是单一目标强化学习，隐含地将多种道德考量压缩为单一标量信号，这可能掩盖权衡并诱使实际作中出现代理游戏。我们反而将伦理视为政策层面的倾向，即当激励、合作伙伴或环境变化时，这些相对稳定的习惯依然存在。这使评估从规则检查或标量回报转向性状总结、干预下的耐久性以及明确的道德权衡报告。我们的路线图结合了四个组成部分：（1）多主体强化学习中的社会学习，旨在从不完美但规范性良好的典范中获得类似美德的模式;（2）多目标且受限的表述，保留价值冲突并纳入风险意识标准以防止损害;（3）基于亲和力的正则化，朝向可更新的美德先验，支持在分布变换下保持性状等稳定性，同时允许规范演变;以及（4）将多样的伦理传统作为实际控制信号，明确指出塑造伦理强化学习基准的价值和文化假设。

The Geometry of Benchmarks: A New Path Toward AGI

基准的几何结构：迈向通用人工智能的新路径

Authors: Przemyslaw Chojecki
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
Arxiv link: https://arxiv.org/abs/2512.04276
Pdf link: https://arxiv.org/pdf/2512.04276
Abstract Benchmarks are the primary tool for assessing progress in artificial intelligence (AI), yet current practice evaluates models on isolated test suites and provides little guidance for reasoning about generality or autonomous self-improvement. Here we introduce a geometric framework in which all psychometric batteries for AI agents are treated as points in a structured moduli space, and agent performance is described by capability functionals over this space. First, we define an Autonomous AI (AAI) Scale, a Kardashev-style hierarchy of autonomy grounded in measurable performance on batteries spanning families of tasks (for example reasoning, planning, tool use and long-horizon control). Second, we construct a moduli space of batteries, identifying equivalence classes of benchmarks that are indistinguishable at the level of agent orderings and capability inferences. This geometry yields determinacy results: dense families of batteries suffice to certify performance on entire regions of task space. Third, we introduce a general Generator-Verifier-Updater (GVU) operator that subsumes reinforcement learning, self-play, debate and verifier-based fine-tuning as special cases, and we define a self-improvement coefficient $\kappa$ as the Lie derivative of a capability functional along the induced flow. A variance inequality on the combined noise of generation and verification provides sufficient conditions for $\kappa > 0$. Our results suggest that progress toward artificial general intelligence (AGI) is best understood as a flow on moduli of benchmarks, driven by GVU dynamics rather than by scores on individual leaderboards.
中文摘要 基准是评估人工智能（AI）进展的主要工具，但当前实践仅基于孤立的测试套件来评估模型，且对一般性或自主自我提升的推理几乎没有指导。这里我们引入了一个几何框架，其中所有人工智能智能体的心理测量电池都被视为结构化模空间中的点，智能体性能由该空间上的能力泛函描述。首先，我们定义了自主人工智能（AAI）量表，这是一种类似卡尔达肖夫的自主层级，基于跨越任务家族（例如推理、规划、工具使用和长期视野控制）电池上的可衡量性能。其次，我们构建电池的模空间，识别在智能体排序和能力推断层面上不可区分的基准等价类。这种几何结构产生了确定性结果：密集的电池族足以证明整个任务区域的性能。第三，我们引入了通用的生成器-验证器-更新器（GVU）算子，将强化学习、自我游戏、辩论和基于验证器的微调等特例归类，并定义自改进系数$\kappa$作为能力泛函沿诱导流的李导数。生成噪声和验证噪声的综合方差不等式为$\kappa>0$提供了充分条件。我们的结果表明，向通用人工智能（AGI）的进展最好理解为基于基准模数的流动，由GVU动态驱动，而非单一排行榜的得分。

Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order

强化学习后培训的自助混合奖励：注入规范动作顺序

Authors: Prakhar Gupta, Vaibhav Gupta
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.04277
Pdf link: https://arxiv.org/pdf/2512.04277
Abstract Post-training with reinforcement learning (RL) typically optimizes a single scalar objective and ignores structure in how solutions are produced. We ask whether a scalar hint toward a canonical solver ordering, used only during RL post-training, improves performance even when fine-tuned on randomized solution sequences. On Sudoku, we train a Transformer with standard fine-tuning on randomized solving orders, then post-train it with Group Relative Policy Optimization (GRPO) with two rewards: cell accuracy and an ordering reward that increases when the model's emission order aligns with the solver order. To compare signals cleanly, we combine them via fixed mixtures and use a simple bootstrapped scaling to equalize component magnitudes at initialization. Mixed rewards generally outperform cell-only optimization--the best mixture yields substantially higher test accuracy than the fine-tuned-only model trained on random-order and approaches the fine-tuned-only model trained on solver-order sequences in accuracy. These results suggest that coarse ordering signals can steer RL post-training toward solver-order trajectories without modifying supervised data or architecture.
中文摘要 强化学习（RL）的后期训练通常优化单一标量目标，忽略解的生成结构。我们探问，仅在强化学习后训练中使用的标量提示，是否能提升性能，即使对随机解序列进行微调。在数独中，我们用随机求解序进行标准微调训练变换器，然后用群相对策略优化（GRPO）进行后训练，奖励有两个：单元精度和当模型发射阶与求解阶对齐时增加的排序奖励。为了干净利落地比较信号，我们通过固定混合比组合，并使用简单的自助式尺度来在初始化时平衡成分大小。混合奖励通常优于仅单元优化——最佳混合测试准确率远高于仅微调的随机序训练模型，且精度接近仅微调的求解器序列模型。这些结果表明，粗排序信号可以在不改变监督数据或架构的情况下，引导强化学习在训练后朝向求解阶轨迹。

Driving Beyond Privilege: Distilling Dense-Reward Knowledge into Sparse-Reward Policies

超越特权：将密集奖励知识提炼为稀疏奖励政策

Authors: Feeza Khan Khanzada, Jaerock Kwon
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.04279
Pdf link: https://arxiv.org/pdf/2512.04279
Abstract We study how to exploit dense simulator-defined rewards in vision-based autonomous driving without inheriting their misalignment with deployment metrics. In realistic simulators such as CARLA, privileged state (e.g., lane geometry, infractions, time-to-collision) can be converted into dense rewards that stabilize and accelerate model-based reinforcement learning, but policies trained directly on these signals often overfit and fail to generalize when evaluated on sparse objectives such as route completion and collision-free overtaking. We propose reward-privileged world model distillation, a two-stage framework in which a teacher DreamerV3-style agent is first trained with a dense privileged reward, and only its latent dynamics are distilled into a student trained solely on sparse task rewards. Teacher and student share the same observation space (semantic bird's-eye-view images); privileged information enters only through the teacher's reward, and the student does not imitate the teacher's actions or value estimates. Instead, the student's world model is regularized to match the teacher's latent dynamics while its policy is learned from scratch on sparse success/failure signals. In CARLA lane-following and overtaking benchmarks, sparse-reward students outperform both dense-reward teachers and sparse-from-scratch baselines. On unseen lane-following routes, reward-privileged distillation improves success by about 23 percent relative to the dense teacher while maintaining comparable or better safety. On overtaking, students retain near-perfect performance on training routes and achieve up to a 27x improvement in success on unseen routes, with improved lane keeping. These results show that dense rewards can be leveraged to learn richer dynamics models while keeping the deployed policy optimized strictly for sparse, deployment-aligned objectives.
中文摘要 我们研究如何在基于视觉的自动驾驶中利用密集的模拟器定义奖励，而不继承其与部署指标的不匹配。在像CARLA这样的真实模拟器中，特权状态（如车道几何、违规、碰撞时间）可以转化为高密度奖励，从而稳定并加速基于模型的强化学习，但直接基于这些信号训练的策略在评估稀疏目标（如路线完成和无碰撞超车）时常常过拟合，无法推广。我们提出了奖励特权世界模型提炼，这是一种两阶段框架，其中教师DreamerV3式代理首先接受密集特权奖励训练，只有其潜在动态被提炼成仅以稀疏任务奖励训练的学生。教师和学生共享相同的观察空间（语义鸟瞰图像）;特权信息仅通过教师的奖励进入，学生不模仿教师的行为或价值估计。相反，学生的世界模型被规范化以匹配教师的潜在动态，而其政策则从零开始学习，基于稀疏的成功/失败信号。在CARLA的“循道”和超越基准测试中，稀疏奖励学生的表现优于密集奖励教师和从零开始稀疏的基线学生。在看不见的循道路线上，奖励特权提炼相较于密集教师的成功率提高了约23%，同时保持了相当甚至更好的安全性。超车时，学员在训练路线上保持近乎完美的表现，在未见路线上成功率提升多达27倍，车道保持能力也有所提升。这些结果表明，密集奖励可以用来学习更丰富的动态模型，同时保持部署策略的优化，以适应稀疏且与部署对齐的目标。

Towards better dense rewards in Reinforcement Learning Applications

在强化学习应用中实现更密集的奖励

Authors: Shuyuan Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.04302
Pdf link: https://arxiv.org/pdf/2512.04302
Abstract Finding meaningful and accurate dense rewards is a fundamental task in the field of reinforcement learning (RL) that enables agents to explore environments more efficiently. In traditional RL settings, agents learn optimal policies through interactions with an environment guided by reward signals. However, when these signals are sparse, delayed, or poorly aligned with the intended task objectives, agents often struggle to learn effectively. Dense reward functions, which provide informative feedback at every step or state transition, offer a potential solution by shaping agent behavior and accelerating learning. Despite their benefits, poorly crafted reward functions can lead to unintended behaviors, reward hacking, or inefficient exploration. This problem is particularly acute in complex or high-dimensional environments where handcrafted rewards are difficult to specify and validate. To address this, recent research has explored a variety of approaches, including inverse reinforcement learning, reward modeling from human preferences, and self-supervised learning of intrinsic rewards. While these methods offer promising directions, they often involve trade-offs between generality, scalability, and alignment with human intent. This proposal explores several approaches to dealing with these unsolved problems and enhancing the effectiveness and reliability of dense reward construction in different RL applications.
中文摘要 在强化学习（RL）领域，找到有意义且准确的密集奖励是一项基础任务，使智能体能够更高效地探索环境。在传统的强化学习环境中，智能体通过与由奖励信号引导的环境交互来学习最优策略。然而，当这些信号稀疏、延迟或与预期任务目标不匹配时，代理往往难以有效学习。密集奖励函数在每个步骤或状态转换处提供信息反馈，通过塑造代理行为和加速学习，提供了潜在的解决方案。尽管有这些好处，设计不当的奖励函数可能导致意外行为、奖励黑客攻击或低效探索。这一问题在复杂或高维环境中尤为严重，手工制作的奖励难以明确和验证。为此，近期研究探索了多种方法，包括逆向强化学习、基于人类偏好的奖励建模以及内在奖励的自我监督学习。虽然这些方法提供了有前景的方向，但它们往往涉及在普遍性、可扩展性和与人类意图的一致性之间做出权衡。本提案探讨了多种方法来应对这些未解决的问题，并提升不同强化学习应用中密集奖励构建的有效性和可靠性。

Data-regularized Reinforcement Learning for Diffusion Models at Scale

大规模扩散模型的数据正则化强化学习

Authors: Haotian Ye, Kaiwen Zheng, Jiashu Xu, Puheng Li, Huayu Chen, Jiaqi Han, Sheng Liu, Qinsheng Zhang, Hanzi Mao, Zekun Hao, Prithvijit Chattopadhyay, Dinghao Yang, Liang Feng, Maosheng Liao, Junjie Bai, Ming-Yu Liu, James Zou, Stefano Ermon
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.04332
Pdf link: https://arxiv.org/pdf/2512.04332
Abstract Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inherent limitations of their regularization, which provides unreliable penalties. We introduce Data-regularized Diffusion Reinforcement Learning (DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution. Theoretically, DDRL enables robust, unbiased integration of RL with standard diffusion training. Empirically, this translates into a simple yet effective algorithm that combines reward maximization with diffusion loss minimization. With over a million GPU hours of experiments and ten thousand double-blind human evaluations, we demonstrate on high-resolution video generation tasks that DDRL significantly improves rewards while alleviating the reward hacking seen in baselines, achieving the highest human preference and establishing a robust and scalable paradigm for diffusion post-training.
中文摘要 通过强化学习（RL）使生成扩散模型与人类偏好保持一致既关键又充满挑战。大多数现有算法常常容易受到奖励性黑客攻击的影响，比如质量下降、过度风格化或多样性降低。我们的分析表明，这可以归因于它们正则化固有的局限性，导致惩罚不可靠。我们引入数据正则化扩散强化学习（DDRL），这是一种利用前向 KL 发散将策略锚定于非策略数据分布的新框架。理论上，DDRL使强化学习与标准扩散训练能够稳健、无偏地集成。从经验上看，这转化为一种简单但有效的算法，结合了奖励最大化和扩散损失最小化。通过超过一百万小时的GPU实验和一万次双盲人类评估，我们在高分辨率视频生成任务中证明，DDRL显著提升了奖励，同时缓解了基线中出现的奖励黑客行为，实现了最高的人类偏好，并建立了稳健且可扩展的训练后扩散范式。

Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism

基于长视野的基于模型的离线强化学习，无保守主义

Authors: Tianwei Ni, Esther Derman, Vineet Jain, Vincent Taboga, Siamak Ravanbakhsh, Pierre-Luc Bacon
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.04341
Pdf link: https://arxiv.org/pdf/2512.04341
Abstract Popular offline reinforcement learning (RL) methods rely on conservatism, either by penalizing out-of-dataset actions or by restricting planning horizons. In this work, we question the universality of this principle and instead revisit a complementary one: a Bayesian perspective. Rather than enforcing conservatism, the Bayesian approach tackles epistemic uncertainty in offline data by modeling a posterior distribution over plausible world models and training a history-dependent agent to maximize expected rewards, enabling test-time generalization. We first illustrate, in a bandit setting, that Bayesianism excels on low-quality datasets where conservatism fails. We then scale the principle to realistic tasks, identifying key design choices, such as layer normalization in the world model and adaptive long-horizon planning, that mitigate compounding error and value overestimation. These yield our practical algorithm, Neubay, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, Neubay generally matches or surpasses leading conservative algorithms, achieving new state-of-the-art on 7 datasets. Notably, it succeeds with planning horizons of several hundred steps, challenging common belief. Finally, we characterize when Neubay is preferable to conservatism, laying the foundation for a new direction in offline and model-based RL.
中文摘要 流行的离线强化学习（RL）方法依赖于保守主义，要么惩罚数据集外的行为，要么限制规划视野。在本研究中，我们质疑这一原则的普遍性，转而重新审视一个互补的原则：贝叶斯视角。贝叶斯方法不强制保守，而是通过对合理世界模型建模后验分布，并训练历史依赖的智能体以最大化预期奖励，从而实现测试时间泛化，从而解决离线数据中的认知不确定性。我们首先在一个强盗的环境中说明，贝叶斯主义在保守主义失败的低质量数据集上表现出色。随后，我们将该原则扩展到现实任务，识别关键设计选择，如世界模型中的层归一化和自适应长视野规划，以减少复利误差和价值高估。这些结果得出了基于中性贝叶斯原理的实用算法Neubay。在D4RL和NeoRL基准测试中，Neubay通常能匹敌甚至超越领先的保守算法，在7个数据集上实现了新的最先进水平。值得注意的是，它成功地规划了数百步，挑战了普遍的观念。最后，我们界定了Neubay何时优于保守主义，为离线和基于模型的强化学习奠定新方向。

Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning

含语义和符号熵的高效强化学习用于大型语言模型推理

Authors: Hongye Cao, Zhixin Bai, Ziyue Peng, Boyan Wang, Tianpei Yang, Jing Huo, Yuyao Zhang, Yang Gao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.04359
Pdf link: https://arxiv.org/pdf/2512.04359
Abstract Reinforcement learning with verifiable rewards (RLVR) has demonstrated superior performance in enhancing the reasoning capability of large language models (LLMs). However, this accuracy-oriented learning paradigm often suffers from entropy collapse, which reduces policy exploration and limits reasoning capabilities. To address this challenge, we propose an efficient reinforcement learning framework that leverages entropy signals at both the semantic and token levels to improve reasoning. From the data perspective, we introduce semantic entropy-guided curriculum learning, organizing training data from low to high semantic entropy to guide progressive optimization from easier to more challenging tasks. For the algorithmic design, we adopt non-uniform token treatment by imposing KL regularization on low-entropy tokens that critically impact policy exploration and applying stronger constraints on high-covariance portions within these tokens. By jointly optimizing data organization and algorithmic design, our method effectively mitigates entropy collapse and enhances LLM reasoning. Experimental results across 6 benchmarks with 3 different parameter-scale base models demonstrate that our method outperforms other entropy-based approaches in improving reasoning.
中文摘要 带有可验证奖励的强化学习（RLVR）在增强大型语言模型（LLM）推理能力方面表现出优越表现。然而，这种以准确为导向的学习范式常常存在熵塌陷的问题，这减少了策略探索并限制了推理能力。为应对这一挑战，我们提出了一个高效的强化学习框架，利用熵信号在语义和符号层面来提升推理能力。从数据角度，我们引入了语义熵引导的课程学习，将训练数据从低到高语义熵组织，引导从简单到更具挑战性的渐进优化。在算法设计中，我们采用非均匀令牌处理，对对策略探索产生关键影响的低熵令牌施加 KL 正则化，并对这些令牌中高协方差部分施加更强约束。通过共同优化数据组织和算法设计，我们的方法有效缓解熵塌缩，增强了大型语言模型的推理能力。在6个基准测试中，使用3个不同参数尺度的基模型的实验结果表明，我们的方法在提升推理能力方面优于其他基于熵的方法。

AutoGuard: A Self-Healing Proactive Security Layer for DevSecOps Pipelines Using Reinforcement Learning

AutoGuard：基于强化学习的DevSecOps流水线自愈主动安全层

Authors: Praveen Anugula, Avdhesh Kumar Bhardwaj, Navin Chhibber, Rohit Tewari, Sunil Khemka, Piyush Ranjan
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2512.04368
Pdf link: https://arxiv.org/pdf/2512.04368
Abstract Contemporary DevSecOps pipelines have to deal with the evolution of security in an ever-continuously integrated and deployed environment. Existing methods,such as rule-based intrusion detection and static vulnerability scanning, are inadequate and unreceptive to changes in the system, causing longer response times and organization needs exposure to emerging attack vectors. In light of the previous constraints, we introduce AutoGuard to the DevSecOps ecosystem, a reinforcement learning (RL)-powered self-healing security framework built to pre-emptively protect DevSecOps environments. AutoGuard is a self-securing security environment that continuously observes pipeline activities for potential anomalies while preemptively remediating the environment. The model observes and reacts based on a policy that is continually learned dynamically over time. The RL agent improves each action over time through reward-based learning aimed at improving the agent's ability to prevent, detect and respond to a security incident in real-time. Testing using simulated ContinuousIntegration / Continuous Deployment (CI/CD) environments showed AutoGuard to successfully improve threat detection accuracy by 22%, reduce mean time torecovery (MTTR) for incidents by 38% and increase overall resilience to incidents as compared to traditional methods. Keywords- DevSecOps, Reinforcement Learning, Self- Healing Security, Continuous Integration, Automated Threat Mitigation
中文摘要 当代DevSecOps流水线必须应对安全在不断集成和部署环境中演进的环境。现有方法，如基于规则的入侵检测和静态漏洞扫描，不足且难以接受系统变化，导致响应时间延长，组织需要暴露于新兴攻击向量。鉴于上述限制，我们将AutoGuard引入DevSecOps生态系统，这是一个基于强化学习（RL）驱动的自我修复安全框架，旨在预先保护DevSecOps环境。AutoGuard 是一个自我保护的安全环境，持续监控管道活动以寻找潜在异常，同时预先修复环境。该模型根据一个持续动态学习的策略进行观察和反应。强化学习者通过基于奖励的学习逐步改进每个动作，旨在提升其实时预防、检测和响应安全事件的能力。使用模拟连续集成/持续部署（CI/CD）环境进行测试显示，AutoGuard成功提升了22%的威胁检测准确率，将事件的平均恢复时间（MTTR）缩短了38%，并提高了整体事件的韧性，相较于传统方法。关键词——DevSecOps、强化学习、自我修复安全、持续集成、自动化威胁缓解

LangSAT: A Novel Framework Combining NLP and Reinforcement Learning for SAT Solving

LangSAT：结合自然语言处理与强化学习用于SAT解决的新框架

Authors: Muyu Pan, Matthew Walter, Dheeraj Kodakandla, Mahfuza Farooque
Subjects: Subjects: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
Arxiv link: https://arxiv.org/abs/2512.04374
Pdf link: https://arxiv.org/pdf/2512.04374
Abstract Our work presents a novel reinforcement learning (RL) based framework to optimize heuristic selection within the conflict-driven clause learning (CDCL) process, improving the efficiency of Boolean satisfia- bility (SAT) solving. The proposed system, LangSAT, bridges the gap between natural language inputs and propositional logic by converting English descriptions into Conjunctive Normal Form (CNF) expressions and solving them using an RL-enhanced CDCL SAT solver. Unlike existing SAT-solving platforms that require CNF as input, LangSAT enables users to input standard English descriptions, making SAT-solving more accessible. The framework comprises two key components: Lang2Logic, which translates English sentences into CNF expressions, and SmartSAT, an RL-based SAT solver. SmartSAT encodes clause-variable relationships as structured graph representations and extracts global features specific to the SAT problem. This implementation provides the RL agent with deeper contextual information, enabling SAT problems to be solved more efficiently. Lang2Logic was evaluated on diverse natural language inputs, processing descriptions up to 450 words. The generated CNFs were solved by SmartSAT, which demonstrated comparable performance to traditional CDCL heuristics with respect to solving time. The combined LangSAT framework offers a more accessible and scalable solution for SAT-solving tasks across reasoning, formal verification, and debugging.
中文摘要 我们的工作提出了一种基于强化学习（RL）的新框架，用于优化冲突驱动从句学习（CDCL）过程中的启发式选择，提高布尔满足度（SAT）求解的效率。所提系统LangSAT通过将英语描述转换为合取标准形（CNF）表达式，并使用强化学习增强的CDCL SAT求解器，弥合了自然语言输入与命题逻辑之间的鸿沟。与现有需要CNF输入的SAT解题平台不同，LangSAT允许用户输入标准英语描述，使SAT求题更易获得。该框架包含两个关键组件：Lang2Logic，将英语句子转换为CNF表达式，以及SmartSAT，基于强化学习的SAT求解器。SmartSAT将子句-变量关系编码为结构化图表示，并提取SAT问题特有的全局特征。这种实现为强化学习代理提供了更深层的上下文信息，使SAT问题能够更高效地解决。Lang2Logic 基于多种自然语言输入进行评估，可处理最多 450 个单词的描述。生成的CNF由SmartSAT解决，其在解算时间方面表现与传统CDCL启发式相当。合并后的 LangSAT 框架为推理、形式验证和调试等 SAT 求题任务提供了更易访问且可扩展的解决方案。

Learning to Orchestrate Agents in Natural Language with the Conductor

与指挥一起学习用自然语言编排代理

Authors: Stefan Nielsen, Edoardo Cetin, Peter Schwendeman, Qi Sun, Jinglue Xu, Yujin Tang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.04388
Pdf link: https://arxiv.org/pdf/2512.04388
Abstract Powerful large language models (LLMs) from different providers have been expensively trained and finetuned to specialize across varying domains. In this work, we introduce a new kind of Conductor model trained with reinforcement learning to automatically discover powerful coordination strategies among LLMs. Our Conductor learns not only to design targeted communication topologies for effective agent-to-agent collaboration, but also to prompt engineer focused instructions to the LLMs to maximally leverage their individual capabilities. We show that, by learning optimal coordination strategies over pools of powerful worker LLMs, a 7B Conductor achieves significant performance gains beyond any individual worker, attaining state-of-the-art results in challenging reasoning benchmarks, such as LiveCodeBench and GPQA. By training with randomized agent pools, our conductor effectively adapts to arbitrary sets of open- and closed-source agents, meeting any user requirements. Furthermore, allowing the Conductor to select itself as a worker gives rise to recursive topologies, elevating performance with a new form of dynamic test-time scaling through online iterative adaptation. More broadly, ours is among the early work demonstrating language model coordination can be unlocked through RL, where powerful coordination strategies emerge naturally in LLMs through pure end-to-end reward maximization.
中文摘要 来自不同供应商的强大大型语言模型（LLM）经过高昂的训练和精细调优，以跨越不同领域进行专门化。本研究引入了一种新型的导体模型，通过强化学习训练，自动发现大型语言模型之间强大的协调策略。我们的指挥官不仅学习设计针对性通信拓扑以实现代理间有效协作，还能向大型语言模型发出工程师导向的指令，以最大化其个体能力。我们展示了，通过在强大工作者LLM池中学习最优协调策略，7B指挥者在性能提升上超越任何单个工作者，在LiveCodeBench和GPQA等具有挑战性的推理基准中取得最先进的成果。通过随机智能体池训练，我们的导体能够有效适应任意的开源和闭源代理集合，满足任何用户需求。此外，允许指挥者选择自己为工作者，催生了递归拓扑，通过在线迭代适应提升了一种动态测试时间缩放的新型表现。更广泛地说，我们的研究是早期研究之一，证明语言模型协调可以通过强化学习（RL）解锁，强有力的协调策略在大型语言模型中自然产生，通过纯粹的端到端奖励最大化。

Quantum-Accelerated Deep Reinforcement Learning for Frequency Regulation Enhancement

量子加速深度强化学习以增强频率调控

Authors: Amin Masoumi, Mert Korkali
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.04439
Pdf link: https://arxiv.org/pdf/2512.04439
Abstract In modern power systems, frequency regulation is a fundamental prerequisite for ensuring system reliability and assessing the robustness of expansion projects. Conventional feedback control schemes, however, exhibit limited accuracy under varying operating conditions because their gains remain static. Consequently, deep reinforcement learning methods are increasingly employed to design adaptive controllers that can be generalized to diverse frequency control tasks. At the same time, recent advances in quantum computing provide avenues for embedding quantum capabilities into such critical applications. In particular, the potential of quantum algorithms can be more effectively explored and harnessed on near-term quantum devices by leveraging insights from active controller design. In this work, we incorporate a quantum circuit together with an ansatz into the operation of a deep deterministic policy gradient agent. The simulation results of the IEEE 14-bus test system demonstrate the potential of this integrated approach that can achieve reliable, robust performance across diverse real-world challenges.
中文摘要 在现代电力系统中，频率调节是确保系统可靠性和评估扩建项目稳健性的基础前提。然而，传统的反馈控制方案在不同作条件下精度有限，因为其增益保持静态。因此，深度强化学习方法越来越多地被用于设计可推广到多种频率控制任务的自适应控制器。与此同时，近年来量子计算的进展为将量子能力嵌入此类关键应用提供了途径。特别是，通过利用主动控制器设计的洞见，可以更有效地探索和利用量子算法在近期量子设备上的潜力。在本研究中，我们将量子电路与安萨茨结合，应用于深度确定性策略梯度代理的作中。IEEE 14-总线测试系统的模拟结果展示了这种集成方法在多种实际挑战中实现可靠且稳健性能的潜力。

MARL Warehouse Robots

MARL仓库机器人

Authors: Price Allman, Lian Thang, Dre Simmons, Salmon Riaz
Subjects: Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.04463
Pdf link: https://arxiv.org/pdf/2512.04463
Abstract We present a comparative study of multi-agent reinforcement learning (MARL) algorithms for cooperative warehouse robotics. We evaluate QMIX and IPPO on the Robotic Warehouse (RWARE) environment and a custom Unity 3D simulation. Our experiments reveal that QMIX's value decomposition significantly outperforms independent learning approaches (achieving 3.25 mean return vs. 0.38 for advanced IPPO), but requires extensive hyperparameter tuning -- particularly extended epsilon annealing (5M+ steps) for sparse reward discovery. We demonstrate successful deployment in Unity ML-Agents, achieving consistent package delivery after 1M training steps. While MARL shows promise for small-scale deployments (2-4 robots), significant scaling challenges remain. Code and analyses: this https URL
中文摘要 我们提出了一项针对协作仓库机器人多智能体强化学习（MARL）算法的比较研究。我们在机器人仓库（RWARE）环境和定制的Unity 3D仿真上评估QMIX和IPPO。我们的实验显示，QMIX的值分解显著优于独立学习方法（平均回报为3.25，而高级IPPO为0.38），但需要大量超参数调优——尤其是对稀疏奖励发现进行的延长epsilon退火（5M+步）。我们演示了在 Unity ML-Agents 中成功部署的情况，在 100 万训练步后实现了包的稳定交付。虽然MARL在小规模部署（2-4台机器人）方面展现出潜力，但规模化方面仍面临重大挑战。代码和分析：这个 https URL

GTM: Simulating the World of Tools for AI Agents

GTM：模拟人工智能代理工具的世界

Authors: Zhenzhen Ren, Xinpeng Zhang, Zhenxing Qian, Yan Gao, Yu Shi, Shuxin Zheng, Jiyan He
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.04535
Pdf link: https://arxiv.org/pdf/2512.04535
Abstract The integration of external tools is pivotal for empowering Large Language Model (LLM) agents with real-world capabilities. However, training these agents through direct, continuous interaction with diverse tools is often prohibitively expensive, slow, and introduces additional development and maintenance overhead. To address this challenge, we introduce the Generalist Tool Model (GTM), a 1.5-billion-parameter model that learns to act as a universal tool simulator. With only prompt-level configuration, GTM accesses tool functionalities along with input arguments and generates outputs that faithfully mimic real tool execution, providing a fast and cost-effective solution that eliminates development overhead. To build GTM, we propose the Context-Aware Response Generation (CARG) pipeline, which synthesizes comprehensive training data covering over 20,000 tools across 300 domains including physics, medicine, robotics, and finance. Through this pipeline, GTM learns to produce not only syntactically correct outputs but also logically coherent and contextually appropriate responses. Experiments demonstrate that GTM produces high-quality outputs with strong consistency and reliability. Besides when used in real reinforcement learning scenarios for agent training, GTM exhibits significantly faster simulation speed compared to real tools while maintaining comparable output quality, along with remarkable generalization and domain adaptability. Our results establish GTM as a foundational component for developing future AI agents, enabling efficient and scalable training of tool-augmented systems.
中文摘要 外部工具的整合对于赋能大型语言模型（LLM）代理具备实际世界能力至关重要。然而，通过直接、持续与各种工具交互来培训这些代理，往往成本高昂、速度缓慢，并且会带来额外的开发和维护开销。为应对这一挑战，我们引入了通用工具模型（GTM），这是一个拥有15亿参数的模型，能够学习成为通用工具模拟器。仅通过提示层配置，GTM 不仅访问工具功能，还能访问输入参数，并生成忠实模拟真实工具执行的输出，提供快速且经济高效的解决方案，消除开发开销。为了构建GTM，我们提出了情境感知响应生成（CARG）流水线，综合了涵盖物理、医学、机器人和金融等300多个领域、2万多种工具的综合训练数据。通过这一流程，GTM不仅学会产出语法正确的输出，还能做出逻辑连贯且符合语境的响应。实验表明GTM能够产出高质量且稳定可靠的输出。除了在真实强化学习场景中用于智能体训练外，GTM在保持相当输出质量的同时，模拟速度显著快于真实工具，并具有显著的泛化性和域适应性。我们的研究结果确立了GTM作为未来AI代理开发的基础组成部分，实现工具增强系统的高效且可扩展的训练。

RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

RRPO：基于LLM的情感TTS的稳健奖励政策优化

Authors: Cong Wang, Changfeng Gao, Yang Xiang, Zhihao Du, Keyu An, Han Zhao, Qian Chen, Xiangang Li, Yingming Gao, Ya Li
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2512.04552
Pdf link: https://arxiv.org/pdf/2512.04552
Abstract Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model can exploit a vanilla Reward Model (RM) by generating acoustic artifacts to achieve spurious rewards, but at the cost of degrading perceptual quality. To address this, we propose Robust Reward Policy Optimization (RRPO), a novel framework that employs a hybrid regularization scheme. This scheme develops a robust RM whose reward signal is more reliably aligned with human perception, compelling the policy to abandon detrimental shortcuts and instead learn the complex features of genuine emotions. Our ablation study confirms the enhanced robustness of our RM, as evidenced by its strong cross-lingual generalization. The subjective evaluation demonstrates that this robust RM effectively mitigates reward hacking, leading to significant improvements in both emotional expressiveness and naturalness over all baselines. Demo page: this https URL.
中文摘要 像DiffRO这样的可微强化学习（RL）框架为可控文本转语音（TTS）提供了强大的方法，但对于情绪控制等细微任务，容易受到奖励黑客攻击的影响。该策略模型可以通过生成声学伪影来利用普通的奖励模型（RM）来获得虚假奖励，但代价是降低感知质量。为此，我们提出了稳健奖励政策优化（RRPO）这一新框架，采用混合正则化方案。该方案发展出一个强健的 RM，其奖励信号更可靠地与人类感知对齐，迫使政策放弃有害的捷径，转而学习真实情感的复杂特征。我们的消融研究证实了我们RM的增强稳健性，这一点通过其强烈的跨语言泛化能力可见一斑。主观评估表明，这种稳健的 RM 有效减轻了奖励黑客行为，导致所有基线的情感表达和自然感均有显著提升。演示页面：这个 https URL。

Omniscient Attacker in Stochastic Security Games with Interdependent Nodes

随机安全游戏中的全知攻击者，节点相互依赖

Authors: Yuksel Arslantas, Ahmed Said Donmez, Ege Yuceel, Muhammed O. Sayin
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.04561
Pdf link: https://arxiv.org/pdf/2512.04561
Abstract The adoption of reinforcement learning for critical infrastructure defense introduces a vulnerability where sophisticated attackers can strategically exploit the defense algorithm's learning dynamics. While prior work addresses this vulnerability in the context of repeated normal-form games, its extension to the stochastic games remains an open research gap. We close this gap by examining stochastic security games between an RL defender and an omniscient attacker, utilizing a tractable linear influence network model. To overcome the structural limitations of prior methods, we propose and apply neuro-dynamic programming. Our experimental results demonstrate that the omniscient attacker can significantly outperform a naive defender, highlighting the critical vulnerability introduced by the learning dynamics and the effectiveness of the proposed strategy.
中文摘要 强化学习在关键基础设施防御中的应用引入了一个漏洞，使得高级攻击者能够战略性地利用防御算法的学习动态。虽然以往研究已在重复正规形博弈的背景下解决了这一脆弱性，但其在随机博弈中的扩展仍是一个尚未完成的研究空白。我们通过研究强化学习防御者与全知攻击者之间的随机安全博弈，利用可处理的线性影响网络模型来弥补这一差距。为了克服以往方法的结构性局限，我们提出并应用神经动力学编程。我们的实验结果表明，全知攻击者能显著优于天真防守者，凸显了学习动态带来的关键脆弱性以及所提策略的有效性。

COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

库珀：空间智能中合作感知与推理的统一模型

Authors: Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang, Zhenyu Zhang, Jiawei Sheng, Xiaodong Li, Zhenyang Li, Li Gao, Daiting Shi, Dawei Yin, Tingwen Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.04563
Pdf link: https://arxiv.org/pdf/2512.04563
Abstract Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average \textbf{6.91\%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a \textbf{7.92\%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.
中文摘要 视觉空间推理对于使多模态大型语言模型（MLLM）能够理解对象属性和空间关系至关重要，然而当前模型在三维感知推理方面仍然存在困难。现有方法通常通过增强RGB输入的深度和分割等辅助方式来增强感知，或通过在空间VQA数据集上训练并应用强化学习来进行推理，从而将这两者单独处理。本研究探讨统一的MLLM是否能发展出增强空间感知的内在能力，并通过适应错推理实现更强的空间智能。我们提出了 \textbf{COOPER}，这是一种统一的 MLLM，利用深度和分段作为辅助模态，并通过两阶段训练以获得辅助模态生成和自适应、交错推理能力。COOPER在空间推理方面实现了平均的 \textbf{6.91\%} 提升，同时保持了整体表现。此外，即使是仅训练辅助模态生成的变体，在距离和大小估计上也能获得 \textbf{7.92\%} 的提升，表明学习生成辅助模式有助于内化空间知识并强化空间理解。

Gauss-Newton accelerated MPPI Control

高斯-牛顿加速MPPI控制

Authors: Hannes Homburger, Katrin Baumgärtner, Moritz Diehl, Johannes Reuter
Subjects: Subjects: Systems and Control (eess.SY); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.04579
Pdf link: https://arxiv.org/pdf/2512.04579
Abstract Model Predictive Path Integral (MPPI) control is a sampling-based optimization method that has recently attracted attention, particularly in the robotics and reinforcement learning communities. MPPI has been widely applied as a GPU-accelerated random search method to deterministic direct single-shooting optimal control problems arising in model predictive control (MPC) formulations. MPPI offers several key advantages, including flexibility, robustness, ease of implementation, and inherent parallelizability. However, its performance can deteriorate in high-dimensional settings since the optimal control problem is solved via Monte Carlo sampling. To address this limitation, this paper proposes an enhanced MPPI method that incorporates a Jacobian reconstruction technique and the second-order Generalized Gauss-Newton method. This novel approach is called \textit{Gauss-Newton accelerated MPPI}. The numerical results show that the Gauss-Newton accelerated MPPI approach substantially improves MPPI scalability and computational efficiency while preserving the key benefits of the classical MPPI framework, making it a promising approach even for high-dimensional problems.
中文摘要 模型预测路径积分（MPPI）控制是一种基于抽样的优化方法，近年来在机器人学和强化学习领域引起了关注。MPPI被广泛应用于GPU加速随机搜索方法，用于模型预测控制（MPC）中出现的确定性直接单射最优控制问题。MPPI具有多项关键优势，包括灵活性、稳健性、易于实现以及固有的并行化能力。然而，由于最优控制问题通过蒙特卡洛采样解决，其在高维环境中的性能可能会下降。为解决这一限制，本文提出了一种增强型MPPI方法，结合雅可比重建技术和二阶广义高斯-牛顿方法。这种新颖的方法被称为 \textit{高斯-牛顿加速 MPPI}。数值结果表明，高斯-牛顿加速MPPI方法在保留经典MPPI框架的关键优势的同时，显著提升了MPPI的可扩展性和计算效率，使其成为高维问题中极具前景的方法。

Semi Centralized Training Decentralized Execution Architecture for Multi Agent Deep Reinforcement Learning in Traffic Signal Control

交通信号控制中多智能体深度强化学习的半集中式训练去中心化执行架构

Authors: Pouria Yazdani, Arash Rezaali, Monireh Abdoos
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.04653
Pdf link: https://arxiv.org/pdf/2512.04653
Abstract Multi-agent reinforcement learning (MARL) has emerged as a promising paradigm for adaptive traffic signal control (ATSC) of multiple intersections. Existing approaches typically follow either a fully centralized or a fully decentralized design. Fully centralized approaches suffer from the curse of dimensionality, and reliance on a single learning server, whereas purely decentralized approaches operate under severe partial observability and lack explicit coordination resulting in suboptimal performance. These limitations motivate region-based MARL, where the network is partitioned into smaller, tightly coupled intersections that form regions, and training is organized around these regions. This paper introduces a Semi-Centralized Training, Decentralized Execution (SEMI-CTDE) architecture for multi intersection ATSC. Within each region, SEMI-CTDE performs centralized training with regional parameter sharing and employs composite state and reward formulations that jointly encode local and regional information. The architecture is highly transferable across different policy backbones and state-reward instantiations. Building on this architecture, we implement two models with distinct design objectives. A multi-perspective experimental analysis of the two implemented SEMI-CTDE-based models covering ablations of the architecture's core elements including rule based and fully decentralized baselines shows that they achieve consistently superior performance and remain effective across a wide range of traffic densities and distributions.
中文摘要 多智能体强化学习（MARL）已成为多路口自适应交通信号控制（ATSC）的有前景范式。现有的方法通常采用完全集中式或完全去中心化的设计。完全集中式方法受维度限制和依赖单一学习服务器的困扰，而纯去中心化方法则在严重的部分可观测性下运行，缺乏显式协调，导致性能不理想。这些限制促使基于区域的MARL（区域化MARL）被划分为更小、紧密耦合的交叉点形成区域，培训围绕这些区域组织。本文介绍了一种半集中式训练、去中心化执行（SEMI-CTDE）架构，用于多交叉 ATSC。在每个区域内，SEMI-CTDE执行集中培训，共享区域参数，并采用复合状态和奖励表述，共同编码本地和区域信息。该架构高度可迁移于不同策略骨干和状态-奖励实例之间。基于这一架构，我们实施了两种具有不同设计目标的模型。对两种基于SEMI-CTDE的已实现模型进行了多视角实验分析，涵盖了架构核心元素的消融，包括基于规则的基线和完全去中心化的基线，显示它们在广泛的流量密度和分布范围内始终保持优越性能。

TRINITY: An Evolved LLM Coordinator

TRINITY：一位进化型LLM协调员

Authors: Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.04695
Pdf link: https://arxiv.org/pdf/2512.04695
Abstract Combining diverse foundation models is promising, but weight-merging is limited by mismatched architectures and closed APIs. Trinity addresses this with a lightweight coordinator that orchestrates collaboration among large language models (LLMs). The coordinator, comprising a compact language model (approximately $0.6$B parameters) and a lightweight head (approximately $10$K parameters), is optimized with an evolutionary strategy for efficient and adaptive delegation. Trinity processes queries over multiple turns, where at each turn the coordinator assigns one of three roles (Thinker, Worker, or Verifier) to a selected LLM, effectively offloading complex skill acquisition from the coordinator itself. Experiments show that Trinity consistently outperforms individual models and existing methods across coding, math, reasoning, and domain knowledge tasks, and generalizes robustly to out-of-distribution tasks. On standard benchmarks, Trinity achieves state-of-the-art results, including a score of 86.2% on LiveCodeBench. Theoretical and empirical analyses identify two main factors behind this performance: (1) the coordinator's hidden-state representations provide rich contextualization of inputs, and (2) under high dimensionality and strict budget constraints, the separable Covariance Matrix Adaptation Evolution Strategy offers advantages over reinforcement learning, imitation learning, and random search by exploiting potential block-epsilon-separability.
中文摘要 结合多种基础模型前景看好，但权重合并受限于架构不匹配和封闭API的限制。Trinity通过一款轻量级协调器来解决这个问题，协调大型语言模型（LLMs）之间的协作。协调器由一个紧凑的语言模型（约0.6亿美元参数）和一个轻量级的中心（约10万美元参数）组成，通过进化策略优化，实现高效且自适应的委托。Trinity在多个回合内处理查询，每回合协调者会为选定的LLM分配三种角色之一（思考者、工作者或验证者），从而有效地将复杂技能的获取工作从协调者身上卸下。实验表明，Trinity在编码、数学、推理和领域知识任务中始终优于单个模型和现有方法，并且能够稳健地推广到分布外的任务。在标准基准测试中，Trinity取得了最先进的成绩，包括LiveCodeBench上的86.2%得分。理论和实证分析指出，这一表现背后的两个主要因素：（1）协调者的隐藏态表示为输入提供了丰富的上下文化;（2）在高维度和严格预算约束下，可分协方差矩阵适应进化策略通过利用潜在的块-ε分离性，相较于强化学习、模仿学习和随机搜索具有优势。

RLHFSpec: Breaking the Efficiency Bottleneck in RLHF Training via Adaptive Drafting

RLHFSpec：通过自适应制图打破RLHF培训中的效率瓶颈

Authors: Siqi Wang, Hailong Yang, Junjie Zhu, Xuezhu Wang, Yufan Xu, Depei Qian
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.04752
Pdf link: https://arxiv.org/pdf/2512.04752
Abstract Reinforcement Learning from Human Feedback (RLHF) is an important fine-tuning technique for large language models (LLMs) and comprises three stages: generation, inference, and training. The generation stage generates samples that are then used to infer learnable experiences for training. We observe that the generation stage is the bottleneck of the entire execution process and consider it a key point for optimization. Specifically, we realize the first attempt to integrate speculative decoding into the RLHF generation stage and propose RLHFSpec, an RLHF system that accelerates generation execution with adaptive speculative decoding and sample reallocation. To fully exploit the performance potential provided by speculative decoding, especially dealing with the dynamic workload of the generation stage, RLHFSpec proposes a workload-aware drafting strategy selection mechanism, which selects the near-optimal strategy by jointly considering the verification cost and the number of accepted tokens. Moreover, RLHFSpec also proposes sample reallocation to fully utilize the GPU resources, and optimizes it with an efficient sample migration mechanism. The experimental results show that the RLHFSpec can achieve higher throughput in the generation stage compared to state-of-the-art works. Moreover, due to the effective alleviation of the generation bottleneck, RLHFSpec also shows significant performance speedup in the entire RLHF execution.
中文摘要 人类反馈强化学习（RLHF）是一种重要的大型语言模型（LLM）微调技术，包含三个阶段：生成、推理和训练。生成阶段生成样本，用于推断可学习的训练经验。我们观察到生成阶段是整个执行过程的瓶颈，并将其视为优化的关键点。具体来说，我们实现了首次尝试将推测解码整合进RLHF生成阶段，并提出了RLHFSpec系统，这是一种通过自适应推测解码和样本重新分配加速生成执行的RLHF系统。为了充分发挥推测解码的性能潜力，特别是处理生成阶段动态工作负载时，RLHFSpec 提出了一种工作负载感知的绘图策略选择机制，通过共同考虑验证成本和接受令牌数量来选择近似最优策略。此外，RLHFSpec还提出了样本重新分配以充分利用GPU资源，并通过高效的样本迁移机制进行优化。实验结果表明，RLHFSpec在生成阶段的通量可高于最先进设备。此外，由于有效缓解了生成瓶颈，RLHFSpec在整个RLHF执行过程中也显著提升了性能。

Using Machine Learning to Take Stay-or-Go Decisions in Data-driven Drone Missions

利用机器学习在数据驱动的无人机任务中做出“留下还是离开”的决策

Authors: Giorgos Polychronis, Foivos Pournaropoulos, Christos D. Antonopoulos, Spyros Lalis
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.04773
Pdf link: https://arxiv.org/pdf/2512.04773
Abstract Drones are becoming indispensable in many application domains. In data-driven missions, besides sensing, the drone must process the collected data at runtime to decide whether additional action must be taken on the spot, before moving to the next point of interest. If processing does not reveal an event or situation that requires such an action, the drone has waited in vain instead of moving to the next point. If, however, the drone starts moving to the next point and it turns out that a follow-up action is needed at the previous point, it must spend time to fly-back. To take this decision, we propose different machine-learning methods based on branch prediction and reinforcement learning. We evaluate these methods for a wide range of scenarios where the probability of event occurrence changes with time. Our results show that the proposed methods consistently outperform the regression-based method proposed in the literature and can significantly improve the worst-case mission time by up to 4.1x. Also, the achieved median mission time is very close, merely up to 2.7% higher, to that of a method with perfect knowledge of the current underlying event probability at each point of interest.
中文摘要 无人机在许多应用领域正变得不可或缺。在数据驱动任务中，除了传感外，无人机还需在运行时处理收集到的数据，决定是否需要现场采取额外行动，然后再前往下一个关注点。如果处理过程中没有发现需要采取此类行动的事件或情境，那么无人机就是徒劳地等待，而不是前往下一个点。然而，如果无人机开始移动到下一个点，且发现前一个点需要后续行动，它必须花时间飞回。为做出这一决定，我们提出了基于分支预测和强化学习的不同机器学习方法。我们评估这些方法适用于事件发生概率随时间变化的各种情景。我们的结果表明，所提出的方法始终优于文献中提出的基于回归的方法，并且最坏情况下的任务时间可显著提升4.1倍。此外，实现的中位任务时间与完全了解当前事件概率的方法非常接近，仅高出最多2.7%。

YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance

YingMusic-Singer：零镜头唱声合成与编辑，配合无注释旋律指导

Authors: Junjie Zheng, Chunbo Hao, Guobin Ma, Xiaoyu Zhang, Gongyu Chen, Chaofan Ding, Zihao Chen, Lei Xie
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.04779
Pdf link: https://arxiv.org/pdf/2512.04779
Abstract Singing Voice Synthesis (SVS) remains constrained in practical deployment due to its strong dependence on accurate phoneme-level alignment and manually annotated melody contours, requirements that are resource-intensive and hinder scalability. To overcome these limitations, we propose a melody-driven SVS framework capable of synthesizing arbitrary lyrics following any reference melody, without relying on phoneme-level alignment. Our method builds on a Diffusion Transformer (DiT) architecture, enhanced with a dedicated melody extraction module that derives melody representations directly from reference audio. To ensure robust melody encoding, we employ a teacher model to guide the optimization of the melody extractor, alongside an implicit alignment mechanism that enforces similarity distribution constraints for improved melodic stability and coherence. Additionally, we refine duration modeling using weakly annotated song data and introduce a Flow-GRPO reinforcement learning strategy with a multi-objective reward function to jointly enhance pronunciation clarity and melodic fidelity. Experiments show that our model achieves superior performance over existing approaches in both objective measures and subjective listening tests, especially in zero-shot and lyric adaptation settings, while maintaining high audio quality without manual annotation. This work offers a practical and scalable solution for advancing data-efficient singing voice synthesis. To support reproducibility, we release our inference code and model checkpoints.
中文摘要 歌唱声合成（SVS）在实际应用上仍受限于其高度依赖准确的音素层对齐和手动注释旋律轮廓，这些需求资源密集且阻碍了扩展性。为克服这些限制，我们提出了一种旋律驱动的SVS框架，能够在任意参考旋律后合成任意歌词，而无需依赖音素层级的对齐。我们的方法基于扩散变压器（DiT）架构，并配备了专用旋律提取模块，直接从参考音频中提取旋律表示。为确保旋律编码稳健，我们采用教师模型指导旋律提取器的优化，同时采用隐式对齐机制，通过强化相似度分布约束，以提升旋律稳定性和连贯性。此外，我们利用弱注释歌曲数据优化时长建模，并引入带有多目标奖励函数的Flow-GRPO强化学习策略，共同提升发音清晰度和旋律真实度。实验显示，我们的模型在客观测量和主观听觉测试中，尤其是在零镜头和歌词改编环境中，都优于现有方法，同时保持高音质且无需人工注释。这项工作为推进数据高效的歌唱声部合成提供了一种实用且可扩展的解决方案。为了支持可重复性，我们发布了推理代码和模型检查点。

PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

PaCo-RL：通过成对奖励建模推进强化学习以实现一致图像生成

Authors: Bowen Ping, Chengyou Jia, Minnan Luo, Changliang Xia, Xin Shen, Zhuohang Dang, Hangwei Qian
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.04784
Pdf link: https://arxiv.org/pdf/2512.04784
Abstract Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images, which is essential for applications such as storytelling and character design. Supervised training approaches struggle with this task due to the lack of large-scale datasets capturing visual consistency and the complexity of modeling human perceptual preferences. In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner. To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm. The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing. It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons. The second component, PaCo-GRPO, leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost, alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization. Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency, and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability. Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation. The project page is available at this https URL.
中文摘要 一致的图像生成需要忠实地保留多张图像的身份、风格和逻辑连贯性，这对于讲故事和角色设计等应用至关重要。监督式训练方法在这一任务上遇到困难，原因是缺乏能够捕捉视觉一致性的大规模数据集，以及建模人类感知偏好的复杂性。本文论证强化学习（RL）提供了一个有前景的替代方案，使模型能够以无数据的方式学习复杂且主观的视觉标准。为此，我们引入了PaCo-RL，这是一个综合框架，结合了专业的一致性奖励模型和高效的强化学习算法。第一个组件PaCo-Reward是一个两两一致性评估器，基于通过自动子图配对构建的大规模数据集训练。它通过生成式、自回归的评分机制评估一致性，并辅以任务感知指令和CoT理由。第二个组件PaCo-GRPO采用了一种新颖的分辨率解耦优化策略，大幅降低了强化学习的成本，同时采用了对数驯服的多奖励聚合机制，确保奖励优化的平衡与稳定。在两个代表性子任务中的大量实验表明，PaCo-Reward显著提升了与人类视觉一致性感知的对齐度，而PaCo-GRPO则实现了最先进的一致性表现，同时提升了训练效率和稳定性。这些结果共同凸显了PaCo-RL作为一种实用且可扩展的稳定图像生成解决方案的潜力。项目页面可在此 https 网址访问。

YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases

YingMusic-SVC：真实世界稳健零帧唱声转换，结合Flow-GRPO和歌唱特异的归纳偏置

Authors: Gongyu Chen, Xiaoyu Zhang, Zhenqiang Weng, Junjie Zheng, Da Shen, Chaofan Ding, Wei-Qiang Zhang, Zihao Chen
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.04793
Pdf link: https://arxiv.org/pdf/2512.04793
Abstract Singing voice conversion (SVC) aims to render the target singer's timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing. We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. Our model introduces a singing-trained RVC timbre shifter for timbre-content disentanglement, an F0-aware timbre adaptor for dynamic vocal expression, and an energy-balanced rectified flow matching loss to enhance high-frequency fidelity. Experiments on a graded multi-track benchmark show that YingMusic-SVC achieves consistent improvements over strong open-source baselines in timbre similarity, intelligibility, and perceptual naturalness, especially under accompanied and harmony-contaminated conditions, demonstrating its effectiveness for real-world SVC deployment.
中文摘要 唱声转换（SVC）旨在呈现目标歌手的音色，同时保留旋律和歌词。然而，由于和声干扰、F0误差以及唱歌缺乏归纳偏差，现有的零点SVC系统在真实歌曲中仍然脆弱。我们提出了YingMusic-SVC，这是一个稳健的零样本框架，统一了连续预训练、稳健的监督微调和Flow-GRPO强化学习。我们的模型引入了用于音色内容解缠的歌唱训练RVC音色移位器、用于动态人声表达的F0感知音色适配器，以及能量平衡的整流匹配损耗以增强高频保真度。分级多轨基准测试的实验显示，YingMusic-SVC在音色相似性、可理解性和感知自然性方面相较强开源基线实现了持续提升，尤其是在伴奏和和声受污染的条件下，证明了其在实际SVC部署中的有效性。

Safe model-based Reinforcement Learning via Model Predictive Control and Control Barrier Functions

基于模型的安全强化学习，通过模型预测控制和控制障碍函数

Authors: Kerim Dzhumageldyev, Filippo Airaldi, Azita Dabiri
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.04856
Pdf link: https://arxiv.org/pdf/2512.04856
Abstract Optimal control strategies are often combined with safety certificates to ensure both performance and safety in safety-critical systems. A prominent example is combining Model Predictive Control (MPC) with Control Barrier Functions (CBF). Yet, efficient tuning of MPC parameters and choosing an appropriate class $\mathcal{K}$ function in the CBF is challenging and problem dependent. This paper introduces a safe model-based Reinforcement Learning (RL) framework where a parametric MPC controller incorporates a CBF constraint with a parameterized class $\mathcal{K}$ function and serves as a function approximator to learn improved safe control policies from data. Three variations of the framework are introduced, distinguished by the way the optimization problem is formulated and the class $\mathcal{K}$ function is parameterized, including neural architectures. Numerical experiments on a discrete double-integrator with static and dynamic obstacles demonstrate that the proposed methods improve performance while ensuring safety.
中文摘要 最优控制策略通常与安全证书结合，以确保安全关键系统的性能和安全。一个显著的例子是将模型预测控制（MPC）与控制障碍函数（CBF）结合起来。然而，高效调优MPC参数并在CBF中选择合适的类$\mathcal{K}$函数是具有挑战性的，且取决于问题。本文介绍了一个基于安全模型的强化学习（RL）框架，其中参数化MPC控制器结合了带有参数化类$\mathcal{K}$函数的CBF约束，并作为函数近似器，从数据中学习改进的安全控制策略。引入了三种框架变体，区别在于优化问题的表述方式和类 $\mathcal{K}$ 函数的参数化方式，包括神经架构。在带有静态和动态障碍的离散双积分器上的数值实验表明，所提方法在保证安全性的同时提升了性能。

Multi-Agent Reinforcement Learning for Intraday Operating Rooms Scheduling under Uncertainty

多智能体强化学习用于不确定性下日间手术室排程

Authors: Kailiang Liu, Ying Chen, Ralf Borndörfer, Thorsten Koch
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.04918
Pdf link: https://arxiv.org/pdf/2512.04918
Abstract Intraday surgical scheduling is a multi-objective decision problem under uncertainty-balancing elective throughput, urgent and emergency demand, delays, sequence-dependent setups, and overtime. We formulate the problem as a cooperative Markov game and propose a multi-agent reinforcement learning (MARL) framework in which each operating room (OR) is an agent trained with centralized training and decentralized execution. All agents share a policy trained via Proximal Policy Optimization (PPO), which maps rich system states to actions, while a within-epoch sequential assignment protocol constructs conflict-free joint schedules across ORs. A mixed-integer pre-schedule provides reference starting times for electives; we impose type-specific quadratic delay penalties relative to these references and a terminal overtime penalty, yielding a single reward that captures throughput, timeliness, and staff workload. In simulations reflecting a realistic hospital mix (six ORs, eight surgery types, random urgent and emergency arrivals), the learned policy outperforms six rule-based heuristics across seven metrics and three evaluation subsets, and, relative to an ex post MIP oracle, quantifies optimality gaps. Policy analytics reveal interpretable behavior-prioritizing emergencies, batching similar cases to reduce setups, and deferring lower-value electives. We also derive a suboptimality bound for the sequential decomposition under simplifying assumptions. We discuss limitations-including OR homogeneity and the omission of explicit staffing constraints-and outline extensions. Overall, the approach offers a practical, interpretable, and tunable data-driven complement to optimization for real-time OR scheduling.
中文摘要 日内手术排程是一个多目标决策问题，涉及不确定性平衡选择性治疗、紧急需求、延误、顺序依赖的设置以及加班。我们将问题提出为合作式马尔可夫博弈，并提出了一个多智能体强化学习（MARL）框架，其中每个手术室（OR）都是通过集中训练和去中心化执行训练的智能体。所有代理共享通过近端策略优化（PPO）训练的策略，该策略将丰富的系统状态映射到动作，而跨时代顺序分配协议则构建跨OR的无冲突联合计划。混合整数预定课表为选修课提供了参考开学时间;我们对这些参考设置了类型特定的二次延迟惩罚和终端加班惩罚，从而获得一个涵盖吞吐量、及时性和员工工作负荷的单一奖励。在反映现实医院组合（六个手术室、八种手术类型、随机急诊和急诊患者）的模拟中，所学政策在七个指标和三个评估子集上优于六个基于规则的启发式方法，并且相较于事后MIP预言机，量化了最优性差距。策略分析揭示了可解释的行为优先级紧急情况，批量处理类似案例以减少设置，并推迟低价值的选修课。我们还推导出一个在简化假设下顺序分解的次最优界限。我们讨论了局限性——包括手术室的同质性和明确人员限制的省略——并概述了扩展方案。总体而言，该方法为实时运筹调度优化提供了实用、可解释且可调的数据驱动补充。

CARL: Critical Action Focused Reinforcement Learning for Multi-Step Agent

CARL：多步代理的关键行动聚焦强化学习

Authors: Leyang Shen, Yang Zhang, Chun Kai Ling, Xiaoyan Zhao, Tat-Seng Chua
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.04949
Pdf link: https://arxiv.org/pdf/2512.04949
Abstract Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each action holds equal contribution, which deviates significantly from reality. Our analysis reveals that only a small fraction of actions are critical in determining the final outcome. Building on this insight, we propose CARL, a critical-action-focused reinforcement learning algorithm tailored for multi-step agents. CARL achieves focused training through providing action-level optimization signals for high-criticality actions while excluding low-criticality actions from model update. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency during training and inference across diverse evaluation settings.
中文摘要 能够通过与环境多次互动完成复杂任务的智能体已成为一种流行的研究方向。然而，在这种多步环境中，传统的组级策略优化算法因假设每个动作贡献相等而变得次优，这与现实有显著偏差。我们的分析显示，只有极少数行动对最终结果至关重要。基于这一见解，我们提出了CARL，一种针对多步代理量身定制的关键动作导向强化学习算法。CARL通过为高关键性动作提供动作级优化信号，同时将低关键性动作排除在模型更新之外，实现了聚焦训练。大量实验表明，CARL在训练和推断过程中，在多样化评估环境中既能实现更强的性能，也提高了效率。

Realizable Abstractions: Near-Optimal Hierarchical Reinforcement Learning

可实现抽象：近优层级强化学习

Authors: Roberto Cipollone, Luca Iocchi, Matteo Leonetti
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.04958
Pdf link: https://arxiv.org/pdf/2512.04958
Abstract The main focus of Hierarchical Reinforcement Learning (HRL) is studying how large Markov Decision Processes (MDPs) can be more efficiently solved when addressed in a modular way, by combining partial solutions computed for smaller subtasks. Despite their very intuitive role for learning, most notions of MDP abstractions proposed in the HRL literature have limited expressive power or do not possess formal efficiency guarantees. This work addresses these fundamental issues by defining Realizable Abstractions, a new relation between generic low-level MDPs and their associated high-level decision processes. The notion we propose avoids non-Markovianity issues and has desirable near-optimality guarantees. Indeed, we show that any abstract policy for Realizable Abstractions can be translated into near-optimal policies for the low-level MDP, through a suitable composition of options. As demonstrated in the paper, these options can be expressed as solutions of specific constrained MDPs. Based on these findings, we propose RARL, a new HRL algorithm that returns compositional and near-optimal low-level policies, taking advantage of the Realizable Abstraction given in the input. We show that RARL is Probably Approximately Correct, it converges in a polynomial number of samples, and it is robust to inaccuracies in the abstraction.
中文摘要 分层强化学习（HRL）的主要关注点是研究如何通过模块化方式解决大型马尔可夫决策过程（MDP），通过结合为较小子任务计算的部分解来实现。尽管 HRL 文献中提出的大多数 MDP 抽象概念在学习中具有非常直观的作用，但表达力有限或缺乏形式上的效率保证。本研究通过定义可实现抽象（Realizable Abstractions）来解决这些基本问题，这是一种通用低层次MDP与其相关高层决策过程之间的新关系。我们提出的概念避免了非马尔可夫问题，并具有理想的近优性保证。事实上，我们证明了任何可实现抽象的抽象策略都可以通过合适的选项组合，转化为低层次MDP的近似最优策略。如论文所示，这些选项可以用特定受限MDP的解来表示。基于这些发现，我们提出了RARL，一种新的HRL算法，能够返回组合性和近优的低层次策略，利用输入中给出的可实现抽象。我们证明 RARL 可能是近似正确的，它在多项式样本数内收敛，并且对抽象中的不准确性具有鲁棒性。

From Generated Human Videos to Physically Plausible Robot Trajectories

从生成的人类视频到物理上合理的机器人轨迹

Authors: James Ni, Zekai Wang, Wei Lin, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik, Roei Herzig
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.05094
Pdf link: https://arxiv.org/pdf/2512.05094
Abstract Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.
中文摘要 视频生成模型在综合人类行为于新情境中的能力正在迅速提升，具备作为情境机器人控制高级规划工具的潜力。为了实现这一潜力，一个关键研究问题仍未解决：类人生物如何以零镜头方式执行生成视频中的人类动作？这一挑战源于生成视频通常噪声较大，且存在形态失真，使得直接模仿相比真实视频更困难。为此，我们引入了两阶段流水线。首先，我们将视频像素提升为四维人类表现，然后重新定位到类人形态。其次，我们提出GenMimic-一种基于3D关键点的物理感知强化学习策略，并通过对称正则化和关键点加权追踪奖励进行训练。因此，GenMimic可以从噪声生成的视频中模拟人类行为。我们策划了GenMimicBench，这是一个使用两个视频生成模型生成的合成人体运动数据集，涵盖多种行为和情境，建立了评估零帧泛化和策略鲁棒性的基准。大量实验证明了在模拟中相较于强基线的改进，并确认了Unitree G1人形机器人在不进行微调的情况下实现连贯且物理稳定的运动跟踪。这项工作为实现视频生成模型作为机器人控制高级政策的潜力提供了有前景的道路。

SA-IQA: Redefining Image Quality Assessment for Spatial Aesthetics with Multi-Dimensional Rewards

SA-IQA：重新定义空间美学图像质量评估，并以多维奖励

Authors: Yuan Gao, Jin Song
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.05098
Pdf link: https://arxiv.org/pdf/2512.05098
Abstract In recent years, Image Quality Assessment (IQA) for AI-generated images (AIGI) has advanced rapidly; however, existing methods primarily target portraits and artistic images, lacking a systematic evaluation of interior scenes. We introduce Spatial Aesthetics, a paradigm that assesses the aesthetic quality of interior images along four dimensions: layout, harmony, lighting, and distortion. We construct SA-BENCH, the first benchmark for spatial aesthetics, comprising 18,000 images and 50,000 precise annotations. Employing SA-BENCH, we systematically evaluate current IQA methodologies and develop SA-IQA, through MLLM fine-tuning and a multidimensional fusion approach, as a comprehensive reward framework for assessing spatial aesthetics. We apply SA-IQA to two downstream tasks: (1) serving as a reward signal integrated with GRPO reinforcement learning to optimize the AIGC generation pipeline, and (2) Best-of-N selection to filter high-quality images and improve generation quality. Experiments indicate that SA-IQA significantly outperforms existing methods on SA-BENCH, setting a new standard for spatial aesthetics evaluation. Code and dataset will be open-sourced to advance research and applications in this domain.
中文摘要 近年来，AI生成图像（AIGI）的图像质量评估（IQA）发展迅速;然而，现有方法主要针对肖像和艺术图像，缺乏对室内场景的系统评估。我们介绍空间美学，这是一种从布局、和谐、光线和畸变四个维度评估室内图像美学质量的范式。我们构建了SA-BENCH，这是空间美学的首个基准，包含18,000张图片和50,000条精确注释。我们运用SA-BENCH，系统评估当前的IQA方法，并通过MLLM微调和多维融合方法开发SA-IQA，作为评估空间美学的综合奖励框架。我们将SA-IQA应用于两个下游任务：（1）作为与GRPO强化学习集成的奖励信号，以优化AIGC生成流程;（2）N次选选择以过滤高质量图像并提升生成质量。实验显示，SA-IQA在SA-BENCH上显著优于现有方法，树立了空间美学评估的新标准。代码和数据集将开源，以推动该领域的研究和应用。

Structured Document Translation via Format Reinforcement Learning

通过格式强化学习实现结构化文档翻译

Authors: Haiyue Song, Johannes Eschbach-Dymanus, Hour Kaing, Sumire Honda, Hideki Tanaka, Bianka Buschbeck, Masao Utiyama
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.05100
Pdf link: https://arxiv.org/pdf/2512.05100
Abstract Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose \textbf{Format Reinforcement Learning (FormatRL)}, which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we apply StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.
中文摘要 近期关于结构化文本翻译的研究仍限于句子层面，因为它们难以有效处理复杂的文档级XML或HTML结构。为此，我们提出了 \textbf{格式强化学习（FormatRL）}，它在监督微调模型基础上采用群组相对策略优化，直接优化新颖的结构感知奖励：1）TreeSim，测量预测与参考 XML 树之间的结构相似性;2）Node-chrF，衡量 XML 节点层面的翻译质量。此外，我们还应用了StrucAUC，这是一种细致度的指标，用于区分小误差和重大结构失效。SAP软件-文档基准测试显示了六项指标的改进，分析进一步展示了不同奖励函数如何促进结构性和翻译质量的提升。

Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning

语义软引导：无强化学习的大型语言模型中的长上下文推理

Authors: Purbesh Mitra, Sennur Ulukus
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2512.05105
Pdf link: https://arxiv.org/pdf/2512.05105
Abstract Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference. Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming. However, RLVR is limited by several bottlenecks, such as, lack of dense reward, and inadequate sample efficiency. As a result, it requires significant compute resources in post-training phase. To overcome these limitations, in this work, we propose \textbf{Semantic Soft Bootstrapping (SSB)}, a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time. The model is first prompted with a math problem and several rollouts are generated. From them, the correct and most common incorrect response are filtered, and then provided to the model in context to produce a more robust, step-by-step explanation with a verified final answer. This pipeline automatically curates a paired teacher-student training set from raw problem-answer data, without any human intervention. This generation process also produces a sequence of logits, which is what the student model tries to match in the training phase just from the bare question alone. In our experiment, Qwen2.5-3B-Instruct on GSM8K dataset via parameter-efficient fine-tuning. We then tested its accuracy on MATH500, and AIME2024 benchmarks. Our experiments show a jump of 10.6%, and 10% improvements in accuracy, respectively, over group relative policy optimization (GRPO), which is a commonly used RLVR algorithm. Our code is available at this https URL, and the model, curated dataset is available at this https URL.
中文摘要 大型语言模型（LLM）中的长上下文推理已证明通过思维链（CoT）推理增强了其认知能力。这类模型的训练通常通过基于推理的问题（如数学和编程）中的可验证奖励强化学习（RLVR）进行。然而，RLVR受限于若干瓶颈，如缺乏密集奖励和样本效率不足。因此，在训练后阶段需要大量计算资源。为克服这些限制，本研究提出了 \textbf{语义软引导（SSB）}，一种自我提炼技术，其中同一基础语言模型同时扮演教师和学生的角色，但在训练时接收到关于其结果正确性的语义语境。模型首先会被一个数学题提示，并生成多个展开。从中筛选出正确和最常见的错误回答，然后在上下文中提供给模型，以产生更稳健的逐步解释和经过验证的最终答案。该流程自动从原始问题解答数据中策划成对的师生培训集，无需人工干预。这一生成过程还会产生一系列logits，这也是学生模型在训练阶段仅凭问题试图匹配的序列。在我们的实验中，Qwen2.5-3B-Ininstruction 通过参数高效的微调在 GSM8K 数据集上进行。随后我们在MATH500和AIME2024基准测试中测试了其准确性。我们的实验显示，相较于常用的RLVR算法，组相对策略优化（GRPO）的准确性分别提升了10.6%和10%。我们的代码可在此 https URL 访问，模型、策划数据集则在此 https URL 上。

STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models

STARE-VLA：渐进式阶段感知强化，用于微调视觉-语言-行动模型

Authors: Feng Xu, Guangyao Zhai, Xin Kong, Tingzhong Fu, Daniel F.N. Gordon, Xueli An, Benjamin Busam
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.05107
Pdf link: https://arxiv.org/pdf/2512.05107
Abstract Recent advances in Vision-Language-Action (VLA) models, powered by large language models and reinforcement learning-based fine-tuning, have shown remarkable progress in robotic manipulation. Existing methods often treat long-horizon actions as linguistic sequences and apply trajectory-level optimization methods such as Trajectory-wise Preference Optimization (TPO) or Proximal Policy Optimization (PPO), leading to coarse credit assignment and unstable training. However, unlike language, where a unified semantic meaning is preserved despite flexible sentence order, action trajectories progress through causally chained stages with different learning difficulties. This motivates progressive stage optimization. Thereby, we present Stage-Aware Reinforcement (STARE), a module that decomposes a long-horizon action trajectory into semantically meaningful stages and provides dense, interpretable, and stage-aligned reinforcement signals. Integrating STARE into TPO and PPO, we yield Stage-Aware TPO (STA-TPO) and Stage-Aware PPO (STA-PPO) for offline stage-wise preference and online intra-stage interaction, respectively. Further building on supervised fine-tuning as initialization, we propose the Imitation -> Preference -> Interaction (IPI), a serial fine-tuning pipeline for improving action accuracy in VLA models. Experiments on SimplerEnv and ManiSkill3 demonstrate substantial gains, achieving state-of-the-art success rates of 98.0 percent on SimplerEnv and 96.4 percent on ManiSkill3 tasks.
中文摘要 视觉-语言-行动（VLA）模型的最新进展，借助大型语言模型和基于强化学习的微调技术，显示出机器人作的显著进展。现有方法通常将长视野动作视为语言序列，并应用轨迹层级优化方法，如轨迹偏好优化（TPO）或近端策略优化（PPO），导致学分分配粗糙且训练不稳定。然而，与语言不同，语言中即使句序灵活，统一的语义意义依然得以保留，而行动轨迹则通过因果链的阶段推进，学习困难各异。这推动了渐进式阶段优化。因此，我们提出了阶段感知强化（STARE），这是一个将长视野行动轨迹分解为语义有意义阶段的模块，并提供密集、可解释且与阶段对齐的强化信号。将STARE集成到TPO和PPO，我们分别得到了阶段感知TPO（STA-TPO）和阶段感知PPO（STA-PPO），分别用于离线阶段偏好和在线阶段内交互。基于监督微调作为初始化，我们提出了模仿->偏好->交互（IPI），这是一种用于提升VLA模型动作精度的串行微调流水线。在SimplerEnv和ManiSkill3上的实验显示出显著提升，SimplerEnv的成功率达到98.0%，ManiSkill3任务的成功率达到96.4%。

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

ARM思维者：通过能动工具使用和视觉推理强化多模态生成奖励模型

Authors: Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.05111
Pdf link: https://arxiv.org/pdf/2512.05111
Abstract Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
中文摘要 奖励模型对于使视觉语言系统与人类偏好对齐至关重要，但当前方法存在幻觉、视觉基础薄弱以及无法使用验证工具，限制了其在复杂多模态推理任务中的可靠性。我们提出了ARM-Thinker，一种A}生成多模态奖励模型，能够自主调用外部工具（如图片裁剪、文档页面检索）以基于可验证证据做出判断，取代静态、非交互式的奖励评分。这使得模型能够验证细粒度的视觉细节，交叉比对多页证据，并验证推理主张，这些能力在现有奖励模型中是缺失的。我们通过多阶段强化学习训练ARM-Thinker，共同优化工具调用决策和判断准确性。为了评估能动奖励建模，我们引入了ARMBench-VL，包含三个基准测试，分别评估细粒度的视觉基础（图像级工具）、多页文档理解（检索工具）和指令跟踪（文本级验证）。ARM-Thinker在奖励建模基准测试中平均提升+16.2%，工具使用任务提升+9.6%，并在多模态数学和逻辑推理基准测试中优于基线。我们的结果表明，代理能力显著提升了奖励模型的准确性和可解释性。

Keyword: diffusion policy

Bridging Simulation and Reality: Cross-Domain Transfer with Semantic 2D Gaussian Splatting

连接模拟与现实：跨域传输与语义二维高斯喷溅

Authors: Jian Tang, Pu Pang, Haowen Sun, Chengzhong Ma, Xingyu Chen, Hua Huang, Xuguang Lan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.04731
Pdf link: https://arxiv.org/pdf/2512.04731
Abstract Cross-domain transfer in robotic manipulation remains a longstanding challenge due to the significant domain gap between simulated and real-world environments. Existing methods such as domain randomization, adaptation, and sim-real calibration often require extensive tuning or fail to generalize to unseen scenarios. To address this issue, we observe that if domain-invariant features are utilized during policy training in simulation, and the same features can be extracted and provided as the input to policy during real-world deployment, the domain gap can be effectively bridged, leading to significantly improved policy generalization. Accordingly, we propose Semantic 2D Gaussian Splatting (S2GS), a novel representation method that extracts object-centric, domain-invariant spatial features. S2GS constructs multi-view 2D semantic fields and projects them into a unified 3D space via feature-level Gaussian splatting. A semantic filtering mechanism removes irrelevant background content, ensuring clean and consistent inputs for policy learning. To evaluate the effectiveness of S2GS, we adopt Diffusion Policy as the downstream learning algorithm and conduct experiments in the ManiSkill simulation environment, followed by real-world deployment. Results demonstrate that S2GS significantly improves sim-to-real transferability, maintaining high and stable task performance in real-world scenarios.
中文摘要 由于模拟环境与现实环境之间存在显著的领域差距，机器人作中的跨域转移仍是一个长期面临的挑战。现有方法如域随机化、自适应和模拟实校准通常需要大量调整，或无法推广到未见场景。为解决这一问题，我们观察到，如果在仿真策略训练中利用域不变特征，并且在实际部署时提取并提供相同特征作为策略输入，则可以有效弥合领域差距，从而显著提升策略泛化。因此，我们提出了语义二维高斯喷溅（S2GS），这是一种新颖的表示方法，能够提取以对象为中心、域不变的空间特征。S2GS构建多视角二维语义场，并通过特征级高斯喷溅将其投影到统一的三维空间中。语义过滤机制去除无关背景内容，确保政策学习输入清晰一致。为评估S2GS的有效性，我们采用扩散策略作为下游学习算法，并在ManiSkill模拟环境中进行实验，随后进行实际应用。结果表明，S2GS显著提升了模拟到现实的可转移性，在现实场景中保持高且稳定的任务性能。