Arxiv Papers of Today

生成时间: 2025-11-19 16:30:31 (UTC+8); Arxiv 发布时间: 2025-11-19 20:00 EST (2025-11-20 09:00 UTC+8)

今天共有 25 篇相关文章

Keyword: reinforcement learning

Deep reinforcement learning-based spacecraft attitude control with pointing keep-out constraint

基于深度强化的学习航天器姿态控制，带有指向保持约束

Authors: Juntang Yang, Mohamed Khalil Ben-Larbi
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.13746
Pdf link: https://arxiv.org/pdf/2511.13746
Abstract This paper implements deep reinforcement learning (DRL) for spacecraft reorientation control with a single pointing keep-out zone. The Soft Actor-Critic (SAC) algorithm is adopted to handle continuous state and action space. A new state representation is designed to explicitly include a compact representation of the attitude constraint zone. The reward function is formulated to achieve the control objective while enforcing the attitude constraint. A curriculum learning approach is used for the agent training. Simulation results demonstrate the effectiveness of the proposed DRL-based method for spacecraft pointing-constrained attitude control.
中文摘要 本文实现了深度强化学习（DRL）用于航天器重新定向控制，并设有单一指向禁区。采用软演员-批评者（SAC）算法处理连续状态和动作空间。新的状态表示被设计为明确包含姿态约束区的紧致表示。奖励函数旨在实现控制目标，同时强制执行态度约束。代理培训采用课程学习方法。模拟结果展示了基于日间缓速（DRL）的方法在航天器指向约束姿态控制中的有效性。

Quantifying Distribution Shift in Traffic Signal Control with Histogram-Based GEH Distance

基于直方图的GEH距离量化交通信号控制中的分布偏移

Authors: Federico Taschin, Ozan K. Tonguz
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.13785
Pdf link: https://arxiv.org/pdf/2511.13785
Abstract Traffic signal control algorithms are vulnerable to distribution shift, where performance degrades under traffic conditions that differ from those seen during design or training. This paper introduces a principled approach to quantify distribution shift by representing traffic scenarios as demand histograms and comparing them with a GEH-based distance function. The method is policy-independent, interpretable, and leverages a widely used traffic engineering statistic. We validate the approach on 20 simulated scenarios using both a NEMA actuated controller and a reinforcement learning controller (FRAP++). Results show that larger scenario distances consistently correspond to increased travel time and reduced throughput, with particularly strong explanatory power for learning-based control. Overall, this method can predict performance degradation under distribution shift better than previously published techniques. These findings highlight the utility of the proposed framework for benchmarking, training regime design, and monitoring in adaptive traffic signal control.
中文摘要 交通信号控制算法易受分配偏移影响，即在与设计或训练时不同的交通条件下性能下降。本文引入了一种原则性方法，通过将交通场景表示为需求直方图，并与基于GEH的距离函数进行比较，来量化分配偏移。该方法与策略无关，可解释，并利用广泛使用的交通工程统计数据。我们在20个模拟场景中验证了该方法，同时使用NEMA驱动控制器和强化学习控制器（FRAP++）。结果显示，较大的情景距离一致地对应于旅行时间增加和吞吐量下降，尤其对基于学习的控制具有强烈的解释力。总体而言，该方法比以往发表的技术更能预测分布偏移下的性能下降。这些发现凸显了所提框架在自适应交通信号控制中基准测试、训练体系设计和监测中的实用性。

Beat the long tail: Distribution-Aware Speculative Decoding for RL Training

击败长尾分析：用于强化学习的分布感知推测解码

Authors: Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Alpay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, Junxiong Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.13841
Pdf link: https://arxiv.org/pdf/2511.13841
Abstract Reinforcement learning(RL) post-training has become essential for aligning large language models (LLMs), yet its efficiency is increasingly constrained by the rollout phase, where long trajectories are generated token by token. We identify a major bottleneck:the long-tail distribution of rollout lengths, where a small fraction of long generations dominates wall clock time and a complementary opportunity; the availability of historical rollouts that reveal stable prompt level patterns across training epochs. Motivated by these observations, we propose DAS, a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs. DAS integrates two key ideas: an adaptive, nonparametric drafter built from recent rollouts using an incrementally maintained suffix tree, and a length aware speculation policy that allocates more aggressive draft budgets to long trajectories that dominate makespan. This design exploits rollout history to sustain acceptance while balancing base and token level costs during decoding. Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves, demonstrating that distribution-aware speculative decoding can significantly accelerate RL post training without compromising learning quality.
中文摘要 强化学习（RL）的后训练已成为对齐大型语言模型（LLM）的关键，但其效率在推广阶段受到越来越大的限制，因为在推广阶段，长轨迹是逐个代币生成的。我们发现了一个主要瓶颈：推广长度的长尾分布，其中少部分长世代占据了墙钟时间和互补的机会;历史上发布的案例揭示了不同训练时期的稳定提示等级模式。基于这些观察，我们提出了DAS，一种分布感知的推测解码框架，能够加速强化学习的推广而不改变模型输出。DAS整合了两个关键理念：基于近期推广构建的自适应非参数绘图器，采用增量维护的后缀树;以及一个长度感知的投机策略，将更激进的草案预算分配给主导完成时长的长轨迹。该设计利用推广历史以维持接受度，同时在解码时平衡基础和令牌级成本。数学和代码推理任务的实验表明，DAS在保持相同训练曲线的同时，将扩展时间缩短多达50%，证明了分布感知的推测解码可以在不影响学习质量的前提下显著加速强化学习。

TaoSearchEmb: A Multi-Objective Reinforcement Learning Framework for Dense Retrieval in Taobao Search

TaoSearchEmb：一个用于淘宝搜索密集检索的多目标强化学习框架

Authors: Xingxian Liu, Dongshuai Li, Tao Wen, Jiahui Wan, Gui Ling, Fuyu Lv, Dan Ou, Haihong Tang
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2511.13885
Pdf link: https://arxiv.org/pdf/2511.13885
Abstract Dense retrieval, as the core component of e-commerce search engines, maps user queries and items into a unified semantic space through pre-trained embedding models to enable large-scale real-time semantic retrieval. Despite the rapid advancement of LLMs gradually replacing traditional BERT architectures for embedding, their training paradigms still adhere to BERT-like supervised fine-tuning and hard negative mining strategies. This approach relies on complex offline hard negative sample construction pipelines, which constrain model iteration efficiency and hinder the evolutionary potential of semantic representation capabilities. Besides, existing multi-task learning frameworks face the seesaw effect when simultaneously optimizing semantic relevance and non-relevance objectives. In this paper, we propose Retrieval-GRPO, a multi-objective reinforcement learning-based dense retrieval framework designed to address these challenges. The method eliminates offline hard negative sample construction by dynamically retrieving Top-K candidate products for each query during training, while introducing a relevance LLM as a reward model to generate real-time feedback. Specifically, the retrieval model dynamically optimizes embedding representations through reinforcement learning, with reward signals combining LLM-generated relevance scores, product quality scores, and multi-way exclusivity metrics to achieve multi-objective user preference alignment and real-time error correction. This mechanism not only removes dependency on hard negatives but also mitigates the seesaw effect through collaborative multi-objective optimization, significantly enhancing the model's semantic generalization capability for complex long-tail queries. Extensive offline and online experiments validate the effectiveness of Retrieval-GRPO, which has been deployed on China's largest e-commerce platform.
中文摘要 密集检索作为电子商务搜索引擎的核心组成部分，通过预训练的嵌入模型将用户查询和商品映射到统一的语义空间，实现大规模的实时语义检索。尽管大型语言模型（LLM）逐渐取代传统BERT架构用于嵌入，其训练范式仍遵循类似BERT的监督微调和硬负面挖掘策略。该方法依赖复杂的离线硬负样本构建流程，限制了模型迭代效率，并阻碍了语义表示能力的进化潜力。此外，现有的多任务学习框架在同时优化语义相关性和非相关目标时，会面临跷跷板效应。本文提出了Retrieval-GRPO，一种多目标强化学习密集检索框架，旨在解决这些挑战。该方法通过在训练过程中动态检索每个查询的顶K候选产品，消除了离线硬性负样本构建，同时引入相关性大型语言模型作为奖励模型以生成实时反馈。具体来说，检索模型通过强化学习动态优化嵌入表示，奖励信号结合了LLM生成的相关性评分、产品质量评分和多方排他性指标，实现多目标用户偏好对齐和实时错误纠正。该机制不仅消除了对硬否定的依赖，还通过协作多目标优化减轻了跷跷板效应，显著提升了模型在复杂长尾查询中的语义泛化能力。大量线上线下实验验证了Retrieval-GRPO的有效性，该技术已部署在中国最大的电商平台上。

Start Small, Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding

从小处开始，思考宏大：基于课程的相对政策优化以视觉基础

Authors: Qingyang Yan, Guangyao Chen, Yixiong Zou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.13924
Pdf link: https://arxiv.org/pdf/2511.13924
Abstract Chain-of-Thought (CoT) prompting has recently shown significant promise across various NLP and computer vision tasks by explicitly generating intermediate reasoning steps. However, we find that reinforcement learning (RL)-based fine-tuned CoT reasoning can paradoxically degrade performance in Visual Grounding tasks, particularly as CoT outputs become lengthy or complex. Additionally, our analysis reveals that increased dataset size does not always enhance performance due to varying data complexities. Motivated by these findings, we propose Curriculum-based Relative Policy Optimization (CuRPO), a novel training strategy that leverages CoT length and generalized Intersection over Union (gIoU) rewards as complexity indicators to progressively structure training data from simpler to more challenging examples. Extensive experiments on RefCOCO, RefCOCO+, RefCOCOg, and LISA datasets demonstrate the effectiveness of our approach. CuRPO consistently outperforms existing methods, including Visual-RFT, with notable improvements of up to +12.52 mAP on RefCOCO. Moreover, CuRPO exhibits exceptional efficiency and robustness, delivering strong localization performance even in few-shot learning scenarios, particularly benefiting tasks characterized by ambiguous and intricate textual this http URL code is released on this https URL.
中文摘要 Chain-of-Thought （CoT）提示最近在各种自然语言处理和计算机视觉任务中展现出显著前景，通过显式生成中间推理步骤。然而，我们发现基于强化学习（RL）的精细调优CoT推理反而会在视觉接地任务中降低性能，尤其是在CoT输出变得过长或复杂时。此外，我们的分析显示，由于数据复杂度不同，数据集规模的增加并不总是能提升性能。基于这些发现，我们提出了基于课程的相对政策优化（CuRPO），这是一种新颖的训练策略，利用CoT长度和广义交叉与合并（gIoU）奖励作为复杂度指标，逐步将训练数据从简单到更具挑战性的实例结构化。在RefCOCO、RefCOCO+、RefCOCOg和LISA数据集上的大量实验证明了我们方法的有效性。CuRPO持续优于现有方法，包括Visual-RFT，在RefCOCO上显著提升至+12.52 mAP。此外，CuRPO 展现出卓越的效率和稳健性，即使在少数样本学习场景中也能提供强劲的本地化性能，尤其有利于文本模糊复杂且复杂的任务。http URL 代码发布于该 https URL。

GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards

GRPO隐私面临风险：针对可验证奖励强化学习的成员推断攻击

Authors: Yule Liu, Heyi Zhang, Jinyi Zheng, Zhen Sun, Zifan Peng, Tianshuo Cong, Yilong Yang, Xinlei He, Zhuo Ma
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.14045
Pdf link: https://arxiv.org/pdf/2511.14045
Abstract Membership inference attacks (MIAs) on large language models (LLMs) pose significant privacy risks across various stages of model training. Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have brought a profound paradigm shift in LLM training, particularly for complex reasoning tasks. However, the on-policy nature of RLVR introduces a unique privacy leakage pattern: since training relies on self-generated responses without fixed ground-truth outputs, membership inference must now determine whether a given prompt (independent of any specific response) is used during fine-tuning. This creates a threat where leakage arises not from answer memorization. To audit this novel privacy risk, we propose Divergence-in-Behavior Attack (DIBA), the first membership inference framework specifically designed for RLVR. DIBA shifts the focus from memorization to behavioral change, leveraging measurable shifts in model behavior across two axes: advantage-side improvement (e.g., correctness gain) and logit-side divergence (e.g., policy drift). Through comprehensive evaluations, we demonstrate that DIBA significantly outperforms existing baselines, achieving around 0.8 AUC and an order-of-magnitude higher [email protected]%FPR. We validate DIBA's superiority across multiple settings--including in-distribution, cross-dataset, cross-algorithm, black-box scenarios, and extensions to vision-language models. Furthermore, our attack remains robust under moderate defensive measures. To the best of our knowledge, this is the first work to systematically analyze privacy vulnerabilities in RLVR, revealing that even in the absence of explicit supervision, training data exposure can be reliably inferred through behavioral traces.
中文摘要 对大型语言模型（LLMs）的成员推理攻击（MIA）在模型训练的各个阶段都存在重大的隐私风险。可验证奖励强化学习（RLVR）的最新进展，为LLM训练带来了深刻的范式转变，尤其是在复杂推理任务中。然而，RLVR的策略性质带来了独特的隐私泄露模式：由于训练依赖于自生成的响应，没有固定的真实输出，成员推断必须确定在微调过程中是否使用某个提示（独立于任何特定响应）。这就造成了泄漏的风险，而非因为记忆答案。为审计这一新型隐私风险，我们提出了行为分歧攻击（DIBA），这是首个专为RLVR设计的成员推断框架。DIBA将关注点从记忆转向行为改变，利用模型行为在两个方面可测量的变化：优势侧改进（例如正确性提升）和逻辑侧发散（例如策略漂移）。通过全面评估，我们证明DIBA显著优于现有基线，达到约0.8 AUC，FPR [email protected]%高出一个数量级。我们验证了DIBA在多个环境中的优越性——包括分布内、跨数据集、跨算法、黑箱场景以及视觉语言模型的扩展。此外，我们的攻击在适度防御措施下依然稳健。据我们所知，这是首次系统性分析RLVR隐私漏洞的研究，显示即使在缺乏显性监督的情况下，训练数据的暴露仍可通过行为痕迹可靠推断。

Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations

基于数字孪生表征的强化学习文本驱动推理视频编辑

Authors: Yiqing Shen, Chenjia Li, Mathias Unberath
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.14100
Pdf link: https://arxiv.org/pdf/2511.14100
Abstract Text-driven video editing enables users to modify video content only using text queries. While existing methods can modify video content if explicit descriptions of editing targets with precise spatial locations and temporal boundaries are provided, these requirements become impractical when users attempt to conceptualize edits through implicit queries referencing semantic properties or object relationships. We introduce reasoning video editing, a task where video editing models must interpret implicit queries through multi-hop reasoning to infer editing targets before executing modifications, and a first model attempting to solve this complex task, RIVER (Reasoning-based Implicit Video Editor). RIVER decouples reasoning from generation through digital twin representations of video content that preserve spatial relationships, temporal trajectories, and semantic attributes. A large language model then processes this representation jointly with the implicit query, performing multi-hop reasoning to determine modifications, then outputs structured instructions that guide a diffusion-based editor to execute pixel-level changes. RIVER training uses reinforcement learning with rewards that evaluate reasoning accuracy and generation quality. Finally, we introduce RVEBenchmark, a benchmark of 100 videos with 519 implicit queries spanning three levels and categories of reasoning complexity specifically for reasoning video editing. RIVER demonstrates best performance on the proposed RVEBenchmark and also achieves state-of-the-art performance on two additional video editing benchmarks (VegGIE and FiVE), where it surpasses six baseline methods.
中文摘要 文本驱动视频编辑允许用户仅通过文本查询修改视频内容。虽然现有方法可以在提供带有精确空间位置和时间边界的编辑目标明确描述的情况下修改视频内容，但当用户试图通过隐式查询来理解编辑时，这些要求变得不切实际。我们介绍了推理视频编辑，这是一项视频编辑模型必须通过多跳推理解释隐式查询以推断编辑目标后再执行修改的任务，以及第一个试图解决这一复杂任务的模型——RIVER（基于推理的隐式视频编辑器）。RIVER通过数字孪生视频内容的表示，将推理与生成解耦，保留了空间关系、时间轨迹和语义属性。大型语言模型随后与隐式查询共同处理该表示，进行多跳推理以确定修改情况，然后输出结构化指令，引导基于扩散的编辑器执行像素级变更。RIVER培训采用强化学习，奖励评估推理准确性和生成质量。最后，我们介绍了RVEBenchmark，这是一个包含100个视频、519个隐式查询的基准工具，涵盖三个推理复杂度的层级和类别，专门用于推理视频剪辑。RIVER在拟议的RVEBenchmark上表现出最佳性能，并在另外两个视频编辑基准测试（VegGIE和FiVE）上达到最先进的性能，超过了六个基线方法。

Fair-GNE : Generalized Nash Equilibrium-Seeking Fairness in Multiagent Healthcare Automation

公平-GNE：多智能体医疗自动化中的广义纳什均衡寻求公平性

Authors: Promise Ekpo, Saesha Agarwal, Felix Grimm, Lekan Molu, Angelique Taylor
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.14135
Pdf link: https://arxiv.org/pdf/2511.14135
Abstract Enforcing a fair workload allocation among multiple agents tasked to achieve an objective in learning enabled demand side healthcare worker settings is crucial for consistent and reliable performance at runtime. Existing multi-agent reinforcement learning (MARL) approaches steer fairness by shaping reward through post hoc orchestrations, leaving no certifiable self-enforceable fairness that is immutable by individual agents at runtime. Contextualized within a setting where each agent shares resources with others, we address this shortcoming with a learning enabled optimization scheme among self-interested decision makers whose individual actions affect those of other agents. This extends the problem to a generalized Nash equilibrium (GNE) game-theoretic framework where we steer group policy to a safe and locally efficient equilibrium, so that no agent can improve its utility function by unilaterally changing its decisions. Fair-GNE models MARL as a constrained generalized Nash equilibrium-seeking (GNE) game, prescribing an ideal equitable collective equilibrium within the problem's natural fabric. Our hypothesis is rigorously evaluated in our custom-designed high-fidelity resuscitation simulator. Across all our numerical experiments, Fair-GNE achieves significant improvement in workload balance over fixed-penalty baselines (0.89 vs.\ 0.33 JFI, $p < 0.01$) while maintaining 86\% task success, demonstrating statistically significant fairness gains through adaptive constraint enforcement. Our results communicate our formulations, evaluation metrics, and equilibrium-seeking innovations in large multi-agent learning-based healthcare systems with clarity and principled fairness enforcement.
中文摘要 在学习支持的需求侧医疗工作者环境中，确保多个代理实现目标的公平工作分配，对于运行时的稳定和可靠表现至关重要。现有的多智能体强化学习（MARL）方法通过事后编排塑造奖励来引导公平性，不留下任何可认证的自我强制公平，且该公平性在运行时由个体智能体不可更改。在每个代理共享资源的环境中，我们通过一种学习驱动的优化方案来弥补这一缺陷，这些决策者以自利为中心，其个别行为会影响其他代理的行为。这将问题扩展到一个广义纳什均衡（GNE）博弈论框架，我们将群体策略引导到安全且局部高效的均衡，使得没有任何代理能通过单方面改变决策来提升其效用函数。公平-GNE将MARL建模为一种受限广义纳什均衡寻求（GNE）博弈，在问题的自然结构内规定了一个理想的公平集体均衡。我们的假设在我们定制的高保真复苏模拟器中经过严格评估。在所有数值实验中，公平-GNE在固定惩罚基线（0.89对0.33 JFI，$p <0.01$）上显著改善了工作负载平衡，同时保持86%的任务成功率，通过自适应约束执行展现了统计学上的公平性提升。我们的结果清晰地传达了我们在大型多智能体基于学习的医疗系统中的表述、评估指标和均衡寻求创新，并以原则性公平为基础。

A Receding Horizon Reinforcement Learning Framework for Campus Chiller Energy Management - A case study from an Australian University

校园冷水机组能源管理的后退地平线强化学习框架——澳大利亚一所大学的案例研究

Authors: Laura Musgrave, Arnab Bhattacharjee, Tapan Kumar Saha
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.14160
Pdf link: https://arxiv.org/pdf/2511.14160
Abstract This work presents a case study of optimal energy management of a large Heating Ventilation and Cooling (HVAC) system within a university campus in Australia using Reinforcement Learning (RL). The HVAC system supplies to nine university buildings with an annual average electricity consumption of $\sim2$ GWh. Updated chiller Coefficient of Performance (COP) curves are identified, and a predictive building cooling demand model is developed using historical data from the HVAC system. Based on these inputs, a Proximal Policy Optimization based RL model is trained to optimally schedule the chillers in a receding horizon control framework with a priority reward function for constraint satisfaction. Compared to the traditional way of controlling the HVAC system based on a reactive rule-based method, the proposed controller saves up to 28\% of the electricity consumed by simply controlling the mass flow rates of the chiller banks and with minimal constraint violations.
中文摘要 本研究提出了一个案例研究，展示了澳大利亚大学校园内大型供暖通风与制冷（HVAC）系统通过强化学习（RL）进行最优能源管理的案例。该暖通空调系统为九栋大学建筑供电，年均用电量为$\sim2$ GWh。识别了更新的冷水机性能系数（COP）曲线，并利用暖通空调系统的历史数据开发了建筑制冷需求预测模型。基于这些输入，基于近距离策略优化的强化学习模型被训练为在后退视界控制框架中最优调度冷水机，并赋予约束满足的优先奖励函数。与基于反应规则的传统控制方式相比，拟议的控制器仅通过控制冷水机组的质量流量，且约束违规最小，可节省高达28%的用电量。

FreeMusco: Motion-Free Learning of Latent Control for Morphology-Adaptive Locomotion in Musculoskeletal Characters

FreeMusco：肌肉骨骼特征中形态适应性运动的无运动学习潜控

Authors: Minkwan Kim, Yoonsang Lee
Subjects: Subjects: Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2511.14205
Pdf link: https://arxiv.org/pdf/2511.14205
Abstract We propose FreeMusco, a motion-free framework that jointly learns latent representations and control policies for musculoskeletal characters. By leveraging the musculoskeletal model as a strong prior, our method enables energy-aware and morphology-adaptive locomotion to emerge without motion data. The framework generalizes across human, non-human, and synthetic morphologies, where distinct energy-efficient strategies naturally appear--for example, quadrupedal gaits in Chimanoid versus bipedal gaits in Humanoid. The latent space and corresponding control policy are constructed from scratch, without demonstration, and enable downstream tasks such as goal navigation and path following--representing, to our knowledge, the first motion-free method to provide such capabilities. FreeMusco learns diverse and physically plausible locomotion behaviors through model-based reinforcement learning, guided by the locomotion objective that combines control, balancing, and biomechanical terms. To better capture the periodic structure of natural gait, we introduce the temporally averaged loss formulation, which compares simulated and target states over a time window rather than on a per-frame basis. We further encourage behavioral diversity by randomizing target poses and energy levels during training, enabling locomotion to be flexibly modulated in both form and intensity at runtime. Together, these results demonstrate that versatile and adaptive locomotion control can emerge without motion capture, offering a new direction for simulating movement in characters where data collection is impractical or impossible.
中文摘要 我们提出了FreeMusco，一个无动作框架，能够共同学习肌肉骨骼特征的潜在表征和控制策略。通过利用肌肉骨骼模型作为强先验，我们的方法使得能够在没有运动数据的情况下实现能量感知和形态适应的运动。该框架可推广至人类、非人类和合成形态，自然会出现不同的节能策略——例如，奇马诺伊德的四足步态与人形的双足步态。潜空间及其对应的控制策略是从零构建的，无需演示，能够实现目标导航和路径跟踪等后续任务——据我们所知，这是首个实现此类功能的无动作方法。FreeMusco通过基于模型的强化学习学习多样且物理合理的运动行为，指导运动目标结合了控制、平衡和生物力学术语。为了更好地捕捉自然步态的周期性结构，我们引入了时间平均损失公式，该方法比较模拟状态与目标状态在时间窗口内的差异，而非逐帧。我们进一步通过在训练期间随机调整目标姿势和能量水平，促进行为多样性，使运动在运行时能够灵活调节形式和强度。这些结果共同表明，多功能且自适应的运动控制可以在无需动作捕捉的情况下出现，为在数据收集不切实际或不可能的情况下模拟角色运动提供了新方向。

Parallelizing Tree Search with Twice Sequential Monte Carlo

用双重顺序蒙特卡洛并行化树搜索

Authors: Yaniv Oren, Joery A. de Vries, Pascal R. van der Vaart, Matthijs T. J. Spaan, Wendelin Böhmer
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.14220
Pdf link: https://arxiv.org/pdf/2511.14220
Abstract Model-based reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS. Through variance reduction and mitigation of path degeneracy, TSMCTS scales favorably with sequential compute while retaining the properties that make SMC natural to parallelize.
中文摘要 基于模型的强化学习（RL）方法利用搜索技术，推动了强化学习的许多里程碑突破。序列蒙特卡洛（SMC）最近作为蒙特卡洛树搜索（MCTS）算法的替代方案出现，MCTS推动了这些突破。SMC更容易并行化，也更适合GPU加速。然而，它也存在较大的方差和路径简并性，这使得随着搜索深度增加（即序列计算量增加）而无法良好扩展。为解决这些问题，我们引入了双重顺序蒙特卡洛树搜索（TSMCTS）。在离散和连续环境中，TSMCTS的表现优于SMC基线以及流行的现代MCTS版本。通过方差减少和路径简并化，TSMCTS 在保持 SMC 自然并行化的特性的同时，在顺序计算中更具可扩展性。

Object-Centric World Models for Causality-Aware Reinforcement Learning

基于因果关系感知强化学习的对象中心世界模型

Authors: Yosuke Nishimoto, Takashi Matsubara
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.14262
Pdf link: https://arxiv.org/pdf/2511.14262
Abstract World models have been developed to support sample-efficient deep reinforcement learning agents. However, it remains challenging for world models to accurately replicate environments that are high-dimensional, non-stationary, and composed of multiple objects with rich interactions since most world models learn holistic representations of all environmental components. By contrast, humans perceive the environment by decomposing it into discrete objects, facilitating efficient decision-making. Motivated by this insight, we propose \emph{Slot Transformer Imagination with CAusality-aware reinforcement learning} (STICA), a unified framework in which object-centric Transformers serve as the world model and causality-aware policy and value networks. STICA represents each observation as a set of object-centric tokens, together with tokens for the agent action and the resulting reward, enabling the world model to predict token-level dynamics and interactions. The policy and value networks then estimate token-level cause--effect relations and use them in the attention layers, yielding causality-guided decision-making. Experiments on object-rich benchmarks demonstrate that STICA consistently outperforms state-of-the-art agents in both sample efficiency and final performance.
中文摘要 世界模型已被开发以支持采样高效的深度强化学习代理。然而，由于大多数世界模型学习所有环境组成部分的整体性表征，世界模型要准确复现高维、非静止且由多个具有丰富交互作用的对象组成的环境仍然具有挑战性。相比之下，人类通过分解环境分解为离散的物体来感知环境，从而促进高效的决策。基于这一见解，我们提出了带有因果性感知强化学习的槽变换器想象力（STICA），这是一个以对象为中心的变换器作为世界模型和因果感知政策与价值网络的统一框架。STICA将每个观测值表示为一组以对象为中心的代币，以及代理动作和由此产生的奖励的代币，使世界模型能够预测代币层级的动态和交互。政策网络和价值网络随后估计代币层级的因果关系，并将其应用于注意力层，从而产生因果导向的决策。对对象丰富基准测试的实验表明，STICA在样本效率和最终性能上始终优于最先进的代理。

Don't Miss the Forest for the Trees: In-Depth Confidence Estimation for LLMs via Reasoning over the Answer Space

不要错过《为树而见森林：通过答案空间推理为大型语言模型提供深入的信心估计》

Authors: Ante Wang, Weizhi Ma, Yang Liu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.14275
Pdf link: https://arxiv.org/pdf/2511.14275
Abstract Knowing the reliability of a model's response is essential in application. With the strong generation capabilities of LLMs, research has focused on generating verbalized confidence. This is further enhanced by combining chain-of-thought reasoning, which provides logical and transparent estimation. However, how reasoning strategies affect the estimated confidence is still under-explored. In this work, we demonstrate that predicting a verbalized probability distribution can effectively encourage in-depth reasoning for confidence estimation. Intuitively, it requires an LLM to consider all candidates within the answer space instead of basing on a single guess, and to carefully assign confidence scores to meet the requirements of a distribution. This method shows an advantage across different models and various tasks, regardless of whether the answer space is known. Its advantage is maintained even after reinforcement learning, and further analysis shows its reasoning patterns are aligned with human expectations.
中文摘要 了解模型响应的可靠性在应用中至关重要。借助LLM强大的生成能力，研究重点在于建立语言化信心。结合思维链推理，进一步增强了逻辑性和透明的估计。然而，推理策略如何影响估计置信度的研究仍然不足。本研究展示了预测口头概率分布能有效鼓励深入推理以进行置信度估计。直观上，它要求LLM考虑答案空间内的所有候选人，而不是仅凭单一猜测，并仔细分配信心分数以满足分布要求。无论答案空间是否已知，这种方法在不同模型和各种任务中都具有优势。即使在强化学习后，其优势依然保持，进一步分析显示其推理模式与人类预期相符。

MA-SLAM: Active SLAM in Large-Scale Unknown Environment using Map Aware Deep Reinforcement Learning

MA-SLAM：利用地图感知深度强化学习在大规模未知环境中进行主动SLAM

Authors: Yizhen Yin, Yuhua Qi, Dapeng Feng, Hongbo Chen, Hongjun Ma, Jin Wu, Yi Jiang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.14330
Pdf link: https://arxiv.org/pdf/2511.14330
Abstract Active Simultaneous Localization and Mapping (Active SLAM) involves the strategic planning and precise control of a robotic system's movement in order to construct a highly accurate and comprehensive representation of its surrounding environment, which has garnered significant attention within the research community. While the current methods demonstrate efficacy in small and controlled settings, they face challenges when applied to large-scale and diverse environments, marked by extended periods of exploration and suboptimal paths of discovery. In this paper, we propose MA-SLAM, a Map-Aware Active SLAM system based on Deep Reinforcement Learning (DRL), designed to address the challenge of efficient exploration in large-scale environments. In pursuit of this objective, we put forward a novel structured map representation. By discretizing the spatial data and integrating the boundary points and the historical trajectory, the structured map succinctly and effectively encapsulates the visited regions, thereby serving as input for the deep reinforcement learning based decision module. Instead of sequentially predicting the next action step within the decision module, we have implemented an advanced global planner to optimize the exploration path by leveraging long-range target points. We conducted experiments in three simulation environments and deployed in a real unmanned ground vehicle (UGV), the results demonstrate that our approach significantly reduces both the duration and distance of exploration compared with state-of-the-art methods.
中文摘要 主动同步定位与制图（Active SLAM）涉及对机器人系统的战略规划和精确控制运动，以构建其周围环境的高度准确且全面的表示，这一技术在研究界引起了广泛关注。虽然现有方法在小型且受控的环境中有效，但在大规模多样环境中应用时面临挑战，这些环境伴随着长时间的探索和不理想的发现路径。本文提出了MA-SLAM，一种基于深度强化学习（DRL）的地图感知主动SLAM系统，旨在解决大规模环境中高效探索的挑战。为实现这一目标，我们提出了一种新的结构化地图表示方式。通过离散化空间数据并整合边界点和历史轨迹，结构化地图简洁有效地封装了访问区域，从而作为基于深度强化学习的决策模块的输入。我们不再在决策模块内顺序预测下一步行动，而是实现了先进的全球规划器，通过利用长期目标点优化探索路径。我们在三种模拟环境中进行了实验，并部署在真实的无人地面飞行器（UGV）中，结果表明，与最先进方法相比，我们的方法显著缩短了探索的时间和距离。

Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

多感官预训练，用于接触丰富机器人强化学习

Authors: Rickmer Krohn, Vignesh Prasad, Gabriele Tiboni, Georgia Chalvatzaki
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.14427
Pdf link: https://arxiv.org/pdf/2511.14427
Abstract Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control.
中文摘要 有效的丰富接触作需要机器人协同利用视觉、力量和本体感觉。然而，强化学习代理在这种多感官环境中学习时，尤其是在感官噪声和动态变化中，学习遇到困难。我们提出了多感官动态预训练（MSDP），这是一种针对任务导向政策学习量身定制的表达性多感官表征的新框架。MSDP基于掩蔽自编码，通过仅从部分传感器嵌入重建多感官观测数据来训练基于变换器的编码器，从而实现跨模态预测和传感器融合。在下游策略学习中，我们引入了一种新的非对称架构，其中交叉注意机制允许批评者从冻结嵌入中提取动态的任务特定特征，而行为者则获得稳定的池化表示以指导其行动。我们的方法展示了在多种扰动条件下的加速学习和稳健性能，包括传感器噪声和物体动力学的变化。在模拟和现实世界中，多个具有挑战性且接触丰富的机器人作任务的评估展示了MSDP的有效性。我们的方法对扰动表现出强烈的鲁棒性，在真实机器人上仅需6000次在线交互即可实现高成功率，为复杂的多感官机器人控制提供了简单而强大的解决方案。

Achieving Safe Control Online through Integration of Harmonic Control Lyapunov-Barrier Functions with Unsafe Object-Centric Action Policies

通过整合谐波控制实现安全在线控制李雅普诺夫障碍功能与不安全以对象为中心的行动政策

Authors: Marlow Fawn (Tufts University), Matthias Scheutz (Tufts University)
Subjects: Subjects: Robotics (cs.RO); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2511.14434
Pdf link: https://arxiv.org/pdf/2511.14434
Abstract We propose a method for combining Harmonic Control Lyapunov-Barrier Functions (HCLBFs) derived from Signal Temporal Logic (STL) specifications with any given robot policy to turn an unsafe policy into a safe one with formal guarantees. The two components are combined via HCLBF-derived safety certificates, thus producing commands that preserve both safety and task-driven behavior. We demonstrate with a simple proof-of-concept implementation for an object-centric force-based policy trained through reinforcement learning for a movement task of a stationary robot arm that is able to avoid colliding with obstacles on a table top after combining the policy with the safety constraints. The proposed method can be generalized to more complex specifications and dynamic task settings.
中文摘要 我们提出一种方法，将源自信号时序逻辑（STL）规范的谐波控制李雅普诺夫-障碍函数（HCLBF）与任一机器人策略结合起来，将不安全的策略转变为带有正式保证的安全策略。这两个组件通过基于HCLBF的认证结合，从而生成既保持安全又符合任务驱动行为的命令。我们通过一个简单的概念验证实现，展示了一个基于物体的基于力的策略，通过强化学习训练，适用于一个固定机器人手臂的移动任务，该机器人手臂在结合策略和安全约束后能够避免与桌面上的障碍物碰撞。该方法可推广至更复杂的规格和动态任务设置。

Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

Agent-R1：通过端到端强化学习训练强大的LLM代理

Authors: Mingyue Cheng, Jie Ouyang, Shuo Yu, Ruiran Yan, Yucong Luo, Zirui Liu, Daoyu Wang, Qi Liu, Enhong Chen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.14460
Pdf link: https://arxiv.org/pdf/2511.14460
Abstract Large Language Models (LLMs) are increasingly being explored for building Agents capable of active environmental interaction (e.g., via tool use) to solve complex problems. Reinforcement Learning (RL) is considered a key technology with significant potential for training such Agents; however, the effective application of RL to LLM Agents is still in its nascent stages and faces considerable challenges. Currently, this emerging field lacks in-depth exploration into RL approaches specifically tailored for the LLM Agent context, alongside a scarcity of flexible and easily extensible training frameworks designed for this purpose. To help advance this area, this paper first revisits and clarifies Reinforcement Learning methodologies for LLM Agents by systematically extending the Markov Decision Process (MDP) framework to comprehensively define the key components of an LLM Agent. Secondly, we introduce Agent-R1, a modular, flexible, and user-friendly training framework for RL-based LLM Agents, designed for straightforward adaptation across diverse task scenarios and interactive environments. We conducted experiments on Multihop QA benchmark tasks, providing initial validation for the effectiveness of our proposed methods and framework.
中文摘要 大型语言模型（LLMs）正日益被探索用于构建能够主动与环境交互（例如通过工具使用）以解决复杂问题的代理。强化学习（RL）被认为是训练此类代理具有巨大潜力的关键技术;然而，强化学习在LLM代理上的有效应用仍处于起步阶段，面临诸多挑战。目前，这一新兴领域缺乏针对LLM代理环境专门设计的强化学习方法的深入探索，同时也缺乏为此目的设计的灵活且易于扩展的训练框架。为推动该领域发展，本文首先通过系统扩展马尔可夫决策过程（MDP）框架，全面定义LLM代理的关键组成部分，重新审视并澄清了LLM代理的强化学习方法论。其次，我们介绍Agent-R1，一个模块化、灵活且用户友好的基于强化学习的大型语言模型代理训练框架，旨在简单地适应多样化的任务场景和交互环境中。我们对多跳QA基准任务进行了实验，初步验证了我们提出的方法和框架的有效性。

Masked IRL: LLM-Guided Reward Disambiguation from Demonstrations and Language

面具现实生活中：LLM引导的奖励与演示和语言的歧义

Authors: Minyoung Hwang, Alexandra Forsey-Smerek, Nathaniel Dennler, Andreea Bobu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.14565
Pdf link: https://arxiv.org/pdf/2511.14565
Abstract Robots can adapt to user preferences by learning reward functions from demonstrations, but with limited data, reward models often overfit to spurious correlations and fail to generalize. This happens because demonstrations show robots how to do a task but not what matters for that task, causing the model to focus on irrelevant state details. Natural language can more directly specify what the robot should focus on, and, in principle, disambiguate between many reward functions consistent with the demonstrations. However, existing language-conditioned reward learning methods typically treat instructions as simple conditioning signals, without fully exploiting their potential to resolve ambiguity. Moreover, real instructions are often ambiguous themselves, so naive conditioning is unreliable. Our key insight is that these two input types carry complementary information: demonstrations show how to act, while language specifies what is important. We propose Masked Inverse Reinforcement Learning (Masked IRL), a framework that uses large language models (LLMs) to combine the strengths of both input types. Masked IRL infers state-relevance masks from language instructions and enforces invariance to irrelevant state components. When instructions are ambiguous, it uses LLM reasoning to clarify them in the context of the demonstrations. In simulation and on a real robot, Masked IRL outperforms prior language-conditioned IRL methods by up to 15% while using up to 4.7 times less data, demonstrating improved sample-efficiency, generalization, and robustness to ambiguous language. Project page: this https URL and Code: this https URL
中文摘要 机器人可以通过演示学习奖励函数来适应用户偏好，但由于数据有限，奖励模型常常对虚假相关过拟合，难以进行泛化。这是因为演示展示了机器人如何完成任务，却没有告诉该任务重要的内容，导致模型关注无关的状态细节。自然语言可以更直接地指定机器人应关注的对象，原则上还能区分与演示一致的多种奖励函数。然而，现有的语言条件奖励学习方法通常将指令视为简单的条件条件信号，未能充分发挥其解决歧义的潜力。此外，真实指令本身往往存在歧义，因此天真条件反射不可靠。我们的关键见解是，这两种输入类型传递互补的信息：演示展示如何行动，而语言则指定重要内容。我们提出了蒙面反向强化学习（Masked Inverse Reinforcement Learning，简称Masked IRL），这是一种利用大型语言模型（LLMs）结合两种输入类型优势的框架。蒙面IRL通过语言指令推断状态相关掩码，并强制对无关状态组件保持变异性。当指令含糊时，它会在演示的语境中使用大型语言模型推理来澄清。在模拟和真实机器人上，Masked IRL比以往的语言条件IRL方法表现高达15%，同时使用的数据减少多达4.7倍，展示了更好的样本效率、泛化能力以及对歧义语言的鲁棒性。项目页面：此 https URL 和代码：此 https URL

ReflexGrad: Three-Way Synergistic Architecture for Zero-Shot Generalization in LLM Agents

ReflexGrad：用于LLM代理零样本推广的三路协同架构

Authors: Ankush Kadu, Ashwanth Krishnan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.14584
Pdf link: https://arxiv.org/pdf/2511.14584
Abstract Enabling agents to learn from experience and generalize across diverse tasks without task-specific training remains a fundamental challenge in reinforcement learning and decision-making. While recent approaches have explored episodic memory (Reflexion), gradient-based prompt optimization (TextGrad),and hierarchical task decomposition independently, their potential for synergistic integration remains unexplored. We introduce ReflexGrad, a novel architecture that tightly couples three complementary mechanisms: (1) LLM-based hierarchical TODO decomposition for strategic planning, (2) history-aware causal reflection that analyzes recent action patterns to identify failure root causes and enable within-trial learning, and (3) gradient-based optimization for systematic improvement. Unlike prior work relying on few-shot demonstrations, our system achieves true zero-shot generalization through pure LLM semantic reasoning,requiring no task-specific examples, fine-tuning, or hardcoded similarity metrics. Evaluated on ALFWorld benchmark tasks, ReflexGrad demonstrates 67% zero-shot success rate on Trial 0 without any prior task experience or demonstrations, establishing effective performance on first exposure. Through empirical analysis, we identify the architectural mechanisms underlying stable convergence (zero action loops) and effective cross-task transfer (67% to 78% improvement).Our work demonstrates that synergistic integration of complementary learning mechanisms enables robust zero-shot generalization that approaches few-shot baselines from prior work.
中文摘要 使智能体能够从经验中学习并在无需特定任务培训的情况下进行多样化任务的泛化，仍然是强化学习和决策中的根本挑战。虽然近期方法分别探索了情节记忆（Reflexion）、基于梯度的提示优化（TextGrad）和层级任务分解，但它们在协同整合方面的潜力尚未被充分探索。我们介绍了ReflexGrad，一种新颖架构，紧密结合了三种互补机制：（1）基于LLM的层级TODO分解用于战略规划，（2）历史感知因果反思，分析近期行动模式以识别失败根源并实现试验内学习;（3）基于梯度的优化以实现系统改进。与以往依赖少数样本演示的工作不同，我们的系统通过纯粹的大型语言模型语义推理实现了真正的零样本泛化，无需任务特定示例、微调或硬编码相似度指标。在ALFWorld基准任务评估中，ReflexGrad在试验0中无任何任务经验或演示的情况下，零射击成功率达到67%，证明了首次暴露的有效表现。通过实证分析，我们识别了稳定收敛（零动作循环）和有效跨任务转移（提升67%至78%）的架构机制。我们的研究表明，互补学习机制的协同整合能够实现稳健的零样本推广，接近以往工作的少数样本基线。

Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning

Seer：在线上下文学习，用于快速同步LLM强化学习

Authors: Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao, Bo Pang, Xinran Xu, Yingdi Shan, Yongwei Wu, Mingxing Zhang
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.14617
Pdf link: https://arxiv.org/pdf/2511.14617
Abstract Reinforcement Learning (RL) has become critical for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks. The rollout phase, which dominates end-to-end iteration time, suffers from substantial long-tail latency and poor resource utilization due to inherent workload imbalance. We present Seer, a novel online context learning system that addresses these challenges by exploiting previously overlooked similarities in output lengths and generation patterns among requests sharing the same prompt. Seer introduces three key techniques: divided rollout for dynamic load balancing, context-aware scheduling, and adaptive grouped speculative decoding. Together, these mechanisms substantially reduce long-tail latency and improve resource efficiency during rollout. Evaluations on production-grade RL workloads demonstrate that Seer improves end-to-end rollout throughput by 74% to 97% and reduces long-tail latency by 75% to 93% compared to state-of-the-art synchronous RL systems, significantly accelerating RL training iterations.
中文摘要 强化学习（RL）已成为推动现代大型语言模型（LLM）发展的关键，但现有的同步强化学习系统仍面临严重的性能瓶颈。部署阶段主导了端到端迭代时间，但由于固有的工作负载不平衡，存在显著的长尾延迟和资源利用率低。我们介绍了 Seer，一种新型在线上下文学习系统，通过利用共享同一提示的请求之间此前被忽视的输出长度和生成模式相似性，解决了这些挑战。Seer 引入了三种关键技术：用于动态负载均衡的分流推广、上下文感知调度和自适应分组推测解码。这些机制共同显著降低了长尾延迟，并提高了部署过程中的资源效率。对生产级强化学习工作负载的评估显示，Seer相比最先进的同步强化学习系统，端到端推广吞吐量提升了74%至97%，长尾延迟降低了75%至93%，显著加快了强化学习的迭代速度。

Failure to Mix: Large language models struggle to answer according to desired probability distributions

混合失败：大型语言模型难以根据期望的概率分布回答

Authors: Ivy Yuqian Yang, David Yu Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.14630
Pdf link: https://arxiv.org/pdf/2511.14630
Abstract Scientific idea generation and selection requires exploration following a target probability distribution. In contrast, current AI benchmarks have objectively correct answers, and training large language models (LLMs) via reinforcement learning against these benchmarks discourages probabilistic exploration. Here, we conducted systematic experiments requesting LLMs to produce outputs following simple probabilistic distributions, and found that all modern LLMs tested grossly fail to follow the distributions. For example, requesting a binary output of "1" 49% of the time produces an answer of "0" nearly 100% of the time. This step function-like behavior of near-exclusively generating the output with marginally highest probability even overrules even strong in-built LLM biases.
中文摘要 科学思想的生成和选择需要按照目标概率分布进行探索。相比之下，当前的人工智能基准测试有客观正确的答案，通过强化学习训练大型语言模型（LLMs）则抑制了概率性探索。我们进行了系统实验，要求大型语言模型（LLM）按照简单概率分布产生输出，结果发现所有现代大型语言模型都严重未能遵循这些分布。例如，请求二进制输出为“1”的49%概率，几乎100%的回答是“0”。这种几乎完全生成略高概率输出的阶梯函数行为甚至覆盖了强大的内置LLM偏差。

Heterogeneous Multi-Agent Proximal Policy Optimization for Power Distribution System Restoration

电力分配系统恢复的异构多智能体近端策略优化

Authors: Parya Dolatyabi, Mahdi Khodayar
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.14730
Pdf link: https://arxiv.org/pdf/2511.14730
Abstract Restoring power distribution systems (PDS) after large-scale outages requires sequential switching operations that reconfigure feeder topology and coordinate distributed energy resources (DERs) under nonlinear constraints such as power balance, voltage limits, and thermal ratings. These challenges make conventional optimization and value-based RL approaches computationally inefficient and difficult to scale. This paper applies a Heterogeneous-Agent Reinforcement Learning (HARL) framework, instantiated through Heterogeneous-Agent Proximal Policy Optimization (HAPPO), to enable coordinated restoration across interconnected microgrids. Each agent controls a distinct microgrid with different loads, DER capacities, and switch counts, introducing practical structural heterogeneity. Decentralized actor policies are trained with a centralized critic to compute advantage values for stable on-policy updates. A physics-informed OpenDSS environment provides full power flow feedback and enforces operational limits via differentiable penalty signals rather than invalid action masking. The total DER generation is capped at 2400 kW, and each microgrid must satisfy local supply-demand feasibility. Experiments on the IEEE 123-bus and IEEE 8500-node systems show that HAPPO achieves faster convergence, higher restored power, and smoother multi-seed training than DQN, PPO, MAES, MAGDPG, MADQN, Mean-Field RL, and QMIX. Results demonstrate that incorporating microgrid-level heterogeneity within the HARL framework yields a scalable, stable, and constraint-aware solution for complex PDS restoration.
中文摘要 大规模停电后恢复电力配电系统（PDS）需要顺序切换作，重新配置馈线拓扑并在功率平衡、电压限制和热额定值等非线性约束下协调分布式能源资源（DER）。这些挑战使得传统的优化和基于价值的强化学习方法在计算效率上低落且难以扩展。本文采用异构代理强化学习（HARL）框架，通过异构代理近端策略优化（HAPPO）实现，实现跨互联微电网的协调恢复。每个代理控制一个独立的微电网，具有不同的负载、DER容量和开关数量，从而引入了实用的结构异质性。去中心化的行为者策略通过中心化批评者训练，计算稳定的策略内更新的优势值。基于物理的OpenDSS环境提供完整的功率流反馈，并通过可微分的惩罚信号而非无效动作掩蔽来强制执行作极限。分布式分布器总发电量上限为2400千瓦，每个微电网必须满足当地供需可行性。在IEEE 123总线和IEEE 8500节点系统的实验表明，HAPPO比DQN、PPO、MAES、MAGDPG、MADQN、平均场强化和QMIX实现更快的收敛、更高的恢复功率和更平滑的多种子训练。结果表明，在HARL框架中融入微电网级异构性，能够为复杂的PDS恢复提供可扩展、稳定且约束感知的解决方案。

$π^{*}_{0.6}$: a VLA That Learns From Experience

$π^{*}_{0.6}$：从经验中学习的VLA

Authors: Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szymon Jakubczak, Rowan Jen, Tim Jones, Ben Katz, Liyiming Ke, Chandra Kuchi, Marinda Lamb, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Yao Lu, Vishnu Mano, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Charvi Sharma, Lucy Xiaoyang Shi, Laura Smith, Jost Tobias Springenberg, Kyle Stachowicz, Will Stoeckle, Alex Swerdlow, James Tanner, Marcel Torne, Quan Vuong, Anna Walling, Haohuan Wang, Blake Williams, Sukwon Yoo, Lili Yu, Ury Zhilinsky, Zhiyuan Zhou
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.14759
Pdf link: https://arxiv.org/pdf/2511.14759
Abstract We study how vision-language-action (VLA) models can improve through real-world deployments via reinforcement learning (RL). We present a general-purpose method, RL with Experience and Corrections via Advantage-conditioned Policies (RECAP), that provides for RL training of VLAs via advantage conditioning. Our method incorporates heterogeneous data into the self-improvement process, including demonstrations, data from on-policy collection, and expert teleoperated interventions provided during autonomous execution. RECAP starts by pre-training a generalist VLA with offline RL, which we call $\pi^{}_{0.6}$, that can then be specialized to attain high performance on downstream tasks through on-robot data collection. We show that the $\pi^{}_{0.6}$ model trained with the full RECAP method can fold laundry in real homes, reliably assemble boxes, and make espresso drinks using a professional espresso machine. On some of the hardest tasks, RECAP more than doubles task throughput and roughly halves the task failure rate.
中文摘要 我们研究视觉-语言-行动（VLA）模型如何通过强化学习（RL）在现实世界中部署来改进。我们提出了一种通用方法——通过优势条件政策进行经验与纠正的强化学习（RECAP），通过优势条件反射对VLA进行强化训练。我们的方法将异构数据纳入自我提升过程，包括演示、政策收集数据以及自主执行期间提供的专家远程作干预。RECAP首先用离线强化语言预训练通用VLA，我们称之为$\pi^{}_{0.6}$，然后可以通过机器人数据收集实现下游任务的高性能。我们展示了用完整RECAP方法训练的$\pi^{}_{0.6}$模型，可以在真实家庭中叠衣服、可靠组装箱子，并使用专业意式咖啡机制作浓缩咖啡饮品。在一些最难的任务中，RECAP能使任务吞吐量翻倍以上，任务失败率也大致减半。

UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning

UniGen-1.5：通过奖励统一增强强化学习中的图像生成与编辑

Authors: Rui Tian, Mingfei Gao, Haiming Gang, Jiasen Lu, Zhe Gan, Yinfei Yang, Zuxuan Wu, Afshin Dehghan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.14760
Pdf link: https://arxiv.org/pdf/2511.14760
Abstract We present UniGen-1.5, a unified multimodal large language model (MLLM) for advanced image understanding, generation and editing. Building upon UniGen, we comprehensively enhance the model architecture and training pipeline to strengthen the image understanding and generation capabilities while unlocking strong image editing ability. Especially, we propose a unified Reinforcement Learning (RL) strategy that improves both image generation and image editing jointly via shared reward models. To further enhance image editing performance, we propose a light Edit Instruction Alignment stage that significantly improves the editing instruction comprehension that is essential for the success of the RL training. Experimental results show that UniGen-1.5 demonstrates competitive understanding and generation performance. Specifically, UniGen-1.5 achieves 0.89 and 4.31 overall scores on GenEval and ImgEdit that surpass the state-of-the-art models such as BAGEL and reaching performance comparable to proprietary models such as GPT-Image-1.
中文摘要 我们介绍UniGen-1.5，一个统一的多模态大型语言模型（MLLM），用于高级图像理解、生成和编辑。基于UniGen，我们全面增强模型架构和训练流水线，增强图像理解和生成能力，同时释放强大的图像编辑能力。特别是，我们提出了一种统一的强化学习（RL）策略，通过共享奖励模型提升图像生成和图像编辑的联合性能。为了进一步提升图像编辑性能，我们提出了一个轻度编辑指导对齐阶段，显著提升编辑教学理解，这对强化学习训练的成功至关重要。实验结果显示，UniGen-1.5展现出竞争性的理解和生成性能。具体来说，UniGen-1.5在GenEval和ImgEdit上的总得分分别为0.89和4.31，超越了BAGEL等最先进模型，并达到与GPT-Image-1等专有模型相当的性能。

Keyword: diffusion policy

Coffee: Controllable Diffusion Fine-tuning

咖啡：可控扩散微调

Authors: Ziyao Zeng, Jingcheng Ni, Ruyi Liu, Alex Wong
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.14113
Pdf link: https://arxiv.org/pdf/2511.14113
Abstract Text-to-image diffusion models can generate diverse content with flexible prompts, which makes them well-suited for customization through fine-tuning with a small amount of user-provided data. However, controllable fine-tuning that prevents models from learning undesired concepts present in the fine-tuning data, and from entangling those concepts with user prompts, remains an open challenge. It is crucial for downstream tasks like bias mitigation, preventing malicious adaptation, attribute disentanglement, and generalizable fine-tuning of diffusion policy. We propose Coffee that allows using language to specify undesired concepts to regularize the adaptation process. The crux of our method lies in keeping the embeddings of the user prompt from aligning with undesired concepts. Crucially, Coffee requires no additional training and enables flexible modification of undesired concepts by modifying textual descriptions. We evaluate Coffee by fine-tuning on images associated with user prompts paired with undesired concepts. Experimental results demonstrate that Coffee can prevent text-to-image models from learning specified undesired concepts during fine-tuning and outperforms existing methods. Code will be released upon acceptance.
中文摘要 文本到图像扩散模型能够生成多样化内容并提供灵活的提示，因此非常适合通过微调用户提供少量数据进行定制。然而，防止模型学习微调数据中不需要的概念，以及将这些概念与用户提示纠缠在一起的可控微调仍是一个开放的挑战。它对于下游任务至关重要，如偏见缓解、防止恶意适配、属性解缠以及扩散策略的可推广微调。我们提出了一种Coffee方法，允许使用语言指定不需要的概念，以规范适应过程。我们方法的核心在于防止用户提示嵌入与不需要的概念一致。关键是，Coffee无需额外培训，并且可以通过修改文本描述灵活修改不需要的概念。我们通过对用户提示与不受欢迎概念相关的图片进行微调来评估Coffee。实验结果表明，Coffee能够防止文本到图像模型在微调过程中学习到指定的不想要的概念，并且优于现有方法。代码将在接受后发布。

Keyword: reinforcement learning

Deep reinforcement learning-based spacecraft attitude control with pointing keep-out constraint

基于深度强化的学习航天器姿态控制，带有指向保持约束

Quantifying Distribution Shift in Traffic Signal Control with Histogram-Based GEH Distance

基于直方图的GEH距离量化交通信号控制中的分布偏移

Beat the long tail: Distribution-Aware Speculative Decoding for RL Training

击败长尾分析：用于强化学习的分布感知推测解码

TaoSearchEmb: A Multi-Objective Reinforcement Learning Framework for Dense Retrieval in Taobao Search

TaoSearchEmb：一个用于淘宝搜索密集检索的多目标强化学习框架

Start Small, Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding

从小处开始，思考宏大：基于课程的相对政策优化以视觉基础

GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards

GRPO隐私面临风险：针对可验证奖励强化学习的成员推断攻击

Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations

基于数字孪生表征的强化学习文本驱动推理视频编辑

Fair-GNE : Generalized Nash Equilibrium-Seeking Fairness in Multiagent Healthcare Automation

公平-GNE：多智能体医疗自动化中的广义纳什均衡寻求公平性

A Receding Horizon Reinforcement Learning Framework for Campus Chiller Energy Management - A case study from an Australian University

校园冷水机组能源管理的后退地平线强化学习框架——澳大利亚一所大学的案例研究

FreeMusco: Motion-Free Learning of Latent Control for Morphology-Adaptive Locomotion in Musculoskeletal Characters

FreeMusco：肌肉骨骼特征中形态适应性运动的无运动学习潜控

Parallelizing Tree Search with Twice Sequential Monte Carlo

用双重顺序蒙特卡洛并行化树搜索

Object-Centric World Models for Causality-Aware Reinforcement Learning

基于因果关系感知强化学习的对象中心世界模型

Don't Miss the Forest for the Trees: In-Depth Confidence Estimation for LLMs via Reasoning over the Answer Space

不要错过《为树而见森林：通过答案空间推理为大型语言模型提供深入的信心估计》

MA-SLAM: Active SLAM in Large-Scale Unknown Environment using Map Aware Deep Reinforcement Learning

MA-SLAM：利用地图感知深度强化学习在大规模未知环境中进行主动SLAM

Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

多感官预训练，用于接触丰富机器人强化学习

Achieving Safe Control Online through Integration of Harmonic Control Lyapunov-Barrier Functions with Unsafe Object-Centric Action Policies

通过整合谐波控制实现安全在线控制 李雅普诺夫障碍功能与不安全以对象为中心的行动政策

Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

Agent-R1：通过端到端强化学习训练强大的LLM代理

Masked IRL: LLM-Guided Reward Disambiguation from Demonstrations and Language

面具现实生活中：LLM引导的奖励与演示和语言的歧义

ReflexGrad: Three-Way Synergistic Architecture for Zero-Shot Generalization in LLM Agents

ReflexGrad：用于LLM代理零样本推广的三路协同架构

Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning

Seer：在线上下文学习，用于快速同步LLM强化学习

Failure to Mix: Large language models struggle to answer according to desired probability distributions

混合失败：大型语言模型难以根据期望的概率分布回答

Heterogeneous Multi-Agent Proximal Policy Optimization for Power Distribution System Restoration

电力分配系统恢复的异构多智能体近端策略优化

$π^{*}_{0.6}$: a VLA That Learns From Experience

$π^{*}_{0.6}$：从经验中学习的VLA

UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning

UniGen-1.5：通过奖励统一增强强化学习中的图像生成与编辑

Keyword: diffusion policy

Coffee: Controllable Diffusion Fine-tuning

咖啡：可控扩散微调

通过整合谐波控制实现安全在线控制李雅普诺夫障碍功能与不安全以对象为中心的行动政策