Arxiv Papers of Today

生成时间: 2025-11-07 16:30:08 (UTC+8); Arxiv 发布时间: 2025-11-07 20:00 EST (2025-11-08 09:00 UTC+8)

今天共有 27 篇相关文章

Keyword: reinforcement learning

Scaling Agent Learning via Experience Synthesis

通过经验综合扩展代理学习

Authors: Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.03773
Pdf link: https://arxiv.org/pdf/2511.03773
Abstract While reinforcement learning (RL) can empower large language model (LLM) agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL.
中文摘要 虽然强化学习（RL）可以通过交互实现自我完善，从而增强大型语言模型（LLM）代理的能力，但由于成本高昂的推出、有限的任务多样性、不可靠的奖励信号和基础设施的复杂性，其实际采用仍然具有挑战性，所有这些都阻碍了可扩展体验数据的收集。为了应对这些挑战，我们推出了 DreamGym，这是第一个统一的框架，旨在综合多样化的体验，并考虑到可扩展性，从而为自主代理提供有效的在线 RL 训练。DreamGym 不依赖昂贵的真实环境推出，而是将环境动态提炼成基于推理的体验模型，该模型通过逐步推理得出一致的状态转换和反馈信号，从而为 RL 实现可扩展的代理推出收集。为了提高过渡的稳定性和质量，DreamGym 利用体验回放缓冲区，该缓冲区使用离线真实世界数据进行初始化，并通过新鲜的交互不断丰富，以积极支持代理训练。为了提高知识获取率，DreamGym 自适应地生成挑战当前代理政策的新任务，从而实现更有效的在线课程学习。跨不同环境和代理主干的实验表明，DreamGym 在完全合成的设置和模拟到真实的传输场景中都极大地改进了 RL 训练。在 WebArena 等非 RL 就绪任务中，DreamGym 的性能比所有基线高出 30% 以上。在 RL 就绪但成本高昂的设置中，它仅使用合成相互作用来匹配 GRPO 和 PPO 性能。当将纯基于合成体验训练的策略转移到真实环境的 RL 时，DreamGym 会产生显着的额外性能提升，同时需要更少的真实世界交互，从而为通用 RL 提供可扩展的热启动策略。

From Static to Dynamic: Enhancing Offline-to-Online Reinforcement Learning via Energy-Guided Diffusion Stratification

从静态到动态：通过能量引导扩散分层增强离线到在线的强化学习

Authors: Lipeng Zu, Hansong Zhou, Xiaonan Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.03828
Pdf link: https://arxiv.org/pdf/2511.03828
Abstract Transitioning from offline to online reinforcement learning (RL) poses critical challenges due to distributional shifts between the fixed behavior policy in the offline dataset and the evolving policy during online learning. Although this issue is widely recognized, few methods attempt to explicitly assess or utilize the distributional structure of the offline data itself, leaving a research gap in adapting learning strategies to different types of samples. To address this challenge, we propose an innovative method, Energy-Guided Diffusion Stratification (StratDiff), which facilitates smoother transitions in offline-to-online RL. StratDiff deploys a diffusion model to learn prior knowledge from the offline dataset. It then refines this knowledge through energy-based functions to improve policy imitation and generate offline-like actions during online fine-tuning. The KL divergence between the generated action and the corresponding sampled action is computed for each sample and used to stratify the training batch into offline-like and online-like subsets. Offline-like samples are updated using offline objectives, while online-like samples follow online learning strategies. We demonstrate the effectiveness of StratDiff by integrating it with off-the-shelf methods Cal-QL and IQL. Extensive empirical evaluations on D4RL benchmarks show that StratDiff significantly outperforms existing methods, achieving enhanced adaptability and more stable performance across diverse RL settings.
中文摘要 由于离线数据集中的固定行为策略与在线学习期间不断演变的策略之间的分布变化，从离线到在线强化学习（RL）的过渡带来了关键挑战。尽管这个问题得到了广泛认可，但很少有方法试图明确评估或利用离线数据本身的分布结构，这在使学习策略适应不同类型的样本方面留下了研究空白。为了应对这一挑战，我们提出了一种创新方法，即能量引导扩散分层（StratDiff），它有助于从离线到在线的 RL 更平滑地过渡。StratDiff 部署扩散模型以从离线数据集中学习先验知识。然后，它通过基于能源的函数来完善这些知识，以改进策略模仿并在在线微调期间生成类似离线的作。为每个样本计算生成的动作和相应的采样动作之间的 KL 差异，并用于将训练批次分层为类似离线和类似在线的子集。类似离线的样本使用离线目标进行更新，而类似在线的样本则遵循在线学习策略。我们通过将 StratDiff 与现成的方法 Cal-QL 和 IQL 集成来证明 StratDiff 的有效性。对 D4RL 基准测试的广泛实证评估表明，StratDiff 的性能明显优于现有方法，在不同的 RL 设置中实现了增强的适应性和更稳定的性能。

RLHF: A comprehensive Survey for Cultural, Multimodal and Low Latency Alignment Methods

RLHF：文化、多模态和低潜伏对齐方法的综合调查

Authors: Raghav Sharma, Manan Mehta, Sai Tiger Raina
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.03939
Pdf link: https://arxiv.org/pdf/2511.03939
Abstract Reinforcement Learning from Human Feedback (RLHF) is the standard for aligning Large Language Models (LLMs), yet recent progress has moved beyond canonical text-based methods. This survey synthesizes the new frontier of alignment research by addressing critical gaps in multi-modal alignment, cultural fairness, and low-latency optimization. To systematically explore these domains, we first review foundational algo- rithms, including PPO, DPO, and GRPO, before presenting a detailed analysis of the latest innovations. By providing a comparative synthesis of these techniques and outlining open challenges, this work serves as an essential roadmap for researchers building more robust, efficient, and equitable AI systems.
中文摘要 来自人类反馈的强化学习（RLHF）是对齐大型语言模型（LLM）的标准，但最近的进展已经超越了基于规范文本的方法。该调查通过解决多模态对齐、文化公平和低延迟优化方面的关键差距，综合了对齐研究的新前沿。为了系统地探索这些领域，我们首先回顾了基本算法，包括 PPO、DPO 和 GRPO，然后对最新创新进行了详细分析。通过提供这些技术的比较综合并概述开放的挑战，这项工作为研究人员构建更强大、更高效和更公平的人工智能系统提供了重要的路线图。

Adaptive Temporal Refinement: Continuous Depth Allocation and Distance Regression for Efficient Action Localization

自适应时间细化：连续深度分配和距离回归以实现高效的动作定位

Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.03943
Pdf link: https://arxiv.org/pdf/2511.03943
Abstract Temporal action localization requires precise boundary detection; however, current methods apply uniform computation despite significant variations in difficulty across boundaries. We present two complementary contributions. First, Boundary Distance Regression (BDR) provides information-theoretically optimal localization through signed-distance regression rather than classification, achieving 43\% sharper boundary peaks. BDR retrofits to existing methods with approximately 50 lines of code, yielding consistent 1.8 to 3.1\% [email protected] improvements across diverse architectures. Second, Adaptive Temporal Refinement (ATR) allocates computation via continuous depth selection $\tau \in [0,1]$, enabling end-to-end differentiable optimization without reinforcement learning. On THUMOS14, ATR achieves 56.5\% [email protected] at 162G FLOPs, compared to 53.6\% at 198G for uniform processing, providing a 2.9\% improvement with 18\% less compute. Gains scale with boundary heterogeneity, showing 4.2\% improvement on short actions. Training cost is mitigated via knowledge distillation, with lightweight students retaining 99\% performance at baseline cost. Results are validated across four benchmarks with rigorous statistical testing.
中文摘要 时间动作定位需要精确的边界检测;然而，尽管跨边界的难度存在显着差异，但当前的方法应用统一计算。我们提出两项补充性贡献。首先，边界距离回归（BDR）通过符号距离回归而不是分类提供信息论最优定位，实现了43%的尖锐边界峰。BDR 对现有方法进行了改造，大约有 50 行代码，在不同的架构中产生一致的 1.8 到 3.1%[email protected] 改进。其次，自适应时间细化（ATR）通过连续深度选择$\tau \in [0,1]$来分配计算，无需强化学习即可实现端到端的可微优化。在THUMOS14上，ATR 在 162G FLOP 时达到 56.5\% [email protected]，而在 198G 时实现了 53.6\% 的均匀处理，提高了 2.9%，计算量减少了 18%。增益随边界异质性而缩放，在短动作上显示出 4.2\% 的改进。通过知识提炼降低培训成本，轻量级学生以基线成本保持 99\% 的表现。结果通过严格的统计测试在四个基准测试中得到验证。

Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots

学习人形机器人的视觉驱动反应性足球技能

Authors: Yushi Wang, Changsheng Luo, Penghui Chen, Jianran Liu, Weijian Sun, Tong Guo, Kechang Yang, Biao Hu, Yangang Zhang, Mingguo Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.03996
Pdf link: https://arxiv.org/pdf/2511.03996
Abstract Humanoid soccer poses a representative challenge for embodied intelligence, requiring robots to operate within a tightly coupled perception-action loop. However, existing systems typically rely on decoupled modules, resulting in delayed responses and incoherent behaviors in dynamic environments, while real-world perceptual limitations further exacerbate these issues. In this work, we present a unified reinforcement learning-based controller that enables humanoid robots to acquire reactive soccer skills through the direct integration of visual perception and motion control. Our approach extends Adversarial Motion Priors to perceptual settings in real-world dynamic environments, bridging motion imitation and visually grounded dynamic control. We introduce an encoder-decoder architecture combined with a virtual perception system that models real-world visual characteristics, allowing the policy to recover privileged states from imperfect observations and establish active coordination between perception and action. The resulting controller demonstrates strong reactivity, consistently executing coherent and robust soccer behaviors across various scenarios, including real RoboCup matches.
中文摘要 人形足球对具身智能提出了具有代表性的挑战，要求机器人在紧密耦合的感知-动作循环中运行。然而，现有系统通常依赖于解耦模块，导致动态环境中的响应延迟和行为不连贯，而现实世界的感知限制进一步加剧了这些问题。在这项工作中，我们提出了一种基于统一强化学习的控制器，使人形机器人能够通过视觉感知和运动控制的直接集成来获得反应性足球技能。我们的方法将对抗性运动先验扩展到现实世界动态环境中的感知设置，桥接运动模仿和视觉基础动态控制。我们引入了编码器-解码器架构与模拟真实世界视觉特征的虚拟感知系统相结合，使策略能够从不完美的观察中恢复特权状态，并在感知和行动之间建立主动协调。由此产生的控制器表现出很强的反应性，在各种场景（包括真实的 RoboCup 比赛）中始终如一地执行连贯且稳健的足球行为。

Necessary and Sufficient Conditions for the Optimization-Based Concurrent Execution of Learned Robotic Tasks

基于优化并发执行学习机器人任务的必要充分条件

Authors: Sheikh A. Tahmid, Gennaro Notomista
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.04054
Pdf link: https://arxiv.org/pdf/2511.04054
Abstract In this work, we consider the problem of executing multiple tasks encoded by value functions, each learned through Reinforcement Learning, using an optimization-based framework. Prior works develop such a framework, but left unanswered a fundamental question of when learned value functions can be concurrently executed. The main contribution of this work is to present theorems which provide necessary and sufficient conditions to concurrently execute sets of learned tasks within subsets of the state space, using a previously proposed min-norm controller. These theorems provide insight into when learned control tasks are possible to be made concurrently executable, when they might already inherently be concurrently executable and when it is not possible at all to make a set of learned tasks concurrently executable using the previously proposed methods. Additional contributions of this work include extending the optimization-based framework to execute multiple tasks encoded by value functions to also account for value functions trained with a discount factor, making the overall framework more compatible with standard RL practices.
中文摘要 在这项工作中，我们考虑了使用基于优化的框架执行由值函数编码的多个任务的问题，每个任务都通过强化学习学习。先前的工作开发了这样的框架，但没有回答一个基本问题，即何时可以同时执行学习的价值函数。这项工作的主要贡献是提出了定理，这些定理提供了必要和充分的条件，以使用先前提出的最小范数控制器在状态空间的子集中并发执行学习的任务集。这些定理提供了对何时可以使学习到的控制任务可以并发执行，何时它们可能已经固有地可以并发执行，以及何时根本不可能使用前面提出的方法使一组学习到的任务并发可执行。这项工作的其他贡献包括扩展基于优化的框架以执行由值函数编码的多个任务，以考虑使用折扣因子训练的值函数，使整个框架与标准 RL 实践更加兼容。

CBMC-V3: A CNS-inspired Control Framework Towards Manipulation Agility with SNN

CBMC-V3：受中枢神经系统启发的控制框架，利用SNN实现作敏捷性

Authors: Yanbo Pang, Qingkai Li, Mingguo Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.04109
Pdf link: https://arxiv.org/pdf/2511.04109
Abstract As robotic arm applications extend beyond industrial settings into healthcare, service, and daily life, existing control algorithms struggle to achieve the agile manipulation required for complex environments with dynamic trajectories, unpredictable interactions, and diverse objects. This paper presents a biomimetic control framework based on Spiking Neural Networks (SNN), inspired by the human Central Nervous System (CNS), to achieve agile control in such environments. The proposed framework features five control modules (cerebral cortex, cerebellum, thalamus, brainstem, spinal cord), three hierarchical control levels (first-order, second-order, third-order), and two information pathways (ascending, descending). Each module is fully implemented using SNN. The spinal cord module uses spike encoding and Leaky Integrate-and-Fire (LIF) neurons for feedback control. The brainstem module employs a network of LIF and non-spiking LIF neurons to dynamically adjust spinal cord parameters via reinforcement learning. The thalamus module similarly adjusts the cerebellum's torque outputs. The cerebellum module uses a recurrent SNN to learn the robotic arm's dynamics through regression, providing feedforward gravity compensation torques. The framework is validated both in simulation and on real-world robotic arm platform under various loads and trajectories. Results demonstrate that our method outperforms the industrial-grade position control in manipulation agility.
中文摘要 随着机械臂应用从工业环境扩展到医疗保健、服务和日常生活，现有的控制算法难以实现具有动态轨迹、不可预测交互和多样化对象的复杂环境所需的敏捷作。本文提出了一种基于尖峰神经网络（SNN）的仿生控制框架，受人类中枢神经系统（CNS）的启发，以实现此类环境中的敏捷控制。所提出的框架具有五个控制模块（大脑皮层、小脑、丘脑、脑干、脊髓）、三个分层控制级别（一阶、二阶、三阶）和两个信息通路（升序、降序）。每个模块都完全使用 SNN 实现。脊髓模块使用尖峰编码和泄漏整合与发射（LIF）神经元进行反馈控制。脑干模块采用 LIF 和非尖峰 LIF 神经元网络，通过强化学习动态调整脊髓参数。丘脑模块同样调节小脑的扭矩输出。小脑模块使用循环SNN通过回归学习机械臂的动力学，提供前馈重力补偿扭矩。该框架在模拟和现实世界的机械臂平台上在各种负载和轨迹下都得到了验证。结果表明，该方法在纵敏捷性方面优于工业级位置控制。

RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning

RIDE：使用项目响应理论进行数学推理的进化扰动困难

Authors: Xinyuan Li, Murong Xu, Wenbiao Tao, Hanlun Zhu, Yike Zhao, Jipeng Zhang, Yunshi Lan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.04120
Pdf link: https://arxiv.org/pdf/2511.04120
Abstract Large language models (LLMs) achieve high performance on mathematical reasoning, but these results can be inflated by training data leakage or superficial pattern matching rather than genuine reasoning. To this end, an adversarial perturbation-based evaluation is needed to measure true mathematical reasoning ability. Current rule-based perturbation methods often generate ill-posed questions and impede the systematic evaluation of question difficulty and the evolution of benchmarks. To bridge this gap, we propose RIDE, a novel adversarial question-rewriting framework that leverages Item Response Theory (IRT) to rigorously measure question difficulty and to generate intrinsically more challenging, well-posed variations of mathematical problems. We employ 35 LLMs to simulate students and build a difficulty ranker from their responses. This ranker provides a reward signal during reinforcement learning and guides a question-rewriting model to reformulate existing questions across difficulty levels. Applying RIDE to competition-level mathematical benchmarks yields perturbed versions that degrade advanced LLM performance, with experiments showing an average 21.73% drop across 26 models, thereby exposing limited robustness in mathematical reasoning and confirming the validity of our evaluation approach.
中文摘要 大型语言模型（LLM）在数学推理方面实现了高性能，但这些结果可能会因训练数据泄露或表面模式匹配而不是真正的推理而夸大。为此，需要进行基于对抗性扰动的评估来衡量真正的数学推理能力。当前基于规则的扰动方法经常产生错误的问题，阻碍了问题难度的系统评估和基准的演变。为了弥合这一差距，我们提出了 RIDE，这是一种新颖的对抗性问题重写框架，它利用项目响应理论（IRT）来严格衡量问题难度，并生成本质上更具挑战性、提出合理变化的数学问题。我们聘请了 35 名法学硕士来模拟学生，并根据他们的回答构建难度排名。该排名器在强化学习期间提供奖励信号，并指导问题重写模型跨难度级别重新表述现有问题。将 RIDE 应用于竞赛级别的数学基准会产生扰动版本，从而降低高级 LLM 性能，实验显示 26 个模型的平均下降率为 21.73%，从而暴露了数学推理的稳健性有限，并证实了我们评估方法的有效性。

BFM-Zero: A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinforcement Learning

BFM-Zero：一种基于无监督强化学习的人形控制的可提示行为基础模型

Authors: Yitang Li, Zhengyi Luo, Tonghe Zhang, Cunxi Dai, Anssi Kanervisto, Andrea Tirinzoni, Haoyang Weng, Kris Kitani, Mateusz Guzek, Ahmed Touati, Alessandro Lazaric, Matteo Pirotta, Guanya Shi
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.04131
Pdf link: https://arxiv.org/pdf/2511.04131
Abstract Building Behavioral Foundation Models (BFMs) for humanoid robots has the potential to unify diverse control tasks under a single, promptable generalist policy. However, existing approaches are either exclusively deployed on simulated humanoid characters, or specialized to specific tasks such as tracking. We propose BFM-Zero, a framework that learns an effective shared latent representation that embeds motions, goals, and rewards into a common space, enabling a single policy to be prompted for multiple downstream tasks without retraining. This well-structured latent space in BFM-Zero enables versatile and robust whole-body skills on a Unitree G1 humanoid in the real world, via diverse inference methods, including zero-shot motion tracking, goal reaching, and reward optimization, and few-shot optimization-based adaptation. Unlike prior on-policy reinforcement learning (RL) frameworks, BFM-Zero builds upon recent advancements in unsupervised RL and Forward-Backward (FB) models, which offer an objective-centric, explainable, and smooth latent representation of whole-body motions. We further extend BFM-Zero with critical reward shaping, domain randomization, and history-dependent asymmetric learning to bridge the sim-to-real gap. Those key design choices are quantitatively ablated in simulation. A first-of-its-kind model, BFM-Zero establishes a step toward scalable, promptable behavioral foundation models for whole-body humanoid control.
中文摘要 为人形机器人构建行为基础模型（BFM）有可能将不同的控制任务统一在单一的、可提示的通才策略下。然而，现有的方法要么专门部署在模拟的人形角色上，要么专门用于跟踪等特定任务。我们提出了 BFM-Zero，这是一个学习有效的共享潜在表示的框架，该框架将运动、目标和奖励嵌入到一个公共空间中，从而能够为多个下游任务提示单个策略，而无需重新训练。BFM-Zero 中这种结构良好的潜在空间通过多种推理方法（包括零样本运动跟踪、目标实现和奖励优化，以及基于少量优化的适应）在现实世界中实现了 Unitree G1 人形生物的多功能和强大的全身技能。与之前的策略强化学习（RL）框架不同，BFM-Zero 建立在无监督 RL 和前向后退（FB）模型的最新进展之上，这些模型提供了以目标为中心、可解释和平滑的全身运动潜在表示。我们通过关键奖励塑造、领域随机化和历史相关非对称学习进一步扩展了 BFM-Zero，以弥合模拟与真实的差距。这些关键设计选择在仿真中进行了定量消融。BFM-Zero 是首创的模型，为全身人形控制建立了可扩展、可提示的行为基础模型。

Exchange Policy Optimization Algorithm for Semi-Infinite Safe Reinforcement Learning

面向半无限安全强化学习的交换策略优化算法

Authors: Jiaming Zhang, Yujie Yang, Haoning Wang, Liping Zhang, Shengbo Eben Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.04147
Pdf link: https://arxiv.org/pdf/2511.04147
Abstract Safe reinforcement learning (safe RL) aims to respect safety requirements while optimizing long-term performance. In many practical applications, however, the problem involves an infinite number of constraints, known as semi-infinite safe RL (SI-safe RL). Such constraints typically appear when safety conditions must be enforced across an entire continuous parameter space, such as ensuring adequate resource distribution at every spatial location. In this paper, we propose exchange policy optimization (EPO), an algorithmic framework that achieves optimal policy performance and deterministic bounded safety. EPO works by iteratively solving safe RL subproblems with finite constraint sets and adaptively adjusting the active set through constraint expansion and deletion. At each iteration, constraints with violations exceeding the predefined tolerance are added to refine the policy, while those with zero Lagrange multipliers are removed after the policy update. This exchange rule prevents uncontrolled growth of the working set and supports effective policy training. Our theoretical analysis demonstrates that, under mild assumptions, strategies trained via EPO achieve performance comparable to optimal solutions with global constraint violations strictly remaining within a prescribed bound.
中文摘要 安全强化学习（safe RL）旨在尊重安全要求，同时优化长期性能。然而，在许多实际应用中，这个问题涉及无限数量的约束，称为半无限安全 RL（SI-safe RL）。当必须在整个连续参数空间中强制执行安全条件时，通常会出现此类约束，例如确保在每个空间位置都有足够的资源分布。在本文中，我们提出了交换策略优化（EPO），这是一种实现最佳策略性能和确定性有界安全性的算法框架。EPO的工作原理是迭代求解具有有限约束集的安全RL子问题，并通过约束扩展和删除自适应地调整活动集。在每次迭代中，都会添加超过预定义容差的违规约束以优化策略，而拉格朗日乘数为零的约束则在策略更新后删除。此交换规则可防止工作集不受控制的增长，并支持有效的策略培训。我们的理论分析表明，在温和的假设下，通过 EPO 训练的策略可以实现与最优解相当的性能，并且全局约束违规严格保持在规定的范围内。

PUL-SLAM: Path-Uncertainty Co-Optimization with Lightweight Stagnation Detection for Efficient Robotic Exploration

PUL-SLAM：路径不确定性协同优化与轻量级停滞检测，实现高效机器人探索

Authors: Yizhen Yin, Dapeng Feng, Hongbo Chen, Yuhua Qi
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.04180
Pdf link: https://arxiv.org/pdf/2511.04180
Abstract Existing Active SLAM methodologies face issues such as slow exploration speed and suboptimal paths. To address these limitations, we propose a hybrid framework combining a Path-Uncertainty Co-Optimization Deep Reinforcement Learning framework and a Lightweight Stagnation Detection mechanism. The Path-Uncertainty Co-Optimization framework jointly optimizes travel distance and map uncertainty through a dual-objective reward function, balancing exploration and exploitation. The Lightweight Stagnation Detection reduces redundant exploration through Lidar Static Anomaly Detection and Map Update Stagnation Detection, terminating episodes on low expansion rates. Experimental results show that compared with the frontier-based method and RRT method, our approach shortens exploration time by up to 65% and reduces path distance by up to 42%, significantly improving exploration efficiency in complex environments while maintaining reliable map completeness. Ablation studies confirm that the collaborative mechanism accelerates training convergence. Empirical validation on a physical robotic platform demonstrates the algorithm's practical applicability and its successful transferability from simulation to real-world environments.
中文摘要 现有的主动 SLAM 方法存在探索速度慢、路径次优等问题。为了解决这些局限性，我们提出了一种混合框架，结合了路径-不确定性协同优化深度强化学习框架和轻量级停滞检测机制。路径-不确定性协同优化框架通过双目标奖励函数，共同优化行程距离和地图不确定性，平衡探索和开发。轻量级停滞检测通过激光雷达静态异常检测和地图更新停滞检测减少冗余探索，以低扩展率终止事件。实验结果表明，与基于前沿的方法和RRT方法相比，该方法可缩短多达65%的探索时间，减少多达42%的路径距离，在保持可靠地图完整性的同时，显著提高了复杂环境下的探索效率。消融研究证实，协作机制加速了训练的收敛。在物理机器人平台上的实证验证证明了该算法的实际适用性及其从模拟到现实环境的成功转移。

Black-Box Guardrail Reverse-engineering Attack

黑匣子护栏逆向工程攻击

Authors: Hongwei Yao, Yun Xia, Shuo Shao, Haoran Shi, Tong Qiao, Cong Wang
Subjects: Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.04215
Pdf link: https://arxiv.org/pdf/2511.04215
Abstract Large language models (LLMs) increasingly employ guardrails to enforce ethical, legal, and application-specific constraints on their outputs. While effective at mitigating harmful responses, these guardrails introduce a new class of vulnerabilities by exposing observable decision patterns. In this work, we present the first study of black-box LLM guardrail reverse-engineering attacks. We propose Guardrail Reverse-engineering Attack (GRA), a reinforcement learning-based framework that leverages genetic algorithm-driven data augmentation to approximate the decision-making policy of victim guardrails. By iteratively collecting input-output pairs, prioritizing divergence cases, and applying targeted mutations and crossovers, our method incrementally converges toward a high-fidelity surrogate of the victim guardrail. We evaluate GRA on three widely deployed commercial systems, namely ChatGPT, DeepSeek, and Qwen3, and demonstrate that it achieves an rule matching rate exceeding 0.92 while requiring less than $85 in API costs. These findings underscore the practical feasibility of guardrail extraction and highlight significant security risks for current LLM safety mechanisms. Our findings expose critical vulnerabilities in current guardrail designs and highlight the urgent need for more robust defense mechanisms in LLM deployment.
中文摘要 大型语言模型（LLM）越来越多地采用护栏来对其输出实施道德、法律和特定于应用程序的约束。虽然这些护栏可以有效减轻有害响应，但通过暴露可观察的决策模式引入了一类新的漏洞。在这项工作中，我们提出了对黑盒 LLM 护栏逆向工程攻击的首次研究。我们提出了护栏逆向工程攻击（GRA），这是一种基于强化学习的框架，它利用遗传算法驱动的数据增强来近似受害者护栏的决策策略。通过迭代收集输入输出对，确定发散情况的优先级，并应用有针对性的突变和交叉，我们的方法逐渐收敛到受害者护栏的高保真代理。我们在三个广泛部署的商业系统（即 ChatGPT、DeepSeek 和 Qwen3）上评估了 GRA，并证明它实现了超过 0.92 的规则匹配率，同时需要不到 85 美元的 API 成本。这些发现强调了护栏提取的实际可行性，并强调了当前法学硕士安全机制的重大安全风险。我们的研究结果暴露了当前护栏设计中的关键漏洞，并强调了 LLM 部署中迫切需要更强大的防御机制。

Opus: A Quantitative Framework for Workflow Evaluation

Opus：工作流程评估的定量框架

Authors: Alan Seroul, Théo Fagnoni, Inès Adnani, Dana O. Mohamed, Phillip Kingston
Subjects: Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2511.04220
Pdf link: https://arxiv.org/pdf/2511.04220
Abstract This paper introduces the Opus Workflow Evaluation Framework, a probabilistic-normative formulation for quantifying Workflow quality and efficiency. It integrates notions of correctness, reliability, and cost into a coherent mathematical model that enables direct comparison, scoring, and optimization of Workflows. The framework combines the Opus Workflow Reward, a probabilistic function estimating expected performance through success likelihood, resource usage, and output gain, with the Opus Workflow Normative Penalties, a set of measurable functions capturing structural and informational quality across Cohesion, Coupling, Observability, and Information Hygiene. It supports automated Workflow assessment, ranking, and optimization within modern automation systems such as Opus and can be integrated into Reinforcement Learning loops to guide Workflow discovery and refinement. In this paper, we introduce the Opus Workflow Reward model that formalizes Workflow success as a probabilistic expectation over costs and outcomes. We define measurable Opus Workflow Normative Penalties capturing structural, semantic, and signal-related properties of Workflows. Finally, we propose a unified optimization formulation for identifying and ranking optimal Workflows under joint Reward-Penalty trade-offs.
中文摘要 本文介绍了Opus工作流程评估框架，这是一种用于量化工作流程质量和效率的概率规范公式。它将正确性、可靠性和成本的概念集成到一个连贯的数学模型中，从而能够直接比较、评分和优化工作流程。该框架结合了 Opus Workflow Reward（一种通过成功可能性、资源使用和产出增益来估计预期绩效的概率函数）和 Opus Workflow Normative Penalties（一组可衡量的函数，用于捕获内聚、耦合、可观察性和信息卫生方面的结构和信息质量）。它支持在 Opus 等现代自动化系统中进行自动化工作流评估、排名和优化，并且可以集成到强化学习循环中以指导工作流发现和完善。在本文中，我们介绍了 Opus 工作流奖励模型，该模型将工作流成功形式化为对成本和结果的概率预期。我们定义了可衡量的 Opus Workflow 规范惩罚，捕获了 Workflows 的结构、语义和信号相关属性。最后，我们提出了一种统一的优化公式，用于在联合奖励-惩罚权衡下识别和排名最优工作流程。

Shared Spatial Memory Through Predictive Coding

通过预测编码共享空间记忆

Authors: Zhengru Fang, Yu Guo, Jingjing Wang, Yuang Zhang, Haonan An, Yinhai Wang, Yuguang Fang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2511.04235
Pdf link: https://arxiv.org/pdf/2511.04235
Abstract Sharing and reconstructing a consistent spatial memory is a critical challenge in multi-agent systems, where partial observability and limited bandwidth often lead to catastrophic failures in coordination. We introduce a multi-agent predictive coding framework that formulate coordination as the minimization of mutual uncertainty among agents. Instantiated as an information bottleneck objective, it prompts agents to learn not only who and what to communicate but also when. At the foundation of this framework lies a grid-cell-like metric as internal spatial coding for self-localization, emerging spontaneously from self-supervised motion prediction. Building upon this internal spatial code, agents gradually develop a bandwidth-efficient communication mechanism and specialized neural populations that encode partners' locations: an artificial analogue of hippocampal social place cells (SPCs). These social representations are further enacted by a hierarchical reinforcement learning policy that actively explores to reduce joint uncertainty. On the Memory-Maze benchmark, our approach shows exceptional resilience to bandwidth constraints: success degrades gracefully from 73.5% to 64.4% as bandwidth shrinks from 128 to 4 bits/step, whereas a full-broadcast baseline collapses from 67.6% to 28.6%. Our findings establish a theoretically principled and biologically plausible basis for how complex social representations emerge from a unified predictive drive, leading to social collective intelligence.
中文摘要 在多智能体系统中，共享和重建一致的空间内存是一项关键挑战，其中部分可观察性和有限的带宽通常会导致协调中的灾难性故障。我们引入了一个多智能体预测编码框架，该框架将协调表述为智能体之间相互不确定性的最小化。作为信息瓶颈目标实例化，它不仅要提示代理了解与谁和什么进行通信，还要了解何时进行通信。该框架的基础是一个类似网格单元的度量，作为自我定位的内部空间编码，从自监督运动预测中自发出现。基于这种内部空间代码，智能体逐渐开发出一种带宽高效的通信机制和编码伴侣位置的专用神经群体：海马社交场所细胞（SPC）的人工类似物。这些社会表征通过分层强化学习政策进一步实施，该政策积极探索减少联合不确定性。在 Memory-Maze 基准测试中，我们的方法显示出对带宽约束的卓越弹性：随着带宽从 128 位/步缩小到 4 位/步，成功率从 73.5% 优雅地下降到 64.4%，而全广播基线从 67.6% 下降到 28.6%。我们的研究结果为复杂的社会表征如何从统一的预测驱动中产生，从而导致社会集体智慧建立了理论原则和生物学上合理的基础。

Can Context Bridge the Reality Gap? Sim-to-Real Transfer of Context-Aware Policies

上下文可以弥合现实差距吗？上下文感知策略的模拟到真实传输

Authors: Marco Iannotta, Yuxuan Yang, Johannes A. Stork, Erik Schaffernicht, Todor Stoyanov
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.04249
Pdf link: https://arxiv.org/pdf/2511.04249
Abstract Sim-to-real transfer remains a major challenge in reinforcement learning (RL) for robotics, as policies trained in simulation often fail to generalize to the real world due to discrepancies in environment dynamics. Domain Randomization (DR) mitigates this issue by exposing the policy to a wide range of randomized dynamics during training, yet leading to a reduction in performance. While standard approaches typically train policies agnostic to these variations, we investigate whether sim-to-real transfer can be improved by conditioning the policy on an estimate of the dynamics parameters -- referred to as context. To this end, we integrate a context estimation module into a DR-based RL framework and systematically compare SOTA supervision strategies. We evaluate the resulting context-aware policies in both a canonical control benchmark and a real-world pushing task using a Franka Emika Panda robot. Results show that context-aware policies outperform the context-agnostic baseline across all settings, although the best supervision strategy depends on the task.
中文摘要 模拟到实的转移仍然是机器人强化学习（RL）中的主要挑战，因为由于环境动态的差异，在模拟中训练的策略通常无法推广到现实世界。域随机化（DR）通过在训练期间将策略暴露于广泛的随机动态来缓解此问题，但会导致性能下降。虽然标准方法通常训练与这些变化无关的策略，但我们研究了是否可以通过对动态参数（称为上下文）的估计来约束策略来改进模拟到真实的转移。为此，我们将上下文估计模块集成到基于DR的RL框架中，并系统地比较SOTA监管策略。我们使用 Franka Emika Panda 机器人在规范控制基准和实际推送任务中评估生成的上下文感知策略。结果表明，情境感知策略在所有设置中都优于与情境无关的基线，尽管最佳监督策略取决于任务。

SSPO: Subsentence-level Policy Optimization

SSPO：子句级策略优化

Authors: Kun Yang, Zikang chen, Yanmeng Wang, Zhigen Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.04256
Pdf link: https://arxiv.org/pdf/2511.04256
Abstract As a significant part of post-training of the Large Language Models (LLMs), Reinforcement Learning from Verifiable Reward (RLVR) has greatly improved LLMs' reasoning skills. However, some RLVR algorithms, such as GRPO (Group Relative Policy Optimization) and GSPO (Group Sequence Policy Optimization), are observed to suffer from unstable policy updates and low usage of sampling data, respectively. The importance ratio of GRPO is calculated at the token level, which focuses more on optimizing a single token. This will be easily affected by outliers, leading to model training collapse. GSPO proposed the calculation of the response level importance ratio, which solves the problem of high variance and training noise accumulation in the calculation of the GRPO importance ratio. However, since all the response tokens share a common importance ratio, extreme values can easily raise or lower the overall mean, leading to the entire response being mistakenly discarded, resulting in a decrease in the utilization of sampled data. This paper introduces SSPO, which applies sentence-level importance ratio, taking the balance between GRPO and GSPO. SSPO not only avoids training collapse and high variance, but also prevents the whole response tokens from being abandoned by the clipping mechanism. Furthermore, we apply sentence entropy to PPO-CLIP to steadily adjust the clipping bounds, encouraging high-entropy tokens to explore and narrow the clipping range of low-entropy tokens. In particular, SSPO achieves an average score of 46.57 across five datasets, surpassing GRPO (43.01) and GSPO (44.42), and wins state-of-the-art performance on three datasets. These results highlight SSPO's effectiveness in leveraging generated data by taking the essence of GSPO but rejecting its shortcomings.
中文摘要 作为大型语言模型（LLM）后训练的重要组成部分，可验证奖励强化学习（RLVR）极大地提高了LLM的推理能力。然而，一些RLVR算法，如GRPO（Group Relative Policy Optimization）和GSPO（Group Sequence Policy Optimization），分别存在策略更新不稳定和采样数据使用率低的问题。GRPO 的重要性比是在代币层面计算的，更侧重于优化单个代币。这很容易受到异常值的影响，导致模型训练崩溃。GSPO提出了响应水平重要性比的计算，解决了GRPO重要性比计算中存在的高方差和训练噪声累积的问题。然而，由于所有响应标记都具有共同的重要性比，极值很容易提高或降低总体均值，导致整个响应被错误地丢弃，从而导致采样数据的利用率下降。本文介绍了SSPO，它应用了句子级重要性比，在GRPO和GSPO之间取得了平衡。SSPO 不仅避免了训练崩溃和高方差，而且防止了整个响应 token 被裁剪机制放弃。此外，我们将句子熵应用于 PPO-CLIP 以稳定地调整裁剪边界，鼓励高熵标记探索和缩小低熵标记的裁剪范围。特别是，SSPO 在 5 个数据集中的平均得分为 46.57，超过了 GRPO（43.01）和 GSPO（44.42），并在三个数据集上赢得了最先进的性能。这些结果凸显了 SSPO 通过采用 GSPO 的精髓但拒绝其缺点来利用生成数据的有效性。

RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

RLoop：具有迭代策略初始化的强化学习的自我改进框架

Authors: Zeng Zhiyuan, Jiashuo Liu, Zhangyue Yin, Ge Zhang, Wenhao Huang, Xipeng Qiu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.04285
Pdf link: https://arxiv.org/pdf/2511.04285
Abstract While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for training large reasoning models, its training dynamics harbor a critical challenge: RL overfitting, where models gain training rewards but lose generalization. Our analysis reveals this is driven by policy over-specialization and catastrophic forgetting of diverse solutions generated during training. Standard optimization discards this valuable inter-step policy diversity. To address this, we introduce RLoop, a self-improving framework built on iterative policy initialization. RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset. This dataset is used via Rejection-sampling Fine-Tuning (RFT) to refine the initial policy, creating a superior starting point for the next iteration. This loop of exploration and exploitation via iterative re-initialization effectively converts transient policy variations into robust performance gains. Our experiments show RLoop mitigates forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.
中文摘要 虽然可验证奖励强化学习（RLVR）在训练大型推理模型方面很强大，但其训练动力学面临着一个关键挑战：RL 过拟合，即模型获得训练奖励但失去泛化。我们的分析表明，这是由于政策过度专业化和灾难性地忘记了培训期间产生的各种解决方案。标准优化抛弃了这种宝贵的步骤间策略多样性。为了解决这个问题，我们引入了 RLoop，这是一个基于迭代策略初始化构建的自我改进框架。RLoop 将标准训练过程转变为良性循环：它首先使用 RL 从给定策略中探索解决方案空间，然后过滤成功的轨迹以创建专家数据集。该数据集通过拒绝采样微调（RFT）用于细化初始策略，为下一次迭代创建更好的起点。这种通过迭代重新初始化进行探索和利用的循环有效地将瞬态策略变化转化为稳健的性能提升。我们的实验表明，与普通 RL 相比，RLoop 可以减轻遗忘并显着提高泛化能力，将平均准确率提高 9%，pass@32提高 15% 以上。

GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents

GUI-360：计算机使用代理的综合数据集和基准测试

Authors: Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, Si Qin, Liqun Li, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.04307
Pdf link: https://arxiv.org/pdf/2511.04307
Abstract We introduce GUI-360$^\circ$, a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs). CUAs present unique challenges and is constrained by three persistent gaps: a scarcity of real-world CUA tasks, the lack of automated collection-and-annotation pipelines for multi-modal trajectories, and the absence of a unified benchmark that jointly evaluates GUI grounding, screen parsing, and action prediction. GUI-360$^\circ$ addresses these gaps with an LLM-augmented, largely automated pipeline for query sourcing, environment-template construction, task instantiation, batched execution, and LLM-driven quality filtering. The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications, and includes full-resolution screenshots, accessibility metadata when available, instantiated goals, intermediate reasoning traces, and both successful and failed action trajectories. The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space that reflects modern agent designs. Benchmarking state-of-the-art vision--language models on GUI-360$^\circ$ reveals substantial out-of-the-box shortcomings in grounding and action prediction; supervised fine-tuning and reinforcement learning yield significant gains but do not close the gap to human-level reliability. We release GUI-360$^\circ$ and accompanying code to facilitate reproducible research and accelerate progress on robust desktop CUAs. The full dataset has been made public on this https URL.
中文摘要 我们介绍了 GUI-360$^\circ$，这是一个大规模、全面的数据集和基准测试套件，旨在推进计算机使用代理（CUA）。CUA 面临着独特的挑战，并受到三个持续差距的限制：现实世界的 CUA 任务稀缺、缺乏用于多模态轨迹的自动收集和注释管道，以及缺乏联合评估 GUI 接地、屏幕解析和动作预测的统一基准。GUI-360$^\circ$ 通过 LLM 增强的、大部分自动化的管道解决了这些差距，用于查询溯源、环境模板构建、任务实例化、批量执行和 LLM 驱动的质量过滤。发布的语料库包含超过 1.2M 个执行的作步骤，涵盖流行的 Windows Office 应用程序中的数千条轨迹，并包括全分辨率屏幕截图、可访问性元数据（如果可用）、实例化目标、中间推理跟踪以及成功和失败的作轨迹。该数据集支持三个规范任务，即 GUI 接地、屏幕解析和动作预测，以及反映现代智能体设计的混合 GUI+API 动作空间。在 GUI-360$^\circ$ 上对最先进的视觉语言模型进行基准测试，揭示了在基础和动作预测方面的大量开箱即用的缺陷;监督微调和强化学习产生了显着的收益，但并没有缩小与人类水平可靠性的差距。我们发布了 GUI-360$^\circ$ 和随附的代码，以促进可重复的研究并加速健壮桌面 CUA 的进展。完整的数据集已在此 https URL 上公开。

MacroNav: Multi-Task Context Representation Learning Enables Efficient Navigation in Unknown Environments

MacroNav：多任务上下文表示学习实现未知环境中的高效导航

Authors: Kuankuan Sima, Longbin Tang, Haozhe Ma, Lin Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.04320
Pdf link: https://arxiv.org/pdf/2511.04320
Abstract Autonomous navigation in unknown environments requires compact yet expressive spatial understanding under partial observability to support high-level decision making. Existing approaches struggle to balance rich contextual representation with navigation efficiency. We present MacroNav, a learning-based navigation framework featuring two key components: (1) a lightweight context encoder trained via multi-task self-supervised learning to capture multi-scale, navigation-centric spatial representations; and (2) a reinforcement learning policy that seamlessly integrates these representations with graph-based reasoning for efficient action selection. Extensive experiments demonstrate the context encoder's efficient and robust environmental understanding. Real-world deployments further validate MacroNav's effectiveness, yielding significant gains over state-of-the-art navigation methods in both Success Rate (SR) and Success weighted by Path Length (SPL), while maintaining low computational cost. Code will be released upon acceptance.
中文摘要 未知环境中的自主导航需要在部分可观测性下紧凑而富有表现力的空间理解，以支持高级决策。现有方法难以平衡丰富的上下文表示与导航效率。我们提出了 MacroNav，这是一个基于学习的导航框架，具有两个关键组件：（1）通过多任务自监督学习训练的轻量级上下文编码器，以捕获多尺度、以导航为中心的空间表示;（2）强化学习策略，将这些表示与基于图的推理无缝集成，以实现有效的行动选择。广泛的实验证明了上下文编码器高效而稳健的环境理解能力。实际部署进一步验证了 MacroNav 的有效性，在成功率（SR）和按路径长度加权的成功率（SPL）方面都比最先进的导航方法取得了显着的收益，同时保持了较低的计算成本。代码将在接受后发布。

Temporal Action Selection for Action Chunking

动作分块的时间动作选择

Authors: Yueyang Weng, Xiaopeng Zhang, Yongjin Mu, Yingcong Zhu, Yanjie Li, Qi Liu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.04421
Pdf link: https://arxiv.org/pdf/2511.04421
Abstract Action chunking is a widely adopted approach in Learning from Demonstration (LfD). By modeling multi-step action chunks rather than single-step actions, action chunking significantly enhances modeling capabilities for human expert policies. However, the reduced decision frequency restricts the utilization of recent observations, degrading reactivity - particularly evident in the inadequate adaptation to sensor noise and dynamic environmental changes. Existing efforts to address this issue have primarily resorted to trading off reactivity against decision consistency, without achieving both. To address this limitation, we propose a novel algorithm, Temporal Action Selector (TAS), which caches predicted action chunks from multiple timesteps and dynamically selects the optimal action through a lightweight selector network. TAS achieves balanced optimization across three critical dimensions: reactivity, decision consistency, and motion coherence. Experiments across multiple tasks with diverse base policies show that TAS significantly improves success rates - yielding an absolute gain of up to 73.3%. Furthermore, integrating TAS as a base policy with residual reinforcement learning (RL) substantially enhances training efficiency and elevates the performance plateau. Experiments in both simulation and physical robots confirm the method's efficacy.
中文摘要 动作分块是从演示中学习（LfD）中广泛采用的方法。通过对多步骤作块而不是单步作进行建模，作分块显着增强了人类专家策略的建模能力。然而，决策频率的降低限制了最近观测结果的利用，降低了反应性——尤其在对传感器噪声和动态环境变化的适应不足方面。解决这个问题的现有努力主要诉诸于权衡反应性与决策一致性，而没有同时实现两者。为了解决这一限制，我们提出了一种新颖的算法，即时间动作选择器（TAS），它缓存来自多个时间步长的预测动作块，并通过轻量级选择器网络动态选择最佳动作。TAS 在三个关键维度上实现了平衡优化：反应性、决策一致性和运动连贯性。具有不同基本策略的多个任务的实验表明，TAS 显着提高了成功率 - 产生高达 73.3% 的绝对收益。此外，将TAS作为基础策略与残差强化学习（RL）相结合，可以大大提高训练效率并提升性能平台期。模拟和物理机器人的实验证实了该方法的有效性。

The Peril of Preference: Why GRPO fails on Ordinal Rewards

偏好的危险：为什么 GRPO 在序数奖励上失败

Authors: Anisha Garg, Ganesh Venkatesh
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.04439
Pdf link: https://arxiv.org/pdf/2511.04439
Abstract Group-relative Policy Optimization's (GRPO) simplicity makes it highly desirable for adapting LLMs to become experts at specific tasks. But this simplicity also makes it ill-specified as we seek to enhance RL training with richer, non-binary feedback. When using ordinal rewards to give partial credit, GRPO's simplicity starts to hurt, as its group-average baseline often assigns a positive advantage to failed trajectories and reinforces incorrect behavior. We introduce Correctness Relative Policy Optimization (CoRPO), a new formulation that solves this flaw. CoRPO uses an adaptive baseline that enforces a minimum quality threshold, ensuring failed solutions are never positively reinforced. Once the policy consistently meets this threshold, the baseline automatically transitions to a relative preference mode, pushing the model to find optimal solutions rather than just "acceptable" ones. We empirically validate CoRPO on a code verification task, where it demonstrates more stable convergence and better out-of-domain generalization. This work represents a critical step in our broader research program to enable LLMs to learn genuinely new capabilities through reinforcement learning. We achieve this by enabling LLMs to learn from rich, multi-dimensional feedback - progressing from binary to ordinal rewards in this work, and onward to denser, per-step supervision.
中文摘要 群体相关策略优化（GRPO）的简单性使得使 LLM 成为特定任务的专家非常可取。但这种简单性也使其不明确，因为我们寻求通过更丰富的非二元反馈来增强 RL 训练。当使用序数奖励给予部分功劳时，GRPO 的简单性开始受到伤害，因为它的组平均基线通常会为失败的轨迹分配积极的优势并强化不正确的行为。我们引入了正确性相对策略优化（CoRPO），这是一种解决这一缺陷的新公式。CoRPO 使用自适应基线来强制执行最低质量阈值，确保失败的解决方案永远不会得到积极的强化。一旦策略始终满足此阈值，基线就会自动过渡到相对偏好模式，推动模型寻找最佳解决方案，而不仅仅是“可接受”的解决方案。我们在代码验证任务上实证验证了 CoRPO，它展示了更稳定的收敛和更好的域外泛化。这项工作代表了我们更广泛的研究计划的关键一步，该计划使法学硕士能够通过强化学习学习真正的新能力。我们通过使法学硕士能够从丰富的多维反馈中学习来实现这一目标——在这项工作中从二进制奖励到序数奖励，再到更密集的每步监督。

Fitting Reinforcement Learning Model to Behavioral Data under Bandits

强化学习模型拟合强盗行为数据

Authors: Hao Zhu, Jasper Hoffmann, Baohe Zhang, Joschka Boedecker
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Optimization and Control (math.OC); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2511.04454
Pdf link: https://arxiv.org/pdf/2511.04454
Abstract We consider the problem of fitting a reinforcement learning (RL) model to some given behavioral data under a multi-armed bandit environment. These models have received much attention in recent years for characterizing human and animal decision making behavior. We provide a generic mathematical optimization problem formulation for the fitting problem of a wide range of RL models that appear frequently in scientific research applications, followed by a detailed theoretical analysis of its convexity properties. Based on the theoretical results, we introduce a novel solution method for the fitting problem of RL models based on convex relaxation and optimization. Our method is then evaluated in several simulated bandit environments to compare with some benchmark methods that appear in the literature. Numerical results indicate that our method achieves comparable performance to the state-of-the-art, while significantly reducing computation time. We also provide an open-source Python package for our proposed method to empower researchers to apply it in the analysis of their datasets directly, without prior knowledge of convex optimization.
中文摘要 我们考虑了在多臂强盗环境中将强化学习（RL）模型拟合到一些给定行为数据的问题。近年来，这些模型因表征人类和动物决策行为而受到广泛关注。针对科研应用中经常出现的多种RL模型的拟合问题，提供了通用的数学优化问题公式，并对其凸性特性进行了详细的理论分析。基于理论结果，提出了一种基于凸松弛和优化的RL模型拟合问题求解方法。然后在几个模拟的强盗环境中评估我们的方法，以与文献中出现的一些基准方法进行比较。数值结果表明，我们的方法实现了与最先进的方法相当的性能，同时显着缩短了计算时间。我们还为我们提出的方法提供了一个开源的 Python 包，使研究人员能够直接将其应用于数据集的分析中，而无需事先了解凸优化。

V-Thinker: Interactive Thinking with Images

V-Thinker：图像交互式思维

Authors: Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Honggang Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.04460
Pdf link: https://arxiv.org/pdf/2511.04460
Abstract Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising "Thinking with Images" paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.
中文摘要 使大型多模态模型（LMM）能够将图像交互与长视界推理能力深度集成，仍然是该领域长期面临的挑战。以视觉为中心的推理的最新进展探索了 LMM 的一种有前途的“图像思维”范式，标志着从图像辅助推理到图像交互思维的转变。虽然这一里程碑使模型能够专注于细粒度的图像区域，但进展仍然受到有限的可视化工具空间和特定于任务的工作流程设计的限制。为了弥合这一差距，我们推出了 V-Thinker，这是一种通用多模态推理助手，通过端到端强化学习实现交互式、以视觉为中心的思维。V-Thinker 由两个关键组件组成：（1）数据演化飞轮，自动合成、演化和验证交互式推理数据集，涵盖多样性、质量和难度三个维度;（2）视觉渐进式培训课程，首先通过点级监督调整感知，然后通过两阶段强化学习框架整合交互式推理。此外，我们还介绍了 VTBench，这是一个经过专家验证的基准测试，针对以视觉为中心的交互式推理任务。广泛的实验表明，V-Thinker 在一般和交互式推理场景中始终优于基于 LMM 的强大基线，为推进图像交互推理应用提供了宝贵的见解。

End-to-End Reinforcement Learning of Koopman Models for eNMPC of an Air Separation Unit

空分装置eNMPC的Koopman模型的端到端强化学习

Authors: Daniel Mayfrank, Kayra Dernek, Laura Lang, Alexander Mitsos, Manuel Dahmen
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2511.04522
Pdf link: https://arxiv.org/pdf/2511.04522
Abstract With our recently proposed method based on reinforcement learning (Mayfrank et al. (2024), Comput. Chem. Eng. 190), Koopman surrogate models can be trained for optimal performance in specific (economic) nonlinear model predictive control ((e)NMPC) applications. So far, our method has exclusively been demonstrated on a small-scale case study. Herein, we show that our method scales well to a more challenging demand response case study built on a large-scale model of a single-product (nitrogen) air separation unit. Across all numerical experiments, we assume observability of only a few realistically measurable plant variables. Compared to a purely system identification-based Koopman eNMPC, which generates small economic savings but frequently violates constraints, our method delivers similar economic performance while avoiding constraint violations.
中文摘要 通过我们最近提出的基于强化学习的方法（Mayfrank et al. （2024）， Comput. Chem. Eng. 190），可以训练库夫曼代理模型，以在特定的（经济）非线性模型预测控制（（e）NMPC）应用程序中实现最佳性能。到目前为止，我们的方法仅在小规模案例研究中进行了演示。在此，我们表明我们的方法可以很好地扩展到基于单一产品（氮气）空气分离装置的大规模模型构建的更具挑战性的需求响应案例研究。在所有数值实验中，我们假设只有少数几个现实可测量的植物变量具有可观测性。与纯粹基于系统识别的 Koopman eNMPC 相比，后者产生少量经济节约但经常违反约束，我们的方法提供了类似的经济性能，同时避免了约束违规。

Environment Agnostic Goal-Conditioning, A Study of Reward-Free Autonomous Learning

与环境无关的目标条件反射，无奖励自主学习研究

Authors: Hampus Åström, Elin Anna Topp, Jacek Malec
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.04598
Pdf link: https://arxiv.org/pdf/2511.04598
Abstract In this paper we study how transforming regular reinforcement learning environments into goal-conditioned environments can let agents learn to solve tasks autonomously and reward-free. We show that an agent can learn to solve tasks by selecting its own goals in an environment-agnostic way, at training times comparable to externally guided reinforcement learning. Our method is independent of the underlying off-policy learning algorithm. Since our method is environment-agnostic, the agent does not value any goals higher than others, leading to instability in performance for individual goals. However, in our experiments, we show that the average goal success rate improves and stabilizes. An agent trained with this method can be instructed to seek any observations made in the environment, enabling generic training of agents prior to specific use cases.
中文摘要 在本文中，我们研究了如何将常规强化学习环境转变为目标条件环境，让智能体学会自主且无奖励地解决任务。我们表明，智能体可以通过以与环境无关的方式选择自己的目标来学习解决任务，其训练时间与外部引导的强化学习相当。我们的方法独立于底层的策略外学习算法。由于我们的方法与环境无关，因此代理不会将任何目标视为高于其他目标的价值，从而导致单个目标的绩效不稳定。然而，在我们的实验中，我们表明平均进球成功率有所提高并趋于稳定。可以指示使用此方法训练的代理寻求在环境中进行的任何观察，从而在特定用例之前对代理进行通用训练。

Forgetting is Everywhere

遗忘无处不在

Authors: Ben Sanati, Thomas L. Lee, Trevor McInroe, Aidan Scannell, Nikolay Malkin, David Abel, Amos Storkey
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2511.04666
Pdf link: https://arxiv.org/pdf/2511.04666
Abstract A fundamental challenge in developing general learning algorithms is their tendency to forget past knowledge when adapting to new data. Addressing this problem requires a principled understanding of forgetting; yet, despite decades of study, no unified definition has emerged that provides insights into the underlying dynamics of learning. We propose an algorithm- and task-agnostic theory that characterises forgetting as a lack of self-consistency in a learner's predictive distribution over future experiences, manifesting as a loss of predictive information. Our theory naturally yields a general measure of an algorithm's propensity to forget. To validate the theory, we design a comprehensive set of experiments that span classification, regression, generative modelling, and reinforcement learning. We empirically demonstrate how forgetting is present across all learning settings and plays a significant role in determining learning efficiency. Together, these results establish a principled understanding of forgetting and lay the foundation for analysing and improving the information retention capabilities of general learning algorithms.
中文摘要 开发通用学习算法的一个基本挑战是它们在适应新数据时倾向于忘记过去的知识。解决这个问题需要对遗忘有原则性的理解;然而，尽管进行了数十年的研究，但尚未出现统一的定义来深入了解学习的潜在动态。我们提出了一种与算法和任务无关的理论，该理论将遗忘描述为学习者对未来经验的预测分布缺乏自洽性，表现为预测信息的丢失。我们的理论自然会产生算法遗忘倾向的一般衡量标准。为了验证该理论，我们设计了一套全面的实验，涵盖分类、回归、生成建模和强化学习。我们实证地证明了遗忘在所有学习环境中都是如何存在的，并在决定学习效率方面发挥着重要作用。这些结果共同建立了对遗忘的原则性理解，并为分析和提高通用学习算法的信息保留能力奠定了基础。

GentleHumanoid: Learning Upper-body Compliance for Contact-rich Human and Object Interaction

GentleHumanoid：学习上半身顺应性，实现接触丰富的人与物交互

Authors: Qingzhou Lu, Yao Feng, Baiyu Shi, Michael Piseno, Zhenan Bao, C. Karen Liu
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2511.04679
Pdf link: https://arxiv.org/pdf/2511.04679
Abstract Humanoid robots are expected to operate in human-centered environments where safe and natural physical interaction is essential. However, most recent reinforcement learning (RL) policies emphasize rigid tracking and suppress external forces. Existing impedance-augmented approaches are typically restricted to base or end-effector control and focus on resisting extreme forces rather than enabling compliance. We introduce GentleHumanoid, a framework that integrates impedance control into a whole-body motion tracking policy to achieve upper-body compliance. At its core is a unified spring-based formulation that models both resistive contacts (restoring forces when pressing against surfaces) and guiding contacts (pushes or pulls sampled from human motion data). This formulation ensures kinematically consistent forces across the shoulder, elbow, and wrist, while exposing the policy to diverse interaction scenarios. Safety is further supported through task-adjustable force thresholds. We evaluate our approach in both simulation and on the Unitree G1 humanoid across tasks requiring different levels of compliance, including gentle hugging, sit-to-stand assistance, and safe object manipulation. Compared to baselines, our policy consistently reduces peak contact forces while maintaining task success, resulting in smoother and more natural interactions. These results highlight a step toward humanoid robots that can safely and effectively collaborate with humans and handle objects in real-world environments.
中文摘要 人形机器人有望在以人为本的环境中运行，在这些环境中，安全和自然的物理交互至关重要。然而，最近的强化学习（RL）政策强调严格跟踪并抑制外力。现有的阻抗增强方法通常仅限于基极或末端执行器控制，并专注于抵抗极端力而不是实现顺应性。我们介绍了 GentleHumanoid，这是一个将阻抗控制集成到全身运动跟踪策略中以实现上半身顺应性的框架。其核心是基于弹簧的统一配方，该配方对电阻触点（按压表面时的恢复力）和导向触点（从人体运动数据中采样的推拉）进行建模。这种公式确保肩部、肘部和手腕的运动学一致的力，同时使策略暴露于不同的交互场景中。通过任务可调整的部队阈值进一步支持安全。我们在模拟和 Unitree G1 人形机器人上评估了我们的方法，这些任务需要不同程度的合规性，包括轻柔的拥抱、从坐到站的辅助和安全的物体作。与基线相比，我们的政策始终减少峰值接触力，同时保持任务成功，从而实现更顺畅、更自然的交互。这些结果凸显了人形机器人朝着安全有效地与人类协作并在现实环境中处理物体的方向迈出了一步。

Keyword: diffusion policy

There is no result