Arxiv Papers of Today

生成时间: 2026-02-02 16:50:00 (UTC+8); Arxiv 发布时间: 2026-02-02 20:00 EST (2026-02-03 09:00 UTC+8)

今天共有 53 篇相关文章

Keyword: reinforcement learning

ShellForge: Adversarial Co-Evolution of Webshell Generation and Multi-View Detection for Robust Webshell Defense

ShellForge：Webshell生成与多视角检测的对抗性共进，实现稳健Webshell防御

Authors: Yizhong Ding
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.22182
Pdf link: https://arxiv.org/pdf/2601.22182
Abstract Webshells remain a primary foothold for attackers to compromise servers, particularly within PHP ecosystems. However, existing detection mechanisms often struggle to keep pace with rapid variant evolution and sophisticated obfuscation techniques that camouflage malicious intent. Furthermore, many current defenses suffer from high false-alarm rates when encountering benign administrative scripts that employ heavy obfuscation for intellectual property protection. To address these challenges, we present ShellForge, an adversarial co-evolution framework that couples automated webshell generation with multi-view detection to continuously harden defensive boundaries. The framework operates through an iterative co-training loop where a generator and a detector mutually reinforce each other via the exchange of hard samples. The generator is optimized through supervised fine-tuning and preference-based reinforcement learning to synthesize functional, highly evasive variants. Simultaneously, we develop a multi-view fusion detector that integrates semantic features from long-string compression, structural features from pruned abstract syntax trees, and global statistical indicators such as Shannon entropy. To minimize false positives, ShellForge utilizes a LLM-based transformation to create de-malicious samples--scripts that retain complex obfuscation patterns but lack harmful payloads--serving as high-quality hard negatives during training. Evaluations on the public FWOID benchmark demonstrate that ShellForge significantly enhances defensive robustness. Upon convergence, the detector maintains a 0.981 F1-score while the generator achieves a 0.939 evasion rate against commercial engines on VirusTotal.
中文摘要 Webshell依然是攻击者攻破服务器的主要据点，尤其是在PHP生态系统中。然而，现有的检测机制往往难以跟上快速变异和复杂混淆技术的步伐，这些技术掩盖了恶意意图。此外，许多现有防御措施在遇到利用大量混淆技术保护知识产权的良性行政脚本时，存在高误报率。为应对这些挑战，我们提出了ShellForge，一个对抗性共进化框架，将自动网页壳生成与多视角检测相结合，持续加固防御边界。该框架通过迭代共训练循环运行，生成器和探测器通过交换硬样本相互强化。生成器通过监督微调和基于偏好的强化学习进行优化，以合成功能性且高度回避的变体。同时，我们开发了一种多视角融合检测器，整合了长字符串压缩的语义特征、修剪抽象语法树的结构特征以及如香农熵等全局统计指标。为了减少误报，ShellForge 利用基于大型语言模型的转换来创建去恶意样本——这些脚本保留了复杂的混淆模式，但没有有害负载——在训练过程中作为高质量的硬负片。公开FWID基准测试的评估显示，ShellForge显著提升了防御稳健性。收敛后，探测器保持0.981的F1分数，而发生器在VirusTotal上对商用发动机的规避率为0.939。

Latent Spherical Flow Policy for Reinforcement Learning with Combinatorial Actions

基于组合行动的强化学习潜在球形流策略

Authors: Lingkai Kong, Anagha Satish, Hezi Jiang, Akseli Kangaslahti, Andrew Ma, Wenbo Chen, Mingxiao Song, Lily Xu, Milind Tambe
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.22211
Pdf link: https://arxiv.org/pdf/2601.22211
Abstract Reinforcement learning (RL) with combinatorial action spaces remains challenging because feasible action sets are exponentially large and governed by complex feasibility constraints, making direct policy parameterization impractical. Existing approaches embed task-specific value functions into constrained optimization programs or learn deterministic structured policies, sacrificing generality and policy expressiveness. We propose a solver-induced \emph{latent spherical flow policy} that brings the expressiveness of modern generative policies to combinatorial RL while guaranteeing feasibility by design. Our method, LSFlow, learns a \emph{stochastic} policy in a compact continuous latent space via spherical flow matching, and delegates feasibility to a combinatorial optimization solver that maps each latent sample to a valid structured action. To improve efficiency, we train the value network directly in the latent space, avoiding repeated solver calls during policy optimization. To address the piecewise-constant and discontinuous value landscape induced by solver-based action selection, we introduce a smoothed Bellman operator that yields stable, well-defined learning targets. Empirically, our approach outperforms state-of-the-art baselines by an average of 20.6\% across a range of challenging combinatorial RL tasks.
中文摘要 结合组合动作空间的强化学习（RL）依然具有挑战性，因为可行动作集呈指数级巨大，且受复杂可行性约束约束，使得直接策略参数化不切实际。现有方法将任务特定的价值函数嵌入受限优化程序，或学习确定性结构化策略，牺牲了普遍性和策略表达性。我们提出了一种由求解器诱导的\emph{潜在球面流策略}，将现代生成策略的表达性带入组合强化学习，同时通过设计保证可行性。我们的方法LSFlow通过球面流匹配在紧凑连续潜空间中学习\emph{随机}策略，并将可行性委托给组合优化求解器，该求解器将每个潜在样本映射到有效的结构化作用。为提高效率，我们直接在潜在空间训练价值网络，避免在策略优化过程中重复求解器调用。为解决基于求解器的动作选择引起的分段常数和不连续值景观，我们引入了平滑化的贝尔曼算子，能够获得稳定且明确的学习目标。从实证来看，我们的方法在一系列具有挑战性的组合强化学习任务中，平均比最先进的基线高出20.6%。

Aligning Microscopic Vehicle and Macroscopic Traffic Statistics: Reconstructing Driving Behavior from Partial Data

微观车辆与宏观交通统计的对齐：从部分数据重建驾驶行为

Authors: Zhihao Zhang, Keith Redmill, Chengyang Peng, Bowen Weng
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.22242
Pdf link: https://arxiv.org/pdf/2601.22242
Abstract A driving algorithm that aligns with good human driving practices, or at the very least collaborates effectively with human drivers, is crucial for developing safe and efficient autonomous vehicles. In practice, two main approaches are commonly adopted: (i) supervised or imitation learning, which requires comprehensive naturalistic driving data capturing all states that influence a vehicle's decisions and corresponding actions, and (ii) reinforcement learning (RL), where the simulated driving environment either matches or is intentionally more challenging than real-world conditions. Both methods depend on high-quality observations of real-world driving behavior, which are often difficult and costly to obtain. State-of-the-art sensors on individual vehicles can gather microscopic data, but they lack context about the surrounding conditions. Conversely, roadside sensors can capture traffic flow and other macroscopic characteristics, but they cannot associate this information with individual vehicles on a microscopic level. Motivated by this complementarity, we propose a framework that reconstructs unobserved microscopic states from macroscopic observations, using microscopic data to anchor observed vehicle behaviors, and learns a shared policy whose behavior is microscopically consistent with the partially observed trajectories and actions and macroscopically aligned with target traffic statistics when deployed population-wide. Such constrained and regularized policies promote realistic flow patterns and safe coordination with human drivers at scale.
中文摘要 一个符合良好人类驾驶习惯，或至少与人类驾驶员有效协作的驾驶算法，对于开发安全高效的自动驾驶车辆至关重要。在实践中，通常采用两种主要方法：（i）监督式或模仿式学习，需要全面自然的驾驶数据，捕捉所有影响车辆决策和相应动作的状态;（ii）强化学习（RL），其中模拟驾驶环境要么匹配，要么有意比现实世界条件更具挑战性。这两种方法都依赖于对真实驾驶行为的高质量观察，而这些观测往往既困难又昂贵。单车上的先进传感器可以收集微观数据，但缺乏对周围环境的背景信息。相反，路边传感器可以捕捉交通流量和其他宏观特征，但无法将这些信息与单个车辆在微观层面关联起来。基于这种互补性，我们提出了一个框架，利用微观数据锚定观察到的车辆行为，从宏观观测中重建未观测到的微观状态，并学习一套共享政策，其行为在微观上与部分观测的轨迹和行为一致，并在全人群部署时与目标交通统计数据宏观对齐。这种受限且规范化的政策促进了真实的流动模式和大规模的人工司机安全协调。

Learning Reward Functions for Cooperative Resilience in Multi-Agent Systems

多智能体系统中合作弹性的奖励函数学习

Authors: Manuela Chacon-Chamorro, Luis Felipe Giraldo, Nicanor Quijano
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.22292
Pdf link: https://arxiv.org/pdf/2601.22292
Abstract Multi-agent systems often operate in dynamic and uncertain environments, where agents must not only pursue individual goals but also safeguard collective functionality. This challenge is especially acute in mixed-motive multi-agent systems. This work focuses on cooperative resilience, the ability of agents to anticipate, resist, recover, and transform in the face of disruptions, a critical yet underexplored property in Multi-Agent Reinforcement Learning. We study how reward function design influences resilience in mixed-motive settings and introduce a novel framework that learns reward functions from ranked trajectories, guided by a cooperative resilience metric. Agents are trained in a suite of social dilemma environments using three reward strategies: i) traditional individual reward; ii) resilience-inferred reward; and iii) hybrid that balance both. We explore three reward parameterizations-linear models, hand-crafted features, and neural networks, and employ two preference-based learning algorithms to infer rewards from behavioral rankings. Our results demonstrate that hybrid strategy significantly improve robustness under disruptions without degrading task performance and reduce catastrophic outcomes like resource overuse. These findings underscore the importance of reward design in fostering resilient cooperation, and represent a step toward developing robust multi-agent systems capable of sustaining cooperation in uncertain environments.
中文摘要 多智能体系统通常在动态且不确定的环境中运行，代理不仅要追求个人目标，还要保障集体功能。这一挑战在混合动机多智能体系统中尤为突出。本研究聚焦于合作韧性，即代理在面对中断时预判、抵抗、恢复和转化的能力，这是多代理强化学习中一个关键但尚未被充分探讨的特性。我们研究了奖励函数设计如何影响混合动机环境中的韧性，并引入了一个新框架，从排序轨迹学习奖励函数，辅以合作韧性指标。代理在一套社会困境环境中接受培训，采用三种奖励策略：i）传统个人奖励;ii）韧性推断奖励;以及三）实现平衡的混合体。我们探索了三种奖励参数化——线性模型、手工设计特征和神经网络，并采用两种基于偏好的学习算法从行为排名中推断奖励。我们的结果表明，混合策略在中断下显著提升了鲁棒性，同时不降低任务表现，并减少资源过度使用等灾难性后果。这些发现强调了奖励设计在促进韧性合作中的重要性，并代表了开发能够在不确定环境中维持合作的稳健多智能体系统迈出的一步。

Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning

为多智能体辩论准备推理语言模型，配合自我辩论强化学习

Authors: Chenxi Liu, Yanshuo Chen, Ruibo Chen, Tianyi Xiong, Tong Zheng, Heng Huang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.22297
Pdf link: https://arxiv.org/pdf/2601.22297
Abstract The reasoning abilities of large language models (LLMs) have been substantially improved by reinforcement learning with verifiable rewards (RLVR). At test time, collaborative reasoning through Multi-Agent Debate (MAD) has emerged as a promising approach for enhancing LLM performance. However, current RLVR methods typically train LLMs to solve problems in isolation, without explicitly preparing them to synthesize and benefit from different rationales that arise during debate. In this work, we propose Self-Debate Reinforcement Learning (SDRL), a training framework that equips a single LLM with strong standalone problem-solving ability and the capability to learn from diverse reasoning trajectories in MAD. Given a prompt, SDRL first samples multiple candidate solutions, then constructs a debate context with diverse reasoning paths and generates second-turn responses conditioned on this context. Finally, SDRL jointly optimizes both the initial and debate-conditioned responses, yielding a model that is effective as both a standalone solver and a debate participant. Experiments across multiple base models and reasoning benchmarks show that SDRL improves overall MAD performance while simultaneously strengthening single model reasoning.
中文摘要 大型语言模型（LLMs）的推理能力通过可验证奖励强化学习（RLVR）得到了显著提升。在测试阶段，通过多智能体辩论（MAD）进行协作推理已成为提升LLM性能的有前景方法。然而，当前的RLVR方法通常训练LLM以单独解决问题，而未明确准备它们综合并受益于辩论中出现的不同理据。在本研究中，我们提出了自我辩论强化学习（SDRL）的培训框架，该框架为单个大型语言模型提供强大的独立问题解决能力，并具备从MAD中多样推理轨迹中学习的能力。给定提示时，SDRL首先采样多个候选解答，然后构建一个具有多样推理路径的辩论语境，并基于该情境生成第二回合的回应。最后，SDRL联合优化了初始和辩论条件反应，形成了一个既能作为独立求解器又能作为辩论参与者有效的模型。跨多个基模型和推理基准的实验表明，SDRL不仅提升了整体MAD性能，还能增强单一模型推理能力。

Models Under SCOPE: Scalable and Controllable Routing via Pre-hoc Reasoning

范畴模型：通过预置推理实现可扩展且可控的路由

Authors: Qi Cao, Shuhao Zhang, Ruizhe Zhou, Ruiyi Zhang, Peijia Qin, Pengtao Xie
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.22323
Pdf link: https://arxiv.org/pdf/2601.22323
Abstract Model routing chooses which language model to use for each query. By sending easy queries to cheaper models and hard queries to stronger ones, it can significantly reduce inference cost while maintaining high accuracy. However, most existing routers treat this as a fixed choice among a small set of models, which makes them hard to adapt to new models or changing budget constraints. In this paper, we propose SCOPE (Scalable and Controllable Outcome Performance Estimator), a routing framework that goes beyond model selection by predicting their cost and performance. Trained with reinforcement learning, SCOPE makes reasoning-based predictions by retrieving how models behave on similar problems, rather than relying on fixed model names, enabling it to work with new, unseen models. Moreover, by explicitly predicting how accurate and how expensive a model will be, it turns routing into a dynamic decision problem, allowing users to easily control the trade-off between accuracy and cost. Experiments show that SCOPE is more than just a cost-saving tool. It flexibly adapts to user needs: it can boost accuracy by up to 25.7% when performance is the priority, or cut costs by up to 95.1% when efficiency matters most.
中文摘要 模型路由为每个查询选择使用哪种语言模型。通过将简单的查询发送给更便宜的模型，将困难的查询发送给更强的模型，可以显著降低推理成本，同时保持高准确性。然而，大多数现有路由器将其视为少数型号中的固定选择，这使得它们难以适应新型号或预算变化。本文提出了SCOPE（可扩展且可控的结果性能估计器），这是一种超越模型选择、通过预测成本和性能的路由框架。通过强化学习训练，SCOPE通过检索模型在类似问题上的行为，而非依赖固定模型名称，从而基于推理进行预测，使其能够处理新的、未被发现的模型。此外，通过明确预测模型的准确性和成本，它将路由变成一个动态决策问题，使用户能够轻松控制准确性与成本之间的权衡。实验表明，SCOPE不仅仅是一个节约成本的工具。它灵活适应用户需求：当性能优先时，准确率可提升多达25.7%，效率最关键时可降低多达95.1%的成本。

Quantum-Inspired Reinforcement Learning for Secure and Sustainable AIoT-Driven Supply Chain Systems

量子启发强化学习，实现安全且可持续的AIoT驱动供应链系统

Authors: Muhammad Bilal Akram Dastagir, Omer Tariq, Shahid Mumtaz, Saif Al-Kuwari, Ahmed Farouk
Subjects: Subjects: Machine Learning (cs.LG); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2601.22339
Pdf link: https://arxiv.org/pdf/2601.22339
Abstract Modern supply chains must balance high-speed logistics with environmental impact and security constraints, prompting a surge of interest in AI-enabled Internet of Things (AIoT) solutions for global commerce. However, conventional supply chain optimization models often overlook crucial sustainability goals and cyber vulnerabilities, leaving systems susceptible to both ecological harm and malicious attacks. To tackle these challenges simultaneously, this work integrates a quantum-inspired reinforcement learning framework that unifies carbon footprint reduction, inventory management, and cryptographic-like security measures. We design a quantum-inspired reinforcement learning framework that couples a controllable spin-chain analogy with real-time AIoT signals and optimizes a multi-objective reward unifying fidelity, security, and carbon costs. The approach learns robust policies with stabilized training via value-based and ensemble updates, supported by window-normalized reward components to ensure commensurate scaling. In simulation, the method exhibits smooth convergence, strong late-episode performance, and graceful degradation under representative noise channels, outperforming standard learned and model-based references, highlighting its robust handling of real-time sustainability and risk demands. These findings reinforce the potential for quantum-inspired AIoT frameworks to drive secure, eco-conscious supply chain operations at scale, laying the groundwork for globally connected infrastructures that responsibly meet both consumer and environmental needs.
中文摘要 现代供应链必须在高速物流与环境影响和安全约束之间取得平衡，这促使全球商业对基于人工智能的物联网（AIoT）解决方案的兴趣激增。然而，传统的供应链优化模型常常忽视关键的可持续发展目标和网络漏洞，使系统容易受到生态破坏和恶意攻击的影响。为同时应对这些挑战，本研究整合了量子启发的强化学习框架，整合碳足迹减少、库存管理和类似密码学的安全措施。我们设计了一个量子启发的强化学习框架，将可控自旋链类比与实时AIoT信号结合，优化多目标奖励，统一忠实度、安全性和碳成本。该方法通过基于价值和集成更新的稳定训练学习稳健的策略，辅以窗口归一化的奖励组件，确保相应的规模化。在仿真中，该方法展现出平滑收敛、强劲的后期表现以及在代表性噪声信道下的优雅降级，优于标准的学习和基于模型的参考，凸显了其对实时可持续性和风险需求的稳健处理能力。这些发现强化了量子启发的AIoT框架推动大规模安全、环保供应链运营的潜力，为全球互联基础设施奠定基础，负责任地满足消费者和环境需求。

SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning

SAIR：通过上下文强化学习实现的多阶段高效机器学习流水线自动扩展

Authors: Jianchang Su, Yifan Zhang, Shengkai Lin, Shizhen Zhao, Yusheng Zheng, Yiwei Yang, Wei Zhang
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2601.22397
Pdf link: https://arxiv.org/pdf/2601.22397
Abstract Multi-stage ML inference pipelines are difficult to autoscale due to heterogeneous resources, cross-stage coupling, and dynamic bottleneck migration. We present SAIR, an autoscaling framework that uses an LLM as an in-context reinforcement learning controller, improving its policy online from reward-labeled interaction histories without gradient updates. SAIR combines Pareto-dominance reward shaping with a provable separation margin, surprisal-guided experience retrieval for context efficiency, and fine-grained GPU rate control via user-space CUDA interception. We provide regret analysis decomposing error into retrieval coverage and LLM selection components. On four ML serving pipelines under three workload patterns, SAIR achieves the best or tied-best P99 latency and effective resource cost among deployed baselines, improving P99 by up to 50% and reducing effective cost by up to 97% (under GPU rate-control assumptions), with 86% bottleneck detection accuracy and no offline training.
中文摘要 多阶段机器学习推断流水线由于资源异构、跨阶段耦合和动态瓶颈迁移，难以实现自动扩展。我们介绍SAIR，一个自扩展框架，利用LLM作为上下文内强化学习控制器，改进其在线策略，从无梯度更新的奖励标记交互历史改进。SAIR 结合了帕累托优势奖励塑造、可证明的分离裕度、惊喜引导的体验检索以提升上下文效率，以及通过用户空间 CUDA 拦截实现细粒度 GPU 速率控制。我们提供遗憾分析，将误差分解为检索覆盖和LLM选择组件。在三种工作负载模式下的四条机器学习服务流水线上，SAIR 在部署基线中实现了最佳或并列最佳的 P99 延迟和有效资源成本，在 GPU 速率控制假设下提升 P99 最多 50%，有效成本降低高达 97%，瓶颈检测准确率为 86%，且无离线训练。

Unrewarded Exploration in Large Language Models Reveals Latent Learning from Psychology

在大型语言模型中无偿探索揭示了心理学的潜在学习

Authors: Jian Xiong, Jingbo Zhou, Zihan Zhou, Yixiong Xiao, Le Zhang, Jingyong Ye, Rui Qian, Yang Zhou, Dejing Dou
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.22474
Pdf link: https://arxiv.org/pdf/2601.22474
Abstract Latent learning, classically theorized by Tolman, shows that biological agents (e.g., rats) can acquire internal representations of their environment without rewards, enabling rapid adaptation once rewards are introduced. In contrast, from a cognitive science perspective, reward learning remains overly dependent on external feedback, limiting flexibility and generalization. Although recent advances in the reasoning capabilities of large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, mark a significant breakthrough, these models still rely primarily on reward-centric reinforcement learning paradigms. Whether and how the well-established phenomenon of latent learning in psychology can inform or emerge within LLMs' training remains largely unexplored. In this work, we present novel findings from our experiments that LLMs also exhibit the latent learning dynamics. During an initial phase of unrewarded exploration, LLMs display modest performance improvements, as this phase allows LLMs to organize task-relevant knowledge without being constrained by reward-driven biases, and performance is further enhanced once rewards are introduced. LLMs post-trained under this two-stage exploration regime ultimately achieve higher competence than those post-trained with reward-based reinforcement learning throughout. Beyond these empirical observations, we also provide theoretical analyses for our experiments explaining why unrewarded exploration yields performance gains, offering a mechanistic account of these dynamics. Specifically, we conducted extensive experiments across multiple model families and diverse task domains to establish the existence of the latent learning dynamics in LLMs.
中文摘要 潜学习，由托尔曼经典理论提出，表明生物制剂（如大鼠）可以在没有奖励的情况下获得其环境的内部表征，从而在引入奖励后实现快速适应。相比之下，从认知科学的角度来看，奖励学习仍然过度依赖外部反馈，限制了灵活性和泛化性。尽管大型语言模型（LLMs）推理能力（如OpenAI-o1和DeepSeek-R1）的最新进展标志着重大突破，但这些模型仍主要依赖以奖励为中心的强化学习范式。心理学中已确立的潜在学习现象是否以及如何能够在LLMs的训练中提供指导或出现，至今仍大多未被充分探讨。本研究中，我们展示了实验中新发现，即大型语言模型也表现出潜在学习动态。在无奖励探索的初始阶段，LLM表现适度提升，因为该阶段允许LLM组织任务相关知识而不受奖励驱动偏见的限制，且引入奖励后性能进一步提升。在这两阶段探索体系下进行后培训的LLM最终比那些通过奖励型强化学习后培训的学生达到更高的能力。除了这些实证观察外，我们还为实验提供了理论分析，解释为何未获回报的探索会带来性能提升，并对这些动态进行了机械解释。具体来说，我们对多个模型家族和多样任务领域进行了广泛实验，以确立LLMs中潜在学习动态的存在。

Continual Policy Distillation from Distributed Reinforcement Learning Teachers

分布式强化学习教师的持续政策提炼

Authors: Yuxuan Li, Qijun He, Mingqi Yuan, Wen-Tse Chen, Jeff Schneider, Jiayu Chen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.22475
Pdf link: https://arxiv.org/pdf/2601.22475
Abstract Continual Reinforcement Learning (CRL) aims to develop lifelong learning agents to continuously acquire knowledge across diverse tasks while mitigating catastrophic forgetting. This requires efficiently managing the stability-plasticity dilemma and leveraging prior experience to rapidly generalize to novel tasks. While various enhancement strategies for both aspects have been proposed, achieving scalable performance by directly applying RL to sequential task streams remains challenging. In this paper, we propose a novel teacher-student framework that decouples CRL into two independent processes: training single-task teacher models through distributed RL and continually distilling them into a central generalist model. This design is motivated by the observation that RL excels at solving single tasks, while policy distillation -- a relatively stable supervised learning process -- is well aligned with large foundation models and multi-task learning. Moreover, a mixture-of-experts (MoE) architecture and a replay-based approach are employed to enhance the plasticity and stability of the continual policy distillation process. Extensive experiments on the Meta-World benchmark demonstrate that our framework enables efficient continual RL, recovering over 85% of teacher performance while constraining task-wise forgetting to within 10%.
中文摘要 持续强化学习（CRL）旨在开发终身学习主体，使其能够在多样化任务中持续获取知识，同时减轻灾难性遗忘。这需要高效管理稳定性与可塑性困境，并利用已有经验快速推广到新颖任务。尽管针对这两个方面已有多种增强策略，但通过直接将强化学习应用于顺序任务流实现可扩展性能仍具挑战性。本文提出了一种新颖的师生框架，将CRL解耦为两个独立过程：通过分布式强化学习训练单任务教师模型，并不断提炼为一个中心的通用模型。这一设计的动机源于这样一个观察：强化学习在解决单一任务方面表现出色，而策略提炼——一种相对稳定的监督学习过程——与大型基础模型和多任务学习高度契合。此外，采用专家混合架构和基于重放的方法，以增强持续政策提炼过程的可塑性和稳定性。Meta-World基准测试的大量实验表明，我们的框架能够高效地实现持续强化学习，在任务限制下，将遗忘率限制在10%以内，恢复了超过85%的教师表现。

RulePlanner: All-in-One Reinforcement Learner for Unifying Design Rules in 3D Floorplanning

RulePlanner：用于统一3D平面规划设计规则的一体化强化学习器

Authors: Ruizhe Zhong, Xingbo Du, Junchi Yan
Subjects: Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.22476
Pdf link: https://arxiv.org/pdf/2601.22476
Abstract Floorplanning determines the coordinate and shape of each module in Integrated Circuits. With the scaling of technology nodes, in floorplanning stage especially 3D scenarios with multiple stacked layers, it has become increasingly challenging to adhere to complex hardware design rules. Current methods are only capable of handling specific and limited design rules, while violations of other rules require manual and meticulous adjustment. This leads to labor-intensive and time-consuming post-processing for expert engineers. In this paper, we propose an all-in-one deep reinforcement learning-based approach to tackle these challenges, and design novel representations for real-world IC design rules that have not been addressed by previous approaches. Specifically, the processing of various hardware design rules is unified into a single framework with three key components: 1) novel matrix representations to model the design rules, 2) constraints on the action space to filter out invalid actions that cause rule violations, and 3) quantitative analysis of constraint satisfaction as reward signals. Experiments on public benchmarks demonstrate the effectiveness and validity of our approach. Furthermore, transferability is well demonstrated on unseen circuits. Our framework is extensible to accommodate new design rules, thus providing flexibility to address emerging challenges in future chip design. Code will be available at: this https URL
中文摘要 楼层规划决定了集成电路中每个模块的坐标和形状。随着技术节点的扩展，尤其是在层楼规划阶段，尤其是多层叠加的三维场景中，遵守复杂的硬件设计规则变得越来越具有挑战性。现有方法只能处理特定且有限的设计规则，而其他规则的违反则需要人工且细致的调整。这导致专家工程师的后期处理既劳动密集又耗时。本文提出了一种基于深度强化学习的全合方法来应对这些挑战，并设计了以往方法未曾解决的真实IC设计规则的新颖表示方式。具体来说，各种硬件设计规则的处理被统一为一个框架，包含三个关键组成部分：1）新颖的矩阵表示以建模设计规则，2）对动作空间的约束以过滤导致规则违规的无效动作，3）作为奖励信号对约束满足的定量分析。公开基准测试展示了我们方法的有效性和有效性。此外，可转移性在未见电路上得到了很好的证明。我们的框架可扩展以适应新的设计规则，从而为应对未来芯片设计中新兴挑战提供灵活性。代码可在以下链接获取：此 https URL

SSL: Sweet Spot Learning for Differentiated Guidance in Agentic Optimization

SSL：智能体优化中差异化指导的甜点学习

Authors: Jinyang Wu, Changpeng Yang, Yuhao Shen, Fangzhi Xu, Bolin Ni, Chonghua Liao, Yuchen Liu, Hongzhen Wang, Shuai Nie, Shuai Zhang, Haoran Luo, Jiaming Xu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.22491
Pdf link: https://arxiv.org/pdf/2601.22491
Abstract Reinforcement learning with verifiable rewards has emerged as a powerful paradigm for training intelligent agents. However, existing methods typically employ binary rewards that fail to capture quality differences among trajectories achieving identical outcomes, thereby overlooking potential diversity within the solution space. Inspired by the ``sweet spot'' concept in tennis-the racket's core region that produces optimal hitting effects, we introduce \textbf{S}weet \textbf{S}pot \textbf{L}earning (\textbf{SSL}), a novel framework that provides differentiated guidance for agent optimization. SSL follows a simple yet effective principle: progressively amplified, tiered rewards guide policies toward the sweet-spot region of the solution space. This principle naturally adapts across diverse tasks: visual perception tasks leverage distance-tiered modeling to reward proximity, while complex reasoning tasks reward incremental progress toward promising solutions. We theoretically demonstrate that SSL preserves optimal solution ordering and enhances the gradient signal-to-noise ratio, thereby fostering more directed optimization. Extensive experiments across GUI perception, short/long-term planning, and complex reasoning tasks show consistent improvements over strong baselines on 12 benchmarks, achieving up to 2.5X sample efficiency gains and effective cross-task transferability. Our work establishes SSL as a general principle for training capable and robust agents.
中文摘要 带有可验证奖励的强化学习已成为训练智能体的强大范式。然而，现有方法通常采用二元奖励，未能捕捉实现相同结果的轨迹间的质量差异，从而忽视了解决方案空间内潜在的多样性。受网球“甜蜜点”概念启发——网球核心区域产生最佳击球效果，我们引入了 \textbf{S}pot \textbf{L}earning （\textbf{SSL}），这是一个为代理优化提供差异化指导的新框架。SSL遵循一个简单而有效的原则：逐步放大、分层奖励引导政策进入解决方案领域的甜蜜区间。这一原则自然适用于各种任务：视觉感知任务利用距离分层建模来奖励接近，而复杂推理任务则奖励朝着有前景解决方案的渐进式进展。我们理论上证明SSL保持了最优解的排序，并提升了梯度信噪比，从而促进了更有针对性的优化。涵盖图形用户界面感知、短期/长期规划和复杂推理任务的广泛实验显示，在12个基准测试中相较强基线持续提升，样本效率提升高达2.5倍，且跨任务可有效转移。我们的工作确立了SSL作为培训有能力且稳健代理的通用原则。

Action-Sufficient Goal Representations

动作充分的目标表示

Authors: Jinu Hyeon, Woobin Park, Hongjoon Ahn, Taesup Moon
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.22496
Pdf link: https://arxiv.org/pdf/2601.22496
Abstract Hierarchical policies in offline goal-conditioned reinforcement learning (GCRL) addresses long-horizon tasks by decomposing control into high-level subgoal planning and low-level action execution. A critical design choice in such architectures is the goal representation-the compressed encoding of goals that serves as the interface between these levels. Existing approaches commonly derive goal representations while learning value functions, implicitly assuming that preserving information sufficient for value estimation is adequate for optimal control. We show that this assumption can fail, even when the value estimation is exact, as such representations may collapse goal states that need to be differentiated for action learning. To address this, we introduce an information-theoretic framework that defines action sufficiency, a condition on goal representations necessary for optimal action selection. We prove that value sufficiency does not imply action sufficiency and empirically verify that the latter is more strongly associated with control success in a discrete environment. We further demonstrate that standard log-loss training of low-level policies naturally induces action-sufficient representations. Our experimental results a popular benchmark demonstrate that our actor-derived representations consistently outperform representations learned via value estimation.
中文摘要 离线目标条件强化学习（GCRL）中的分层策略通过将控制分解为高层次子目标规划和低层次行动执行，来解决长期任务。在这类架构中，一个关键的设计选择是目标表示——即目标的压缩编码，作为这些层级之间的接口。现有方法通常在学习价值函数时推导目标表示，隐含假设保持足够信息以进行价值估计即可实现最佳控制。我们证明，即使值估计精确，这一假设也可能失效，因为此类表示可能会崩溃需要为行动学习区分的目标状态。为此，我们引入了一个信息论框架，定义了行动充分性，这是目标表示的条件，是最佳动作选择所需的条件。我们证明价值充分性并不意味着行动充分性，并通过实证验证后者在离散环境中与控制成功更为相关。我们进一步证明，低层次策略的标准日志丢失训练自然地诱导了动作充分的表示。我们的实验结果是一项广受欢迎的基准测试，表明我们的演员推导表示始终优于通过价值估计学到的表示。

DreamVAR: Taming Reinforced Visual Autoregressive Model for High-Fidelity Subject-Driven Image Generation

DreamVAR：驯服强化视觉自回归模型以实现高保真主体驱动图像生成

Authors: Xin Jiang, Jingwen Chen, Yehao Li, Yingwei Pan, Kezhou Chen, Zechao Li, Ting Yao, Tao Mei
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.22507
Pdf link: https://arxiv.org/pdf/2601.22507
Abstract Recent advances in subject-driven image generation using diffusion models have attracted considerable attention for their remarkable capabilities in producing high-quality images. Nevertheless, the potential of Visual Autoregressive (VAR) models, despite their unified architecture and efficient inference, remains underexplored. In this work, we present DreamVAR, a novel framework for subject-driven image synthesis built upon a VAR model that employs next-scale prediction. Technically, multi-scale features of the reference subject are first extracted by a visual tokenizer. Instead of interleaving these conditional features with target image tokens across scales, our DreamVAR pre-fills the full subject feature sequence prior to predicting target image tokens. This design simplifies autoregressive dependencies and mitigates the train-test discrepancy in multi-scale conditioning scenario within the VAR paradigm. DreamVAR further incorporates reinforcement learning to jointly enhance semantic alignment and subject consistency. Extensive experiments demonstrate that DreamVAR achieves superior appearance preservation compared to leading diffusion-based methods.
中文摘要 近年来，利用扩散模型进行主体驱动图像生成的进展因其在生成高质量图像方面的卓越能力而备受关注。尽管视觉自回归（VAR）模型具有统一的架构和高效的推理能力，其潜力仍未被充分开发。在本研究中，我们介绍了DreamVAR，一种基于VAR模型、采用次尺度预测的主体驱动图像合成新框架。从技术上讲，参考对象的多尺度特征首先通过可视化分词器提取。我们的DreamVAR在预测目标图像符号之前，不会将这些条件特征与目标图像符号交错交错，而是预填充完整的受试者特征序列，然后再预测目标图像符号。该设计简化了自回归依赖关系，并减轻了VAR范式中多尺度条件场景中的列车-测试差异。DreamVAR还结合强化学习，共同提升语义对齐和主语一致性。大量实验表明，DreamVAR相比主流基于扩散的方法实现了更优越的外观保存。

Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards

模拟世界，真实技能：构建小型智能语言模型，采用合成任务、模拟环境和基于评分标准的奖励

Authors: Yuan-Jay Lü, Chengyu Wang, Lei Shen, Jun Huang, Tong Xu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.22511
Pdf link: https://arxiv.org/pdf/2601.22511
Abstract Small LLMs often struggle to match the agentic capabilities of large, costly models. While reinforcement learning can help, progress has been limited by two structural bottlenecks: existing open-source agentic training data are narrow in task variety and easily solved; real-world APIs lack diversity and are unstable for large-scale reinforcement learning rollout processes. We address these challenges with SYNTHAGENT, a framework that jointly synthesizes diverse tool-use training data and simulates complete environments. Specifically, a strong teacher model creates novel tasks and tool ecosystems, then rewrites them into intentionally underspecified instructions. This compels agents to actively query users for missing details. When handling synthetic tasks, an LLM-based user simulator provides user-private information, while a mock tool system delivers stable tool responses. For rewards, task-level rubrics are constructed based on required subgoals, user-agent interactions, and forbidden behaviors. Across 14 challenging datasets in math, search, and tool use, models trained on our synthetic data achieve substantial gains, with small models outperforming larger baselines.
中文摘要 小型大型语言模型常常难以匹敌大型且昂贵模型的代理能力。虽然强化学习有帮助，但进展受限于两个结构性瓶颈：现有开源的代理训练数据任务种类狭窄且易于解决;现实中的API缺乏多样性，且不稳定，适合大规模强化学习的推广过程。我们通过SYNTHAGENT来应对这些挑战，该框架能够综合多种工具使用训练数据并模拟完整环境。具体来说，强有力的教师模型会创造新的任务和工具生态系统，然后将它们重写成有意低估的指令。这迫使客服人员主动查询用户缺少的细节。在处理合成任务时，基于LLM的用户模拟器提供用户私有信息，而模拟工具系统则提供稳定的工具响应。对于奖励，任务级评分标准基于必要的子目标、用户-代理互动和禁止行为构建。在14个具有挑战性的数学、搜索和工具使用数据集中，基于我们合成数据训练的模型取得了显著提升，小模型的表现优于大型基线。

RoboStriker: Hierarchical Decision-Making for Autonomous Humanoid Boxing

RoboStriker：自主人形拳击的层级决策

Authors: Kangning Yin, Zhe Cao, Wentao Dong, Weishuai Zeng, Tianyi Zhang, Qiang Zhang, Jingbo Wang, Jiangmiao Pang, Ming Zhou, Weinan Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.22517
Pdf link: https://arxiv.org/pdf/2601.22517
Abstract Achieving human-level competitive intelligence and physical agility in humanoid robots remains a major challenge, particularly in contact-rich and highly dynamic tasks such as boxing. While Multi-Agent Reinforcement Learning (MARL) offers a principled framework for strategic interaction, its direct application to humanoid control is hindered by high-dimensional contact dynamics and the absence of strong physical motion priors. We propose RoboStriker, a hierarchical three-stage framework that enables fully autonomous humanoid boxing by decoupling high-level strategic reasoning from low-level physical execution. The framework first learns a comprehensive repertoire of boxing skills by training a single-agent motion tracker on human motion capture data. These skills are subsequently distilled into a structured latent manifold, regularized by projecting the Gaussian-parameterized distribution onto a unit hypersphere. This topological constraint effectively confines exploration to the subspace of physically plausible motions. In the final stage, we introduce Latent-Space Neural Fictitious Self-Play (LS-NFSP), where competing agents learn competitive tactics by interacting within the latent action space rather than the raw motor space, significantly stabilizing multi-agent training. Experimental results demonstrate that RoboStriker achieves superior competitive performance in simulation and exhibits sim-to-real transfer. Our website is available at RoboStriker.
中文摘要 在人形机器人中实现人类水平的竞争智能和身体敏捷仍是一大挑战，尤其是在接触密集且高度动态的任务中，如拳击。虽然多智能体强化学习（MARL）为战略交互提供了原则性框架，但其直接应用于人形控制受到高维接触动力学和缺乏强烈物理运动先验的影响。我们提出了RoboStriker，这是一个分层的三阶段框架，通过将高层战略推理与低层物理执行脱钩，实现完全自主的人形拳击。该框架首先通过训练单代理动作追踪器，学习一套全面的拳击技能，基于人类动作捕捉数据。这些技能随后被提炼成一个结构化的潜在流形，通过将高斯参数化分布投影到单位超球面上进行正则化。这一拓扑约束实际上将探索限制在物理合理运动的子空间中。在最后阶段，我们引入了潜在空间神经虚构自我游戏（LS-NFSP），竞争的代理通过在潜在行动空间中而非原始运动空间中互动来学习竞争战术，显著稳定了多代理训练。实验结果表明，RoboStriker在仿真中实现了更优的竞争性能，并实现了模拟到现实的传输能力。我们的网站可通过RoboStriker访问。

One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry

统治所有的一环：通过动态幂均几何实现基于群体的统一强化学习

Authors: Weisong Zhao, Tong Wang, Zichang Tan, Te Yang, Siran Peng, Haoyuan Zhang, Tianshuo Zhang, Haichao Shi, Meng Meng, Yang Yang, Xiangyu Zhu, Zhen Lei, Xiao-Yu Zhang, Xu Zhou
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.22521
Pdf link: https://arxiv.org/pdf/2601.22521
Abstract Group-based reinforcement learning has evolved from the arithmetic mean of GRPO to the geometric mean of GMPO. While GMPO improves stability by constraining a conservative objective, it shares a fundamental limitation with GRPO: reliance on a fixed aggregation geometry that ignores the evolving and heterogeneous nature of each trajectory. In this work, we unify these approaches under Power-Mean Policy Optimization (PMPO), a generalized framework that parameterizes the aggregation geometry via the power-mean geometry exponent p. Within this framework, GRPO and GMPO are recovered as special cases. Theoretically, we demonstrate that adjusting p modulates the concentration of gradient updates, effectively reweighting tokens based on their advantage contribution. To determine p adaptively, we introduce a Clip-aware Effective Sample Size (ESS) mechanism. Specifically, we propose a deterministic rule that maps a trajectory clipping fraction to a target ESS. Then, we solve for the specific p to align the trajectory induced ESS with this target one. This allows PMPO to dynamically transition between the aggressive arithmetic mean for reliable trajectories and the conservative geometric mean for unstable ones. Experiments on multiple mathematical reasoning benchmarks demonstrate that PMPO outperforms strong baselines.
中文摘要 基于群体的强化学习已从GRPO的算术平均演变到GMPO的几何平均。虽然GMPO通过约束保守目标提升稳定性，但它与GRPO有一个根本性的局限：依赖固定聚合几何，忽视了每个轨迹的演变和异质性。在本研究中，我们将这些方法统一归入幂均策略优化（PMPO），这是一个通过幂均几何指数p参数化聚合几何的通用框架。在此框架下，GRPO和GMPO作为特殊案例被回收。理论上，我们证明调整p可以调节梯度更新的集中度，有效根据其优势贡献重新加权。为了自适应地确定p，我们引入了一种剪辑感知有效样本量（ESS）机制。具体来说，我们提出了一种确定性规则，将轨迹截波分数映射到目标ESS。然后，我们求解特定p，使轨迹诱导的ESS与该目标对应。这使得PMPO能够动态地在可靠轨迹的激进算术均值和不稳定轨迹的保守几何均值之间转换。多项数学推理基准测试的实验表明，PMPO优于强基线。

Detect and Act: Automated Dynamic Optimizer through Meta-Black-Box Optimization

检测与行动：通过元黑匣子优化实现自动化动态优化器

Authors: Zijian Gao, Yuanting Zhong, Zeyuan Ma, Yue-Jiao Gong, Hongshu Guo
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.22542
Pdf link: https://arxiv.org/pdf/2601.22542
Abstract Dynamic Optimization Problems (DOPs) are challenging to address due to their complex nature, i.e., dynamic environment variation. Evolutionary Computation methods are generally advantaged in solving DOPs since they resemble dynamic biological evolution. However, existing evolutionary dynamic optimization methods rely heavily on human-crafted adaptive strategy to detect environment variation in DOPs, and then adapt the searching strategy accordingly. These hand-crafted strategies may perform ineffectively at out-of-box scenarios. In this paper, we propose a reinforcement learning-assisted approach to enable automated variation detection and self-adaption in evolutionary algorithms. This is achieved by borrowing the bi-level learning-to-optimize idea from recent Meta-Black-Box Optimization works. We use a deep Q-network as optimization dynamics detector and searching strategy adapter: It is fed as input with current-step optimization state and then dictates desired control parameters to underlying evolutionary algorithms for next-step optimization. The learning objective is to maximize the expected performance gain across a problem distribution. Once trained, our approach could generalize toward unseen DOPs with automated environment variation detection and self-adaption. To facilitate comprehensive validation, we further construct an easy-to-difficult DOPs testbed with diverse synthetic instances. Extensive benchmark results demonstrate flexible searching behavior and superior performance of our approach in solving DOPs, compared to state-of-the-art baselines.
中文摘要 动态优化问题（DOP）因其复杂性质（即动态环境变化）而难以解决。进化计算方法通常在解决 DOP 时具有优势，因为它们类似于动态的生物进化。然而，现有的进化动态优化方法高度依赖人类制定的适应策略来检测DOPs的环境变化，并据此调整搜索策略。这些手工设计的策略在开箱即用场景中可能效果不佳。本文提出一种强化学习辅助方法，以实现进化算法中的自动变异检测和自我适应。这是通过借鉴近期Meta-Black-Box优化研究中的双层学习优化理念实现的。我们使用深度Q网络作为优化动力学检测器和搜索策略适配器：输入当前阶段优化状态，然后将期望控制参数分配给底层进化算法以进行下一步优化。学习目标是最大化问题分布中的预期性能提升。一旦训练好，我们的方法可以推广到未被发现的DOP，并实现自动化环境变化检测和自我适应。为促进全面验证，我们进一步构建了一个从简单到困难的DOPs测试平台，包含多样化的合成实例。大量基准测试结果显示，我们方法在解决DOP问题方面具有灵活性，优于最先进的基线。

Adapting Reinforcement Learning for Path Planning in Constrained Parking Scenarios

在受限停车场景下调整强化学习以实现路径规划

Authors: Feng Tao, Luca Paparusso, Chenyi Gu, Robin Koehler, Chenxu Wu, Xinyu Huang, Christian Juette, David Paz, Ren Liu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.22545
Pdf link: https://arxiv.org/pdf/2601.22545
Abstract Real-time path planning in constrained environments remains a fundamental challenge for autonomous systems. Traditional classical planners, while effective under perfect perception assumptions, are often sensitive to real-world perception constraints and rely on online search procedures that incur high computational costs. In complex surroundings, this renders real-time deployment prohibitive. To overcome these limitations, we introduce a Deep Reinforcement Learning (DRL) framework for real-time path planning in parking scenarios. In particular, we focus on challenging scenes with tight spaces that require a high number of reversal maneuvers and adjustments. Unlike classical planners, our solution does not require ideal and structured perception, and in principle, could avoid the need for additional modules such as localization and tracking, resulting in a simpler and more practical implementation. Also, at test time, the policy generates actions through a single forward pass at each step, which is lightweight enough for real-time deployment. The task is formulated as a sequential decision-making problem grounded in a bicycle model dynamics, enabling the agent to directly learn navigation policies that respect vehicle kinematics and environmental constraints in the closed-loop setting. A new benchmark is developed to support both training and evaluation, capturing diverse and challenging scenarios. Our approach achieves state-of-the-art success rates and efficiency, surpassing classical planner baselines by +96% in success rate and +52% in efficiency. Furthermore, we release our benchmark as an open-source resource for the community to foster future research in autonomous systems. The benchmark and accompanying tools are available at this https URL.
中文摘要 在受限环境中实现实时路径规划仍然是自治系统面临的根本挑战。传统传统规划器虽然在完美感知假设下有效，但通常对现实世界的感知约束敏感，依赖于在线搜索程序，这会产生较高的计算成本。在复杂的环境中，这使得实时部署变得困难。为克服这些局限，我们引入了深度强化学习（DRL）框架，用于停车场景中的实时路径规划。特别是，我们专注于需要大量反转动作和调整的狭窄空间挑战场景。与传统规划师不同，我们的解决方案不需要理想和结构化的感知，原则上可以避免对定位和跟踪等额外模块的需求，从而实现更简单、更实用。此外，在测试阶段，策略通过每一步的一次前向传递生成动作，这足够轻量化，便于实时部署。该任务被表述为基于自行车模型动力学的顺序决策问题，使智能体能够直接学习在闭环环境中尊重车辆运动学和环境约束的导航策略。制定了一个新的基准，以支持培训和评估，捕捉多样且具有挑战性的情景。我们的方法实现了最先进的成功率和效率，成功率比传统规划师基线高出+96%，效率提升+52%。此外，我们将基准作为开源资源发布，供社区支持未来自主系统研究。基准测试及相关工具可在该 https URL 访问。

PersonaAct: Simulating Short-Video Users with Personalized Agents for Counterfactual Filter Bubble Auditing

PersonaAct：用个性化代理模拟短视频用户进行反事实过滤气泡审计

Authors: Shilong Zhao, Qinggang Yang, Zhiyi Yin, Xiaoshi Wang, Zhenxing Chen, Du Su, Xueqi Cheng
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2601.22547
Pdf link: https://arxiv.org/pdf/2601.22547
Abstract Short-video platforms rely on personalized recommendation, raising concerns about filter bubbles that narrow content exposure. Auditing such phenomena at scale is challenging because real user studies are costly and privacy-sensitive, and existing simulators fail to reproduce realistic behaviors due to their reliance on textual signals and weak personalization. We propose PersonaAct, a framework for simulating short-video users with persona-conditioned multimodal agents trained on real behavioral traces for auditing filter bubbles in breadth and depth. PersonaAct synthesizes interpretable personas through automated interviews combining behavioral analysis with structured questioning, then trains agents on multimodal observations using supervised fine-tuning and reinforcement learning. We deploy trained agents for filter bubble auditing and evaluate bubble breadth via content diversity and bubble depth via escape potential. The evaluation demonstrates substantial improvements in fidelity over generic LLM baselines, enabling realistic behavior reproduction. Results reveal significant content narrowing over interaction. However, we find that Bilibili demonstrates the strongest escape potential. We release the first open multimodal short-video dataset and code to support reproducible auditing of recommender systems.
中文摘要 短视频平台依赖个性化推荐，这引发了人们对过滤气泡缩小内容曝光范围的担忧。大规模审计此类现象具有挑战性，因为真实用户研究成本高昂且隐私敏感，现有模拟器因依赖文本信号和薄弱的个性化，无法重现真实行为。我们提出了PersonaAct框架，这是一个用基于真实行为痕迹训练的Persona条件多模态代理模拟短视频用户的框架，用于在广度和深度上审计滤波泡泡。PersonaAct通过结合行为分析与结构化提问的自动化访谈综合可解释的人格，然后通过监督微调和强化学习训练代理进行多模态观察。我们部署受过培训的客服进行过滤气泡审计，并通过内容多样性评估气泡广度，通过泄漏潜力评估气泡深度。评估显示，相较于通用大型语言模型基线，保真度显著提升，实现了真实的行为再现。结果显示内容在互动中显著缩小。然而，我们发现Bilibili展现出最强的逃脱潜力。我们发布了首个开放的多模态短视频数据集和代码，以支持推荐系统可重复审计。

Exo-Plore: Exploring Exoskeleton Control Space through Human-aligned Simulation

Exo-Plore：通过人类对齐模拟探索外骨骼控制空间

Authors: Geonho Leem, Jaedong Lee, Jehee Lee, Seungmoon Song, Jungdam Won
Subjects: Subjects: Robotics (cs.RO); Graphics (cs.GR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.22550
Pdf link: https://arxiv.org/pdf/2601.22550
Abstract Exoskeletons show great promise for enhancing mobility, but providing appropriate assistance remains challenging due to the complexity of human adaptation to external forces. Current state-of-the-art approaches for optimizing exoskeleton controllers require extensive human experiments in which participants must walk for hours, creating a paradox: those who could benefit most from exoskeleton assistance, such as individuals with mobility impairments, are rarely able to participate in such demanding procedures. We present Exo-plore, a simulation framework that combines neuromechanical simulation with deep reinforcement learning to optimize hip exoskeleton assistance without requiring real human experiments. Exo-plore can (1) generate realistic gait data that captures human adaptation to assistive forces, (2) produce reliable optimization results despite the stochastic nature of human gait, and (3) generalize to pathological gaits, showing strong linear relationships between pathology severity and optimal assistance.
中文摘要 外骨骼在增强机动性方面展现出巨大潜力，但由于人类对外部力量的复杂适应，提供适当的辅助仍具挑战性。目前优化外骨骼控制器的先进方法需要大量人体实验，参与者必须步行数小时，这造成了一个悖论：那些最需要外骨骼辅助的人，如行动障碍者，很少能参与如此高强度的作。我们提出了Exo-plore，一种结合神经机械模拟与深度强化学习的模拟框架，旨在优化髋部外骨骼辅助，而无需进行真实人体实验。外向探索可以（1）生成真实的步态数据，捕捉人类对辅助力的适应，（2）尽管人类步态具有随机性，仍能产生可靠的优化结果，（3）推广到病理步态，显示病理严重程度与最佳辅助之间存在强烈线性关系。

Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR

以更少的资源了解更多：RLVR的不确定性一致性引导查询选择

Authors: Hao Yi, Yulan Hu, Xin Li, Sheng Ouyang, Lizhong Ding, Yong Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.22595
Pdf link: https://arxiv.org/pdf/2601.22595
Abstract Large Language Models (LLMs) have recently improved mathematical reasoning through Reinforcement Learning with Verifiable Reward (RLVR). However, existing RLVR algorithms require large query budgets, making annotation costly. We investigate whether fewer but more informative queries can yield similar or superior performance, introducing active learning (AL) into RLVR. We identify that classic AL sampling strategies fail to outperform random selection in this setting, due to ignoring objective uncertainty when only selecting by subjective uncertainty. This work proposes an uncertainty consistency metric to evaluate how well subjective uncertainty aligns with objective uncertainty. In the offline setting, this alignment is measured using the Point-Biserial Correlation Coefficient (PBC). For online training, because of limited sampling and dynamically shifting output distributions, PBC estimation is difficult. Therefore, we introduce a new online variant, computed from normalized advantage and subjective uncertainty. Theoretically, we prove that the online variant is strictly negatively correlated with offline PBC and supports better sample selection. Experiments show our method consistently outperforms random and classic AL baselines, achieving full-dataset performance while training on only 30% of the data, effectively reducing the cost of RLVR for reasoning tasks.
中文摘要 大型语言模型（LLMs）最近通过可验证奖励强化学习（RLVR）提升了数学推理能力。然而，现有的RLVR算法需要庞大的查询预算，导致注释成本较高。我们研究了更少但更具信息量的查询是否能带来类似或更优的性能，并将主动学习（AL）引入RLVR。我们发现，经典的AL抽样策略在这种情境下无法超越随机选择，因为在仅通过主观不确定性进行选择时忽略了客观不确定性。本研究提出了一个不确定性一致性指标，以评估主观不确定性与客观不确定性的契合度。在离线环境中，这种比对通过点-双序列相关系数（PBC）进行测量。对于在线训练来说，由于抽样有限且输出分布动态变化，PBC估计较为困难。因此，我们引入一种新的在线变体，基于归一化优势和主观不确定性计算。理论上，我们证明在线变体与离线PBC严格负相关，支持更好的样本选择。实验显示，我们的方法持续优于随机和经典AL基线，在仅30%的数据上训练即可实现全数据集性能，有效降低了推理任务中RLVR的成本。

From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

从自我演化的合成数据到可验证奖励的强化学习：训练后多回合交互工具使用代理

Authors: Jiaxuan Gao, Jiaao Chen, Chuyi He, Wei-Chen Wang, Shusheng Xu, Hanrui Wang, Di Jin, Yi Wu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.22607
Pdf link: https://arxiv.org/pdf/2601.22607
Abstract Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions. Post-training such agents is challenging because synthesis for high-quality multi-turn tool-use data is difficult to scale, and reinforcement learning (RL) could face noisy signals caused by user simulation, leading to degraded training efficiency. We propose a unified framework that combines a self-evolving data agent with verifier-based RL. Our system, EigenData, is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers, and improves generation reliability via closed-loop self-evolving process that updates prompts and workflow. Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training with trajectory-level group-relative advantages and dynamic filtering, yielding consistent improvements beyond SFT. Evaluated on tau^2-bench, our best model reaches 73.0% pass^1 on Airline and 98.3% pass^1 on Telecom, matching or exceeding frontier models. Overall, our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.
中文摘要 交互式工具作代理必须通过多回合与人类和外部环境交互来解决现实任务，这需要对话状态跟踪、多步骤工具执行，同时遵循复杂的指令。此类代理的后期训练具有挑战性，因为高质量多回合工具使用数据的综合难以扩展，且强化学习（RL）可能面临用户仿真引起的噪声信号，导致训练效率下降。我们提出了一个统一框架，结合了自我演化的数据代理和基于验证器的强化学习。我们的系统EigenData是一个分层多代理引擎，综合工具对话和可执行的每实例检查器，并通过闭环自我演进过程更新提示和工作流程提升生成可靠性。基于合成数据，我们开发了一套强化学习方案，先微调用户模型，然后应用带有轨迹级群组相对优势和动态滤波的GRPO式训练，带来超越SFT的持续改进。根据tau^2-bench的评估，我们的最佳模型在航空领域达到73.0%的通过率^1，在电信领域达到98.3%的合格率^1，能够与前沿模型匹敌甚至超过。总体而言，我们的结果表明，在不需昂贵人工注释的情况下，实现复杂工具使用行为的自力更易实现的可扩展路径。

COBRA++: Enhanced COBRA Optimizer with Augmented Surrogate Pool and Reinforced Surrogate Selection

COBRA++：增强型COBRA优化器，配备增强替代池和强化代理选择

Authors: Zepei Yu, Zhiyang Huang, Hongshu Guo, Yue-Jiao Gong, Zeyuan Ma
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2601.22624
Pdf link: https://arxiv.org/pdf/2601.22624
Abstract The optimization problems in realistic world present significant challenges onto optimization algorithms, such as the expensive evaluation issue and complex constraint conditions. COBRA optimizer (including its up-to-date variants) is a representative and effective tool for addressing such optimization problems, which introduces 1) RBF surrogate to reduce online evaluation and 2) bi-stage optimization process to alternate search for feasible solution and optimal solution. Though promising, its design space, i.e., surrogate model pool and selection standard, is still manually decided by human expert, resulting in labor-intensive fine-tuning for novel tasks. In this paper, we propose a learning-based adaptive strategy (COBRA++) that enhances COBRA in two aspects: 1) An augmented surrogate pool to break the tie with RBF-like surrogate and hence enhances model diversity and approximation capability; 2) A reinforcement learning-based online model selection policy that empowers efficient and accurate optimization process. The model selection policy is trained to maximize overall performance of COBRA++ across a distribution of constrained optimization problems with diverse properties. We have conducted multi-dimensional validation experiments and demonstrate that COBRA++ achieves substantial performance improvement against vanilla COBRA and its adaptive variant. Ablation studies are provided to support correctness of each design component in COBRA++.
中文摘要 现实世界中的优化问题对优化算法构成了重大挑战，如昂贵的评估问题和复杂的约束条件。COBRA优化器（包括其最新变体）是一个代表性且有效的工具，用于解决此类优化问题，它引入了1）RBF替代以减少在线评估，2）双阶段优化过程，交替寻找可行解和最优解。尽管前景看好，其设计空间，即替代模型池和选择标准，仍由人类专家手动决定，导致为新颖任务进行劳动密集型微调。本文提出了一种基于学习的自适应策略（COBRA++），在两个方面增强COBRA的使用：1）增强替代池，打破与类似RBF代理的联系，从而提升模型多样性和近似能力;2）基于强化学习的在线模型选择策略，赋能高效且准确的优化过程。模型选择策略训练以最大化COBRA++在具有不同性质的受约束优化问题分布中的整体性能。我们进行了多维验证实验，证明COBRA++在性能上相比原生COBRA及其自适应变体实现了显著提升。为支持COBRA++中每个设计组件的正确性，还提供了消融研究。

Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability

通过平均延续对数概率评估和奖励表达性角色扮演TTS的LALMs

Authors: Yong Ren, Jingbei Li, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang
Subjects: Subjects: Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2601.22661
Pdf link: https://arxiv.org/pdf/2601.22661
Abstract Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle to maintain stylistic consistency with character profiles and scene descriptions across multi-turn dialogues. A critical bottleneck is the lack of objective metrics for quantifying speaking style. To bridge this gap, we propose Mean Continuation Log-Probability (MCLP) as both an evaluation metric and a reward signal, validated on LALM-based Role-Play TTS (RP-TTS) tasks. Critically, we leverage the In-Context Learning capability of pre-trained LALMs to formulate MCLP via a continuation log-probability prediction. This metric quantifies stylistic consistency by measuring the likelihood of the ground-truth speech conditioned on the generated speech. Furthermore, we employ MCLP as a reinforcement learning reward to enhance the style alignment between generated speech and Role-Play instructions. To facilitate evaluation, we construct an RP-TTS dataset with rich scene and character annotations. Experimental results demonstrate that our method significantly outperforms strong LALM baselines on both objective and subjective metrics.
中文摘要 大型音频语言模型（LALMs）的最新进展将文本转语音（TTS）扩展到交互式角色扮演场景，这些场景要求高度的表现力和严格遵守角色扮演指令。然而，现有模型在多回合对话中角色简介和场景描述的风格一致性上存在困难。一个关键瓶颈是缺乏客观的量化演讲风格的指标。为弥合这一差距，我们提出了平均续写对数概率（MCLP）作为评估指标和奖励信号，并在基于LLAM的角色扮演TTS（RP-TTS）任务中得到验证。关键是，我们利用预训练LALMs的上下文学习能力，通过延续对数概率预测来构建MCLP。该指标通过衡量基于生成语音的真实性言语的可能性，量化了风格一致性。此外，我们将MCLP作为强化学习奖励，以增强生成语音与角色扮演指令之间的风格一致性。为便于评估，我们构建了一个包含丰富场景和角色注释的RP-TTS数据集。实验结果表明，我们的方法在客观和主观指标上均显著优于强LALM基线。

Real-Time Aligned Reward Model beyond Semantics

超越语义的实时对齐奖励模型

Authors: Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuefeng Xiao, Hongyan Xie, Li Huaqiu, Songshi Liang, Zhongxiang Dai, Fuzhen Zhuang, Jianxin Li, Yikun Ban, Deqing Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.22664
Pdf link: https://arxiv.org/pdf/2601.22664
Abstract Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward overoptimization, in which policy models overfit to the reward model, exploit spurious reward patterns instead of faithfully capturing human intent. Prior mitigations primarily relies on surface semantic information and fails to efficiently address the misalignment between the reward model (RM) and the policy model caused by continuous policy distribution shifts. This inevitably leads to an increasing reward discrepancy, exacerbating reward overoptimization. To address these limitations, we introduce R2M (Real-Time Aligned Reward Model), a novel lightweight RLHF framework. R2M goes beyond vanilla reward models that solely depend on the semantic representations of a pretrained LLM. Instead, it leverages the evolving hidden states of the policy (namely policy feedback) to align with the real-time distribution shift of the policy during the RL process. This work points to a promising new direction for improving the performance of reward models through real-time utilization of feedback from policy models.
中文摘要 人类反馈强化学习（RLHF）是一种将大型语言模型（LLMs）与人类偏好对齐的关键技术，但它容易出现奖励过度优化，即政策模型过度拟合奖励模型，利用虚假的奖励模式，而非忠实捕捉人类意图。以往的缓解措施主要依赖表层语义信息，未能有效解决因持续政策分布转移导致的奖励模型（RM）与政策模型之间的错位。这不可避免地导致奖励差距加剧，加剧了奖励过度优化。为解决这些局限，我们引入了R2M（实时对齐奖励模型），这是一种新型轻量级RLHF框架。R2M超越了仅依赖预训练LLM语义表示的普通奖励模型。相反，它利用策略中不断演变的隐性状态（即策略反馈），以配合强化学习过程中策略的实时分布转移。这项工作为通过实时利用政策模型反馈，为提升奖励模型的性能提出了一个有前景的新方向。

A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization

退一步：前缀重要性比稳定策略优化

Authors: Shiye Lei, Zhihao Cheng, Dacheng Tao
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.22718
Pdf link: https://arxiv.org/pdf/2601.22718
Abstract Reinforcement learning (RL) post-training has increasingly demonstrated strong ability to elicit reasoning behaviors in large language models (LLMs). For training efficiency, rollouts are typically generated in an off-policy manner using an older sampling policy and then used to update the current target policy. To correct the resulting discrepancy between the sampling and target policies, most existing RL objectives rely on a token-level importance sampling ratio, primarily due to its computational simplicity and numerical stability. However, we observe that token-level correction often leads to unstable training dynamics when the degree of off-policyness is large. In this paper, we revisit LLM policy optimization under off-policy conditions and show that the theoretically rigorous correction term is the prefix importance ratio, and that relaxing it to a token-level approximation can induce instability in RL post-training. To stabilize LLM optimization under large off-policy drift, we propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO). MinPRO replaces the unstable cumulative prefix ratio with a non-cumulative surrogate based on the minimum token-level ratio observed in the preceding prefix. Extensive experiments on both dense and mixture-of-experts LLMs, across multiple mathematical reasoning benchmarks, demonstrate that MinPRO substantially improves training stability and peak performance in off-policy regimes.
中文摘要 强化学习（RL）在训练后逐渐展现出在大型语言模型（LLMs）中引发推理行为的强大能力。为了提高训练效率，部署通常以非策略方式生成，使用较早的抽样策略，然后用于更新当前目标策略。为了纠正抽样与目标策略之间的差异，大多数现有的强化学习目标依赖于代币级重要性抽样比率，主要因其计算简单性和数值稳定性。然而，我们观察到，当偏离策略程度较大时，令牌级校正常导致训练动态不稳定。本文回顾了在非策略条件下的LLM策略优化，并证明理论上严谨的修正项是前缀重要性比，且将其放宽为代币级近似可能会在训练后引发强化学习的不稳定性。为了在较大的偏离策略漂移下稳定LLM优化，我们提出了一个简单但有效的目标——最小前缀比率（MinPRO）。MinPRO用基于前述前缀观察到的最小令牌级比率的非累积替代替换不稳定的累积前缀比率。在多个数学推理基准测试中，对密集型和专家混合型大型语言模型进行的广泛实验表明，MinPRO在非策略模式下显著提升了训练稳定性和峰值性能。

TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

TSPO：打破多回合搜索策略优化中的双重同化困境

Authors: Shichao Ma, Zhiyuan Ma, Ming Yang, Xiaofan Li, Xing Wu, Jintao Du, Yu Cheng, Weiqiang Wang, Qiliang Liu, Zhengyang Zhou, Yang Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.22776
Pdf link: https://arxiv.org/pdf/2601.22776
Abstract Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals and increasing reward variance within groups without requiring external reward models or any annotations. Extensive experiments demonstrate that TSPO significantly outperforms state-of-the-art baselines, achieving average performance gains of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively.
中文摘要 多回合工具集成推理使大型语言模型（LLMs）能够通过迭代信息检索解决复杂任务。然而，当前用于搜索增强推理的强化学习（RL）框架主要依赖于结果层面的稀疏奖励，导致了“双重同质化困境”。这表现为（1）过程同质化，即忽视生成过程中的思维、推理和工具。（2）组内同质化、粗粒度结果奖励常导致组内优势估计效率低下，采用组相对策略优化（GRPO）等方法。为此，我们提出了回合级阶段感知策略优化（TSPO）。TSPO引入了首次出现潜在奖励（FOLR）机制，将部分奖励分配给首次出现真实答案的步骤，从而保留过程级信号并增加组内奖励方差，而无需外部奖励模型或任何注释。大量实验表明，TSPO在Qwen2.5-3B和7B模型上分别实现了24%和13.6%的平均性能提升，优于最先进的基线模型。

Clipping-Free Policy Optimization for Large Language Models

大型语言模型的无裁剪策略优化

Authors: Ömer Veysel Çağatan, Barış Akgün, Gözde Gül Şahin, Xuandong Zhao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.22801
Pdf link: https://arxiv.org/pdf/2601.22801
Abstract Reinforcement learning has become central to post-training large language models, yet dominant algorithms rely on clipping mechanisms that introduce optimization issues at scale, including zero-gradient regions, reward hacking, and training instability. We propose Clipping-Free Policy Optimization (CFPO), which replaces heuristic clipping with a convex quadratic penalty derived from Total Variation divergence constraints, yielding an everywhere-differentiable objective that enforces stable policy updates without hard boundaries. We evaluate CFPO across both reasoning and alignment settings. In reasoning, CFPO matches clipping-based methods on downstream benchmarks while extending the stable training regime. In alignment, CFPO mitigates verbosity exploitation and reduces capability degradation, while achieving competitive instruction-following performance. CFPO requires only a one-line code change and no additional hyperparameters. Our results suggest that CFPO is a promising drop-in alternative to clipping-based methods for LLM post-training.
中文摘要 强化学习已成为大型语言模型后训练的核心，但主流算法依赖于大规模引入优化问题的剪裁机制，包括零梯度区域、奖励黑客和训练不稳定性。我们提出了无剪裁策略优化（CFPO），它用源自全变差发散约束的凸二次惩罚取代启发式裁断，从而实现一个处处可微的目标，从而强制无硬边界的策略更新。我们在推理和对齐两个环境下评估CFPO。在推理中，CFPO在下游基准上匹配基于剪切的方法，同时扩展了稳定的训练体系。在对齐中，CFPO减少了冗长的利用，减少了能力退化，同时实现了具竞争力的跟随指令性能。CFPO只需一行代码的更改，无需额外的超参数。我们的结果表明，CFPO是LLM后训练中一种有前景的基于裁剪方法的替代选择。

CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning

CVeDRL：通过难度感知强化学习实现高效的代码验证器

Authors: Ji Shi, Peiming Guo, Meishan Zhang, Miao Zhang, Xuebo Liu, Min Zhang, Weili Guan
Subjects: Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2601.22803
Pdf link: https://arxiv.org/pdf/2601.22803
Abstract Code verifiers play a critical role in post-verification for LLM-based code generation, yet existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency. While reinforcement learning (RL) offers a promising alternative by optimizing models through execution-driven rewards without labeled supervision, our preliminary results show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples. We first theoretically analyze showing that branch coverage, sample difficulty, syntactic and functional correctness can be jointly modeled as RL rewards, where optimizing these signals can improve the reliability of unit-test-based verification. Guided by this analysis, we design syntax- and functionality-aware rewards and further propose branch- and sample-difficulty--aware RL using exponential reward shaping and static analysis metrics. With this formulation, CVeDRL achieves state-of-the-art performance with only 0.6B parameters, yielding up to 28.97% higher pass rate and 15.08% higher branch coverage than GPT-3.5, while delivering over $20\times$ faster inference than competitive baselines. Code is available at this https URL
中文摘要 代码验证器在基于LLM的代码生成后验证中起着关键作用，但现有的监督式微调方法存在数据稀缺、高失败率和推理效率低下的问题。虽然强化学习（RL）通过执行驱动的奖励优化模型提供了有前景的替代方案，但我们的初步结果显示，仅有功能奖励的朴素强化学习无法为困难分支和样本生成有效的单元测试。我们首先理论分析，证明分支覆盖率、样本难度、句法和功能正确性可以联合建模为强化学习奖励，优化这些信号可以提升基于单元测试验证的可靠性。在该分析的指导下，我们设计了语法和功能感知型的奖励，并进一步提出了利用指数奖励塑造和静态分析指标实现分支和样本难度感知的强化学习。通过这种表述，CVeDRL仅用0.6亿参数就能实现最先进的性能，比GPT-3.5高出28.97%的通过率和15.08%的分支覆盖率，同时比竞争对手基线快超过20美元/倍数。代码可在此 https URL 获取

Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment

在稳健风格对齐下高质量行为的离线强化学习

Authors: Mathieu Petitbois, Rémy Portelas, Sylvain Lamprier
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.22823
Pdf link: https://arxiv.org/pdf/2601.22823
Abstract We study offline reinforcement learning of style-conditioned policies using explicit style supervision via subtrajectory labeling functions. In this setting, aligning style with high task performance is particularly challenging due to distribution shift and inherent conflicts between style and reward. Existing methods, despite introducing numerous definitions of style, often fail to reconcile these objectives effectively. To address these challenges, we propose a unified definition of behavior style and instantiate it into a practical framework. Building on this, we introduce Style-Conditioned Implicit Q-Learning (SCIQL), which leverages offline goal-conditioned RL techniques, such as hindsight relabeling and value learning, and combine it with a new Gated Advantage Weighted Regression mechanism to efficiently optimize task performance while preserving style alignment. Experiments demonstrate that SCIQL achieves superior performance on both objectives compared to prior offline methods. Code, datasets and visuals are available in: this https URL.
中文摘要 我们通过子轨迹标记函数，利用显式风格监督，研究了风格条件策略的离线强化学习。在这种环境下，将风格与高任务绩效对齐尤其具有挑战性，因为分布转移以及风格与奖励之间的固有冲突。现有方法尽管引入了许多风格定义，但往往未能有效调和这些目标。为应对这些挑战，我们提出了行为风格的统一定义，并将其实例化为实用框架。在此基础上，我们引入了风格条件隐含Q学习（SCIQL），它利用离线目标条件强化学习技术，如事后重标和价值学习，并结合新的门控优势加权回归机制，高效优化任务表现，同时保持风格对齐。实验表明，SCIQL在这两个目标上都优于以往的离线方法。代码、数据集和可视化内容可在此 https URL 中获取。

Robust Rigid Body Assembly via Contact-Implicit Optimal Control with Exact Second-Order Derivatives

通过接触隐式最优控制实现的稳健刚体组装，具有精确二阶导数

Authors: Christian Dietz, Sebastian Albrecht, Gianluca Frison, Moritz Diehl, Armin Nurkanović
Subjects: Subjects: Robotics (cs.RO); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2601.22849
Pdf link: https://arxiv.org/pdf/2601.22849
Abstract Efficient planning of assembly motions is a long standing challenge in the field of robotics that has been primarily tackled with reinforcement learning and sampling-based methods by using extensive physics simulations. This paper proposes a sample-efficient robust optimal control approach for the determination of assembly motions, which requires significantly less physics simulation steps during planning through the efficient use of derivative information. To this end, a differentiable physics simulation is constructed that provides second-order analytic derivatives to the numerical solver and allows one to traverse seamlessly from informative derivatives to accurate contact simulation. The solution of the physics simulation problem is made differentiable by using smoothing inspired by interior-point methods applied to both the collision detection as well as the contact resolution problem. We propose a modified variant of an optimization-based formulation of collision detection formulated as a linear program and present an efficient implementation for the nominal evaluation and corresponding first- and second-order derivatives. Moreover, a multi-scenario-based trajectory optimization problem that ensures robustness with respect to sim-to-real mismatches is derived. The capability of the considered formulation is illustrated by results where over 99\% successful executions are achieved in real-world experiments. Thereby, we carefully investigate the effect of smooth approximations of the contact dynamics and robust modeling on the success rates. Furthermore, the method's capability is tested on different peg-in-hole problems in simulation to show the benefit of using exact Hessians over commonly used Hessian approximations.
中文摘要 高效规划组装运动是机器人领域长期面临的挑战，主要通过强化学习和基于采样的方法，利用大量物理仿真来解决。本文提出了一种样本高效的鲁棒最优控制方法来确定装配运动，通过高效利用导数信息，在规划过程中所需的物理模拟步骤显著减少。为此，构建了一个可微物理仿真，为数值求解器提供二阶解析导数，使得从信息导数无缝遍历到精确接触仿真。通过利用受内部点方法启发的平滑化，使物理仿真问题的可微化，同时应用于碰撞检测和接触解析问题。我们提出了一种基于优化的碰撞检测表述的修改变体，采用线性规划形式，并提出了名义评估及其对应的一阶和二阶导数的高效实现。此外，还推导出了一个基于多场景的轨迹优化问题，确保对模拟与现实不匹配的鲁棒性。所考虑的表述能力通过实际实验中超过99%的成功率得以展示。因此，我们仔细研究了接触动力学的平滑近似和稳健建模对成功率的影响。此外，该方法的能力还在模拟中对不同的孔钉问题进行了测试，展示了使用精确黑森近似相较于常用黑森近似的优势。

Degradation-Aware Frequency Regulation of a Heterogeneous Battery Fleet via Reinforcement Learning

通过强化学习对异构电池车队进行降级感知频率调节

Authors: Tanay Raghunandan Srinivasa, Vivek Deulkar, Jia Bhargava, Mohammad Hajiesmaili, Prashant Shenoy
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.22865
Pdf link: https://arxiv.org/pdf/2601.22865
Abstract Battery energy storage systems are increasingly deployed as fast-responding resources for grid balancing services such as frequency regulation and for mitigating renewable generation uncertainty. However, repeated charging and discharging induces cycling degradation and reduces battery lifetime. This paper studies the real-time scheduling of a heterogeneous battery fleet that collectively tracks a stochastic balancing signal subject to per-battery ramp-rate and capacity constraints, while minimizing long-term cycling degradation. Cycling degradation is fundamentally path-dependent: it is determined by charge-discharge cycles formed by the state-of-charge (SoC) trajectory and is commonly quantified via rainflow cycle counting. This non-Markovian structure makes it difficult to express degradation as an additive per-time-step cost, complicating classical dynamic programming approaches. We address this challenge by formulating the fleet scheduling problem as a Markov decision process (MDP) with constrained action space and designing a dense proxy reward that provides informative feedback at each time step while remaining aligned with long-term cycle-depth reduction. To scale learning to large state-action spaces induced by fine-grained SoC discretization and asymmetric per-battery constraints, we develop a function-approximation reinforcement learning method using an Extreme Learning Machine (ELM) as a random nonlinear feature map combined with linear temporal-difference learning. We evaluate the proposed approach on a toy Markovian signal model and on a Markovian model trained from real-world regulation signal traces obtained from the University of Delaware, and demonstrate consistent reductions in cycle-depth occurrence and degradation metrics compared to baseline scheduling policies.
中文摘要 电池储能系统正日益被部署为电网平衡服务的快速响应资源，如频率调节和缓解可再生能源发电的不确定性。然而，反复充放电会导致循环老化，缩短电池寿命。本文研究了异构电池车队的实时调度，该车队集体跟踪随机平衡信号，受限于每电池的爬升速率和容量限制，同时最大限度地减少长期循环降级。循环降解本质上依赖于路径：它由电荷状态（SoC）轨迹形成的电荷-放电循环决定，通常通过雨流循环计数来量化。这种非马尔可夫结构使得将退化表示为每时间步的加法成本变得困难，使经典动态规划方法变得复杂。我们通过将舰队调度问题提出为马尔可夫决策过程（MDP），并设计一个密集的代理奖励，在每个时间步提供信息反馈的同时，与长期周期深度减少保持一致来应对这一挑战。为了将学习扩展到由细粒度SoC离散化和非对称每电池约束诱导的大型状态-动作空间，我们开发了一种函数近似强化学习方法，利用极限学习机（ELM）作为随机非线性特征映射结合线性时间差分学习。我们在玩具马尔可夫信号模型和从特拉华大学获得的真实调控信号痕迹训练的马尔可夫模型上评估了该方法，并证明了与基线调度政策相比，周期深度发生和退化指标的持续减少。

Reinforcement Learning-Based Co-Design and Operation of Chiller and Thermal Energy Storage for Cost-Optimal HVAC Systems

基于强化学习的冷水机组和热能储存的协同设计与运行，以实现成本效益最高的暖通空调系统

Authors: Tanay Raghunandan Srinivasa, Vivek Deulkar, Aviruch Bhatia, Vishal Garg
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.22880
Pdf link: https://arxiv.org/pdf/2601.22880
Abstract We study the joint operation and sizing of cooling infrastructure for commercial HVAC systems using reinforcement learning, with the objective of minimizing life-cycle cost over a 30-year horizon. The cooling system consists of a fixed-capacity electric chiller and a thermal energy storage (TES) unit, jointly operated to meet stochastic hourly cooling demands under time-varying electricity prices. The life-cycle cost accounts for both capital expenditure and discounted operating cost, including electricity consumption and maintenance. A key challenge arises from the strong asymmetry in capital costs: increasing chiller capacity by one unit is far more expensive than an equivalent increase in TES capacity. As a result, identifying the right combination of chiller and TES sizes, while ensuring zero loss-of-cooling-load under optimal operation, is a non-trivial co-design problem. To address this, we formulate the chiller operation problem for a fixed infrastructure configuration as a finite-horizon Markov Decision Process (MDP), in which the control action is the chiller part-load ratio (PLR). The MDP is solved using a Deep Q Network (DQN) with a constrained action space. The learned DQN RL policy minimizes electricity cost over historical traces of cooling demand and electricity prices. For each candidate chiller-TES sizing configuration, the trained policy is evaluated. We then restrict attention to configurations that fully satisfy the cooling demand and perform a life-cycle cost minimization over this feasible set to identify the cost-optimal infrastructure design. Using this approach, we determine the optimal chiller and thermal energy storage capacities to be 700 and 1500, respectively.
中文摘要 我们利用强化学习研究商用暖通空调系统冷却基础设施的联合运行和规模，目标是在30年内降低生命周期成本。冷却系统由固定容量的电冷水机组和热能储存（TES）单元组成，联合运行以满足时变电价下的随机小时冷却需求。生命周期成本既考虑资本支出，也包括电力消耗和维护，包括折现运营成本。一个关键挑战来自资本成本的强烈不对称性：增加一个冷水机容量远比同等增加TES容量更昂贵。因此，在确保最佳运行下冷却负载零损失的同时，确定合适的冷却机和TES尺寸组合，是一个非简单的共同设计难题。为此，我们将固定基础设施配置下的冷水机运行问题提出为有限视野马尔可夫决策过程（MDP），其中控制作用为冷水机部分负荷比（PLR）。MDP通过具有约束作用空间的深度Q网络（DQN）求解。已学到的DQN RL政策使得电力成本在历史冷媒需求和电价的痕迹中最小化。对于每个候选的冷水机-TES尺寸配置，都会评估训练好的策略。然后，我们将注意力限制在完全满足冷却需求的配置上，并对该可行集合进行生命周期成本最小化，以确定成本最优的基础设施设计。通过这种方法，我们确定了最优的冷水机和热能储存容量分别为700和1500。

PlatoLTL: Learning to Generalize Across Symbols in LTL Instructions for Multi-Task RL

PlatoLTL：学习在多任务强化学习中跨符号的LTL指令

Authors: Jacques Cloete, Mathias Jackermeier, Ioannis Havoutis, Alessandro Abate
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.22891
Pdf link: https://arxiv.org/pdf/2601.22891
Abstract A central challenge in multi-task reinforcement learning (RL) is to train generalist policies capable of performing tasks not seen during training. To facilitate such generalization, linear temporal logic (LTL) has recently emerged as a powerful formalism for specifying structured, temporally extended tasks to RL agents. While existing approaches to LTL-guided multi-task RL demonstrate successful generalization across LTL specifications, they are unable to generalize to unseen vocabularies of propositions (or "symbols"), which describe high-level events in LTL. We present PlatoLTL, a novel approach that enables policies to zero-shot generalize not only compositionally across LTL formula structures, but also parametrically across propositions. We achieve this by treating propositions as instances of parameterized predicates rather than discrete symbols, allowing policies to learn shared structure across related propositions. We propose a novel architecture that embeds and composes predicates to represent LTL specifications, and demonstrate successful zero-shot generalization to novel propositions and tasks across challenging environments.
中文摘要 多任务强化学习（RL）中一个核心挑战是训练能够执行培训中未见任务的通用策略。为了促进这种推广，线性时间逻辑（LTL）最近成为一种强大的形式主义，用于为强化学习代理指定结构化、时间扩展的任务。虽然现有的LTL引导多任务强化学习方法在LTL规范中成功泛化，但它们无法推广到描述LTL中高层事件的未见命题（或“符号”）词汇。我们提出了PlatoLTL，一种新颖的方法，使策略不仅能在LTL公式结构上进行零样本推广，还能在命题中参数化。我们通过将命题视为参数化谓词的实例而非离散符号来实现这一点，使策略能够学习跨相关命题的共享结构。我们提出了一种新颖架构，能够嵌入并组合谓词以表示LTL规范，并成功地将零样本推广应用于挑战性环境中的新命题和任务。

MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop

MulFeRL：在多回合循环中通过语言反馈增强强化学习

Authors: Xuancheng Li, Haitao Li, Yujia Zhou, YiqunLiu, Qingyao Ai
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.22900
Pdf link: https://arxiv.org/pdf/2601.22900
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in multiple domains, yet outcome-only scalar rewards are often sparse and uninformative, especially on failed samples, where they merely indicate failure and provide no insight into why the reasoning fails. In this paper, we investigate how to leverage richer verbal feedback to guide RLVR training on failed samples, and how to convert such feedback into a trainable learning signal. Specifically, we propose a multi-turn feedback-guided reinforcement learning framework. It builds on three mechanisms: (1) dynamic multi-turn regeneration guided by feedback, triggered only on failed samples, (2) two complementary learning signals for within-turn and cross-turn optimization, and (3) structured feedback injection into the model's reasoning process. Trained on sampled OpenR1-Math, the approach outperforms supervised fine-tuning and RLVR baselines in-domain and generalizes well out-of-domain.
中文摘要 带可验证奖励的强化学习（RLVR）被广泛用于多个领域提升推理能力，但仅结果的标量奖励往往稀少且信息不足，尤其是在失败样本中，它们仅表示失败，无法解释推理失败的原因。本文探讨如何利用更丰富的语言反馈来指导对失败样本进行RLVR训练，以及如何将此类反馈转化为可训练的学习信号。具体来说，我们提出了一种多回合反馈引导的强化学习框架。它基于三种机制：（1）由反馈引导的动态多回合再生，仅在失败样本时触发;（2）两种互补的学习信号用于轮内和跨轮优化;（3）向模型推理过程注入结构化反馈。该方法基于采样的OpenR1-Math进行训练，在领域内优于监督微调和RLVR基线，并在域外推广效果良好。

MTDrive: Multi-turn Interactive Reinforcement Learning for Autonomous Driving

MTDrive：多转向交互式强化学习，适用于自动驾驶

Authors: Xidong Li, Mingyu Guo, Chenchao Xu, Bailin Li, Wenjing Zhu, Yangang Zou, Rui Chen, Zehuan Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.22930
Pdf link: https://arxiv.org/pdf/2601.22930
Abstract Trajectory planning is a core task in autonomous driving, requiring the prediction of safe and comfortable paths across diverse scenarios. Integrating Multi-modal Large Language Models (MLLMs) with Reinforcement Learning (RL) has shown promise in addressing "long-tail" scenarios. However, existing methods are constrained to single-turn reasoning, limiting their ability to handle complex tasks requiring iterative refinement. To overcome this limitation, we present MTDrive, a multi-turn framework that enables MLLMs to iteratively refine trajectories based on environmental feedback. MTDrive introduces Multi-Turn Group Relative Policy Optimization (mtGRPO), which mitigates reward sparsity by computing relative advantages across turns. We further construct an interactive trajectory understanding dataset from closed-loop simulation to support multi-turn training. Experiments on the NAVSIM benchmark demonstrate superior performance compared to existing methods, validating the effectiveness of our multi-turn reasoning paradigm. Additionally, we implement system-level optimizations to reduce data transfer overhead caused by high-resolution images and multi-turn sequences, achieving 2.5x training throughput. Our data, models, and code will be made available soon.
中文摘要 轨迹规划是自动驾驶的核心任务，需要预测在多样场景下安全舒适的路径。将多模态大型语言模型（MLLM）与强化学习（RL）整合，在解决“长尾”场景方面展现出潜力。然而，现有方法受限于单回合推理，限制了其处理需要迭代细化的复杂任务的能力。为克服这一限制，我们提出了MTDrive，一个多回合框架，使MLLM能够基于环境反馈迭代优化路径。MTDrive 引入了多回合组相对策略优化（mtGRPO），通过计算跨回合的相对优势来缓解奖励稀疏性。我们进一步构建了闭环模拟的交互式轨迹理解数据集，以支持多回合训练。NAVSIM基准测试的实验显示其性能优于现有方法，验证了我们多回合推理范式的有效性。此外，我们还实施系统级优化，以减少高分辨率图像和多回合序列带来的数据传输开销，实现2.5倍的训练吞吐量。我们的数据、模型和代码将很快公开。

SWE-Manager: Selecting and Synthesizing Golden Proposals Before Coding

SWE-Manager：编码前选择和综合黄金提案

Authors: Boyin Tan, Haoning Deng, Junyuan Zhang, Junjielong Xu, Pinjia He, Youcheng Sun
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2601.22956
Pdf link: https://arxiv.org/pdf/2601.22956
Abstract Large language model (LLM) research in software engineering has largely focused on tasks such as code generation and bug repair. In practice, teams often draft multiple candidate proposals for fixing an issue and then deliberate on one golden proposal for implementation. This selection requires not only assessing the issue's scope, impact, and urgency, but also a clear understanding of each proposal's strengths and weaknesses. A good selection could make issue resolution more reliable while reducing regression and operational risk, whereas a poor choice can increase risk and even cause unpredictable failures. We first conduct a manual study of real-world issues to characterize the rationales maintainers use when selecting among competing proposals. Motivated by these findings, we introduce SWE-Manager, a joint selection and synthesis approach that selects the best proposal and synthesizes a golden proposal. SWE-Manager is an 8B model trained via reinforcement learning (RL) to compare proposals, justify its choice, and synthesize a golden proposal for implementation. We view proposal selection as a reasoning task, mirroring how technical managers review competing proposals by weighing issue context and each proposal's solution without executing code or running tests. On the SWE-Lancer Manager benchmark, SWE-Manager achieves 53.21 selection accuracy and 57.75 earn rate, earning 152,750 dollars and outperforming strong baselines including GPT-5. To further evaluate the effectiveness of SWE-Manager in real-world issue resolution, we design the P2A framework, which simulates a real-world workflow where multiple proposals are drafted, reviewed, and a golden proposal is selected for implementation ...
中文摘要 软件工程中的大型语言模型（LLM）研究主要集中在代码生成和漏洞修复等任务上。实际上，团队通常会起草多个候选方案来解决问题，然后再讨论一个黄金方案以实施。这一选择不仅需要评估议题的范围、影响和紧迫性，还要清楚理解每个提案的优缺点。一个好的选择可以使问题解决更可靠，同时降低回归和运营风险，而错误的选择则可能增加风险，甚至导致不可预测的失败。我们首先对现实世界的问题进行人工研究，以描述维护者在选择竞争提案时所使用的理由。基于这些发现，我们引入了SWE-Manager，这是一种联合选择与综合方法，能够选择最佳提案并综合黄金提案。SWE-Manager 是一个通过强化学习（RL）训练的 8B 模型，用于比较提案、证明其选择的合理性，并综合出一个用于实施的黄金提案。我们将提案选择视为一项推理任务，类似于技术经理通过权衡问题背景和每个提案的解决方案来审查竞争提案，而无需执行代码或运行测试。在SWE-Lancer Manager基准测试中，SWE-Manager实现了53.21的选择准确率和57.75的获益率，收入152,750美元，并优于包括GPT-5在内的强力基线。为了进一步评估SWE-Manager在现实问题解决中的有效性，我们设计了P2A框架，模拟了一个真实工作流程，在该流程中多个提案被起草、审查，并选出一个黄金提案进行实施......

Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

金鹅：从无法验证的互联网文本中合成无限RLVR任务的简单技巧

Authors: Ximing Lu, David Acuna, Jaehun Jung, Jian Hu, Di Zhang, Shizhe Diao, Yunheng Zou, Shaokun Zhang, Brandon Cui, Mingjie Liu, Hyunwoo Kim, Prithviraj Ammanabrolu, Jan Kautz, Yi Dong, Yejin Choi
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.22975
Pdf link: https://arxiv.org/pdf/2601.22975
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.
中文摘要 带可验证奖励的强化学习（RLVR）已成为大型语言模型（LLM）中解锁复杂推理的基石。然而，扩展强化学习受到有限且可验证数据的限制，随着训练的延长，改进会越来越多。为克服这一问题，我们提出了“金鹅”技巧，这是一种简单的技巧，通过构建一个多项选择题解答版本的中间填充任务，从无法验证的互联网文本中合成无限的RLVR任务。给定源文本，我们提示大型语言模型识别并掩盖关键推理步骤，然后生成一组多样且合理的干扰因素。这使我们能够利用以往 RLVR 数据构建中通常被排除的推理丰富且不可验证的语料库（例如科学教材），合成 GooseReason-0.7M，这是一个包含超过 70 万任务的大型 RLVR 数据集，涵盖数学、编程及一般科学领域。通过实证，GooseReason有效地复活了充斥于现有RLVR数据的模型，在持续强化学习下取得稳健且持续的提升，并在15个多样化基准测试中实现了1.5B和4B-Ininstruction模型的全新尖端结果。最后，我们将Golden Goose部署在真实环境中，从FineWeb原始抓取中综合RLVR任务，针对网络安全领域，而该领域之前没有RLVR数据。基于所得数据培训Qwen3-4B-Instruct，GooseReason-Cyber在网络安全领域树立了新的领先地位，超越了7B领域专用模型，并提供了广泛的领域专项预培训和后期培训。这凸显了通过利用大量、推理丰富且无法验证的互联网文本，自动扩展RLVR数据的潜力。

Automatic Constraint Policy Optimization based on Continuous Constraint Interpolation Framework for Offline Reinforcement Learning

基于连续约束插值框架的自动约束策略优化，用于离线强化学习

Authors: Xinchen Han, Qiuyang Fang, Hossam Afifi, Michel Marot
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.23010
Pdf link: https://arxiv.org/pdf/2601.23010
Abstract Offline Reinforcement Learning (RL) relies on policy constraints to mitigate extrapolation error, where both the constraint form and constraint strength critically shape performance. However, most existing methods commit to a single constraint family: weighted behavior cloning, density regularization, or support constraints, without a unified principle that explains their connections or trade-offs. In this work, we propose Continuous Constraint Interpolation (CCI), a unified optimization framework in which these three constraint families arise as special cases along a common constraint spectrum. The CCI framework introduces a single interpolation parameter that enables smooth transitions and principled combinations across constraint types. Building on CCI, we develop Automatic Constraint Policy Optimization (ACPO), a practical primal--dual algorithm that adapts the interpolation parameter via a Lagrangian dual update. Moreover, we establish a maximum-entropy performance difference lemma and derive performance lower bounds for both the closed-form optimal policy and its parametric projection. Experiments on D4RL and NeoRL2 demonstrate robust gains across diverse domains, achieving state-of-the-art performance overall.
中文摘要 离线强化学习（RL）依赖策略约束来减少外推误差，约束形式和约束强度都关键地影响性能。然而，大多数现有方法只遵循单一约束族：加权行为克隆、密度正则化或支持约束，缺乏统一的原则来解释它们的联系或权衡。在本研究中，我们提出了连续约束插值（CCI）这一统一的优化框架，其中这三种约束族作为共同约束谱上的特例出现。CCI框架引入了一个单一插值参数，使得在约束类型间实现平滑过渡和原则性组合。基于CCI，我们开发了自动约束策略优化（ACPO），这是一种实用的原始-对偶算法，通过拉格朗日对偶更新调整插值参数。此外，我们建立了最大熵性能差引理，并推导了闭式最优策略及其参数投影的性能下界。D4RL和NeoRL2的实验在多个领域展现了强劲的提升，整体性能达到最前沿。

Mem-T: Densifying Rewards for Long-Horizon Memory Agents

Mem-T：对长视界记忆代理的细化奖励

Authors: Yanwei Yue, Guibin Zhang, Boci Peng, Xuanbo Fan, Jiaxin Guo, Qiankun Li, Yan Zhang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.23014
Pdf link: https://arxiv.org/pdf/2601.23014
Abstract Memory agents, which depart from predefined memory-processing pipelines by endogenously managing the processing, storage, and retrieval of memories, have garnered increasing attention for their autonomy and adaptability. However, existing training paradigms remain constrained: agents often traverse long-horizon sequences of memory operations before receiving sparse and delayed rewards, which hinders truly end-to-end optimization of memory management policies. To address this limitation, we introduce Mem-T, an autonomous memory agent that interfaces with a lightweight hierarchical memory database to perform dynamic updates and multi-turn retrieval over streaming inputs. To effectively train long-horizon memory management capabilities, we further propose MoT-GRPO, a tree-guided reinforcement learning framework that transforms sparse terminal feedback into dense, step-wise supervision via memory operation tree backpropagation and hindsight credit assignment, thereby enabling the joint optimization of memory construction and retrieval. Extensive experiments demonstrate that Mem-T is (1) high-performing, surpassing frameworks such as A-Mem and Mem0 by up to $14.92\%$, and (2) economical, operating on a favorable accuracy-efficiency Pareto frontier and reducing inference tokens per query by $\sim24.45\%$ relative to GAM without sacrificing performance.
中文摘要 记忆代理通过内生管理记忆的处理、存储和检索，脱离了预设的记忆处理流程，因其自主性和适应性而受到越来越多的关注。然而，现有的训练范式仍然受限：代理常常在收到稀疏和延迟奖励前，先经历长视野的内存作序列，这阻碍了真正端到端的内存管理策略优化。为解决这一限制，我们引入了Mem-T，一种自主内存代理，可与轻量级分层内存数据库接口，对流输入进行动态更新和多回合检索。为有效训练长视野内存管理能力，我们进一步提出了MoT-GRPO树引导强化学习框架，通过记忆作树反向传播和事后视角署名分配，将稀疏终端反馈转化为密集的分步骤监督，从而实现记忆构建与检索的联合优化。大量实验表明，Mem-T（1）性能优异，高出A-Mem和Mem0等框架，高出高达14.92\%$，且（2）经济实惠，运行在有利的准确率和效率帕累托前沿，且每次查询的推理标记相较GAM减少了$\sim24.45\%$，同时不牺牲性能。

Guided by Trajectories: Repairing and Rewarding Tool-Use Trajectories for Tool-Integrated Reasoning

以轨迹为导引：修复并奖励工具使用轨迹以实现工具整合推理

Authors: Siyu Gong, Linan Yue, Weibo Gao, Fangzhou Yao, Shimin Di, Lei Feng, Min-Ling Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.23032
Pdf link: https://arxiv.org/pdf/2601.23032
Abstract Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to solve complex tasks by interacting with external tools, yet existing approaches depend on high-quality synthesized trajectories selected by scoring functions and sparse outcome-based rewards, providing limited and biased supervision for learning TIR. To address these challenges, in this paper, we propose AutoTraj, a two-stage framework that automatically learns TIR by repairing and rewarding tool-use trajectories. Specifically, in the supervised fine-tuning (SFT) stage, AutoTraj generates multiple candidate tool-use trajectories for each query and evaluates them along multiple dimensions. High-quality trajectories are directly retained, while low-quality ones are repaired using a LLM (i.e., LLM-as-Repairer). The resulting repaired and high-quality trajectories form a synthetic SFT dataset, while each repaired trajectory paired with its original low-quality counterpart constitutes a dataset for trajectory preference modeling. In the reinforcement learning (RL) stage, based on the preference dataset, we train a trajectory-level reward model to assess the quality of reasoning paths and combine it with outcome and format rewards, thereby explicitly guiding the optimization toward reliable TIR behaviors. Experiments on real-world benchmarks demonstrate the effectiveness of AutoTraj in TIR.
中文摘要 工具集成推理（TIR）使大型语言模型（LLMs）能够通过与外部工具交互来解决复杂任务，但现有方法依赖于通过评分函数和稀疏的结果奖励选择的高质量综合轨迹，提供了有限且有偏见的TIR学习监督。为应对这些挑战，本文提出了AutoTraj，这是一个两阶段框架，通过修复和奖励工具使用轨迹自动学习TIR。具体来说，在监督微调（SFT）阶段，AutoTraj 为每个查询生成多个候选工具使用轨迹，并沿多个维度进行评估。高质量轨迹会被直接保留，而低质量轨迹则通过大型语言模型（即作为修复器的LLM）进行修复。修复后且高质量的轨迹构成合成SFT数据集，而每个修复轨迹与其原始低质量轨迹配对构成轨迹偏好建模数据集。在强化学习（RL）阶段，基于偏好数据集，我们训练轨迹级奖励模型，评估推理路径的质量，并将其与结果和格式奖励结合，明确引导优化，朝向可靠的TIR行为。基于现实世界基准测试的实验证明了AutoTraj在TIR中的有效性。

From Absolute to Relative: Rethinking Reward Shaping in Group-Based Reinforcement Learning

从绝对到相对：重新思考基于群体强化学习中的奖励塑造

Authors: Wenzhe Niu, Wei He, Zongxia Xie, Jinpeng Ou, Huichuan Fan, Yuchen Ge, Yanru Sun, Ziyin Wang, Yizhao Sun, Chengshun Shi, Jiuchong Gao, Jinghua Hao, Renqing He
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.23058
Pdf link: https://arxiv.org/pdf/2601.23058
Abstract Reinforcement learning has become a cornerstone for enhancing the reasoning capabilities of Large Language Models, where group-based approaches such as GRPO have emerged as efficient paradigms that optimize policies by leveraging intra-group performance differences. However, these methods typically rely on absolute numerical rewards, introducing intrinsic limitations. In verifiable tasks, identical group evaluations often result in sparse supervision, while in open-ended scenarios, the score range instability of reward models undermines advantage estimation based on group means. To address these limitations, we propose Reinforcement Learning with Relative Rewards (RLRR), a framework that shifts reward shaping from absolute scoring to relative ranking. Complementing this framework, we introduce the Ranking Reward Model, a listwise preference model tailored for group-based optimization to directly generate relative rankings. By transforming raw evaluations into robust relative signals, RLRR effectively mitigates signal sparsity and reward instability. Experimental results demonstrate that RLRR yields consistent performance improvements over standard group-based baselines across reasoning benchmarks and open-ended generation tasks.
中文摘要 强化学习已成为提升大型语言模型推理能力的基石，基于群体的方法如GRPO已成为高效范式，通过利用群体内性能差异优化策略。然而，这些方法通常依赖绝对数值奖励，带来了内在的限制。在可验证任务中，相同的群体评估通常导致监督稀疏，而在开放式情景中，奖励模型的得分区间不稳定性削弱了基于群体均值的优势估计。为解决这些局限性，我们提出了带有相对奖励的强化学习（RLRR）框架，将奖励塑造从绝对评分转向相对排名。在该框架的补充下，我们引入了排名奖励模型，这是一种针对基于群体优化的列表偏好模型，旨在直接生成相对排名。通过将原始评估转化为稳健的相对信号，RLRR有效减轻了信号稀疏性和奖励不稳定性。实验结果表明，RLRR在推理基准和开放式生成任务中，相较于标准基于小组的基线实现了持续的性能提升。

RN-D: Discretized Categorical Actors with Regularized Networks for On-Policy Reinforcement Learning

RN-D：带有正则化网络的离散化类别演员用于策略内强化学习

Authors: Yuexin Bian, Jie Feng, Tao Wang, Yijiang Li, Sicun Gao, Yuanyuan Shi
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.23075
Pdf link: https://arxiv.org/pdf/2601.23075
Abstract On-policy deep reinforcement learning remains a dominant paradigm for continuous control, yet standard implementations rely on Gaussian actors and relatively shallow MLP policies, often leading to brittle optimization when gradients are noisy and policy updates must be conservative. In this paper, we revisit policy representation as a first-class design choice for on-policy optimization. We study discretized categorical actors that represent each action dimension with a distribution over bins, yielding a policy objective that resembles a cross-entropy loss. Building on architectural advances from supervised learning, we further propose regularized actor networks, while keeping critic design fixed. Our results show that simply replacing the standard actor network with our discretized regularized actor yields consistent gains and achieve the state-of-the-art performance across diverse continuous-control benchmarks.
中文摘要 策略上深度强化学习仍然是持续控制的主导范式，但标准实现依赖高斯演员和相对浅的MLP策略，当梯度噪声大且策略更新必须保守时，常常导致优化脆弱。本文重新审视政策表示作为政策内优化的一流设计选择。我们研究了以分布于多个箱表示每个动作维度的离散化类别行为者，得出类似交叉熵损失的策略目标。基于监督学习的架构进步，我们进一步提出了正则化演员网络，同时保持批评设计的固定性。我们的结果表明，仅仅用离散化正则化演员替换标准演员网络，就能获得稳定的提升，并在多种连续控制基准测试中实现最先进的性能。

Why GRPO Needs Normalization: A Local-Curvature Perspective on Adaptive Gradients

为什么GRPO需要归一化：自适应梯度的局部曲率视角

Authors: Cheng Ge, Caitlyn Heqi Yin, Hao Liang, Jiawei Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.23135
Pdf link: https://arxiv.org/pdf/2601.23135
Abstract Reinforcement learning (RL) has become a key driver of language model reasoning. Among RL algorithms, Group Relative Policy Optimization (GRPO) is the de facto standard, avoiding the need for a critic by using per-prompt baselines and variance normalization. Yet why and when this normalization helps remains unclear. In this work, we provide an explanation through the lens of local curvature of the sequence-level policy gradient: standard deviation normalization implements an adaptive gradient. Theoretically, under mild conditions, GRPO enjoys a strictly improved convergence rate over unnormalized REINFORCE, with gains characterized by the average within-prompt reward standard deviation across prompts and iterations. Empirically, our analysis on GSM8K and MATH benchmarks reveals three distinct training phases governed by the interplay between feature orthogonality and reward variance: (I) an early acceleration phase where high variance and orthogonality favor adaptive scaling; (II) a relatively stable transition phase; and (III) a late-stage regime where the loss of orthogonality limits further gains. Together, these results provide a principled account of when std normalization helps in GRPO, and offer broader insights into the design of critic-free RL algorithms.
中文摘要 强化学习（RL）已成为语言模型推理的关键驱动力。在强化学习算法中，群相对策略优化（Group Relative Policy Optimization，GRPO）是事实上的标准，通过使用每个提示的基线和方差归一化，避免了批评者的需求。然而，这种正常化为何以及如何有效，仍不清楚。本研究通过序列层策略梯度的局部曲率角度解释：标准差归一化实现了自适应梯度。理论上，在温和条件下，GRPO的收敛率优于非正规化强化，其收益表现为提示和迭代间的即时内奖励标准差的平均值。通过对GSM8K和MATH基准的实证分析，我们发现了三个不同的训练阶段，这些阶段由特征正交性和奖励方差之间的相互作用所支配：（I）早期加速阶段，高方差和正交性有利于自适应缩放;（II）相对稳定的过渡阶段;以及（III）后期阶段，正交性丧失限制了进一步的收益。这些结果共同提供了标准化归一化在GRPO中何时有助于的原则性阐述，并为无批判强化学习算法的设计提供了更广泛的见解。

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

THINKSAFE：推理模型的自生成安全对齐

Authors: Seanie Lee, Sangwoo Park, Yumin Choi, Gyeongman Kim, Minki Kang, Jihun Yun, Dongmin Park, Jongho Park, Sung Ju Hwang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.23143
Pdf link: https://arxiv.org/pdf/2601.23143
Abstract Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We propose ThinkSafe, a self-generated alignment framework that restores safety alignment without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, guiding the model to generate in-distribution safety reasoning traces. Fine-tuning on these self-generated responses effectively realigns the model while minimizing distribution shift. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency. Notably, it achieves superior safety and comparable reasoning to GRPO, with significantly reduced computational cost. Code, models, and datasets are available at this https URL.
中文摘要 大型推理模型（LRM）通过在推理任务中利用强化学习（RL）生成长思考链（CoT）推理，实现了卓越的性能。然而，这种过度优化往往优先考虑合规，使模型容易受到有害提示的影响。为了减轻这种安全性下降，近期方法依赖外部教师的提炼，但这会引入分布差异，降低母语推理能力。我们提出了ThinkSafe，一种自创的对齐框架，可以在没有外部教师的情况下恢复安全对齐。我们的关键见解是，虽然合规抑制了安全机制，但模型常保留潜在知识以识别危害。ThinkSafe通过轻量级拒绝引导解锁这一点，引导模型生成分布内安全推理轨迹。对这些自生成答的微调可以有效重新对齐模型，同时最小化分布偏移。DeepSeek-R1-Distill和Qwen3的实验显示，ThinkSafe在保持推理熟练度的同时，显著提升了安全性。值得注意的是，它在安全性和推理能力上优于GRPO，计算成本显著降低。代码、模型和数据集均可在此 https URL 访问。

On Safer Reinforcement Learning Policies for Sedation and Analgesia in Intensive Care

关于重症监护中镇静和镇痛的安全强化学习政策

Authors: Joel Romero-Hernandez, Oscar Camara
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.23154
Pdf link: https://arxiv.org/pdf/2601.23154
Abstract Pain management in intensive care usually involves complex trade-offs between therapeutic goals and patient safety, since both inadequate and excessive treatment may induce serious sequelae. Reinforcement learning can help address this challenge by learning medication dosing policies from retrospective data. However, prior work on sedation and analgesia has optimized for objectives that do not value patient survival while relying on algorithms unsuitable for imperfect information settings. We investigated the risks of these design choices by implementing a deep reinforcement learning framework to suggest hourly medication doses under partial observability. Using data from 47,144 ICU stays in the MIMIC-IV database, we trained policies to prescribe opioids, propofol, benzodiazepines, and dexmedetomidine according to two goals: reduce pain or jointly reduce pain and mortality. We found that, although the two policies were associated with lower pain, actions from the first policy were positively correlated with mortality, while those proposed by the second policy were negatively correlated. This suggests that valuing long-term outcomes could be critical for safer treatment policies, even if a short-term goal remains the primary objective.
中文摘要 重症监护中的疼痛管理通常涉及治疗目标与患者安全之间的复杂权衡，因为治疗不充分或过度都可能引发严重的后遗症。强化学习可以通过回顾性数据学习药物剂量政策来应对这一挑战。然而，以往关于镇静和镇痛的研究优化了不重视患者生存的目标，同时依赖不适合不完美信息环境的算法。我们通过实施深度强化学习框架，在部分可观察性下建议每小时用药剂量，探讨了这些设计选择的风险。利用MIMIC-IV数据库中47,144例ICU住院数据，我们根据两个目标训练政策，开具阿片类药物、丙泊酚、苯二氮卓类药物和右美托咪定，目标为减轻疼痛或共同降低疼痛和死亡率。我们发现，尽管这两种政策与较低的疼痛相关，第一种政策的行为与死亡率呈正相关，而第二项政策提出的行为则呈负相关。这表明，即使短期目标仍是主要目标，重视长期结果对更安全的治疗政策至关重要。

Unsupervised Hierarchical Skill Discovery

无监督层级技能发现

Authors: Damion Harvey (1), Geraud Nangue Tasse (1 and 2), Branden Ingram (1 and 2), Benjamin Rosman (1 and 2), Steven James (1 and 2) ((1) University of the Witwatersrand, Johannesburg, South Africa, (2) Machine Intelligence and Neural Discovery (MIND) Institute, University of the Witwatersrand, Johannesburg, South Africa)
Subjects: Subjects: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL)
Arxiv link: https://arxiv.org/abs/2601.23156
Pdf link: https://arxiv.org/pdf/2601.23156
Abstract We consider the problem of unsupervised skill segmentation and hierarchical structure discovery in reinforcement learning. While recent approaches have sought to segment trajectories into reusable skills or options, most rely on action labels, rewards, or handcrafted annotations, limiting their applicability. We propose a method that segments unlabelled trajectories into skills and induces a hierarchical structure over them using a grammar-based approach. The resulting hierarchy captures both low-level behaviours and their composition into higher-level skills. We evaluate our approach in high-dimensional, pixel-based environments, including Craftax and the full, unmodified version of Minecraft. Using metrics for skill segmentation, reuse, and hierarchy quality, we find that our method consistently produces more structured and semantically meaningful hierarchies than existing baselines. Furthermore, as a proof of concept for utility, we demonstrate that these discovered hierarchies accelerate and stabilise learning on downstream reinforcement learning tasks.
中文摘要 我们考察强化学习中无监督技能分割和层级结构发现的问题。虽然近期方法试图将路径划分为可重复使用的技能或选项，但大多数依赖动作标签、奖励或手工注释，限制了其适用范围。我们提出了一种方法，将未标记轨迹划分为技能，并通过基于语法的方法在这些路径上诱导层级结构。由此产生的层级结构既捕捉了低层行为，也涵盖了其对高级技能的组合。我们在高维像素环境中评估我们的方法，包括Craftax和完整未修改的Minecraft版本。利用技能细分、重用和层级质量的指标，我们发现我们的方法比现有基线更能产生更有结构且语义意义更强的层级结构。此外，作为效用的概念验证，我们证明这些发现的层级结构加速并稳定了下游强化学习任务中的学习。

Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training

Med-Scout：通过几何感知强化学习后训练，治愈MLLM在医学感知中的几何盲点

Authors: Anglin Liu, Ruichao Chen, Yi Lu, Hongxia Xu, Jintai Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.23220
Pdf link: https://arxiv.org/pdf/2601.23220
Abstract Despite recent Multimodal Large Language Models (MLLMs)' linguistic prowess in medical diagnosis, we find even state-of-the-art MLLMs suffer from a critical perceptual deficit: geometric blindness. This failure to ground outputs in objective geometric constraints leads to plausible yet factually incorrect hallucinations, rooted in training paradigms that prioritize linguistic fluency over geometric fidelity. This paper introduces Med-Scout, a novel framework that "cures" this blindness via Reinforcement Learning (RL) that leverages the intrinsic geometric logic latent within unlabeled medical images. Instead of relying on costly expert annotations, Med-Scout derives verifiable supervision signals through three strategic proxy tasks: Hierarchical Scale Localization, Topological Jigsaw Reconstruction, and Anomaly Consistency Detection. To rigorously quantify this deficit, we present Med-Scout-Bench, a new benchmark specifically designed to evaluate geometric perception. Extensive evaluations show that Med-Scout significantly mitigates geometric blindness, outperforming leading proprietary and open-source MLLMs by over 40% on our benchmark. Furthermore, this enhanced geometric perception generalizes to broader medical understanding, achieving superior results on radiological and comprehensive medical VQA tasks.
中文摘要 尽管近年来多模态大型语言模型（MLLMs）在医学诊断中表现出色，我们发现即使是最先进的MLLM也存在一个关键的感知缺陷：几何盲。这种未能将输出置于客观几何约束基础的做法，导致了合理但事实错误的幻觉，这些幻觉根植于优先考虑语言流畅度而非几何真实性的训练范式。本文介绍了Med-Scout，一种通过强化学习（RL）“治愈”盲点的新框架，利用未标记医学图像中潜藏的内在几何逻辑。Med-Scout 不依赖昂贵的专家注释，而是通过三个战略代理任务：层级尺度定位、拓扑拼图重建和异常一致性检测，从而推导出可验证的监督信号。为了严格量化这一缺陷，我们推出了医学侦察台，这是一个专门用于评估几何感知的新基准。大量评估显示，Med-Scout 显著减少了几何盲点，在我们的基准测试中表现优于领先的专有和开源 MLLM 超过 40%。此外，这种增强的几何感知能力推广到更广泛的医学理解，在放射学和综合医学VQA任务中取得更优结果。

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

video-o3：本地交错线索寻找长视频多跳推理

Authors: Xiangyu Zeng, Zhiqiu Zhang, Yuhan Zhu, Xinhao Li, Zikang Wang, Changlian Ma, Qingyu Zhang, Zizheng Huang, Kun Ouyang, Tianxiang Jiang, Ziang Yan, Yi Wang, Hongjie Zhang, Yali Wang, Limin Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.23224
Pdf link: https://arxiv.org/pdf/2601.23224
Abstract Existing multimodal large language models for long-video understanding predominantly rely on uniform sampling and single-turn inference, limiting their ability to identify sparse yet critical evidence amid extensive redundancy. We introduce Video-o3, a novel framework that supports iterative discovery of salient visual clues, fine-grained inspection of key segments, and adaptive termination once sufficient evidence is acquired. Technically, we address two core challenges in interleaved tool invocation. First, to mitigate attention dispersion induced by the heterogeneity of reasoning and tool-calling, we propose Task-Decoupled Attention Masking, which isolates per-step concentration while preserving shared global context. Second, to control context length growth in multi-turn interactions, we introduce a Verifiable Trajectory-Guided Reward that balances exploration coverage with reasoning efficiency. To support training at scale, we further develop a data synthesis pipeline and construct Seeker-173K, comprising 173K high-quality tool-interaction trajectories for effective supervised and reinforcement learning. Extensive experiments show that Video-o3 substantially outperforms state-of-the-art methods, achieving 72.1% accuracy on MLVU and 46.5% on Video-Holmes. These results demonstrate Video-o3's strong multi-hop evidence-seeking and reasoning capabilities, and validate the effectiveness of native tool invocation in long-video scenarios.
中文摘要 现有用于长视频理解的多模态大型语言模型主要依赖于均匀采样和单回合推断，限制了它们识别稀疏但关键证据的能力，且存在大量冗余。我们介绍了Video-o3，这是一个新颖框架，支持迭代发现显著的视觉线索、对关键片段的细致检查，以及在获得足够证据后实现自适应终止。从技术上讲，我们解决了交错工具调用中的两个核心挑战。首先，为了减轻推理和工具调用异质性引起的注意力分散，我们提出了任务解耦注意力掩蔽（Task-Decoupled Attention Masking），该方法在保持共享全局语境的同时，隔离每步的注意力集中。其次，为了控制多回合交互中的上下文长度增长，我们引入了可验证轨迹引导奖励，平衡了探索覆盖率与推理效率。为支持大规模培训，我们进一步开发数据综合流水线，构建了Seeker-173K，包含17.3K条高质量工具交互轨迹，以实现有效的监督和强化学习。大量实验表明，Video-o3在MLVU上达到72.1%的准确率，在Video-Holmes上达到46.5%的准确率，远超最先进的方法。这些结果展示了Video-o3强大的多跳证据寻求和推理能力，并验证了原生工具调用在长视频场景中的有效性。

Agile Reinforcement Learning through Separable Neural Architecture

通过可分神经架构实现敏捷强化学习

Authors: Rajib Mostakim, Reza T. Batley, Sourav Saha
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.23225
Pdf link: https://arxiv.org/pdf/2601.23225
Abstract Deep reinforcement learning (RL) is increasingly deployed in resource-constrained environments, yet the go-to function approximators - multilayer perceptrons (MLPs) - are often parameter-inefficient due to an imperfect inductive bias for the smooth structure of many value functions. This mismatch can also hinder sample efficiency and slow policy learning in this capacity-limited regime. Although model compression techniques exist, they operate post-hoc and do not improve learning efficiency. Recent spline-based separable architectures - such as Kolmogorov-Arnold Networks (KANs) - have been shown to offer parameter efficiency but are widely reported to exhibit significant computational overhead, especially at scale. In seeking to address these limitations, this work introduces SPAN (SPline-based Adaptive Networks), a novel function approximation approach to RL. SPAN adapts the low rank KHRONOS framework by integrating a learnable preprocessing layer with a separable tensor product B-spline basis. SPAN is evaluated across discrete (PPO) and high-dimensional continuous (SAC) control tasks, as well as offline settings (Minari/D4RL). Empirical results demonstrate that SPAN achieves a 30-50% improvement in sample efficiency and 1.3-9 times higher success rates across benchmarks compared to MLP baselines. Furthermore, SPAN demonstrates superior anytime performance and robustness to hyperparameter variations, suggesting it as a viable, high performance alternative for learning intrinsically efficient policies in resource-limited settings.
中文摘要 深度强化学习（RL）越来越多地应用于资源受限的环境中，但常用的函数近似器——多层感知器（MLP）——由于许多值函数的平滑结构归纳偏差不完美，常常在参数效率上不高效。这种不匹配还可能阻碍样本效率，并在容量有限的环境中减缓政策学习。尽管存在模型压缩技术，但它们是事后作的，并不能提升学习效率。近期基于样条的可分离架构——如Kolmogorov-Arnold Networks（KAN）——已被证明能提供参数效率，但普遍报道在大规模时会带来显著的计算开销。为了解决这些局限性，本研究引入了基于SPline的自适应网络（SPAN），这是一种新颖的强化学习功能近似方法。SPAN通过集成可学习的预处理层与可分离的张量积B样条基，适配了低秩KHRONOS框架。SPAN在离散（PPO）和高维连续（SAC）控制任务中进行评估，同时也适用于离线环境（Minari/D4RL）。实证结果显示，SPAN在样本效率上提升了30%-50%，在基准测试中成功率比MLP基线高出1.3-9倍。此外，SPAN展现出优越的任意性能和对超参数变异的鲁棒性，表明它是在资源有限环境中学习本质高效策略的可行高性能替代方案。

IRL-DAL: Safe and Adaptive Trajectory Planning for Autonomous Driving via Energy-Guided Diffusion Models

IRL-DAL：通过能量引导扩散模型实现自动驾驶的安全与适应性轨迹规划

Authors: Seyed Ahmad Hosseini Miangoleh, Amin Jalal Aghdasian, Farzaneh Abdollahi
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.23266
Pdf link: https://arxiv.org/pdf/2601.23266
Abstract This paper proposes a novel inverse reinforcement learning framework using a diffusion-based adaptive lookahead planner (IRL-DAL) for autonomous vehicles. Training begins with imitation from an expert finite state machine (FSM) controller to provide a stable initialization. Environment terms are combined with an IRL discriminator signal to align with expert goals. Reinforcement learning (RL) is then performed with a hybrid reward that combines diffuse environmental feedback and targeted IRL rewards. A conditional diffusion model, which acts as a safety supervisor, plans safe paths. It stays in its lane, avoids obstacles, and moves smoothly. Then, a learnable adaptive mask (LAM) improves perception. It shifts visual attention based on vehicle speed and nearby hazards. After FSM-based imitation, the policy is fine-tuned with Proximal Policy Optimization (PPO). Training is run in the Webots simulator with a two-stage curriculum. A 96\% success rate is reached, and collisions are reduced to 0.05 per 1k steps, marking a new benchmark for safe navigation. By applying the proposed approach, the agent not only drives in lane but also handles unsafe conditions at an expert level, increasing this http URL make our code publicly available.
中文摘要 本文提出了一种基于扩散的自适应前瞻规划器（IRL-DAL）用于自动驾驶车辆的新型逆强化学习框架。训练从专家有限状态机（FSM）控制器的模拟开始，以提供稳定的初始化。环境术语与现实中的判别器信号结合，以与专家目标保持一致。强化学习（RL）随后通过混合奖励进行，结合了扩散的环境反馈和有针对性的现实学习奖励。条件扩散模型作为安全监督器，规划安全路径。它保持在车道上，避开障碍物，移动顺畅。然后，可学习的自适应掩码（LAM）可以提升感知能力。它会根据车辆速度和附近的危险物转移视觉注意力。在基于有限状态法的模拟后，策略通过近端策略优化（PPO）进行微调。培训在Webots模拟器中进行，课程分为两阶段。成功率达到96%，碰撞率降至每1000步0.05次，标志着安全导航的新标杆。通过采用拟议方法，代理不仅能进入车道，还能以专家级别处理不安全状况，增加此 http URL，使代码公开。

Keyword: diffusion policy

Self-Imitated Diffusion Policy for Efficient and Robust Visual Navigation

高效且稳健的视觉导航自拟扩散政策

Authors: Runhua Zhang, Junyi Hou, Changxu Cheng, Qiyi Chen, Tao Wang, Wuyue Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.22965
Pdf link: https://arxiv.org/pdf/2601.22965
Abstract Diffusion policies (DP) have demonstrated significant potential in visual navigation by capturing diverse multi-modal trajectory distributions. However, standard imitation learning (IL), which most DP methods rely on for training, often inherits sub-optimality and redundancy from expert demonstrations, thereby necessitating a computationally intensive "generate-then-filter" pipeline that relies on auxiliary selectors during inference. To address these challenges, we propose Self-Imitated Diffusion Policy (SIDP), a novel framework that learns improved planning by selectively imitating a set of trajectories sampled from itself. Specifically, SIDP introduces a reward-guided self-imitation mechanism that encourages the policy to consistently produce high-quality trajectories efficiently, rather than outputs of inconsistent quality, thereby reducing reliance on extensive sampling and post-filtering. During training, we employ a reward-driven curriculum learning paradigm to mitigate inefficient data utility, and goal-agnostic exploration for trajectory augmentation to improve planning robustness. Extensive evaluations on a comprehensive simulation benchmark show that SIDP significantly outperforms previous methods, with real-world experiments confirming its effectiveness across multiple robotic platforms. On Jetson Orin Nano, SIDP delivers a 2.5$\times$ faster inference than the baseline NavDP, i.e., 110ms VS 273ms, enabling efficient real-time deployment.
中文摘要 扩散政策（DP）通过捕捉多模态轨迹分布，已在视觉导航中展现出显著潜力。然而，大多数DP方法依赖的标准模仿学习（IL）常常继承了专家演示的次优性和冗余性，因此需要一个计算密集型的“生成后过滤”流水线，并在推理过程中依赖辅助选择器。为应对这些挑战，我们提出了自我模仿扩散政策（SIDP），这是一种新颖框架，通过选择性模仿自身采样的一组轨迹来学习改进规划。具体来说，SIDP引入了一种奖励引导的自我模仿机制，鼓励政策持续高效地产生高质量轨迹，而非质量不一致的输出，从而减少对大量抽样和后期过滤的依赖。在培训过程中，我们采用奖励驱动的课程学习范式，以减少数据利用效率低下，并采用目标无关的探索，推动轨迹增强，以提升规划的稳健性。对综合模拟基准的广泛评估显示，SIDP的表现显著优于以往方法，真实实验也证实了其在多个机器人平台上的有效性。在 Jetson Orin Nano 上，SIDP 的推断速度比基线 NavDP 快 2.5 美元/时间点，即 110 毫秒对比 273毫秒，实现高效的实时部署。