Arxiv Papers of Today

生成时间: 2025-11-11 16:31:34 (UTC+8); Arxiv 发布时间: 2025-11-11 20:00 EST (2025-11-12 09:00 UTC+8)

今天共有 67 篇相关文章

Keyword: reinforcement learning

Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models

前瞻揭露引发扩散语言模型中的准确解码

Authors: Sanghyun Lee, Seungryong Kim, Jongho Park, Dongmin Park
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.05563
Pdf link: https://arxiv.org/pdf/2511.05563
Abstract Masked Diffusion Models (MDMs) as language models generate by iteratively unmasking tokens, yet their performance crucially depends on the inference time order of unmasking. Prevailing heuristics, such as confidence based sampling, are myopic: they optimize locally, fail to leverage extra test-time compute, and let early decoding mistakes cascade. We propose Lookahead Unmasking (LookUM), which addresses these concerns by reformulating sampling as path selection over all possible unmasking orders without the need for an external reward model. Our framework couples (i) a path generator that proposes paths by sampling from pools of unmasking sets with (ii) a verifier that computes the uncertainty of the proposed paths and performs importance sampling to subsequently select the final paths. Empirically, erroneous unmasking measurably inflates sequence level uncertainty, and our method exploits this to avoid error-prone trajectories. We validate our framework across six benchmarks, such as mathematics, planning, and coding, and demonstrate consistent performance improvements. LookUM requires only two to three paths to achieve peak performance, demonstrating remarkably efficient path selection. The consistent improvements on both LLaDA and post-trained LLaDA 1.5 are particularly striking: base LLaDA with LookUM rivals the performance of RL-tuned LLaDA 1.5, while LookUM further enhances LLaDA 1.5 itself showing that uncertainty based verification provides orthogonal benefits to reinforcement learning and underscoring the versatility of our framework. Code will be publicly released.
中文摘要 屏蔽扩散模型（MDM）作为语言模型，通过迭代解除屏蔽标记生成，但它们的性能关键取决于取消屏蔽的推理时间顺序。流行的启发式方法，例如基于置信度的抽样，是短视的：它们在本地进行优化，无法利用额外的测试时间计算，并让早期解码错误级联。我们提出了前瞻揭露（LookUM），它通过将采样重新表述为所有可能的揭露顺序的路径选择来解决这些问题，而无需外部奖励模型。我们的框架将（i）一个路径生成器与（ii）一个验证器相结合，该生成器通过从未屏蔽集池中采样来提出路径，该验证器计算所提议路径的不确定性并执行重要性采样以随后选择最终路径。根据经验，错误的揭露会显着夸大序列水平的不确定性，我们的方法利用这一点来避免容易出错的轨迹。我们通过数学、规划和编码等六个基准来验证我们的框架，并展示了一致的性能改进。LookUM只需要两到三条路径即可实现最佳性能，展示了非常高效的路径选择。LLaDA 和后训练 LLaDA 1.5 的持续改进尤其引人注目：带有 LookUM 的基础 LLaDA 可与 RL 调整的 LLaDA 1.5 的性能相媲美，而 LookUM 进一步增强了 LLaDA 1.5 本身，表明基于不确定性的验证为强化学习提供了正交优势，并强调了我们框架的多功能性。代码将公开发布。

CoPRIS: Efficient and Stable Reinforcement Learning via Concurrency-Controlled Partial Rollout with Importance Sampling

CoPRIS：通过并发控制的部分推出和重要性采样实现高效稳定的强化学习

Authors: Zekai Qu, Yinxu Pan, Ao Sun, Chaojun Xiao, Xu Han
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.05589
Pdf link: https://arxiv.org/pdf/2511.05589
Abstract Reinforcement learning (RL) post-training has become a trending paradigm for enhancing the capabilities of large language models (LLMs). Most existing RL systems for LLMs operate in a fully synchronous manner, where training must wait for the rollout of an entire batch to complete. This design leads to severe inefficiencies, as extremely long trajectories can stall the entire rollout process and leave many GPUs idle. To address this issue, we propose Concurrency- Controlled Partial Rollout with Importance Sampling (CoPRIS), which mitigates long-tail inefficiencies by maintaining a fixed number of concurrent rollouts, early-terminating once sufficient samples are collected, and reusing unfinished trajectories in subsequent rollouts. To mitigate the impact of off-policy trajectories, we introduce Cross-stage Importance Sampling Correction, which concatenates buffered log probabilities from the previous policy with those recomputed under the current policy for importance sampling correction. Experiments on challenging mathematical reasoning benchmarks show that CoPRIS achieves up to 1.94x faster training while maintaining comparable or superior performance to synchronous RL systems. The code of CoPRIS is available at this https URL.
中文摘要 强化学习（RL）后训练已成为增强大型语言模型（LLM）能力的趋势范式。大多数现有的 LLM RL 系统都以完全同步的方式运行，其中训练必须等待整个批次的推出完成。这种设计会导致效率严重低下，因为极长的轨迹会阻碍整个推出过程并使许多 GPU 闲置。为了解决这个问题，我们提出了具有重要性采样的并发控制部分推出（CoPRIS），它通过保持固定数量的并发推出、在收集到足够的样本后提前终止以及在后续推出中重用未完成的轨迹来缓解长尾效率低下。为了减轻策略外轨迹的影响，我们引入了跨阶段重要性采样校正，它将先前策略中的缓冲对数概率与当前策略下重新计算的对数概率连接起来，以进行重要性抽样校正。在具有挑战性的数学推理基准测试上的实验表明，CoPRIS 的训练速度提高了 1.94 倍，同时保持了与同步 RL 系统相当或更好的性能。CoPRIS 的代码可在此 https URL 中找到。

Distributionally Robust Self Paced Curriculum Reinforcement Learning

分布稳健自定进度课程强化学习

Authors: Anirudh Satheesh, Keenan Powell, Vaneet Aggarwal
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.05694
Pdf link: https://arxiv.org/pdf/2511.05694
Abstract A central challenge in reinforcement learning is that policies trained in controlled environments often fail under distribution shifts at deployment into real-world environments. Distributionally Robust Reinforcement Learning (DRRL) addresses this by optimizing for worst-case performance within an uncertainty set defined by a robustness budget $\epsilon$. However, fixing $\epsilon$ results in a tradeoff between performance and robustness: small values yield high nominal performance but weak robustness, while large values can result in instability and overly conservative policies. We propose Distributionally Robust Self-Paced Curriculum Reinforcement Learning (DR-SPCRL), a method that overcomes this limitation by treating $\epsilon$ as a continuous curriculum. DR-SPCRL adaptively schedules the robustness budget according to the agent's progress, enabling a balance between nominal and robust performance. Empirical results across multiple environments demonstrate that DR-SPCRL not only stabilizes training but also achieves a superior robustness-performance trade-off, yielding an average 11.8\% increase in episodic return under varying perturbations compared to fixed or heuristic scheduling strategies, and achieving approximately 1.9$\times$ the performance of the corresponding nominal RL algorithms.
中文摘要 强化学习的一个核心挑战是，在受控环境中训练的策略在部署到现实环境时经常会在分布转变下失败。分布稳健强化学习（DRRL）通过在鲁棒性预算 $\epsilon$ 定义的不确定性集中优化最坏情况的性能来解决这个问题。然而，固定 $\epsilon$ 会导致性能和鲁棒性之间的权衡：小值产生高标称性能但弱鲁棒性，而大值会导致不稳定和过于保守的策略。我们提出了分布稳健自定进度课程强化学习（DR-SPCRL），这是一种通过将$\epsilon$视为连续课程来克服这一限制的方法。DR-SPCRL 根据代理的进度自适应地安排鲁棒性预算，从而在标称性能和鲁棒性能之间实现平衡。跨多个环境的实证结果表明，DR-SPCRL不仅稳定了训练，而且实现了卓越的鲁棒性-性能权衡，与固定或启发式调度策略相比，在不同的扰动下，情景回报平均增加了11.8%，并实现了相应标称RL算法的性能约1.9$\times$。

STAIR: Stability criterion for Time-windowed Assignment and Internal adversarial influence in Routing and decision-making

STAIR：时间窗口分配的稳定性准则以及路由和决策中的内部对抗影响

Authors: Roee M. Francos, Daniel Garces, Orhan Eren Akgün, Stephanie Gil
Subjects: Subjects: Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.05715
Pdf link: https://arxiv.org/pdf/2511.05715
Abstract A major limitation of existing routing algorithms for multi-agent systems is that they are designed without considering the potential presence of adversarial agents in the decision-making loop, which could lead to severe performance degradation in real-life applications where adversarial agents may be present. We study autonomous pickup-and-delivery routing problems in which adversarial agents launch coordinated denial-of-service attacks by spoofing their locations. This deception causes the central scheduler to assign pickup requests to adversarial agents instead of cooperative agents. Adversarial agents then choose not to service the requests with the goal of disrupting the operation of the system, leading to delays, cancellations, and potential instability in the routing policy. Policy stability in routing problems is typically defined as the cost of the policy being uniformly bounded over time, and it has been studied through two different lenses: queuing theory and reinforcement learning (RL), which are not well suited for routing with adversaries. In this paper, we propose a new stability criterion, STAIR, which is easier to analyze than queuing-theory-based stability in adversarial settings. Furthermore, STAIR does not depend on a chosen discount factor as is the case in discounted RL stability. STAIR directly links stability to desired operational metrics, like a finite number of rejected requests. This characterization is particularly useful in adversarial settings as it provides a metric for monitoring the effect of adversaries in the operation of the system. Furthermore, we demonstrate STAIR's practical relevance through simulations on real-world San Francisco mobility-on-demand data. We also identify a phenomenon of degenerate stability that arises in the adversarial routing problem, and we introduce time-window constraints in the decision-making algorithm to mitigate it.
中文摘要 用于多代理系统的现有路由算法的一个主要局限性是，它们的设计没有考虑决策循环中潜在存在对抗代理，这可能会导致在可能存在对抗代理的实际应用程序中性能严重下降。我们研究自主取货和送货路由问题，其中对抗性代理通过欺骗其位置发起协调的拒绝服务攻击。这种欺骗导致中央调度程序将取件请求分配给对抗性代理而不是合作代理。然后，对抗代理选择不为请求提供服务，目的是破坏系统的运行，从而导致延迟、取消和路由策略的潜在不稳定。路由问题中的策略稳定性通常被定义为策略随时间均匀限制的成本，并且已经通过两个不同的视角进行了研究：排队理论和强化学习（RL），它们不太适合与对手进行路由。在本文中，我们提出了一种新的稳定性准则 STAIR，它在对抗环境中比基于排队理论的稳定性更容易分析。此外，STAIR 不像贴现 RL 稳定性那样依赖于所选的贴现因子。STAIR 将稳定性直接与所需的作指标联系起来，例如有限数量的被拒绝请求。这种表征在对抗环境中特别有用，因为它提供了一个指标来监控对手在系统运行中的影响。此外，我们还通过对真实世界旧金山按需移动数据的模拟展示了 STAIR 的实际相关性。我们还发现了对抗路由问题中出现的稳定性退化现象，并在决策算法中引入了时间窗口约束来缓解它。

SymLight: Exploring Interpretable and Deployable Symbolic Policies for Traffic Signal Control

SymLight：探索交通信号控制的可解释和可部署符号策略

Authors: Xiao-Cheng Liao, Yi Mei, Mengjie Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.05790
Pdf link: https://arxiv.org/pdf/2511.05790
Abstract Deep Reinforcement Learning have achieved significant success in automatically devising effective traffic signal control (TSC) policies. Neural policies, however, tend to be over-parameterized and non-transparent, hindering their interpretability and deployability on resource-limited edge devices. This work presents SymLight, a priority function search framework based on Monte Carlo Tree Search (MCTS) for discovering inherently interpretable and deployable symbolic priority functions to serve as the TSC policies. The priority function, in particular, accepts traffic features as input and then outputs a priority for each traffic signal phase, which subsequently directs the phase transition. For effective search, we propose a concise yet expressive priority function representation. This helps mitigate the combinatorial explosion of the action space in MCTS. Additionally, a probabilistic structural rollout strategy is introduced to leverage structural patterns from previously discovered high-quality priority functions, guiding the rollout process. Our experiments on real-world datasets demonstrate SymLight's superior performance across a range of baselines. A key advantage is SymLight's ability to produce interpretable and deployable TSC policies while maintaining excellent performance.
中文摘要 深度强化学习在自动设计有效的交通信号控制（TSC）策略方面取得了重大成功。然而，神经策略往往过度参数化且不透明，阻碍了它们在资源有限的边缘设备上的可解释性和可部署性。这项工作提出了 SymLight，这是一个基于蒙特卡洛树搜索（MCTS）的优先级函数搜索框架，用于发现固有的可解释和可部署的符号优先级函数以用作 TSC 策略。特别是优先级函数，接受交通特征作为输入，然后输出每个交通信号阶段的优先级，随后指导相变。为了有效的搜索，我们提出了一种简洁而富有表现力的优先级函数表示。这有助于减轻 MCTS 中动作空间的组合爆炸。此外，还引入了概率结构推出策略，以利用先前发现的高质量优先级函数的结构模式，指导推出过程。我们在真实世界数据集上的实验证明了 SymLight 在一系列基线上的卓越性能。一个关键优势是 SymLight 能够生成可解释和可部署的 TSC 策略，同时保持出色的性能。

Evader-Agnostic Team-Based Pursuit Strategies in Partially-Observable Environments

与回避者无关的基于团队的部分可观察环境中的追捕策略

Authors: Addison Kalanther, Daniel Bostwick, Chinmay Maheshwari, Shankar Sastry
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.05812
Pdf link: https://arxiv.org/pdf/2511.05812
Abstract We consider a scenario where a team of two unmanned aerial vehicles (UAVs) pursue an evader UAV within an urban environment. Each agent has a limited view of their environment where buildings can occlude their field-of-view. Additionally, the pursuer team is agnostic about the evader in terms of its initial and final location, and the behavior of the evader. Consequently, the team needs to gather information by searching the environment and then track it to eventually intercept. To solve this multi-player, partially-observable, pursuit-evasion game, we develop a two-phase neuro-symbolic algorithm centered around the principle of bounded rationality. First, we devise an offline approach using deep reinforcement learning to progressively train adversarial policies for the pursuer team against fictitious evaders. This creates $k$-levels of rationality for each agent in preparation for the online phase. Then, we employ an online classification algorithm to determine a "best guess" of our current opponent from the set of iteratively-trained strategic agents and apply the best player response. Using this schema, we improved average performance when facing a random evader in our environment.
中文摘要 我们考虑了一个场景，即一个由两架无人机（UAV）组成的团队在城市环境中追捕一架规避无人机。每个代理对其环境的视图有限，建筑物可能会遮挡其视野。此外，追捕者团队对逃避者的初始和最终位置以及逃避者的行为是不可知的。因此，团队需要通过搜索环境来收集信息，然后对其进行跟踪以最终拦截。为了解决这种多人参与、部分可观察、追踪-逃避的游戏，我们开发了一种以有限理性原理为中心的两阶段神经符号算法。首先，我们设计了一种离线方法，使用深度强化学习来逐步训练追捕团队针对虚构逃避者的对抗策略。这为每个代理创造了 $k 美元级别的合理性，为在线阶段做准备。然后，我们采用在线分类算法，从一组经过迭代训练的战略代理中确定当前对手的“最佳猜测”，并应用最佳玩家反应。使用此模式，我们提高了在环境中面对随机规避者时的平均性能。

WAR-Re: Web API Recommendation with Semantic Reasoning

WAR-Re：具有语义推理的 Web API 推荐

Authors: Zishuo Xu, Dezhong Yao, Yao Wan
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.05820
Pdf link: https://arxiv.org/pdf/2511.05820
Abstract With the development of cloud computing, the number of Web APIs has increased dramatically, further intensifying the demand for efficient Web API recommendation. Despite the demonstrated success of previous Web API recommendation solutions, two critical challenges persist: 1) a fixed top-N recommendation that cannot accommodate the varying API cardinality requirements of different mashups, and 2) these methods output only ranked API lists without accompanying reasons, depriving users of understanding the recommendation. To address these challenges, we propose WAR-Re, an LLM-based model for Web API recommendation with semantic reasoning for justification. WAR-Re leverages special start and stop tokens to handle the first challenge and uses two-stage training: supervised fine-tuning and reinforcement learning via Group Relative Policy Optimization (GRPO) to enhance the model's ability in both tasks. Comprehensive experimental evaluations on the ProgrammableWeb dataset demonstrate that WAR-Re achieves a gain of up to 21.59\% over the state-of-the-art baseline model in recommendation accuracy, while consistently producing high-quality semantic reasons for recommendations.
中文摘要 随着云计算的发展，Web API的数量急剧增加，进一步加剧了对高效Web API推荐的需求。尽管以前的 Web API 推荐解决方案已经取得了成功，但仍然存在两个关键挑战：1）固定的前 N 个推荐无法适应不同混搭的不同 API 基数要求，以及 2）这些方法仅输出排名的 API 列表，而没有附带的原因，剥夺了用户对推荐的理解。为了应对这些挑战，我们提出了 WAR-Re，这是一种基于 LLM 的 Web API 推荐模型，具有语义推理以进行理由证明。WAR-Re 利用特殊的开始和停止标记来处理第一个挑战，并使用两阶段训练：监督微调和通过组相对策略优化（GRPO）进行强化学习，以增强模型在这两项任务中的能力。对 ProgrammableWeb 数据集的综合实验评估表明，WAR-Re 在推荐准确性方面比最先进的基线模型提高了 21.59\%，同时始终如一地为推荐产生高质量的语义原因。

Policy Gradient-Based EMT-in-the-Loop Learning to Mitigate Sub-Synchronous Control Interactions

基于策略梯度的EMT在环学习，缓解次同步控制交互

Authors: Sayak Mukherjee, Ramij R. Hossain, Kaustav Chatterjee, Sameer Nekkalapu, Marcelo Elizondo
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.05822
Pdf link: https://arxiv.org/pdf/2511.05822
Abstract This paper explores the development of learning-based tunable control gains using EMT-in-the-loop simulation framework (e.g., PSCAD interfaced with Python-based learning modules) to address critical sub-synchronous oscillations. Since sub-synchronous control interactions (SSCI) arise from the mis-tuning of control gains under specific grid configurations, effective mitigation strategies require adaptive re-tuning of these gains. Such adaptiveness can be achieved by employing a closed-loop, learning-based framework that considers the grid conditions responsible for such sub-synchronous oscillations. This paper addresses this need by adopting methodologies inspired by Markov decision process (MDP) based reinforcement learning (RL), with a particular emphasis on simpler deep policy gradient methods with additional SSCI-specific signal processing modules such as down-sampling, bandpass filtering, and oscillation energy dependent reward computations. Our experimentation in a real-world event setting demonstrates that the deep policy gradient based trained policy can adaptively compute gain settings in response to varying grid conditions and optimally suppress control interaction-induced oscillations.
中文摘要 本文探讨了使用EMT在环仿真框架（例如，PSCAD与基于Python的学习模块接口）开发基于学习的可调控制增益，以解决关键的次同步振荡。由于次同步控制相互作用（SSCI）是由特定电网配置下控制增益的错误调谐引起的，因此有效的缓解策略需要对这些增益进行自适应重新调谐。这种自适应性可以通过采用闭环、基于学习的框架来实现，该框架考虑导致此类亚同步振荡的网格条件。本文通过采用受马尔可夫决策过程（MDP）基于强化学习（RL）启发的方法来满足这一需求，特别强调更简单的深度策略梯度方法，以及额外的SSCI特定信号处理模块，如下采样、带通滤波和振荡能量相关奖励计算。我们在真实事件环境中的实验表明，基于深度策略梯度的训练策略可以自适应地计算增益设置以响应不同的网格条件，并以最佳方式抑制控制相互作用引起的振荡。

Learning-Based Multi-Stage Strategy for a Fixed-Wing Aircraft to Evade a Missile Detected at a Short Distance

基于学习的固定翼飞机躲避短距离探测导弹的多级策略

Authors: Zhiguan Niu, Xiaochao Zhou, Hao Xiong
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.05828
Pdf link: https://arxiv.org/pdf/2511.05828
Abstract Missiles pose a major threat to aircraft in modern air combat. Advances in technology make them increasingly difficult to detect until they are close to the target and highly resistant to jamming. The evasion maneuver is the last line of defense for an aircraft. However, conventional rule-based evasion strategies are limited by computational demands and aerodynamic constraints, and existing learning-based approaches remain unconvincing for manned aircraft against modern missiles. To enhance aircraft survivability, this study investigates missile evasion inspired by the pursuit-evasion game between a gazelle and a cheetah and proposes a multi-stage reinforcement learning-based evasion strategy. The strategy learns a large azimuth policy to turn to evade, a small azimuth policy to keep moving away, and a short distance policy to perform agile aggressive maneuvers to avoid. One of the three policies is activated at each stage based on distance and azimuth. To evaluate performance, a high-fidelity simulation environment modeling an F-16 aircraft and missile under various conditions is used to compare the proposed approach with baseline strategies. Experimental results show that the proposed method achieves superior performance, enabling the F-16 aircraft to successfully avoid missiles with a probability of 80.89 percent for velocities ranging from 800 m/s to 1400 m/s, maximum overloads from 40 g to 50 g, detection distances from 5000 m to 15000 m, and random azimuths. When the missile is detected beyond 8000 m, the success ratio increases to 85.06 percent.
中文摘要 导弹在现代空战中对飞机构成重大威胁。技术的进步使得它们越来越难以被检测到，直到它们靠近目标并且具有很强的抗干扰能力。规避机动是飞机的最后一道防线。然而，传统的基于规则的规避策略受到计算需求和空气动力学约束的限制，现有的基于学习的方法对于有人驾驶飞机对抗现代导弹仍然没有说服力。为了提高飞机的生存能力，本研究以瞪羚和猎豹之间的追避博弈为灵感，研究了导弹的规避，并提出了一种基于强化学习的多阶段规避策略。该策略学习了转向躲避的大方位角策略，不断远离的小方位角策略，以及执行敏捷的攻击性机动以避免的短距离策略。根据距离和方位角在每个阶段激活三个策略之一。为了评估性能，使用高保真模拟环境对各种条件下的 F-16 飞机和导弹进行建模，将所提出的方法与基线策略进行比较。实验结果表明，所提方法具有优越的性能，使F-16飞机在800 m/s至1400 m/s的速度范围内，最大过载在40 g至50 g之间，探测距离从5000 m到15000 m，以及随机方位角，能够以80.89%的概率成功避开导弹。当导弹被探测到 8000 米以上时，成功率增加到 85.06%。

EGG-SR: Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph

EGG-SR：通过相等图将符号等价嵌入符号回归

Authors: Nan Jiang, Ziyi Wang, Yexiang Xue
Subjects: Subjects: Symbolic Computation (cs.SC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.05849
Pdf link: https://arxiv.org/pdf/2511.05849
Abstract Symbolic regression seeks to uncover physical laws from experimental data by searching for closed-form expressions, which is an important task in AI-driven scientific discovery. Yet the exponential growth of the search space of expression renders the task computationally challenging. A promising yet underexplored direction for reducing the effective search space and accelerating training lies in symbolic equivalence: many expressions, although syntactically different, define the same function -- for example, $\log(x_1^2x_2^3)$, $\log(x_1^2)+\log(x_2^3)$, and $2\log(x_1)+3\log(x_2)$. Existing algorithms treat such variants as distinct outputs, leading to redundant exploration and slow learning. We introduce EGG-SR, a unified framework that integrates equality graphs (e-graphs) into diverse symbolic regression algorithms, including Monte Carlo Tree Search (MCTS), deep reinforcement learning (DRL), and large language models (LLMs). EGG-SR compactly represents equivalent expressions through the proposed EGG module, enabling more efficient learning by: (1) pruning redundant subtree exploration in EGG-MCTS, (2) aggregating rewards across equivalence classes in EGG-DRL, and (3) enriching feedback prompts in EGG-LLM. Under mild assumptions, we show that embedding e-graphs tightens the regret bound of MCTS and reduces the variance of the DRL gradient estimator. Empirically, EGG-SR consistently enhances multiple baselines across challenging benchmarks, discovering equations with lower normalized mean squared error than state-of-the-art methods. Code implementation is available at: this https URL.
中文摘要 符号回归试图通过搜索闭式表达式来从实验数据中揭示物理规律，这是人工智能驱动的科学发现中的一项重要任务。然而，表达式搜索空间的指数级增长使任务在计算上具有挑战性。减少有效搜索空间和加速训练的一个有前途但未被充分探索的方向在于符号等价：许多表达式虽然在语法上不同，但定义了相同的函数——例如，$\log（x_1^2x_2^3）$、$\log（x_1^2）+\log（x_2^3）$ 和 $2\log（x_1）+3\log（x_2）$。现有算法将此类变体视为不同的输出，导致冗余探索和缓慢学习。我们介绍了 EGG-SR，这是一个统一的框架，它将相等图（e-graphs）集成到各种符号回归算法中，包括蒙特卡洛树搜索（MCTS）、深度强化学习（DRL）和大型语言模型（LLM）。EGG-SR通过所提出的EGG模块紧凑地表示等效表达式，通过以下方式实现更有效的学习：（1）在EGG-MCTS中修剪冗余子树探索，（2）在EGG-DRL中聚合跨等价类的奖励，以及（3）丰富EGG-LLM中的反馈提示。在温和的假设下，我们表明嵌入电子图收紧了 MCTS 的遗憾界限并降低了 DRL 梯度估计器的方差。根据经验，EGG-SR 在具有挑战性的基准中不断增强多个基线，发现比最先进方法具有更低归一化均方误差的方程。代码实现可在以下位置获得：此 https URL。

Gentle Manipulation Policy Learning via Demonstrations from VLM Planned Atomic Skills

通过 VLM 计划原子技能演示进行温和纵策略学习

Authors: Jiayu Zhou, Qiwei Wu, Jian Li, Zhe Chen, Xiaogang Xiong, Renjing Xu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.05855
Pdf link: https://arxiv.org/pdf/2511.05855
Abstract Autonomous execution of long-horizon, contact-rich manipulation tasks traditionally requires extensive real-world data and expert engineering, posing significant cost and scalability challenges. This paper proposes a novel framework integrating hierarchical semantic decomposition, reinforcement learning (RL), visual language models (VLMs), and knowledge distillation to overcome these limitations. Complex tasks are decomposed into atomic skills, with RL-trained policies for each primitive exclusively in simulation. Crucially, our RL formulation incorporates explicit force constraints to prevent object damage during delicate interactions. VLMs perform high-level task decomposition and skill planning, generating diverse expert demonstrations. These are distilled into a unified policy via Visual-Tactile Diffusion Policy for end-to-end execution. We conduct comprehensive ablation studies exploring different VLM-based task planners to identify optimal demonstration generation pipelines, and systematically compare imitation learning algorithms for skill distillation. Extensive simulation experiments and physical deployment validate that our approach achieves policy learning for long-horizon manipulation without costly human demonstrations, while the VLM-guided atomic skill framework enables scalable generalization to diverse tasks.
中文摘要 传统上，自主执行长期、接触丰富的作任务需要大量的真实世界数据和专家工程，这带来了巨大的成本和可扩展性挑战。本文提出了一种集成了分层语义分解、强化学习（RL）、视觉语言模型（VLM）和知识蒸馏的新框架来克服这些限制。复杂的任务被分解为原子技能，每个基元的 RL 训练策略仅在模拟中。至关重要的是，我们的强再生配方结合了明确的力约束，以防止在精细的相互作用过程中损坏物体。VLM 执行高级任务分解和技能规划，生成各种专家演示。这些通过视觉触觉扩散策略提炼成统一的策略，以实现端到端执行。我们进行了全面的消融研究，探索不同的基于VLM的任务规划器，以确定最佳的演示生成管道，并系统地比较用于技能蒸馏的模仿学习算法。广泛的模拟实验和物理部署验证了我们的方法无需昂贵的人工演示即可实现长期作的策略学习，而 VLM 引导的原子技能框架可以扩展到不同的任务。

MCP-RiskCue: Can LLM infer risk information from MCP server System Logs?

MCP-RiskCue：LLM 能否从 MCP 服务器系统日志中推断出风险信息？

Authors: Jiayi Fu, Qiyao Sun
Subjects: Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.05867
Pdf link: https://arxiv.org/pdf/2511.05867
Abstract Large language models (LLMs) demonstrate strong capabilities in solving complex tasks when integrated with external tools. The Model Context Protocol (MCP) has become a standard interface for enabling such tool-based interactions. However, these interactions introduce substantial security concerns, particularly when the MCP server is compromised or untrustworthy. While prior benchmarks primarily focus on prompt injection attacks or analyze the vulnerabilities of LLM MCP interaction trajectories, limited attention has been given to the underlying system logs associated with malicious MCP servers. To address this gap, we present the first synthetic benchmark for evaluating LLMs ability to identify security risks from system logs. We define nine categories of MCP server risks and generate 1,800 synthetic system logs using ten state-of-the-art LLMs. These logs are embedded in the return values of 243 curated MCP servers, yielding a dataset of 2,421 chat histories for training and 471 queries for evaluation. Our pilot experiments reveal that smaller models often fail to detect risky system logs, leading to high false negatives. While models trained with supervised fine-tuning (SFT) tend to over-flag benign logs, resulting in elevated false positives, Reinforcement Learning from Verifiable Reward (RLVR) offers a better precision-recall balance. In particular, after training with Group Relative Policy Optimization (GRPO), Llama3.1-8B-Instruct achieves 83% accuracy, surpassing the best-performing large remote model by 9 percentage points. Fine-grained, per-category analysis further underscores the effectiveness of reinforcement learning in enhancing LLM safety within the MCP framework. Code and data are available at: this https URL
中文摘要 大型语言模型（LLM）在与外部工具集成时表现出解决复杂任务的强大能力。模型上下文协议（MCP）已成为实现此类基于工具的交互的标准接口。然而，这些交互会带来严重的安全问题，特别是当 MCP 服务器受到损害或不可信时。虽然之前的基准测试主要关注提示注入攻击或分析 LLM MCP 交互轨迹的漏洞，但对与恶意 MCP 服务器相关的底层系统日志的关注有限。为了解决这一差距，我们提出了第一个综合基准，用于评估 LLM 从系统日志中识别安全风险的能力。我们定义了九类 MCP 服务器风险，并使用 10 个最先进的 LLM 生成了 1,800 个合成系统日志。这些日志嵌入到 243 个精选 MCP 服务器的返回值中，产生一个包含 2,421 个聊天历史记录的数据集用于训练和 471 个查询用于评估。我们的试点实验表明，较小的模型通常无法检测到有风险的系统日志，从而导致高误报率。虽然使用监督微调（SFT）训练的模型往往会过度标记良性日志，从而导致误报率升高，但可验证奖励强化学习（RLVR）提供了更好的精度-召回率平衡。特别是，在使用组相对策略优化（GRPO）进行训练后，Llama3.1-8B-Instruct 实现了 83% 的准确率，比性能最好的大型远程模型高出 9 个百分点。细粒度的、每个类别的分析进一步强调了强化学习在 MCP 框架内增强 LLM 安全性方面的有效性。代码和数据可在以下位置获得：此 https URL

Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs

强化学习改进了法学硕士中层次知识的遍历

Authors: Renfei Zhang, Manasa Kaniselvan, Niloofar Mireshghallah
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.05933
Pdf link: https://arxiv.org/pdf/2511.05933
Abstract Reinforcement learning (RL) is often credited with improving language model reasoning and generalization at the expense of degrading memorized knowledge. We challenge this narrative by observing that RL-enhanced models consistently outperform their base and supervised fine-tuned (SFT) counterparts on pure knowledge recall tasks, particularly those requiring traversal of hierarchical, structured knowledge (e.g., medical codes). We hypothesize these gains stem not from newly acquired data, but from improved procedural skills in navigating and searching existing knowledge hierarchies within the model parameters. To support this hypothesis, we show that structured prompting, which explicitly guides SFTed models through hierarchical traversal, recovers most of the performance gap (reducing 24pp to 7pp on MedConceptsQA for DeepSeek-V3/R1). We further find that while prompting improves final-answer accuracy, RL-enhanced models retain superior ability to recall correct procedural paths on deep-retrieval tasks. Finally our layer-wise internal activation analysis reveals that while factual representations (e.g., activations for the statement "code 57.95 refers to urinary infection") maintain high cosine similarity between SFT and RL models, query representations (e.g., "what is code 57.95") diverge noticeably, indicating that RL primarily transforms how models traverse knowledge rather than the knowledge representation itself.
中文摘要 强化学习（RL）通常被认为改进了语言模型的推理和泛化，但代价是降低记忆知识。我们通过观察到RL增强模型在纯知识回忆任务上始终优于其基础和监督微调（SFT）模型来挑战这种说法，特别是那些需要遍历分层结构化知识（例如医疗代码）的任务。我们假设这些收益不是源于新获得的数据，而是源于在模型参数中导航和搜索现有知识层次结构的程序技能的提高。为了支持这一假设，我们表明，结构化提示通过分层遍历明确引导 SFT 模型，可以恢复大部分性能差距（将 DeepSeek-V3/R1 的 MedConceptsQA 上的 24pp 减少到 7pp）。我们进一步发现，虽然提示提高了最终答案的准确性，但RL增强模型在深度检索任务中保留了回忆正确程序路径的卓越能力。最后，我们的层级内部激活分析表明，虽然事实表示（例如，“代码 57.95 指尿路感染”语句的激活）在 SFT 和 RL 模型之间保持了高度余弦相似性，但查询表示（例如，“代码 57.95 是什么”）明显不同，表明 RL 主要改变模型遍历知识的方式，而不是知识表示本身。

Klear-AgentForge: Forging Agentic Intelligence through Posttraining Scaling

Klear-AgentForge：通过训练后扩展锻造代理智能

Authors: Qi Wang, Hongzhi Zhang, Jia Fu, Kai Fu, Yahui Liu, Tinghai Zhang, Chenxi Sun, Gangwei Jiang, Jingyi Tang, Xingguang Ji, Yang Yue, Jingyuan Zhang, Fuzheng Zhang, Kun Gai, Guorui Zhou
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.05951
Pdf link: https://arxiv.org/pdf/2511.05951
Abstract Despite the proliferation of powerful agentic models, the lack of critical post-training details hinders the development of strong counterparts in the open-source community. In this study, we present a comprehensive and fully open-source pipeline for training a high-performance agentic model for interacting with external tools and environments, named Klear-Qwen3-AgentForge, starting from the Qwen3-8B base model. We design effective supervised fine-tuning (SFT) with synthetic data followed by multi-turn reinforcement learning (RL) to unlock the potential for multiple diverse agentic tasks. We perform exclusive experiments on various agentic benchmarks in both tool use and coding domains. Klear-Qwen3-AgentForge-8B achieves state-of-the-art performance among LLMs of similar size and remains competitive with significantly larger models.
中文摘要 尽管强大的代理模型激增，但缺乏关键的训练后细节阻碍了开源社区中强大对应模型的发展。在这项研究中，我们提出了一个全面且完全开源的管道，用于训练一个高性能代理模型，用于与外部工具和环境交互，名为 Klear-Qwen3-AgentForge，从 Qwen3-8B 基础模型开始。我们设计了有效的监督微调（SFT），使用合成数据，然后进行多轮强化学习（RL），以释放多种不同代理任务的潜力。我们在工具使用和编码领域的各种代理基准上进行了独家实验。Klear-Qwen3-AgentForge-8B 在类似规模的 LLM 中实现了最先进的性能，并且与更大的模型保持竞争力。

Adaptive Agent Selection and Interaction Network for Image-to-point cloud Registration

用于图点云配准的自适应智能体选择与交互网络

Authors: Zhixin Cheng, Xiaotian Yin, Jiacheng Deng, Bohao Liao, Yujia Chen, Xu Zhou, Baoqun Yin, Tianzhu Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.05965
Pdf link: https://arxiv.org/pdf/2511.05965
Abstract Typical detection-free methods for image-to-point cloud registration leverage transformer-based architectures to aggregate cross-modal features and establish correspondences. However, they often struggle under challenging conditions, where noise disrupts similarity computation and leads to incorrect correspondences. Moreover, without dedicated designs, it remains difficult to effectively select informative and correlated representations across modalities, thereby limiting the robustness and accuracy of registration. To address these challenges, we propose a novel cross-modal registration framework composed of two key modules: the Iterative Agents Selection (IAS) module and the Reliable Agents Interaction (RAI) module. IAS enhances structural feature awareness with phase maps and employs reinforcement learning principles to efficiently select reliable agents. RAI then leverages these selected agents to guide cross-modal interactions, effectively reducing mismatches and improving overall robustness. Extensive experiments on the RGB-D Scenes v2 and 7-Scenes benchmarks demonstrate that our method consistently achieves state-of-the-art performance.
中文摘要 图像到点云配准的典型免检测方法利用基于 transformer 的架构来聚合跨模态特征并建立对应关系。然而，它们经常在具有挑战性的条件下挣扎，在这种条件下，噪声会破坏相似性计算并导致不正确的对应关系。此外，如果没有专门的设计，仍然很难有效地选择跨模态的信息丰富且相关的表示，从而限制了配准的稳健性和准确性。为了应对这些挑战，我们提出了一种由两个关键模块组成的新型跨模态注册框架：迭代代理选择（IAS）模块和可靠代理交互（RAI）模块。IAS通过相位图增强结构特征意识，并采用强化学习原理有效地选择可靠的智能体。然后，RAI 利用这些选定的代理来指导跨模态交互，有效减少错配并提高整体鲁棒性。在 RGB-D 场景 v2 和 7-Scenes 基准测试上的大量实验表明，我们的方法始终如一地实现了最先进的性能。

DWM-RO: Decentralized World Models with Reasoning Offloading for SWIPT-enabled Satellite-Terrestrial HetNets

DWM-RO：支持SWIPT的星地HetNet的具有推理卸载的去中心化世界模型

Authors: Guangyuan Liu, Yinqiu Liu, Ruichen Zhang, Dusit Niyato, Jiawen Kang, Sumei Sun, Abbas Jamalipour, Ping Zhang
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2511.05972
Pdf link: https://arxiv.org/pdf/2511.05972
Abstract Wireless networks are undergoing a paradigm shift toward massive connectivity with energy-efficient operation, driving the integration of satellite-terrestrial architectures with simultaneous wireless information and power transfer (SWIPT). Optimizing transmit beamforming and power splitting in such systems faces formidable challenges, e.g., time-varying channels and multi-tier interference, which create a complex decision landscape where conventional model-free multi-agent reinforcement learning (MARL) suffers from sample inefficiency due to rarely-encountered state transitions and poor coordination as decentralized agents act independently. This paper proposes the Decentralized World Model with Reasoning Offloading (DWM-RO) framework to address these fundamental limitations. Specifically, each agent employs a world model to learn compact predictive representations of environment dynamics, enabling imagination-based policy training that dramatically reduces required environment interactions. An uncertainty-aware offloading gate monitors local interference levels and model reconstruction errors to trigger selective edge coordination. When activated, a lightweight latent decorrelation mechanism at the edge refines agents' strategic representations, guiding them toward orthogonal actions that minimize resource conflicts. Extensive simulations demonstrate that DWM-RO converges 5 times faster than state-of-the-art baselines while achieving 34.7% higher spectral efficiency and reducing constraint violations by 40%. In dense network scenarios with 10 users, DWM-RO maintains violation rates below 20% while baselines exceed 70%, validating superior robustness.
中文摘要 无线网络正在经历向具有节能运行的大规模连接的范式转变，推动了卫星-地面架构与同步无线信息和电力传输（SWIPT）的集成。在此类系统中优化发射波束成形和功率分配面临着艰巨的挑战，例如时变信道和多层干扰，这创造了一个复杂的决策环境，传统的无模型多智能体强化学习（MARL）由于很少遇到状态转换和分散智能体独立行动时协调性差，因此样本效率低下。本文提出了具有推理卸载的去中心化世界模型（DWM-RO）框架来解决这些基本局限性。具体来说，每个智能体都采用一个世界模型来学习环境动态的紧凑预测表示，从而实现基于想象力的策略训练，从而显着减少所需的环境交互。不确定性感知卸载门监控局部干扰水平和模型重建误差，以触发选择性边缘协调。激活后，边缘的轻量级潜在去相关机制会细化代理的战略表示，引导他们采取正交行动，最大限度地减少资源冲突。广泛的仿真表明，DWM-RO 的收敛速度比最先进的基线快 5 倍，同时实现了 34.7% 的光谱效率，并将约束违规减少了 40%。在10个用户的密集网络场景下，DWM-RO的违规率保持在20%以下，而基线超过70%，验证了卓越的鲁棒性。

Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

重温大型推理模型强化学习中的熵

Authors: Renren Jin, Pengzhi Gao, Yuqi Ren, Zhuowen Han, Tongxuan Zhang, Wuwei Huang, Wei Liu, Jian Luan, Deyi Xiong
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.05993
Pdf link: https://arxiv.org/pdf/2511.05993
Abstract Reinforcement learning with verifiable rewards (RLVR) has emerged as a predominant approach for enhancing the reasoning capabilities of large language models (LLMs). However, the entropy of LLMs usually collapses during RLVR training, causing premature convergence to suboptimal local minima and hinder further performance improvement. Although various approaches have been proposed to mitigate entropy collapse, a comprehensive study of entropy in RLVR remains lacking. To address this gap, we conduct extensive experiments to investigate the entropy dynamics of LLMs trained with RLVR and analyze how model entropy correlates with response diversity, calibration, and performance across various benchmarks. Our findings reveal that the number of off-policy updates, the diversity of training data, and the clipping thresholds in the optimization objective are critical factors influencing the entropy of LLMs trained with RLVR. Moreover, we theoretically and empirically demonstrate that tokens with positive advantages are the primary contributors to entropy collapse, and that model entropy can be effectively regulated by adjusting the relative loss weights of tokens with positive and negative advantages during training.
中文摘要 具有可验证奖励的强化学习（RLVR）已成为增强大型语言模型（LLM）推理能力的主要方法。然而，LLM的熵通常会在RLVR训练期间崩溃，导致过早收敛到次优的局部最小值，并阻碍了进一步的性能改进。尽管已经提出了各种方法来减轻熵崩溃，但仍然缺乏对 RLVR 熵的全面研究。为了解决这一差距，我们进行了广泛的实验来研究使用 RLVR 训练的 LLM 的熵动力学，并分析模型熵如何与各种基准的响应多样性、校准和性能相关。我们的研究结果表明，优化目标中的非策略更新次数、训练数据的多样性和裁剪阈值是影响使用 RLVR 训练的 LLM 熵的关键因素。此外，我们从理论和实证上证明，具有正优势的token是熵崩溃的主要贡献者，并且可以通过在训练过程中调整具有正负优势的token的相对损失权重来有效调节模型熵。

Probe-and-Release Coordination of Platoons at Highway Bottlenecks with Unknown Parameters

参数未知的公路瓶颈处排的探放协调

Authors: Yi Gao, Xi Xiong, Karl H. Johansson, Li Jin
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.06026
Pdf link: https://arxiv.org/pdf/2511.06026
Abstract This paper considers coordination of platoons of connected and autonomous vehicles (CAVs) at mixed-autonomy bottlenecks in the face of three practically important factors, viz. time-varying traffic demand, random CAV platoon sizes, and capacity breakdowns. Platoon coordination is essential to smoothen the interaction between CAV platoons and non-CAV traffic. Based on a fluid queuing model, we develop a "probe-and-release" algorithm that simultaneously estimates environmental parameters and coordinates CAV platoons for traffic stabilization. We show that this algorithm ensures bounded estimation errors and bounded traffic queues. The proof builds on a Lyapunov function that jointly penalizes estimation errors and traffic queues and a drift argument for an embedded Markov process. We validate the proposed algorithm in a standard micro-simulation environment and compare against a representative deep reinforcement learning method in terms of control performance and computational efficiency.
中文摘要 本文考虑了在混合自动驾驶瓶颈下联网和自动驾驶汽车（CAV）队列的协调，同时考虑了三个实际重要因素，即随时间变化的交通需求、随机的CAV队列规模和容量细分。排协调对于平滑 CAV 排和非 CAV 交通之间的互动至关重要。基于流体排队模型，我们开发了一种“探测和释放”算法，该算法可以同时估计环境参数并协调 CAV 排以稳定交通。我们表明，该算法确保了有限估计误差和有限流量队列。该证明建立在一个 Lyapunov 函数之上，该函数共同惩罚估计误差和流量队列，以及嵌入式马尔可夫过程的漂移参数。我们在标准的微观仿真环境中验证了所提算法，并在控制性能和计算效率方面与具有代表性的深度强化学习方法进行了比较。

ScRPO: From Errors to Insights

ScRPO：从错误到洞察

Authors: Lianrui Li, Dakuan Lu, Jiawei Shao, Chi Zhang, Xuelong Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.06065
Pdf link: https://arxiv.org/pdf/2511.06065
Abstract We propose Self-correction Relative Policy Optimization (ScRPO), a novel reinforcement learning framework designed to enhance large language models on challenging mathemati- cal problems by leveraging self-reflection and error correction. Our approach consists of two stages: (1) Trial-and-error learning stage: training the model with GRPO and collect- ing incorrect answers along with their cor- responding questions in an error pool; (2) Self-correction learning stage: guiding the model to reflect on why its previous an- swers were wrong. Extensive experiments across multiple math reasoning benchmarks, including AIME, AMC, Olympiad, MATH- 500, GSM8k, using Deepseek-Distill-Qwen- 1.5B and Deepseek-Distill-Qwen-7B. The ex- perimental results demonstrate that ScRPO consistently outperforms several post-training methods. These findings highlight ScRPO as a promising paradigm for enabling language models to self-improve on difficult tasks with limited external feedback, paving the way to- ward more reliable and capable AI systems.
中文摘要 我们提出了自我纠正相对策略优化（ScRPO），这是一种新颖的强化学习框架，旨在通过利用自我反思和纠错来增强大型语言模型在具有挑战性的数学问题上。我们的方法包括两个阶段：（1）试错学习阶段：用GRPO训练模型，并在错误池中收集错误答案及其相应的问题;（2）自我纠正学习阶段：引导模型反思其前一个答案为何错误。使用 Deepseek-Distill-Qwen- 1.5B 和 Deepseek-Distill-Qwen-7B 在多个数学推理基准测试中进行了广泛的实验，包括 AIME、AMC、Olympiad、MATH-500、GSM7k。实验结果表明，ScRPO 始终优于几种训练后方法。这些发现凸显了 ScRPO 是一种有前途的范式，使语言模型能够在有限的外部反馈下自我改进困难的任务，为更可靠、更强大的人工智能系统铺平道路。

Approximating Shapley Explanations in Reinforcement Learning

强化学习中的近似 Shapley 解释

Authors: Daniel Beechey, Özgür Şimşek
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.06094
Pdf link: https://arxiv.org/pdf/2511.06094
Abstract Reinforcement learning has achieved remarkable success in complex decision-making environments, yet its lack of transparency limits its deployment in practice, especially in safety-critical settings. Shapley values from cooperative game theory provide a principled framework for explaining reinforcement learning; however, the computational cost of Shapley explanations is an obstacle to their use. We introduce FastSVERL, a scalable method for explaining reinforcement learning by approximating Shapley values. FastSVERL is designed to handle the unique challenges of reinforcement learning, including temporal dependencies across multi-step trajectories, learning from off-policy data, and adapting to evolving agent behaviours in real time. FastSVERL introduces a practical, scalable approach for principled and rigorous interpretability in reinforcement learning.
中文摘要 强化学习在复杂的决策环境中取得了显着的成功，但其缺乏透明度限制了其在实践中的部署，特别是在安全关键环境中。合作博弈论中的 Shapley 值为解释强化学习提供了一个原则框架;然而，Shapley 解释的计算成本是其使用的障碍。我们介绍了 FastSVERL，这是一种通过近似 Shapley 值来解释强化学习的可扩展方法。FastSVERL 旨在应对强化学习的独特挑战，包括跨多步骤轨迹的时间依赖关系、从策略外数据中学习以及实时适应不断变化的智能体行为。FastSVERL 引入了一种实用的、可扩展的方法，用于强化学习中的原则性和严格的可解释性。

Guardian-regularized Safe Offline Reinforcement Learning for Smart Weaning of Mechanical Circulatory Devices

用于机械循环装置智能脱机的 Guardian-regular 化安全离线强化学习

Authors: Aysin Tumay, Sophia Sun, Sonia Fereidooni, Aaron Dumas, Elise Jortberg, Rose Yu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.06111
Pdf link: https://arxiv.org/pdf/2511.06111
Abstract We study the sequential decision-making problem for automated weaning of mechanical circulatory support (MCS) devices in cardiogenic shock patients. MCS devices are percutaneous micro-axial flow pumps that provide left ventricular unloading and forward blood flow, but current weaning strategies vary significantly across care teams and lack data-driven approaches. Offline reinforcement learning (RL) has proven to be successful in sequential decision-making tasks, but our setting presents challenges for training and evaluating traditional offline RL methods: prohibition of online patient interaction, highly uncertain circulatory dynamics due to concurrent treatments, and limited data availability. We developed an end-to-end machine learning framework with two key contributions (1) Clinically-aware OOD-regularized Model-based Policy Optimization (CORMPO), a density-regularized offline RL algorithm for out-of-distribution suppression that also incorporates clinically-informed reward shaping and (2) a Transformer-based probabilistic digital twin that models MCS circulatory dynamics for policy evaluation with rich physiological and clinical metrics. We prove that \textsf{CORMPO} achieves theoretical performance guarantees under mild assumptions. CORMPO attains a higher reward than the offline RL baselines by 28% and higher scores in clinical metrics by 82.6% on real and synthetic datasets. Our approach offers a principled framework for safe offline policy learning in high-stakes medical applications where domain expertise and safety constraints are essential.
中文摘要 我们研究了心源性休克患者机械循环支持（MCS）装置自动脱机的顺序决策问题。MCS 设备是经皮微轴流泵，可提供左心室卸载和向前血流，但目前的脱机策略因护理团队而异，并且缺乏数据驱动的方法。离线强化学习（RL）已被证明在顺序决策任务中是成功的，但我们的设置给训练和评估传统的离线 RL 方法带来了挑战：禁止在线患者互动、由于同时治疗导致的循环动力学高度不确定，以及数据可用性有限。我们开发了一个端到端机器学习框架，具有两个关键贡献：（1）临床感知 OOD 正则化基于模型的策略优化（CORMPO），一种用于分布外抑制的密度正则化离线 RL 算法，还结合了临床知情的奖励塑造和（2）基于 Transformer 的概率数字孪生，用于模拟 MCS 循环动力学，以丰富的生理和临床指标进行政策评估。我们证明了\textsf{CORMPO}在温和的假设下实现了理论性能保证。在真实和合成数据集上，CORMPO 获得了比离线 RL 基线更高的 28% 的奖励，在临床指标上获得了更高的 82.6% 的分数。我们的方法为在领域专业知识和安全约束至关重要的高风险医疗应用中的安全离线政策学习提供了一个原则框架。

A Deep Learning Model for Predicting Transformation Legality

一种用于预测转换合法性的深度学习模型

Authors: Avani Tiwari, Yacine Hakimi, Riyadh Baghdadi
Subjects: Subjects: Programming Languages (cs.PL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.06120
Pdf link: https://arxiv.org/pdf/2511.06120
Abstract Compilers must check the legality of code transformations to guarantee the correctness of applying a sequence of code transformations to a given code. While such a legality check needs to be precisely computed in general, we can use an approximate legality prediction model in certain cases, such as training a reinforcement learning (RL) agent for schedule prediction. In this paper, we propose an approximate method for legality checks. We propose a novel DL model for predicting the legality of transformations. The model takes the code representation and a list of transformations as input and predicts whether applying those transformations to the code is legal. We implement and evaluate the proposed model, demonstrating its effectiveness. Our evaluation shows an F1 score of 0.91 on a test set of randomly generated programs. To further evaluate the model in a practical scenario, we used the model to replace the legality check used during the training of an RL agent designed for automatic code optimization. We demonstrate that such a replacement enables the agent to train on twice as many steps, resulting in faster training and reducing resource usage by approximately 80\% for CPU and 35\% for RAM. The agent trained using this approach maintains comparable performance, with only a 4\% reduction on benchmarks from the Polybench suite compared to the traditional method.
中文摘要 编译器必须检查代码转换的合法性，以保证将一系列代码转换应用于给定代码的正确性。虽然这种合法性检查通常需要精确计算，但在某些情况下，我们可以使用近似的合法性预测模型，例如训练强化学习（RL）代理进行进度预测。本文提出了一种近似的合法性检查方法。我们提出了一种新的DL模型来预测转换的合法性。该模型将代码表示形式和转换列表作为输入，并预测将这些转换应用于代码是否合法。我们实施并评估所提出的模型，证明其有效性。我们的评估显示，在一组随机生成的程序测试中，F1 得分为 0.91。为了在实际场景中进一步评估该模型，我们使用该模型来替换在训练专为自动代码优化而设计的 RL 代理期间使用的合法性检查。我们证明，这种替换使代理能够训练两倍的步骤，从而加快训练速度，并将 CPU 的资源使用量减少约 80\%，RAM 的资源使用量减少约 35\%。使用这种方法训练的代理保持了相当的性能，与传统方法相比，Polybench 套件的基准测试仅降低了 4%。

Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs

Maestro：学习通过多代理法学硕士的条件列表策略优化进行协作

Authors: Wei Yang, Jiacheng Pang, Shixuan Li, Paul Bogdan, Stephen Tu, Jesse Thomason
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.06134
Pdf link: https://arxiv.org/pdf/2511.06134
Abstract Multi-agent systems (MAS) built on Large Language Models (LLMs) are being used to approach complex problems and can surpass single model inference. However, their success hinges on navigating a fundamental cognitive tension: the need to balance broad, divergent exploration of the solution space with a principled, convergent synthesis to the optimal solution. Existing paradigms often struggle to manage this duality, leading to premature consensus, error propagation, and a critical credit assignment problem that fails to distinguish between genuine reasoning and superficially plausible arguments. To resolve this core challenge, we propose the Multi-Agent Exploration-Synthesis framework Through Role Orchestration (Maestro), a principled paradigm for collaboration that structurally decouples these cognitive modes. Maestro uses a collective of parallel Execution Agents for diverse exploration and a specialized Central Agent for convergent, evaluative synthesis. To operationalize this critical synthesis phase, we introduce Conditional Listwise Policy Optimization (CLPO), a reinforcement learning objective that disentangles signals for strategic decisions and tactical rationales. By combining decision-focused policy gradients with a list-wise ranking loss over justifications, CLPO achieves clean credit assignment and stronger comparative supervision. Experiments on mathematical reasoning and general problem-solving benchmarks demonstrate that Maestro, coupled with CLPO, consistently outperforms existing state-of-the-art multi-agent approaches, delivering absolute accuracy gains of 6% on average and up to 10% at best.
中文摘要 基于大型语言模型（LLM）构建的多智能体系统（MAS）被用于解决复杂问题，并且可以超越单一模型推理。然而，它们的成功取决于驾驭一个基本的认知张力：需要平衡对解决方案空间的广泛、发散的探索与对最优解决方案的有原则的收敛综合。现有范式往往难以管理这种二元性，导致过早达成共识、错误传播以及无法区分真正的推理和表面合理的论点的关键信用分配问题。为了解决这一核心挑战，我们提出了通过角色编排（Maestro）的多智能体探索-综合框架，这是一种原则性的协作范式，在结构上解耦了这些认知模式。Maestro 使用一组并行执行代理进行多样化的探索，并使用专门的中央代理进行收敛、评估性综合。为了实施这一关键综合阶段，我们引入了条件列表策略优化（CLPO），这是一种强化学习目标，可以理清战略决策和战术原理的信号。通过将以决策为中心的政策梯度与按列表排列的理由损失相结合，CLPO 实现了干净的信用分配和更强的比较监督。数学推理和一般问题解决基准的实验表明，Maestro 与 CLPO 相结合，始终优于现有的最先进的多智能体方法，平均提供 6% 的绝对准确率提升，最多可达 10%。

When Object-Centric World Models Meet Policy Learning: From Pixels to Policies, and Where It Breaks

以对象为中心的世界模型何时满足策略学习：从像素到策略，以及它在哪里中断

Authors: Stefano Ferraro, Akihiro Nakano, Masahiro Suzuki, Yutaka Matsuo
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.06136
Pdf link: https://arxiv.org/pdf/2511.06136
Abstract Object-centric world models (OCWM) aim to decompose visual scenes into object-level representations, providing structured abstractions that could improve compositional generalization and data efficiency in reinforcement learning. We hypothesize that explicitly disentangled object-level representations, by localizing task-relevant information, can enhance policy performance across novel feature combinations. To test this hypothesis, we introduce DLPWM, a fully unsupervised, disentangled object-centric world model that learns object-level latents directly from pixels. DLPWM achieves strong reconstruction and prediction performance, including robustness to several out-of-distribution (OOD) visual variations. However, when used for downstream model-based control, policies trained on DLPWM latents underperform compared to DreamerV3. Through latent-trajectory analyses, we identify representation shift during multi-object interactions as a key driver of unstable policy learning. Our results suggest that, although object-centric perception supports robust visual modeling, achieving stable control requires mitigating latent drift.
中文摘要 以对象为中心的世界模型（OCWM）旨在将视觉场景分解为对象级表示，提供结构化抽象，从而提高强化学习中的组合泛化和数据效率。我们假设，通过定位任务相关信息，显式解缠的对象级表示可以增强跨新特征组合的策略性能。为了验证这一假设，我们引入了 DLPWM，这是一种完全无监督、解缠的以对象为中心的世界模型，它直接从像素中学习对象级潜在数据。DLPWM 实现了强大的重建和预测性能，包括对多种分布外（OOD）视觉变化的鲁棒性。然而，当用于基于下游模型的控制时，与 DreamerV3 相比，在 DLPWM 潜在上训练的策略表现不佳。通过潜在轨迹分析，我们确定多对象交互过程中的表征转变是不稳定策略学习的关键驱动因素。我们的结果表明，尽管以对象为中心的感知支持鲁棒的视觉建模，但实现稳定的控制需要减轻潜在漂移。

MALinZero: Efficient Low-Dimensional Search for Mastering Complex Multi-Agent Planning

MALinZero：掌握复杂多智能体规划的高效低维搜索

Authors: Sizhe Tang, Jiayu Chen, Tian Lan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.06142
Pdf link: https://arxiv.org/pdf/2511.06142
Abstract Monte Carlo Tree Search (MCTS), which leverages Upper Confidence Bound for Trees (UCTs) to balance exploration and exploitation through randomized sampling, is instrumental to solving complex planning problems. However, for multi-agent planning, MCTS is confronted with a large combinatorial action space that often grows exponentially with the number of agents. As a result, the branching factor of MCTS during tree expansion also increases exponentially, making it very difficult to efficiently explore and exploit during tree search. To this end, we propose MALinZero, a new approach to leverage low-dimensional representational structures on joint-action returns and enable efficient MCTS in complex multi-agent planning. Our solution can be viewed as projecting the joint-action returns into the low-dimensional space representable using a contextual linear bandit problem formulation. We solve the contextual linear bandit problem with convex and $\mu$-smooth loss functions -- in order to place more importance on better joint actions and mitigate potential representational limitations -- and derive a linear Upper Confidence Bound applied to trees (LinUCT) to enable novel multi-agent exploration and exploitation in the low-dimensional space. We analyze the regret of MALinZero for low-dimensional reward functions and propose an $(1-\tfrac1e)$-approximation algorithm for the joint action selection by maximizing a sub-modular objective. MALinZero demonstrates state-of-the-art performance on multi-agent benchmarks such as matrix games, SMAC, and SMACv2, outperforming both model-based and model-free multi-agent reinforcement learning baselines with faster learning speed and better performance.
中文摘要 蒙特卡洛树搜索（MCTS）利用树木置信上限（UCT）通过随机抽样平衡勘探和开发，有助于解决复杂的规划问题。然而，对于多智能体规划，MCTS 面临着一个巨大的组合行动空间，该空间通常随着智能体数量的增加而呈指数级增长。因此，MCTS在树木扩展过程中的分支因子也呈指数级增长，使得在树木搜索过程中难以进行高效探索和开发。为此，我们提出了 MALinZero，这是一种利用联合行动回报的低维表征结构的新方法，并在复杂的多智能体规划中实现高效的 MCTS。我们的解决方案可以被视为将联合动作返回投影到使用上下文线性强盗问题公式表示的低维空间中。我们用凸函数和$\mu$平滑损失函数解决上下文线性强盗问题，以更加重视更好的联合行动并减轻潜在的表征限制，并推导出应用于树的线性置信上限（LinUCT），以实现在低维空间中进行新型的多智能体探索和利用。我们分析了MALinZero对低维奖励函数的遗憾，并提出了一种通过最大化子模目标的联合动作选择的$（1-\tfrac1e）$近似算法。MALinZero 在矩阵博弈、SMAC 和 SMACv2 等多智能体基准测试中展示了最先进的性能，以更快的学习速度和更好的性能优于基于模型和无模型的多智能体强化学习基线。

Elastic Data Transfer Optimization with Hybrid Reinforcement Learning

使用混合强化学习进行弹性数据传输优化

Authors: Rasman Mubtasim Swargo, Md Arifuzzaman
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2511.06159
Pdf link: https://arxiv.org/pdf/2511.06159
Abstract Modern scientific data acquisition generates petabytes of data that must be transferred to geographically distant computing clusters. Conventional tools either rely on preconfigured sessions, which are difficult to tune for users without domain expertise, or they adaptively optimize only concurrency while ignoring other important parameters. We present \name, an adaptive data transfer method that jointly considers multiple parameters. Our solution incorporates heuristic-based parallelism, infinite pipelining, and a deep reinforcement learning based concurrency optimizer. To make agent training practical, we introduce a lightweight network simulator that reduces training time to less than four minutes and provides a $2750\times$ speedup compared to online training. Experimental evaluation shows that \name consistently outperforms existing methods across diverse datasets, achieving up to 9.5x higher throughput compared to state-of-the-art solutions.
中文摘要 现代科学数据采集会产生 PB 级的数据，这些数据必须传输到地理上遥远的计算集群。传统工具要么依赖于预配置的会话，这些会话很难针对没有领域专业知识的用户进行调整，要么它们仅自适应地优化并发性而忽略其他重要参数。我们提出了 \name，一种共同考虑多个参数的自适应数据传输方法。我们的解决方案结合了基于启发式的并行性、无限流水线和基于深度强化学习的并发优化器。为了使代理训练变得实用，我们引入了一种轻量级网络模拟器，该模拟器将训练时间缩短到不到四分钟，并且与在线训练相比，速度提高了 2750 美元。实验评估表明，\name 在不同数据集中始终优于现有方法，与最先进的解决方案相比，吞吐量提高了 9.5 倍。

OpenVLN: Open-world aerial Vision-Language Navigation

OpenVLN：开放世界空中视觉语言导航

Authors: Peican Lin, Gan Sun, Chenxi Liu, Fazeng Li, Weihong Ren, Yang Cong
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.06182
Pdf link: https://arxiv.org/pdf/2511.06182
Abstract Vision-language models (VLMs) have been widely-applied in ground-based vision-language navigation (VLN). However, the vast complexity of outdoor aerial environments compounds data acquisition challenges and imposes long-horizon trajectory planning requirements on Unmanned Aerial Vehicles (UAVs), introducing novel complexities for aerial VLN. To address these challenges, we propose a data-efficient Open-world aerial Vision-Language Navigation (i.e., OpenVLN) framework, which could execute language-guided flight with limited data constraints and enhance long-horizon trajectory planning capabilities in complex aerial environments. Specifically, we reconfigure a reinforcement learning framework to optimize the VLM for UAV navigation tasks, which can efficiently fine-tune VLM by using rule-based policies under limited training data. Concurrently, we introduce a long-horizon planner for trajectory synthesis that dynamically generates precise UAV actions via value-based rewards. To the end, we conduct sufficient navigation experiments on the TravelUAV benchmark with dataset scaling across diverse reward settings. Our method demonstrates consistent performance gains of up to 4.34% in Success Rate, 6.19% in Oracle Success Rate, and 4.07% in Success weighted by Path Length over baseline methods, validating its deployment efficacy for long-horizon UAV navigation in complex aerial environments.
中文摘要 视觉语言模型（VLM）在地面视觉语言导航（VLN）中得到了广泛的应用。然而，室外空中环境的巨大复杂性加剧了数据采集挑战，并对无人机（UAV）提出了远视距轨迹规划要求，从而给空中 VLN 带来了新的复杂性。为了应对这些挑战，我们提出了一种数据高效的开放世界空中视觉语言导航（OpenVLN）框架，该框架可以在有限的数据约束下执行语言引导飞行，并增强复杂空中环境中的长视距轨迹规划能力。具体来说，我们重新配置了一个强化学习框架，以优化无人机导航任务的VLM，在有限的训练数据下，通过使用基于规则的策略，可以有效地微调VLM。同时，我们引入了一种用于轨迹合成的远视野规划器，它通过基于价值的奖励动态生成精确的无人机动作。最后，我们在 TravelUAV 基准测试上进行了充分的导航实验，并在不同的奖励设置中扩展了数据集。与基线方法相比，我们的方法在成功率方面表现出持续的性能提升，分别提高了 4.34%、6.19% 的 Oracle 成功率和 4.07% 的成功率（按路径长度加权），验证了其在复杂空中环境中的长视距无人机导航的部署效率。

Deep Reinforcement Learning for Dynamic Origin-Destination Matrix Estimation in Microscopic Traffic Simulations Considering Credit Assignment

考虑信用分配的微观交通模拟中动态起点-目的地矩阵估计的深度强化学习

Authors: Donggyu Min, Seongjin Choi, Dong-Kyu Kim
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.06229
Pdf link: https://arxiv.org/pdf/2511.06229
Abstract This paper focuses on dynamic origin-destination matrix estimation (DODE), a crucial calibration process necessary for the effective application of microscopic traffic simulations. The fundamental challenge of the DODE problem in microscopic simulations stems from the complex temporal dynamics and inherent uncertainty of individual vehicle dynamics. This makes it highly challenging to precisely determine which vehicle traverses which link at any given moment, resulting in intricate and often ambiguous relationships between origin-destination (OD) matrices and their contributions to resultant link flows. This phenomenon constitutes the credit assignment problem, a central challenge addressed in this study. We formulate the DODE problem as a Markov Decision Process (MDP) and propose a novel framework that applies model-free deep reinforcement learning (DRL). Within our proposed framework, the agent learns an optimal policy to sequentially generate OD matrices, refining its strategy through direct interaction with the simulation environment. The proposed method is validated on the Nguyen-Dupuis network using SUMO, where its performance is evaluated against ground-truth link flows aggregated at 5-minute intervals over a 30-minute horizon. Experimental results demonstrate that our approach achieves a 43.2% reduction in mean squared error (MSE) compared to the best-performing conventional baseline. By reframing DODE as a sequential decision-making problem, our approach addresses the credit assignment challenge through its learned policy, thereby overcoming the limitations of conventional methods and proposing a novel framework for calibration of microscopic traffic simulations.
中文摘要 本文重点介绍动态起点-目的地矩阵估计（DODE），这是有效应用微观交通模拟所必需的关键校准过程。微观模拟中DODE问题的根本挑战源于复杂的时间动力学和单个车辆动力学的固有不确定性。这使得在任何给定时刻精确确定哪个车辆穿越哪个链路变得极具挑战性，导致出发地-目的地（OD）矩阵及其对最终链路流的贡献之间存在复杂且通常模棱两可的关系。这种现象构成了学分分配问题，这是本研究解决的一个核心挑战。我们将DODE问题表述为马尔可夫决策过程（MDP），并提出了一种应用无模型深度强化学习（DRL）的新框架。在我们提出的框架内，智能体学习了一种最佳策略来按顺序生成OD矩阵，通过与模拟环境的直接交互来完善其策略。所提出的方法使用 SUMO 在 Nguyen-Dupuis 网络上进行了验证，其中根据 30 分钟范围内以 5 分钟间隔聚合的地面实况链路流评估其性能。实验结果表明，与性能最佳的传统基线相比，我们的方法实现了 43.2% 的均方误差（MSE）降低。通过将DODE重新定义为一个顺序决策问题，我们的方法通过其学习策略解决了信用分配挑战，从而克服了传统方法的局限性，并提出了一种用于微观交通模拟校准的新框架。

MrCoM: A Meta-Regularized World-Model Generalizing Across Multi-Scenarios

MrCoM：跨多场景泛化的元正则化世界模型

Authors: Xuantang Xiong, Ni Mu, Runpeng Xie, Senhao Yang, Yaqing Wang, Lexiang Wang, Yao Luan, Siyuan Li, Shuang Xu, Yiqin Yang, Bo Xu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.06252
Pdf link: https://arxiv.org/pdf/2511.06252
Abstract Model-based reinforcement learning (MBRL) is a crucial approach to enhance the generalization capabilities and improve the sample efficiency of RL algorithms. However, current MBRL methods focus primarily on building world models for single tasks and rarely address generalization across different scenarios. Building on the insight that dynamics within the same simulation engine share inherent properties, we attempt to construct a unified world model capable of generalizing across different scenarios, named Meta-Regularized Contextual World-Model (MrCoM). This method first decomposes the latent state space into various components based on the dynamic characteristics, thereby enhancing the accuracy of world-model prediction. Further, MrCoM adopts meta-state regularization to extract unified representation of scenario-relevant information, and meta-value regularization to align world-model optimization with policy learning across diverse scenario objectives. We theoretically analyze the generalization error upper bound of MrCoM in multi-scenario settings. We systematically evaluate our algorithm's generalization ability across diverse scenarios, demonstrating significantly better performance than previous state-of-the-art methods.
中文摘要 基于模型的强化学习（MBRL）是增强RL算法泛化能力、提高采样效率的重要途径。然而，当前的 MBRL 方法主要侧重于为单个任务构建世界模型，很少解决跨不同场景的泛化问题。基于同一模拟引擎中的动力学具有固有属性的见解，我们尝试构建一个能够跨不同场景进行泛化的统一世界模型，称为元正则化上下文世界模型（MrCoM）。该方法首先根据动态特征将潜态空间分解为各种分量，从而提高世界模型预测的准确性。此外，MrCoM 采用元状态正则化来提取场景相关信息的统一表示，并采用元值正则化来使世界模型优化与跨不同场景目标的策略学习保持一致。从理论上分析了多场景设置下MrCoM的泛化误差上限。我们系统地评估了我们的算法在不同场景下的泛化能力，展示了比以前最先进的方法更好的性能。

VideoSSR: Video Self-Supervised Reinforcement Learning

VideoSSR：视频自监督强化学习

Authors: Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, Yu Cheng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.06281
Pdf link: https://arxiv.org/pdf/2511.06281
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, the rapid progress of MLLMs is outpacing the complexity of existing video datasets, while the manual annotation of new, high-quality data remains prohibitively expensive. This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data? To investigate this, we introduce three self-supervised pretext tasks: Anomaly Grounding, Object Counting, and Temporal Jigsaw. We construct the Video Intrinsic Understanding Benchmark (VIUBench) to validate their difficulty, revealing that current state-of-the-art MLLMs struggle significantly on these tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset and propose VideoSSR, a novel video self-supervised reinforcement learning framework for RLVR. Extensive experiments across 17 benchmarks, spanning four major video domains (General Video QA, Long Video QA, Temporal Grounding, and Complex Reasoning), demonstrate that VideoSSR consistently enhances model performance, yielding an average improvement of over 5\%. These results establish VideoSSR as a potent foundational framework for developing more advanced video understanding in MLLMs. The code is available at this https URL.
中文摘要 具有可验证奖励的强化学习（RLVR）极大地提高了多模态大型语言模型（MLLM）的视频理解能力。然而，MLLM 的快速发展已经超过了现有视频数据集的复杂性，而手动注释新的高质量数据仍然昂贵得令人望而却步。这项工作调查了一个关键问题：视频中丰富的内在信息能否被用来自行生成高质量、可验证的训练数据？为了研究这一点，我们引入了三个自监督借口任务：异常接地、对象计数和时间拼图。我们构建了视频内在理解基准（VIUBench）来验证它们的难度，表明当前最先进的 MLLM 在这些任务上表现不佳。基于这些借口任务，我们开发了VideoSSR-30K数据集，并提出了VideoSSR，这是一种用于RLVR的新型视频自监督强化学习框架。跨越四个主要视频领域（通用视频 QA、长视频 QA、时间接地和复杂推理）的 17 个基准测试的广泛实验表明，VideoSSR 持续增强模型性能，平均改进超过 5\%。这些结果将 VideoSSR 确立为在 MLLM 中开发更高级视频理解的有力基础框架。该代码可在此 https URL 中找到。

What Makes Reasoning Invalid: Echo Reflection Mitigation for Large Language Models

是什么让推理无效：大型语言模型的回声反射缓解

Authors: Chen He, Xun Jiang, Lei Wang, Hao Yang, Chong Peng, Peng Yan, Fumin Shen, Xing Xu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.06380
Pdf link: https://arxiv.org/pdf/2511.06380
Abstract Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of reasoning tasks. Recent methods have further improved LLM performance in complex mathematical reasoning. However, when extending these methods beyond the domain of mathematical reasoning to tasks involving complex domain-specific knowledge, we observe a consistent failure of LLMs to generate novel insights during the reflection stage. Instead of conducting genuine cognitive refinement, the model tends to mechanically reiterate earlier reasoning steps without introducing new information or perspectives, a phenomenon referred to as "Echo Reflection". We attribute this behavior to two key defects: (1) Uncontrollable information flow during response generation, which allows premature intermediate thoughts to propagate unchecked and distort final decisions; (2) Insufficient exploration of internal knowledge during reflection, leading to repeating earlier findings rather than generating new cognitive insights. Building on these findings, we proposed a novel reinforcement learning method termed Adaptive Entropy Policy Optimization (AEPO). Specifically, the AEPO framework consists of two major components: (1) Reflection-aware Information Filtration, which quantifies the cognitive information flow and prevents the final answer from being affected by earlier bad cognitive information; (2) Adaptive-Entropy Optimization, which dynamically balances exploration and exploitation across different reasoning stages, promoting both reflective diversity and answer correctness. Extensive experiments demonstrate that AEPO consistently achieves state-of-the-art performance over mainstream reinforcement learning baselines across diverse benchmarks.
中文摘要 大型语言模型（LLM）在广泛的推理任务中表现出了卓越的性能。最近的方法进一步提高了法学硕士在复杂数学推理中的表现。然而，当将这些方法扩展到数学推理领域之外，扩展到涉及复杂领域特定知识的任务时，我们观察到法学硕士在反思阶段始终无法产生新的见解。该模型没有进行真正的认知细化，而是倾向于机械地重申早期的推理步骤，而不引入新的信息或观点，这种现象被称为“回声反射”。我们将这种行为归因于两个关键缺陷：（1）在响应生成过程中信息流不可控，这使得过早的中间思想传播不受控制并扭曲最终决策;（2）反思过程中对内部知识的探索不足，导致重复早期的发现，而不是产生新的认知见解。基于这些发现，我们提出了一种新的强化学习方法，称为自适应熵策略优化（AEPO）。具体来说，AEPO框架由两大组成部分组成：（1）反思感知信息过滤，量化认知信息流，防止最终答案受到早期不良认知信息的影响;（2）自适应熵优化，动态平衡不同推理阶段的探索和利用，促进反思多样性和答案正确性。广泛的实验表明，AEPO 在不同基准上始终比主流强化学习基线实现最先进的性能。

Dynamic Electric Vehicle Charging Pricing for Load Balancing in Power Distribution Networks based on Collaborative DDPG Agents

基于协同DDPG代理的配电网负载均衡动态电动汽车充电定价

Authors: Leloko J. Lepolesa, Kayode E. Adetunji, Khmaies Ouahada, Zhenqing Liu, Ling Cheng
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.06398
Pdf link: https://arxiv.org/pdf/2511.06398
Abstract The transition from the Internal Combustion Engine Vehicles (ICEVs) to the Electric Vehicles (EVs) is globally recommended to combat the unfavourable environmental conditions caused by reliance on fossil fuels. However, it has been established that the charging of EVs can destabilize the grid when they penetrate the market in large numbers, especially in grids that were not initially built to handle the load from the charging of EVs. In this work, we present a dynamic EV charging pricing strategy that fulfills the following three objectives: distribution network-level load peak-shaving, valley-filling, and load balancing across distribution networks. Based on historical environmental variables such as temperature, humidity, wind speed, EV charging prices and distribution of vehicles in different areas in different times of the day, we first forecast the distribution network load demand, and then use deep reinforcement learning approach to set the optimal dynamic EV charging price. While most research seeks to achieve load peak-shaving and valley-filling to stabilize the grid, our work goes further into exploring the load-balancing between the distribution networks in the close vicinity to each other. We compare the performance of Deep Deterministic Policy Gradient (DDPG), Soft Actor-Critic (SAC) and Proximal Policy Optimization (PPO) algorithms for this purpose. The best algorithm is used for dymamic EV pricing. Simulation results show an improved utilization of the grid at the distribution network level, leading to the optimal usage of the grid on a larger scale.
中文摘要 全球建议从内燃机汽车（ICEV）过渡到电动汽车（EV），以应对因依赖化石燃料而造成的不利环境条件。然而，已经确定，当电动汽车大量渗透市场时，电动汽车的充电可能会破坏电网的稳定，尤其是在最初不是为了处理电动汽车充电负载而建造的电网中。在这项工作中，我们提出了一种动态的电动汽车充电定价策略，可实现以下三个目标：配电网级负荷削峰、填谷和跨配电网的负载平衡。基于温度、湿度、风速、电动汽车充电价格以及一天中不同时间段不同区域车辆分布等历史环境变量，我们首先预测配电网负荷需求，然后使用深度强化学习方法设置最优的动态电动汽车充电价格。虽然大多数研究都试图实现负荷调峰和填谷以稳定电网，但我们的工作进一步探索彼此附近配电网之间的负荷平衡。为此，我们比较了深度确定性策略梯度（DDPG）、软参与者批评（SAC）和近端策略优化（PPO）算法的性能。最好的算法用于动力电动汽车定价。仿真结果表明，配电网层面电网利用率有所提高，从而在更大范围内实现电网的优化利用。

SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization

SofT-GRPO：通过Gumbel-Reparameterized软思维策略优化超越离散token法学强化学习

Authors: Zhi Zheng, Wee Sun Lee
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.06411
Pdf link: https://arxiv.org/pdf/2511.06411
Abstract The soft-thinking paradigm for Large Language Model (LLM) reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the soft-thinking reasoning pattern. SofT-GRPO injects the Gumbel noise into logits, employs the Gumbel-Softmax technique to avoid soft-thinking tokens outside the pre-trained embedding space, and leverages the reparameterization trick in policy gradient. We conduct experiments across base LLMs ranging from 1.5B to 7B parameters, and results demonstrate that SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% on average accuracy), while exhibiting a substantial uplift on Pass@32 (+2.19% on average accuracy). Codes and weights are available on this https URL
中文摘要 大型语言模型（LLM）推理的软思维范式在某些情况下可以优于传统的离散代币思维链（CoT）推理，凸显了其研究和应用价值。然而，虽然离散标记CoT推理模式可以通过群体相对策略优化（GRPO）等策略优化算法得到强化，但通过强化学习（RL）扩展软思维模式仍然具有挑战性。这种困难源于向软思维代币注入随机性并相应更新软思维策略的复杂性。因此，之前将软思维与 GRPO 相结合的尝试通常表现不佳于离散代币 GRPO 的对应物。为了充分释放软思维的潜力，本文提出了一种新的策略优化算法SofT-GRPO，以强化软思维推理模式下的LLM。SofT-GRPO 将 Gumbel 噪声注入 logits，采用 Gumbel-Softmax 技术来避免预训练嵌入空间之外的软思维标记，并利用策略梯度中的重新参数化技巧。我们对基本 LLM 进行了 1.5B 到 7B 参数的实验，结果表明，SofT-GRPO 使软思维的 LLM 在 Pass@1 上略优于离散标记 GRPO（平均准确率 +0.13%），同时在Pass@32上表现出显着提升（平均准确率 +2.19%）。代码和权重在此 https URL 上可用

CG-TTRL: Context-Guided Test-Time Reinforcement Learning for On-Device Large Language Models

CG-TTRL：面向设备大型语言模型的上下文引导测试时强化学习

Authors: Peyman Hosseini, Ondrej Bohdal, Taha Ceritli, Ignacio Castro, Matthew Purver, Mete Ozay, Umberto Michieli
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.06430
Pdf link: https://arxiv.org/pdf/2511.06430
Abstract Test-time Reinforcement Learning (TTRL) has shown promise in adapting foundation models for complex tasks at test-time, resulting in large performance improvements. TTRL leverages an elegant two-phase sampling strategy: first, multi-sampling derives a pseudo-label via majority voting, while subsequent downsampling and reward-based fine-tuning encourages the model to explore and learn diverse valid solutions, with the pseudo-label modulating the reward signal. Meanwhile, in-context learning has been widely explored at inference time and demonstrated the ability to enhance model performance without weight updates. However, TTRL's two-phase sampling strategy under-utilizes contextual guidance, which can potentially improve pseudo-label accuracy in the initial exploitation phase while regulating exploration in the second. To address this, we propose context-guided TTRL (CG-TTRL), integrating context dynamically into both sampling phases and propose a method for efficient context selection for on-device applications. Our evaluations on mathematical and scientific QA benchmarks show CG-TTRL outperforms TTRL (e.g. additional 7% relative accuracy improvement over TTRL), while boosting efficiency by obtaining strong performance after only a few steps of test-time training (e.g. 8% relative improvement rather than 1% over TTRL after 3 steps).
中文摘要 测试时强化学习（TTRL）在测试时为复杂任务调整基础模型方面显示出前景，从而大幅提高性能。TTRL 利用了一种优雅的两阶段采样策略：首先，多重采样通过多数投票推导出伪标签，而随后的下采样和基于奖励的微调鼓励模型探索和学习各种有效解决方案，伪标签调制奖励信号。同时，上下文学习在推理时得到了广泛的探索，并证明了无需权重更新即可增强模型性能的能力。然而，TTRL 的两阶段采样策略没有充分利用上下文引导，这可能会在初始开发阶段提高伪标签准确性，同时调节第二个开发阶段的探索。为了解决这个问题，我们提出了上下文引导的TTRL（CG-TTRL），将上下文动态集成到两个采样阶段，并提出了一种为设备上的应用程序进行高效上下文选择的方法。我们对数学和科学 QA 基准的评估表明，CG-TTRL 优于 TTRL（例如，比 TTRL 额外提高了 7% 的相对准确性），同时通过仅在几个步骤的测试时间训练后获得强大的性能来提高效率（例如，3 步后比 TTRL 提高 8%，而不是 1%）。

Sim-to-Real Transfer in Deep Reinforcement Learning for Bipedal Locomotion

双足运动深度强化学习中的模拟到实数转移

Authors: Lingfan Bao, Tianhu Peng, Chengxu Zhou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.06465
Pdf link: https://arxiv.org/pdf/2511.06465
Abstract This chapter addresses the critical challenge of simulation-to-reality (sim-to-real) transfer for deep reinforcement learning (DRL) in bipedal locomotion. After contextualizing the problem within various control architectures, we dissect the ``curse of simulation'' by analyzing the primary sources of sim-to-real gap: robot dynamics, contact modeling, state estimation, and numerical solvers. Building on this diagnosis, we structure the solutions around two complementary philosophies. The first is to shrink the gap through model-centric strategies that systematically improve the simulator's physical fidelity. The second is to harden the policy, a complementary approach that uses in-simulation robustness training and post-deployment adaptation to make the policy inherently resilient to model inaccuracies. The chapter concludes by synthesizing these philosophies into a strategic framework, providing a clear roadmap for developing and evaluating robust sim-to-real solutions.
中文摘要 本章解决了双足运动中深度强化学习（DRL）的模拟到现实（sim-to-real）传输的关键挑战。在将问题置于各种控制架构中的情境之后，我们通过分析模拟与真实差距的主要来源来剖析“模拟的诅咒”：机器人动力学、接触建模、状态估计和数值求解器。在此诊断的基础上，我们围绕两种互补的理念构建解决方案。首先是通过以模型为中心的策略来缩小差距，系统地提高模拟器的物理保真度。第二种是强化策略，这是一种补充方法，它使用模拟中的鲁棒性训练和部署后适应，使策略本质上能够适应模型不准确。本章最后将这些理念综合到一个战略框架中，为开发和评估强大的模拟到真实解决方案提供了清晰的路线图。

Brain-Inspired Planning for Better Generalization in Reinforcement Learning

在强化学习中实现更好泛化的类脑规划

Authors: Mingde "Harry" Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.06470
Pdf link: https://arxiv.org/pdf/2511.06470
Abstract Existing Reinforcement Learning (RL) systems encounter significant challenges when applied to real-world scenarios, primarily due to poor generalization across environments that differ from their training conditions. This thesis explores the direction of enhancing agents' zero-shot systematic generalization abilities by granting RL agents reasoning behaviors that are found to help systematic generalization in the human brain. Inspired by human conscious planning behaviors, we first introduced a top-down attention mechanism, which allows a decision-time planning agent to dynamically focus its reasoning on the most relevant aspects of the environmental state given its instantaneous intentions, a process we call "spatial abstraction". This approach significantly improves systematic generalization outside the training tasks. Subsequently, building on spatial abstraction, we developed the Skipper framework to automatically decompose complex tasks into simpler, more manageable sub-tasks. Skipper provides robustness against distributional shifts and efficacy in long-term, compositional planning by focusing on pertinent spatial and temporal elements of the environment. Finally, we identified a common failure mode and safety risk in planning agents that rely on generative models to generate state targets during planning. It is revealed that most agents blindly trust the targets they hallucinate, resulting in delusional planning behaviors. Inspired by how the human brain rejects delusional intentions, we propose learning a feasibility evaluator to enable rejecting hallucinated infeasible targets, which led to significant performance improvements in various kinds of planning agents. Finally, we suggest directions for future research, aimed at achieving general task abstraction and fully enabling abstract planning.
中文摘要 现有的强化学习（RL）系统在应用于现实场景时遇到了重大挑战，这主要是由于跨环境的泛化性差，与其训练条件不同。本论文探讨了通过赋予强联智能体有助于人脑系统泛化的推理行为来增强智能体零样本系统泛化能力的方向。受人类有意识的规划行为的启发，我们首先引入了一种自上而下的注意力机制，它允许决策时间规划代理根据其瞬时意图动态地将其推理集中在环境状态最相关的方面，我们称之为“空间抽象”。这种方法显着提高了训练任务之外的系统泛化。随后，在空间抽象的基础上，我们开发了 Skipper 框架，以自动将复杂的任务分解为更简单、更易于管理的子任务。Skipper 通过关注环境的相关空间和时间元素，在长期组合规划中提供针对分布变化的鲁棒性和功效。最后，我们确定了在规划过程中依赖生成模型生成状态目标的规划代理的常见故障模式和安全风险。据透露，大多数智能体盲目相信他们产生幻觉的目标，从而导致妄想计划行为。受人脑如何拒绝妄想意图的启发，我们建议学习一种可行性评估器，以拒绝幻觉的不可行目标，从而显着提高各种规划代理的性能。最后，我们提出了未来研究的方向，旨在实现一般任务抽象并充分实现抽象规划。

Zooming into Comics: Region-Aware RL Improves Fine-Grained Comic Understanding in Vision-Language Models

放大漫画：区域感知 RL 提高了视觉语言模型中细粒度的漫画理解

Authors: Yule Chen, Yufan Ren, Sabine Süsstrunk
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.06490
Pdf link: https://arxiv.org/pdf/2511.06490
Abstract Complex visual narratives, such as comics, present a significant challenge to Vision-Language Models (VLMs). Despite excelling on natural images, VLMs often struggle with stylized line art, onomatopoeia, and densely packed multi-panel layouts. To address this gap, we introduce AI4VA-FG, the first fine-grained and comprehensive benchmark for VLM-based comic understanding. It spans tasks from foundational recognition and detection to high-level character reasoning and narrative construction, supported by dense annotations for characters, poses, and depth. Beyond that, we evaluate state-of-the-art proprietary models, including GPT-4o and Gemini-2.5, and open-source models such as Qwen2.5-VL, revealing substantial performance deficits across core tasks of our benchmarks and underscoring that comic understanding remains an unsolved challenge. To enhance VLMs' capabilities in this domain, we systematically investigate post-training strategies, including supervised fine-tuning on solutions (SFT-S), supervised fine-tuning on reasoning trajectories (SFT-R), and reinforcement learning (RL). Beyond that, inspired by the emerging "Thinking with Images" paradigm, we propose Region-Aware Reinforcement Learning (RARL) for VLMs, which trains models to dynamically attend to relevant regions through zoom-in operations. We observe that when applied to the Qwen2.5-VL model, RL and RARL yield significant gains in low-level entity recognition and high-level storyline ordering, paving the way for more accurate and efficient VLM applications in the comics domain.
中文摘要 复杂的视觉叙事，如漫画，对视觉语言模型（VLM）提出了重大挑战。尽管 VLM 擅长自然图像，但经常在风格化的线条艺术、拟声词和密集的多面板布局方面遇到困难。为了弥补这一差距，我们推出了 AI4VA-FG，这是第一个基于 VLM 的漫画理解的细粒度和综合基准。它涵盖从基础识别和检测到高级角色推理和叙事构建的任务，并由角色、姿势和深度的密集注释提供支持。除此之外，我们还评估了最先进的专有模型，包括 GPT-4o 和 Gemini-2.5，以及开源模型，如 Qwen2.5-VL，揭示了我们基准测试的核心任务存在巨大的性能缺陷，并强调漫画理解仍然是一个未解决的挑战。为了增强VLM在该领域的能力，我们系统地研究了训练后策略，包括解决方案的监督微调（SFT-S）、推理轨迹的监督微调（SFT-R）和强化学习（RL）。除此之外，受新兴的“用图像思考”范式的启发，我们提出了针对 VLM 的区域感知强化学习（RARL），它训练模型通过放大作动态关注相关区域。我们观察到，当应用于Qwen2.5-VL模型时，RL和RARL在低级实体识别和高级故事情节排序方面产生了显著的收益，为漫画领域更准确、更高效的VLM应用铺平了道路。

SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports

SportR：体育多模态大语言模型推理的标杆

Authors: Haotian Xia, Haonan Ge, Junbo Zou, Hyun Woo Choi, Xuebin Zhang, Danny Suradja, Botao Rui, Ethan Tran, Wendy Jin, Zhen Ye, Xiyang Lin, Christopher Lai, Shengjie Zhang, Junwen Miao, Shichao Chen, Rhys Tracy, Vicente Ordonez, Weining Shen, Hanjie Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.06499
Pdf link: https://arxiv.org/pdf/2511.06499
Abstract Deeply understanding sports requires an intricate blend of fine-grained visual perception and rule-based reasoning - a challenge that pushes the limits of current multimodal models. To succeed, models must master three critical capabilities: perceiving nuanced visual details, applying abstract sport rule knowledge, and grounding that knowledge in specific visual evidence. Current sports benchmarks either cover single sports or lack the detailed reasoning chains and precise visual grounding needed to robustly evaluate these core capabilities in a multi-sport context. To address this gap, we introduce SportR, the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence. Our benchmark provides a dataset of 5,017 images and 2,101 videos. To enable granular evaluation, we structure our benchmark around a progressive hierarchy of question-answer (QA) pairs designed to probe reasoning at increasing depths - from simple infraction identification to complex penalty prediction. For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain of Thought (CoT) annotations. In addition, our benchmark incorporates both image and video modalities and provides manual bounding box annotations to test visual grounding in the image part directly. Extensive experiments demonstrate the profound difficulty of our benchmark. State-of-the-art baseline models perform poorly on our most challenging tasks. While training on our data via Supervised Fine-Tuning and Reinforcement Learning improves these scores, they remain relatively low, highlighting a significant gap in current model capabilities. SportR presents a new challenge for the community, providing a critical resource to drive future research in multimodal sports reasoning.
中文摘要 深入理解体育运动需要细粒度视觉感知和基于规则的推理的复杂结合——这一挑战突破了当前多模态模型的极限。为了取得成功，模型必须掌握三个关键能力：感知细微的视觉细节，应用抽象的运动规则知识，并将这些知识建立在特定的视觉证据中。当前的体育基准要么涵盖单项运动，要么缺乏在多运动背景下稳健评估这些核心能力所需的详细推理链和精确的视觉基础。为了解决这一差距，我们推出了 SportR，这是第一个多运动大规模基准测试，旨在训练和评估 MLLM 关于运动智能所需的基本推理。我们的基准测试提供了包含 5,017 张图像和 2,101 个视频的数据集。为了实现精细评估，我们围绕问答（QA）对的渐进层次结构构建了基准，旨在从简单的违规识别到复杂的处罚预测，以不断深入地探测推理。对于需要多步骤推理的最高级任务，例如确定惩罚或解释策略，我们提供了 7,118 个高质量的、人类创作的思维链（CoT）注释。此外，我们的基准测试结合了图像和视频模态，并提供手动边界框注释，以直接测试图像部分的视觉接地。大量的实验证明了我们基准的巨大难度。最先进的基线模型在我们最具挑战性的任务中表现不佳。虽然通过监督微调和强化学习对我们的数据进行训练提高了这些分数，但它们仍然相对较低，凸显了当前模型能力的巨大差距。SportR 为社区提出了新的挑战，为推动多模态体育推理的未来研究提供了关键资源。

Adaptive PID Control for Robotic Systems via Hierarchical Meta-Learning and Reinforcement Learning with Physics-Based Data Augmentation

基于物理的数据增强，通过分层元学习和强化学习对机器人系统进行自适应 PID 控制

Authors: JiaHao Wu, ShengWen Yu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.06500
Pdf link: https://arxiv.org/pdf/2511.06500
Abstract Proportional-Integral-Derivative (PID) controllers remain the predominant choice in industrial robotics due to their simplicity and reliability. However, manual tuning of PID parameters for diverse robotic platforms is time-consuming and requires extensive domain expertise. This paper presents a novel hierarchical control framework that combines meta-learning for PID initialization and reinforcement learning (RL) for online adaptation. To address the sample efficiency challenge, a \textit{physics-based data augmentation} strategy is introduced that generates virtual robot configurations by systematically perturbing physical parameters, enabling effective meta-learning with limited real robot data. The proposed approach is evaluated on two heterogeneous platforms: a 9-DOF Franka Panda manipulator and a 12-DOF Laikago quadruped robot. Experimental results demonstrate that the proposed method achieves 16.6\% average improvement on Franka Panda (6.26° MAE), with exceptional gains in high-load joints (J2: 80.4\% improvement from 12.36° to 2.42°). Critically, this work discovers the \textit{optimization ceiling effect}: RL achieves dramatic improvements when meta-learning exhibits localized high-error joints, but provides no benefit (0.0\%) when baseline performance is uniformly strong, as observed in Laikago. The method demonstrates robust performance under disturbances (parameter uncertainty: +19.2\%, no disturbance: +16.6\%, average: +10.0\%) with only 10 minutes of training time. Multi-seed analysis across 100 random initializations confirms stable performance (4.81+/-1.64\% average). These results establish that RL effectiveness is highly dependent on meta-learning baseline quality and error distribution, providing important design guidance for hierarchical control systems.
中文摘要 比例积分微分（PID）控制器因其简单性和可靠性而仍然是工业机器人的主要选择。然而，手动调整不同机器人平台的 PID 参数非常耗时，并且需要广泛的领域专业知识。本文提出了一种新的分层控制框架，该框架将元学习用于PID初始化和强化学习（RL）相结合，用于在线适应。为了解决样本效率挑战，引入了一种\textit{基于物理的数据增强}策略，该策略通过系统地扰动物理参数来生成虚拟机器人配置，从而在有限的真实机器人数据下实现有效的元学习。所提出的方法在两个异构平台上进行了评估：一个 9 自由度的 Franka Panda 机械手和一个 12 自由度的 Laikago 四足机器人。实验结果表明，所提方法在Franka Panda（6.26° MAE）上实现了16.6%的平均提升，在高载荷接头上具有显著的增益（J2：从12.36°提高到2.42°提高了80.4%）。至关重要的是，这项工作发现了\textit{优化天花板效应}：当元学习表现出局部高误差关节时，RL取得了显着的改进，但当基线性能均匀强时，RL没有提供任何好处（0.0\%），正如在Laikago中观察到的那样。该方法在扰动（参数不确定度：+19.2\%，无扰动：+16.6\%，平均：+10.0\%）下表现出稳健的性能，训练时间仅为10分钟。对 100 次随机初始化的多种子分析证实了稳定的性能（平均值 4.81+/-1.64\%）。这些结果表明，RL的有效性高度依赖于元学习基线质量和误差分布，为分层控制系统提供了重要的设计指导。

Practical Policy Distillation for Reinforcement Learning in Radio Access Networks

无线接入网强化学习的实用策略提炼

Authors: Sara Khosravi, Burak Demirel, Linghui Zhou, Javier Rasines, Pablo Soldati
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.06563
Pdf link: https://arxiv.org/pdf/2511.06563
Abstract Adopting artificial intelligence (AI) in radio access networks (RANs) presents several challenges, including limited availability of link-level measurements (e.g., CQI reports), stringent real-time processing constraints (e.g., sub-1 ms per TTI), and network heterogeneity (different spectrum bands, cell types, and vendor equipment). A critical yet often overlooked barrier lies in the computational and memory limitations of RAN baseband hardware, particularly in legacy 4th Generation (4G) systems, which typically lack on-chip neural accelerators. As a result, only lightweight AI models (under 1 Mb and sub-100~\mu s inference time) can be effectively deployed, limiting both their performance and applicability. However, achieving strong generalization across diverse network conditions often requires large-scale models with substantial resource demands. To address this trade-off, this paper investigates policy distillation in the context of a reinforcement learning-based link adaptation task. We explore two strategies: single-policy distillation, where a scenario-agnostic teacher model is compressed into one generalized student model; and multi-policy distillation, where multiple scenario-specific teachers are consolidated into a single generalist student. Experimental evaluations in a high-fidelity, 5th Generation (5G)-compliant simulator demonstrate that both strategies produce compact student models that preserve the teachers' generalization capabilities while complying with the computational and memory limitations of existing RAN hardware.
中文摘要 在无线接入网络（RAN）中采用人工智能（AI）面临着一些挑战，包括链路级测量（例如 CQI 报告）的可用性有限、严格的实时处理限制（例如，每个 TTI 低于 1 毫秒）和网络异构性（不同的频段、小区类型和供应商设备）。一个关键但经常被忽视的障碍在于 RAN 基带硬件的计算和内存限制，特别是在传统的第四代（4G）系统中，这些系统通常缺乏片上神经加速器。因此，只有轻量级的 AI 模型（低于 1 Mb，推理时间低于 100~\mu s）才能有效部署，限制了其性能和适用性。然而，要在不同的网络条件下实现强大的泛化，通常需要具有大量资源需求的大规模模型。为了解决这一权衡问题，本文在基于强化学习的链路适应任务背景下研究了策略提炼。我们探索了两种策略：单一政策蒸馏，其中与场景无关的教师模型被压缩为一个广义的学生模型;以及多政策蒸馏，将多个特定场景的教师合并为一个通才学生。在符合第五代（5G）标准的高保真模拟器中的实验评估表明，这两种策略都能生成紧凑的学生模型，既保留了教师的泛化能力，又符合现有 RAN 硬件的计算和内存限制。

Underactuated Biomimetic Autonomous Underwater Vehicle for Ecosystem Monitoring

用于生态系统监测的欠驱动仿生自主水下航行器

Authors: Kaustubh Singh, Shivam Kumar, Shashikant Pawar, Sandeep Manjanna
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.06578
Pdf link: https://arxiv.org/pdf/2511.06578
Abstract In this paper, we present an underactuated biomimetic underwater robot that is suitable for ecosystem monitoring in both marine and freshwater environments. We present an updated mechanical design for a fish-like robot and propose minimal actuation behaviors learned using reinforcement learning techniques. We present our preliminary mechanical design of the tail oscillation mechanism and illustrate the swimming behaviors on FishGym simulator, where the reinforcement learning techniques will be tested on
中文摘要 在本文中，我们提出了一种适用于海洋和淡水环境中生态系统监测的欠驱动仿生水下机器人。我们提出了一种类似鱼的机器人的更新机械设计，并提出了使用强化学习技术学习的最小驱动行为。我们展示了尾部摆动机制的初步机械设计，并说明了 FishGym 模拟器上的游泳行为，其中将测试强化学习技术

GRAPH-GRPO-LEX: Contract Graph Modeling and Reinforcement Learning with Group Relative Policy Optimization

GRAPH-GRPO-LEX：具有组相对策略优化的契约图建模和强化学习

Authors: Moriya Dechtiar, Daniel Martin Katz, Mari Sundaresan, Sylvain Jaume, Hongming Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2511.06618
Pdf link: https://arxiv.org/pdf/2511.06618
Abstract Contracts are complex documents featuring detailed formal structures, explicit and implicit dependencies and rich semantic content. Given these document properties, contract drafting and manual examination of contracts have proven to be both arduous and susceptible to errors. This work aims to simplify and automate the task of contract review and analysis using a novel framework for transforming legal contracts into structured semantic graphs, enabling computational analysis and data-driven insights. We introduce a detailed ontology mapping core legal contract elements to their graph-theoretic equivalents of nodes and edges. We then present a reinforcement learning based Large Language Model (LLM) framework for segmentation and extraction of entities and relationships from contracts. Our method, GRAPH-GRPO-LEX, incorporates both LLMs and reinforcement learning with group relative policy optimization (GRPO). By applying a carefully drafted reward function of graph metrics, we demonstrate the ability to automatically identify direct relationships between clauses, and even uncover hidden dependencies. Our introduction of the gated GRPO approach shows a strong learning signal and can move contract analysis from a linear, manual reading process to an easily visualized graph. This allows for a more dynamic analysis, including building the groundwork for contract linting similar to what is now practiced in software engineering.
中文摘要 合约是复杂的文档，具有详细的形式结构、显式和隐式依赖关系以及丰富的语义内容。鉴于这些文件属性，合同起草和合同的人工审查已被证明既艰巨又容易出错。这项工作旨在使用一种新颖的框架将法律合同转换为结构化语义图来简化和自动化合同审查和分析任务，从而实现计算分析和数据驱动的见解。我们引入了一个详细的本体论，将核心法律契约元素映射到它们的图论等价节点和边。然后，我们提出了一个基于强化学习的大型语言模型（LLM）框架，用于从合同中分割和提取实体和关系。我们的方法 GRAPH-GRPO-LEX 将 LLM 和强化学习与组相对策略优化（GRPO）相结合。通过应用精心起草的图形指标奖励函数，我们展示了自动识别子句之间直接关系的能力，甚至发现隐藏的依赖关系。我们引入的门控 GRPO 方法显示出强烈的学习信号，可以将合约分析从线性的手动读取过程转变为易于可视化的图表。这允许进行更动态的分析，包括为合同 linting 奠定基础，类似于现在在软件工程中实践的做法。

Secure Low-altitude Maritime Communications via Intelligent Jamming

通过智能干扰保护低空海上通信

Authors: Jiawei Huang, Aimin Wang, Geng Sun, Jiahui Li, Jiacheng Wang, Weijie Yuan, Dusit Niyato, Xianbin Wang
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2511.06659
Pdf link: https://arxiv.org/pdf/2511.06659
Abstract Low-altitude wireless networks (LAWNs) have emerged as a viable solution for maritime communications. In these maritime LAWNs, unmanned aerial vehicles (UAVs) serve as practical low-altitude platforms for wireless communications due to their flexibility and ease of deployment. However, the open and clear UAV communication channels make maritime LAWNs vulnerable to eavesdropping attacks. Existing security approaches often assume eavesdroppers follow predefined trajectories, which fails to capture the dynamic movement patterns of eavesdroppers in realistic maritime environments. To address this challenge, we consider a low-altitude maritime communication system that employs intelligent jamming to counter dynamic eavesdroppers with uncertain positioning to enhance the physical layer security. Since such a system requires balancing the conflicting performance metrics of the secrecy rate and energy consumption of UAVs, we formulate a secure and energy-efficient maritime communication multi-objective optimization problem (SEMCMOP). To solve this dynamic and long-term optimization problem, we first reformulate it as a partially observable Markov decision process (POMDP). We then propose a novel soft actor-critic with conditional variational autoencoder (SAC-CVAE) algorithm, which is a deep reinforcement learning algorithm improved by generative artificial intelligence. Specifically, the SAC-CVAE algorithm employs advantage-conditioned latent representations to disentangle and optimize policies, while enhancing computational efficiency by reducing the state space dimension. Simulation results demonstrate that our proposed intelligent jamming approach achieves secure and energy-efficient maritime communications.
中文摘要 低空无线网络（LAWN）已成为海上通信的可行解决方案。在这些海上草坪中，无人机（UAV）因其灵活性和易于部署而成为无线通信的实用低空平台。然而，开放和清晰的无人机通信通道使得海上草坪容易受到窃听攻击。现有的安全方法通常假设窃听者遵循预定义的轨迹，这无法捕捉窃听者在现实海上环境中的动态运动模式。为了应对这一挑战，我们考虑了一种低空海上通信系统，该系统利用智能干扰来应对定位不确定的动态窃听者，以增强物理层的安全性。由于这样的系统需要平衡无人机的保密率和能耗等相互冲突的性能指标，因此我们制定了一个安全、节能的海上通信多目标优化问题（SEMCMOP）。为了解决这个动态和长期的优化问题，我们首先将其重新表述为部分可观察的马尔可夫决策过程（POMDP）。然后，我们提出了一种具有条件变分自动编码器（SAC-CVAE）算法的新型软行为者批评者，这是一种通过生成式人工智能改进的深度强化学习算法。具体来说，SAC-CVAE算法采用优势条件的潜在表示来解开和优化策略，同时通过减少状态空间维度来提高计算效率。仿真结果表明，我们提出的智能干扰方法实现了安全、节能的海上通信。

Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View

从难度区分视角重新审视多模态后训练中的数据采样

Authors: Jianyu Qi, Ding Zou, Wenrui Yan, Rui Ma, Jiaxu Li, Zhijie Zheng, Zhiguo Yang, Rongchang Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.06722
Pdf link: https://arxiv.org/pdf/2511.06722
Abstract Recent advances in Multimodal Large Language Models (MLLMs) have spurred significant progress in Chain-of-Thought (CoT) reasoning. Building on the success of Deepseek-R1, researchers extended multimodal reasoning to post-training paradigms based on reinforcement learning (RL), focusing predominantly on mathematical datasets. However, existing post-training paradigms tend to neglect two critical aspects: (1) The lack of quantifiable difficulty metrics capable of strategically screening samples for post-training optimization. (2) Suboptimal post-training paradigms that fail to jointly optimize perception and reasoning capabilities. To address this gap, we propose two novel difficulty-aware sampling strategies: Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation, while Cross-Modality Attention Balance (CMAB) assesses cross-modal interaction complexity via attention distribution analysis. Leveraging these metrics, we design a hierarchical training framework that incorporates both GRPO-only and SFT+GRPO hybrid training paradigms, and evaluate them across six benchmark datasets. Experiments demonstrate consistent superiority of GRPO applied to difficulty-stratified samples compared to conventional SFT+GRPO pipelines, indicating that strategic data sampling can obviate the need for supervised fine-tuning while improving model accuracy. Our code will be released at this https URL.
中文摘要 多模态大型语言模型（MLLM）的最新进展刺激了思维链（CoT）推理的重大进展。在 Deepseek-R1 成功的基础上，研究人员将多模态推理扩展到基于强化学习（RL）的训练后范式，主要关注数学数据集。然而，现有的训练后范式往往忽略了两个关键方面：（1）缺乏能够战略性地筛选样本以进行训练后优化的可量化难度指标。（2）次优训练后范式，无法共同优化感知和推理能力。为了解决这一差距，我们提出了两种新颖的难度感知采样策略：渐进式图像语义掩蔽（PISM）通过系统图像降级来量化样本硬度，而跨模态注意力平衡（CMAB）通过注意力分布分析评估跨模态交互复杂性。利用这些指标，我们设计了一个分层训练框架，该框架结合了仅 GRPO 和 SFT+GRPO 混合训练范式，并在六个基准数据集中对其进行评估。实验表明，与传统的SFT+GRPO管道相比，GRPO应用于难度分层样品具有一致的优越性，表明战略性数据采样可以消除监督微调的需要，同时提高模型精度。我们的代码将在此 https URL 上发布。

Physically-Grounded Goal Imagination: Physics-Informed Variational Autoencoder for Self-Supervised Reinforcement Learning

物理基础目标想象：用于自监督强化学习的物理知情变分自动编码器

Authors: Lan Thi Ha Nguyen, Kien Ton Manh, Anh Do Duc, Nam Pham Hai
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.06745
Pdf link: https://arxiv.org/pdf/2511.06745
Abstract Self-supervised goal-conditioned reinforcement learning enables robots to autonomously acquire diverse skills without human supervision. However, a central challenge is the goal setting problem: robots must propose feasible and diverse goals that are achievable in their current environment. Existing methods like RIG (Visual Reinforcement Learning with Imagined Goals) use variational autoencoder (VAE) to generate goals in a learned latent space but have the limitation of producing physically implausible goals that hinder learning efficiency. We propose Physics-Informed RIG (PI-RIG), which integrates physical constraints directly into the VAE training process through a novel Enhanced Physics-Informed Variational Autoencoder (Enhanced p3-VAE), enabling the generation of physically consistent and achievable goals. Our key innovation is the explicit separation of the latent space into physics variables governing object dynamics and environmental factors capturing visual appearance, while enforcing physical consistency through differential equation constraints and conservation laws. This enables the generation of physically consistent and achievable goals that respect fundamental physical principles such as object permanence, collision constraints, and dynamic feasibility. Through extensive experiments, we demonstrate that this physics-informed goal generation significantly improves the quality of proposed goals, leading to more effective exploration and better skill acquisition in visual robotic manipulation tasks including reaching, pushing, and pick-and-place scenarios.
中文摘要 自监督目标条件强化学习使机器人能够在没有人类监督的情况下自主获得各种技能。然而，一个核心挑战是目标设定问题：机器人必须提出在当前环境中可实现的可行且多样化的目标。现有方法如 RIG（具有想象目标的视觉强化学习）使用变分自动编码器（VAE）在学习的潜在空间中生成目标，但存在产生物理上难以置信的目标的局限性，从而阻碍了学习效率。我们提出了物理知情 RIG （PI-RIG），它通过一种新型的增强物理知情变分自动编码器（增强型 p3-VAE）将物理约束直接集成到 VAE 训练过程中，从而能够生成物理上一致且可实现的目标。我们的关键创新是将潜在空间明确分离为物理变量，这些变量控制物体动力学和捕捉视觉外观的环境因素，同时通过微分方程约束和守恒定律强制执行物理一致性。这可以生成物理上一致且可实现的目标，这些目标尊重基本物理原理，例如对象持久性、碰撞约束和动态可行性。通过广泛的实验，我们证明，这种基于物理的目标生成显着提高了所提出目标的质量，从而在视觉机器人纵任务（包括到达、推动和拾取和放置场景）中实现更有效的探索和更好的技能获取。

OntoTune: Ontology-Driven Learning for Query Optimization with Convolutional Models

OntoTune：本体驱动学习，使用卷积模型进行查询优化

Authors: Songhui Yue, Yang Shao, Sean Hayes
Subjects: Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.06780
Pdf link: https://arxiv.org/pdf/2511.06780
Abstract Query optimization has been studied using machine learning, reinforcement learning, and, more recently, graph-based convolutional networks. Ontology, as a structured, information-rich knowledge representation, can provide context, particularly in learning problems. This paper presents OntoTune, an ontology-based platform for enhancing learning for query optimization. By connecting SQL queries, database metadata, and statistics, the ontology developed in this research is promising in capturing relationships and important determinants of query performance. This research also develops a method to embed ontologies while preserving as much of the relationships and key information as possible, before feeding it into learning algorithms such as tree-based and graph-based convolutional networks. A case study shows how OntoTune's ontology-driven learning delivers performance gains compared with database system default query execution.
中文摘要 已经使用机器学习、强化学习以及最近的基于图的卷积网络来研究查询优化。本体作为一种结构化的、信息丰富的知识表示，可以提供上下文，特别是在学习问题中。本文介绍了OntoTune，这是一个基于本体的平台，用于增强查询优化的学习。通过连接 SQL 查询、数据库元数据和统计信息，本研究中开发的本体在捕获查询性能的关系和重要决定因素方面很有希望。这项研究还开发了一种嵌入本体的方法，同时尽可能多地保留关系和关键信息，然后将其输入到基于树和基于图的卷积网络等学习算法中。一个案例研究展示了与数据库系统默认查询执行相比，OntoTune 的本体驱动学习如何带来性能提升。

Controllable Flow Matching for Online Reinforcement Learning

在线强化学习的可控流匹配

Authors: Bin Wang, Boxiang Tao, Haifeng Jing, Hongbo Dou, Zijian Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.06816
Pdf link: https://arxiv.org/pdf/2511.06816
Abstract Model-based reinforcement learning (MBRL) typically relies on modeling environment dynamics for data efficiency. However, due to the accumulation of model errors over long-horizon rollouts, such methods often face challenges in maintaining modeling stability. To address this, we propose CtrlFlow, a trajectory-level synthetic method using conditional flow matching (CFM), which directly modeling the distribution of trajectories from initial states to high-return terminal states without explicitly modeling the environment transition function. Our method ensures optimal trajectory sampling by minimizing the control energy governed by the non-linear Controllability Gramian Matrix, while the generated diverse trajectory data significantly enhances the robustness and cross-task generalization of policy learning. In online settings, CtrlFlow demonstrates the better performance on common MuJoCo benchmark tasks than dynamics models and achieves superior sample efficiency compared to standard MBRL methods.
中文摘要 基于模型的强化学习（MBRL）通常依赖于对环境动态进行建模来提高数据效率。然而，由于模型误差在长期部署过程中的积累，此类方法在保持建模稳定性方面往往面临挑战。为了解决这个问题，我们提出了 CtrlFlow，这是一种使用条件流匹配（CFM）的轨迹级合成方法，它直接对从初始状态到高回报终端状态的轨迹分布进行建模，而无需显式对环境转换函数进行建模。该方法通过最小化非线性可控性格米安矩阵控制的控制能量来保证最佳轨迹采样，同时生成的多样化轨迹数据显著增强了策略学习的鲁棒性和跨任务泛化性。在在线设置中，与标准 MBRL 方法相比，CtrlFlow 在常见的 MuJoCo 基准测试任务上表现出比动力学模型更好的性能，并实现了更高的样本效率。

On The Presence of Double-Descent in Deep Reinforcement Learning

关于深度强化学习中双降的存在

Authors: Viktor Veselý, Aleksandar Todorov, Matthia Sabatelli
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2511.06895
Pdf link: https://arxiv.org/pdf/2511.06895
Abstract The double descent (DD) paradox, where over-parameterized models see generalization improve past the interpolation point, remains largely unexplored in the non-stationary domain of Deep Reinforcement Learning (DRL). We present preliminary evidence that DD exists in model-free DRL, investigating it systematically across varying model capacity using the Actor-Critic framework. We rely on an information-theoretic metric, Policy Entropy, to measure policy uncertainty throughout training. Preliminary results show a clear epoch-wise DD curve; the policy's entrance into the second descent region correlates with a sustained, significant reduction in Policy Entropy. This entropic decay suggests that over-parameterization acts as an implicit regularizer, guiding the policy towards robust, flatter minima in the loss landscape. These findings establish DD as a factor in DRL and provide an information-based mechanism for designing agents that are more general, transferable, and robust.
中文摘要 双重下降（DD）悖论，即过度参数化模型看到泛化在插值点之后得到改善，在深度强化学习（DRL）的非平稳领域中在很大程度上仍未得到探索。我们提供了初步证据，证明 DD 存在于无模型 DRL 中，并使用 Actor-Critic 框架在不同的模型容量中对其进行了系统调查。我们依靠信息论指标“政策熵”来衡量整个训练过程中的政策不确定性。初步结果显示，DD曲线清晰;该策略进入第二下降区域与策略熵的持续、显着降低相关。这种熵衰减表明，过度参数化充当隐式正则化器，引导策略在损失形势中实现稳健、更平坦的最小值。这些发现将 DD 确立为 DRL 中的一个因素，并为设计更通用、可转移和更稳健的代理提供了一种基于信息的机制。

Fine-Tuning Diffusion-Based Recommender Systems via Reinforcement Learning with Reward Function Optimization

通过奖励函数优化强化学习微调基于扩散的推荐系统

Authors: Yu Hou, Hua Li, Ha Young Kim, Won-Yong Shin
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2511.06937
Pdf link: https://arxiv.org/pdf/2511.06937
Abstract Diffusion models recently emerged as a powerful paradigm for recommender systems, offering state-of-the-art performance by modeling the generative process of user-item interactions. However, training such models from scratch is both computationally expensive and yields diminishing returns once convergence is reached. To remedy these challenges, we propose ReFiT, a new framework that integrates Reinforcement learning (RL)-based Fine-Tuning into diffusion-based recommender systems. In contrast to prior RL approaches for diffusion models depending on external reward models, ReFiT adopts a task-aligned design: it formulates the denoising trajectory as a Markov decision process (MDP) and incorporates a collaborative signal-aware reward function that directly reflects recommendation quality. By tightly coupling the MDP structure with this reward signal, ReFiT empowers the RL agent to exploit high-order connectivity for fine-grained optimization, while avoiding the noisy or uninformative feedback common in naive reward designs. Leveraging policy gradient optimization, ReFiT maximizes exact log-likelihood of observed interactions, thereby enabling effective post hoc fine-tuning of diffusion recommenders. Comprehensive experiments on wide-ranging real-world datasets demonstrate that the proposed ReFiT framework (a) exhibits substantial performance gains over strong competitors (up to 36.3% on sequential recommendation), (b) demonstrates strong efficiency with linear complexity in the number of users or items, and (c) generalizes well across multiple diffusion-based recommendation scenarios. The source code and datasets are publicly available at this https URL.
中文摘要 扩散模型最近成为推荐系统的强大范式，通过对用户-项目交互的生成过程进行建模来提供最先进的性能。然而，从头开始训练此类模型既计算成本高昂，又一旦达到收敛，收益就会递减。为了解决这些挑战，我们提出了 ReFiT，这是一个新框架，它将基于强化学习（RL）的微调集成到基于扩散的推荐系统中。与之前依赖外部奖励模型的扩散模型的 RL 方法相比，ReFiT 采用了任务对齐设计：它将去噪轨迹表述为马尔可夫决策过程（MDP），并结合了直接反映推荐质量的协作信号感知奖励函数。通过将 MDP 结构与该奖励信号紧密耦合，ReFiT 使 RL 代理能够利用高阶连接进行细粒度优化，同时避免朴素奖励设计中常见的嘈杂或无信息反馈。利用策略梯度优化，ReFiT 最大限度地提高了观察到的交互作用的精确对数似然，从而实现了扩散推荐器的有效事后微调。对广泛的真实世界数据集的综合实验表明，所提出的 ReFiT 框架（a）与强大的竞争对手相比表现出显着的性能提升（在顺序推荐上高达 36.3%），（b）在用户或项目数量上表现出强大的效率和线性复杂性，以及（c）在多个基于扩散的推荐场景中具有良好的泛化效果。源代码和数据集在此 https URL 上公开可用。

Learning to Focus: Prioritizing Informative Histories with Structured Attention Mechanisms in Partially Observable Reinforcement Learning

学习专注：在部分可观察强化学习中优先考虑具有结构化注意力机制的信息历史

Authors: Daniel De Dios Allegue, Jinke He, Frans A. Oliehoek
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.06946
Pdf link: https://arxiv.org/pdf/2511.06946
Abstract Transformers have shown strong ability to model long-term dependencies and are increasingly adopted as world models in model-based reinforcement learning (RL) under partial observability. However, unlike natural language corpora, RL trajectories are sparse and reward-driven, making standard self-attention inefficient because it distributes weight uniformly across all past tokens rather than emphasizing the few transitions critical for control. To address this, we introduce structured inductive priors into the self-attention mechanism of the dynamics head: (i) per-head memory-length priors that constrain attention to task-specific windows, and (ii) distributional priors that learn smooth Gaussian weightings over past state-action pairs. We integrate these mechanisms into UniZero, a model-based RL agent with a Transformer-based world model that supports planning under partial observability. Experiments on the Atari 100k benchmark show that most efficiency gains arise from the Gaussian prior, which smoothly allocates attention to informative transitions, while memory-length priors often truncate useful signals with overly restrictive cut-offs. In particular, Gaussian Attention achieves a 77% relative improvement in mean human-normalized scores over UniZero. These findings suggest that in partially observable RL domains with non-stationary temporal dependencies, discrete memory windows are difficult to learn reliably, whereas smooth distributional priors flexibly adapt across horizons and yield more robust data efficiency. Overall, our results demonstrate that encoding structured temporal priors directly into self-attention improves the prioritization of informative histories for dynamics modeling under partial observability.
中文摘要 Transformer 表现出了强大的长期依赖建模能力，并且越来越多地被采用为部分可观测性下基于模型的强化学习（RL）中的世界模型。然而，与自然语言语料库不同的是，RL 轨迹是稀疏的且由奖励驱动的，这使得标准的自注意力效率低下，因为它将权重均匀分布在所有过去的代币中，而不是强调对控制至关重要的少数过渡。为了解决这个问题，我们将结构化的归纳先验引入到动态头的自注意力机制中：（i）将注意力限制在特定于任务窗口的每个头的记忆长度先验，以及（ii）学习过去状态-动作对的平滑高斯权重的分布先验。我们将这些机制集成到 UniZero 中，UniZero 是一个基于模型的 RL 代理，具有基于 Transformer 的世界模型，支持部分可观测性下的规划。Atari 100k 基准测试上的实验表明，大多数效率提升来自高斯先验，它平滑地将注意力分配给信息性转换，而内存长度先验通常会以过度限制的截止截止截断有用的信号。特别是，与 UniZero 相比，Gaussian Attention 在平均人类归一化分数方面实现了 77% 的相对提高。这些发现表明，在具有非平稳时间依赖性的部分可观测RL域中，离散内存窗口难以可靠地学习，而平滑分布先验可以灵活地跨视界适应，并产生更稳健的数据效率。总体而言，我们的结果表明，将结构化的时间先验直接编码为自注意力可以改善部分可观测性下动力学建模的信息历史的优先级。

Learning Quantized Continuous Controllers for Integer Hardware

学习整数硬件的量化连续控制器

Authors: Fabian Kresse, Christoph H. Lampert
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.07046
Pdf link: https://arxiv.org/pdf/2511.07046
Abstract Deploying continuous-control reinforcement learning policies on embedded hardware requires meeting tight latency and power budgets. Small FPGAs can deliver these, but only if costly floating point pipelines are avoided. We study quantization-aware training (QAT) of policies for integer inference and we present a learning-to-hardware pipeline that automatically selects low-bit policies and synthesizes them to an Artix-7 FPGA. Across five MuJoCo tasks, we obtain policy networks that are competitive with full precision (FP32) policies but require as few as 3 or even only 2 bits per weight, and per internal activation value, as long as input precision is chosen carefully. On the target hardware, the selected policies achieve inference latencies on the order of microseconds and consume microjoules per action, favorably comparing to a quantized reference. Last, we observe that the quantized policies exhibit increased input noise robustness compared to the floating-point baseline.
中文摘要 在嵌入式硬件上部署连续控制强化学习策略需要满足紧张的延迟和功耗预算。小型 FPGA 可以提供这些，但前提是避免了昂贵的浮点管道。我们研究了整数推理策略的量化感知训练（QAT），并提出了一个学习到硬件的管道，该管道可以自动选择低位策略并将它们合成到 Artix-7 FPGA。在五个 MuJoCo 任务中，我们获得了与全精度（FP32）策略具有竞争力的策略网络，但只要仔细选择输入精度，每个权重和每个内部激活值只需要 3 位甚至 2 位。在目标硬件上，所选策略实现了微秒量级的推理延迟，并且每个作消耗微焦耳，与量化参考相比具有优势。最后，我们观察到，与浮点基线相比，量化策略表现出更高的输入噪声鲁棒性。

Improving Deepfake Detection with Reinforcement Learning-Based Adaptive Data Augmentation

通过基于强化学习的自适应数据增强改进深度伪造检测

Authors: Yuxuan Zhou, Tao Yu, Wen Huang, Yuheng Zhang, Tao Dai, Shu-Tao Xia
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2511.07051
Pdf link: https://arxiv.org/pdf/2511.07051
Abstract The generalization capability of deepfake detectors is critical for real-world use. Data augmentation via synthetic fake face generation effectively enhances generalization, yet current SoTA methods rely on fixed strategies-raising a key question: Is a single static augmentation sufficient, or does the diversity of forgery features demand dynamic approaches? We argue existing methods overlook the evolving complexity of real-world forgeries (e.g., facial warping, expression manipulation), which fixed policies cannot fully simulate. To address this, we propose CRDA (Curriculum Reinforcement-Learning Data Augmentation), a novel framework guiding detectors to progressively master multi-domain forgery features from simple to complex. CRDA synthesizes augmented samples via a configurable pool of forgery operations and dynamically generates adversarial samples tailored to the detector's current learning state. Central to our approach is integrating reinforcement learning (RL) and causal inference. An RL agent dynamically selects augmentation actions based on detector performance to efficiently explore the vast augmentation space, adapting to increasingly challenging forgeries. Simultaneously, the agent introduces action space variations to generate heterogeneous forgery patterns, guided by causal inference to mitigate spurious correlations-suppressing task-irrelevant biases and focusing on causally invariant features. This integration ensures robust generalization by decoupling synthetic augmentation patterns from the model's learned representations. Extensive experiments show our method significantly improves detector generalizability, outperforming SOTA methods across multiple cross-domain datasets.
中文摘要 深度伪造检测器的泛化能力对于现实世界的使用至关重要。通过合成假人脸生成进行数据增强有效地增强了泛化性，但当前的 SoTA 方法依赖于固定策略——提出了一个关键问题：单一的静态增强是否足够，或者伪造特征的多样性是否需要动态方法？我们认为，现有方法忽视了现实世界伪造（例如面部扭曲、表情纵）不断变化的复杂性，而固定策略无法完全模拟这些伪造。为了解决这个问题，我们提出了CRDA（课程强化学习数据增强），这是一种指导检测器逐步掌握从简单到复杂的多域伪造特征的新颖框架。CRDA 通过可配置的伪造作池合成增强样本，并动态生成适合检测器当前学习状态的对抗样本。我们方法的核心是整合强化学习（RL）和因果推理。RL 代理根据探测器性能动态选择增强动作，以有效地探索广阔的增强空间，适应越来越具有挑战性的伪造。同时，智能体引入动作空间变化以生成异构伪造模式，以因果推理为指导，以减轻虚假相关性抑制任务无关偏差，并关注因果不变特征。这种集成通过将合成增强模式与模型的学习表示解耦来确保稳健的泛化。大量实验表明，我们的方法显着提高了检测器的泛化性，在多个跨域数据集中优于 SOTA 方法。

Multi-Agent Reinforcement Learning for Deadlock Handling among Autonomous Mobile Robots

用于自主移动机器人死锁处理的多智能体强化学习

Authors: Marcel Müller
Subjects: Subjects: Multiagent Systems (cs.MA); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.07071
Pdf link: https://arxiv.org/pdf/2511.07071
Abstract This dissertation explores the application of multi-agent reinforcement learning (MARL) for handling deadlocks in intralogistics systems that rely on autonomous mobile robots (AMRs). AMRs enhance operational flexibility but also increase the risk of deadlocks, which degrade system throughput and reliability. Existing approaches often neglect deadlock handling in the planning phase and rely on rigid control rules that cannot adapt to dynamic operational conditions. To address these shortcomings, this work develops a structured methodology for integrating MARL into logistics planning and operational control. It introduces reference models that explicitly consider deadlock-capable multi-agent pathfinding (MAPF) problems, enabling systematic evaluation of MARL strategies. Using grid-based environments and an external simulation software, the study compares traditional deadlock handling strategies with MARL-based solutions, focusing on PPO and IMPALA algorithms under different training and execution modes. Findings reveal that MARL-based strategies, particularly when combined with centralized training and decentralized execution (CTDE), outperform rule-based methods in complex, congested environments. In simpler environments or those with ample spatial freedom, rule-based methods remain competitive due to their lower computational demands. These results highlight that MARL provides a flexible and scalable solution for deadlock handling in dynamic intralogistics scenarios, but requires careful tailoring to the operational context.
中文摘要 本论文探讨了多智能体强化学习（MARL）在处理依赖自主移动机器人（AMR）的内部物流系统中死锁的应用。AMR 增强了作灵活性，但也增加了死锁的风险，从而降低了系统吞吐量和可靠性。现有方法往往忽略了规划阶段的死锁处理，并依赖于无法适应动态作条件的严格控制规则。为了解决这些缺点，这项工作开发了一种结构化方法，将 MARL 整合到物流规划和运营控制中。它引入了明确考虑具有死锁功能的多智能体寻路（MAPF）问题的参考模型，从而能够对MARL策略进行系统评估。利用基于网格的环境和外部仿真软件，将传统的死锁处理策略与基于MARL的解决方案进行了比较，重点关注不同训练和执行模式下的PPO和IMPALA算法。研究结果表明，在复杂、拥挤的环境中，基于 MARL 的策略，特别是与集中式训练和分散式执行（CTDE）相结合时，优于基于规则的方法。在更简单的环境或具有足够空间自由度的环境中，基于规则的方法由于其较低的计算需求而保持竞争力。这些结果凸显了 MARL 为动态内部物流场景中的死锁处理提供了灵活且可扩展的解决方案，但需要根据运营环境进行仔细定制。

Two Heads are Better than One: Distilling Large Language Model Features Into Small Models with Feature Decomposition and Mixture

两个头比一个好：将大型语言模型特征提炼成具有特征分解和混合的小模型

Authors: Tianhao Fu, Xinxin Xu, Weichen Xu, Jue Chen, Ruilong Ren, Bowen Deng, Xinyu Zhao, Jian Cao, Xixin Cao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.07110
Pdf link: https://arxiv.org/pdf/2511.07110
Abstract Market making (MM) through Reinforcement Learning (RL) has attracted significant attention in financial trading. With the development of Large Language Models (LLMs), more and more attempts are being made to apply LLMs to financial areas. A simple, direct application of LLM as an agent shows significant performance. Such methods are hindered by their slow inference speed, while most of the current research has not studied LLM distillation for this specific task. To address this, we first propose the normalized fluorescent probe to study the mechanism of the LLM's feature. Based on the observation found by our investigation, we propose Cooperative Market Making (CMM), a novel framework that decouples LLM features across three orthogonal dimensions: layer, task, and data. Various student models collaboratively learn simple LLM features along with different dimensions, with each model responsible for a distinct feature to achieve knowledge distillation. Furthermore, CMM introduces an Hájek-MoE to integrate the output of the student models by investigating the contribution of different models in a kernel function-generated common feature space. Extensive experimental results on four real-world market datasets demonstrate the superiority of CMM over the current distillation method and RL-based market-making strategies.
中文摘要 通过强化学习（RL）做市（MM）在金融交易中引起了广泛关注。随着大型语言模型（LLMs）的发展，越来越多的尝试将LLM应用于金融领域。LLM 作为代理的简单、直接应用显示出显着的性能。此类方法因其推理速度慢而受到阻碍，而目前的大多数研究尚未针对这一特定任务研究 LLM 蒸馏。为了解决这个问题，我们首先提出了归一化荧光探针来研究LLM特征的机制。根据我们的调查发现的观察结果，我们提出了合作做市（CMM），这是一种新颖的框架，它跨三个正交维度（层、任务和数据）解耦LLM特征。各种学生模型协作学习简单的 LLM 特征以及不同的维度，每个模型负责一个独特的特征，以实现知识提炼。此外，CMM引入了Hájek-MoE，通过研究不同模型在核函数生成的公共特征空间中的贡献来整合学生模型的输出。在四个真实世界市场数据集上的广泛实验结果证明了 CMM 相对于当前的蒸馏方法和基于 RL 的做市策略的优越性。

Dynamics-Decoupled Trajectory Alignment for Sim-to-Real Transfer in Reinforcement Learning for Autonomous Driving

自动驾驶强化学习中用于模拟到实数转移的动力学解耦轨迹对齐

Authors: Thomas Steinecker, Alexander Bienemann, Denis Trescher, Thorsten Luettel, Mirko Maehlisch
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.07155
Pdf link: https://arxiv.org/pdf/2511.07155
Abstract Reinforcement learning (RL) has shown promise in robotics, but deploying RL on real vehicles remains challenging due to the complexity of vehicle dynamics and the mismatch between simulation and reality. Factors such as tire characteristics, road surface conditions, aerodynamic disturbances, and vehicle load make it infeasible to model real-world dynamics accurately, which hinders direct transfer of RL agents trained in simulation. In this paper, we present a framework that decouples motion planning from vehicle control through a spatial and temporal alignment strategy between a virtual vehicle and the real system. An RL agent is first trained in simulation using a kinematic bicycle model to output continuous control actions. Its behavior is then distilled into a trajectory-predicting agent that generates finite-horizon ego-vehicle trajectories, enabling synchronization between virtual and real vehicles. At deployment, a Stanley controller governs lateral dynamics, while longitudinal alignment is maintained through adaptive update mechanisms that compensate for deviations between virtual and real trajectories. We validate our approach on a real vehicle and demonstrate that the proposed alignment strategy enables robust zero-shot transfer of RL-based motion planning from simulation to reality, successfully decoupling high-level trajectory generation from low-level vehicle control.
中文摘要 强化学习（RL）在机器人技术中显示出前景，但由于车辆动力学的复杂性以及模拟与现实之间的不匹配，在真实车辆上部署强化学习仍然具有挑战性。轮胎特性、路面状况、空气动力学扰动和车辆负载等因素使得准确模拟真实世界的动力学变得不可行，这阻碍了在仿真中训练的 RL 代理的直接转移。在本文中，我们提出了一个框架，通过虚拟车辆和真实系统之间的空间和时间对齐策略，将运动规划与车辆控制解耦。RL 代理首先使用运动学自行车模型进行模拟训练，以输出连续控制动作。然后，它的行为被提炼成轨迹预测代理，生成有限视野的自我车辆轨迹，从而实现虚拟和真实车辆之间的同步。在部署时，Stanley 控制器控制横向动力学，而纵向对齐则通过自适应更新机制来维持，该机制补偿虚拟轨迹和真实轨迹之间的偏差。我们在真实车辆上验证了我们的方法，并证明所提出的对准策略能够将基于 RL 的运动规划从仿真到现实进行稳健的零样本传输，成功地将高级轨迹生成与低级车辆控制解耦。

Guiding Generative Models to Uncover Diverse and Novel Crystals via Reinforcement Learning

引导生成模型通过强化学习发现多样化和新颖的晶体

Authors: Hyunsoo Park, Aron Walsh
Subjects: Subjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
Arxiv link: https://arxiv.org/abs/2511.07158
Pdf link: https://arxiv.org/pdf/2511.07158
Abstract Discovering functional crystalline materials entails navigating an immense combinatorial design space. While recent advances in generative artificial intelligence have enabled the sampling of chemically plausible compositions and structures, a fundamental challenge remains: the objective misalignment between likelihood-based sampling in generative modelling and targeted focus on underexplored regions where novel compounds reside. Here, we introduce a reinforcement learning framework that guides latent denoising diffusion models toward diverse and novel, yet thermodynamically viable crystalline compounds. Our approach integrates group relative policy optimisation with verifiable, multi-objective rewards that jointly balance creativity, stability, and diversity. Beyond de novo generation, we demonstrate enhanced property-guided design that preserves chemical validity, while targeting desired functional properties. This approach establishes a modular foundation for controllable AI-driven inverse design that addresses the novelty-validity trade-off across scientific discovery applications of generative models.
中文摘要 发现功能性晶体材料需要探索一个巨大的组合设计空间。虽然生成人工智能的最新进展使得化学上合理的成分和结构的采样成为可能，但仍然存在一个根本挑战：生成建模中基于似然的采样与对新型化合物所在的未开发区域的有针对性的关注之间的客观不一致。在这里，我们引入了一个强化学习框架，该框架将潜在去噪扩散模型引导到多样化、新颖但热力学上可行的结晶化合物。我们的方法将集团相对政策优化与可验证的多目标奖励相结合，共同平衡创造力、稳定性和多样性。除了从头生成之外，我们还展示了增强的性能引导设计，可以保持化学有效性，同时针对所需的功能特性。这种方法为可控的人工智能驱动的逆向设计建立了模块化基础，解决了生成模型科学发现应用中新颖性与有效性的权衡问题。

Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

通过深度行为者批评稳定实现策略外模仿学习

Authors: Sayambhu Sen, Shalabh Bhatnagar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.07288
Pdf link: https://arxiv.org/pdf/2511.07288
Abstract Learning complex policies with Reinforcement Learning (RL) is often hindered by instability and slow convergence, a problem exacerbated by the difficulty of reward engineering. Imitation Learning (IL) from expert demonstrations bypasses this reliance on rewards. However, state-of-the-art IL methods, exemplified by Generative Adversarial Imitation Learning (GAIL)Ho et. al, suffer from severe sample inefficiency. This is a direct consequence of their foundational on-policy algorithms, such as TRPO Schulman this http URL. In this work, we introduce an adversarial imitation learning algorithm that incorporates off-policy learning to improve sample efficiency. By combining an off-policy framework with auxiliary techniques specifically, double Q network based stabilization and value learning without reward function inference we demonstrate a reduction in the samples required to robustly match expert behavior.
中文摘要 使用强化学习（RL）学习复杂策略往往受到不稳定性和收敛缓慢的阻碍，奖励工程的难度加剧了这个问题。来自专家演示的模仿学习（IL）绕过了这种对奖励的依赖。然而，以生成对抗模仿学习（GAIL）为代表的最先进的 IL 方法Ho 等人。al，样品效率严重低下。这是他们基础策略算法的直接结果，例如 TRPO Schulman 这个 http URL。在这项工作中，我们引入了一种对抗性模仿学习算法，该算法结合了策略外学习以提高样本效率。通过将非策略框架与辅助技术相结合，特别是基于双 Q 网络的稳定和价值学习，无需奖励函数推理，我们证明了稳健匹配专家行为所需的样本减少。

Superhuman AI for Stratego Using Self-Play Reinforcement Learning and Test-Time Search

使用自我游戏强化学习和测试时间搜索的 Stratego 超人 AI

Authors: Samuel Sokota, Eugene Vinitsky, Hengyuan Hu, J. Zico Kolter, Gabriele Farina
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.07312
Pdf link: https://arxiv.org/pdf/2511.07312
Abstract Few classical games have been regarded as such significant benchmarks of artificial intelligence as to have justified training costs in the millions of dollars. Among these, Stratego -- a board wargame exemplifying the challenge of strategic decision making under massive amounts of hidden information -- stands apart as a case where such efforts failed to produce performance at the level of top humans. This work establishes a step change in both performance and cost for Stratego, showing that it is now possible not only to reach the level of top humans, but to achieve vastly superhuman level -- and that doing so requires not an industrial budget, but merely a few thousand dollars. We achieved this result by developing general approaches for self-play reinforcement learning and test-time search under imperfect information.
中文摘要 很少有经典游戏被视为如此重要的人工智能基准，足以证明数百万美元的训练成本是合理的。其中，Stratego——一款棋盘战争游戏，体现了在大量隐藏信息下战略决策的挑战——作为此类努力未能产生顶级人类水平表现的一个案例而脱颖而出。这项工作为Stratego在性能和成本上都发生了重大变化，表明现在不仅有可能达到顶尖人类的水平，而且可以达到超人的水平——而且这样做不需要工业预算，而只需要几千美元。我们通过开发不完善信息下自我游戏强化学习和测试时间搜索的通用方法实现了这一结果。

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

RLVE：使用自适应可验证环境扩展语言模型的强化学习

Authors: Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, Chenyang Zhao, Yulia Tsvetkov, Simon Shaolei Du, Natasha Jaques, Hao Peng, Pang Wei Koh, Hannaneh Hajishirzi
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.07317
Pdf link: https://arxiv.org/pdf/2511.07317
Abstract We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses. In contrast, static data distributions often lead to vanishing learning signals when problems are either too easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, a large-scale suite of 400 verifiable environments carefully developed through manual environment engineering. Using RLVE-Gym, we show that environment scaling, i.e., expanding the collection of training environments, consistently improves generalizable reasoning capabilities. RLVE with joint training across all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement across six reasoning benchmarks, starting from one of the strongest 1.5B reasoning LMs. By comparison, continuing this LM's original RL training yields only a 0.49% average absolute gain despite using over 3x more compute. We release our code publicly.
中文摘要 我们引入了带有自适应可验证环境（RLVE）的强化学习（RL），这是一种使用可验证环境的方法，该环境在程序上生成问题并提供算法可验证的奖励，以扩展语言模型（LM）的 RL。RLVE使每个可验证环境能够随着训练的进行动态调整其问题难度分布以适应策略模型的能力。相比之下，静态数据分布通常会导致学习信号消失，当问题对策略来说太容易或太难时。为了实现 RLVE，我们创建了 RLVE-Gym，这是一个包含 400 个可验证环境的大型套件，通过手动环境工程精心开发。使用 RLVE-Gym，我们表明环境扩展，即扩展训练环境的集合，可以持续提高可推广的推理能力。RLVE在RLVE-Gym中对所有400个环境进行联合训练，从最强的1.5B推理LM之一开始，在六个推理基准中产生了3.37%的绝对平均改进。相比之下，尽管使用了 3 倍多的计算量，但继续该 LM 的原始 RL 训练仅产生 0.49% 的平均绝对增益。我们公开发布我们的代码。

FinRpt: Dataset, Evaluation System and LLM-based Multi-agent Framework for Equity Research Report Generation

FinRpt：用于股票研究报告生成的数据集、评估系统和基于 LLM 的多智能体框架

Authors: Song Jin, Shuqi Li, Shukun Zhang, Rui Yan
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.07322
Pdf link: https://arxiv.org/pdf/2511.07322
Abstract While LLMs have shown great success in financial tasks like stock prediction and question answering, their application in fully automating Equity Research Report generation remains uncharted territory. In this paper, we formulate the Equity Research Report (ERR) Generation task for the first time. To address the data scarcity and the evaluation metrics absence, we present an open-source evaluation benchmark for ERR generation - FinRpt. We frame a Dataset Construction Pipeline that integrates 7 financial data types and produces a high-quality ERR dataset automatically, which could be used for model training and evaluation. We also introduce a comprehensive evaluation system including 11 metrics to assess the generated ERRs. Moreover, we propose a multi-agent framework specifically tailored to address this task, named FinRpt-Gen, and train several LLM-based agents on the proposed datasets using Supervised Fine-Tuning and Reinforcement Learning. Experimental results indicate the data quality and metrics effectiveness of the benchmark FinRpt and the strong performance of FinRpt-Gen, showcasing their potential to drive innovation in the ERR generation field. All code and datasets are publicly available.
中文摘要 虽然法学硕士在股票预测和问答等金融任务中取得了巨大成功，但它们在完全自动化股票研究报告生成方面的应用仍然是未知领域。在本文中，我们首次制定了股票研究报告（ERR）生成任务。为了解决数据稀缺和评估指标缺失的问题，我们提出了一个用于生成 ERR 的开源评估基准 - FinRpt。我们构建了一个数据集构建管道，该管道集成了 7 种金融数据类型并自动生成高质量的 ERR 数据集，可用于模型训练和评估。我们还引入了一个全面的评估系统，包括 11 个指标来评估生成的 ERR。此外，我们提出了一个专门为解决这一任务而定制的多智能体框架，名为 FinRpt-Gen，并使用监督微调和强化学习在所提出的数据集上训练几个基于 LLM 的智能体。实验结果表明，基准 FinRpt 的数据质量和指标有效性以及 FinRpt-Gen 的强劲性能，展示了它们在推动 ERR 生成领域创新的潜力。所有代码和数据集都是公开的。

IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction

IterResearch：通过马尔可夫状态重建重新思考长视野代理

Authors: Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Wayne Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, Kuan Li, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.07327
Pdf link: https://arxiv.org/pdf/2511.07327
Abstract Recent advances in deep-research agents have shown promise for autonomous knowledge construction through dynamic reasoning over external sources. However, existing approaches rely on a mono-contextual paradigm that accumulates all information in a single, expanding context window, leading to context suffocation and noise contamination that limit their effectiveness on long-horizon tasks. We introduce IterResearch, a novel iterative deep-research paradigm that reformulates long-horizon research as a Markov Decision Process with strategic workspace reconstruction. By maintaining an evolving report as memory and periodically synthesizing insights, our approach preserves consistent reasoning capacity across arbitrary exploration depths. We further develop Efficiency-Aware Policy Optimization (EAPO), a reinforcement learning framework that incentivizes efficient exploration through geometric reward discounting and enables stable distributed training via adaptive downsampling. Extensive experiments demonstrate that IterResearch achieves substantial improvements over existing open-source agents with average +14.5pp across six benchmarks and narrows the gap with frontier proprietary systems. Remarkably, our paradigm exhibits unprecedented interaction scaling, extending to 2048 interactions with dramatic performance gains (from 3.5\% to 42.5\%), and serves as an effective prompting strategy, improving frontier models by up to 19.2pp over ReAct on long-horizon tasks. These findings position IterResearch as a versatile solution for long-horizon reasoning, effective both as a trained agent and as a prompting paradigm for frontier models.
中文摘要 深度研究代理的最新进展显示出通过对外部来源的动态推理进行自主知识构建的前景。然而，现有方法依赖于单一上下文范式，该范式将所有信息积累在单个扩展的上下文窗口中，导致上下文窒息和噪声污染，从而限制了它们在长期任务中的有效性。我们介绍了 IterResearch，这是一种新颖的迭代深度研究范式，它将长期研究重新表述为马尔可夫决策过程，并重建战略性工作空间。通过将不断发展的报告作为记忆并定期综合见解，我们的方法在任意探索深度中保持一致的推理能力。我们进一步开发了效率感知策略优化（EAPO），这是一种强化学习框架，通过几何奖励贴现激励高效探索，并通过自适应下采样实现稳定的分布式训练。广泛的实验表明，IterResearch 在六个基准测试中以平均 +14.5pp 的速度比现有的开源代理取得了实质性的改进，并缩小了与前沿专有系统的差距。值得注意的是，我们的范式表现出前所未有的交互扩展，扩展到 2048 次交互，性能显着提升（从 3.5\% 到 42.5\%），并作为一种有效的提示策略，在长期任务上比 ReAct 将前沿模型提高了 19.2pp。这些发现将 IterResearch 定位为长期推理的多功能解决方案，既可以作为训练有素的代理，也可以作为前沿模型的提示范式。

Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training

Q-RAG：通过基于值的嵌入器训练进行长上下文多步骤检索

Authors: Artyom Sorokin, Nazar Buzun, Alexander Anokhin, Oleg Inozemcev, Egor Vedernikov, Petr Anokhin, Mikhail Burtsev, Trushkov Alexey, Yin Wenshuai, Evgeny Burnaev
Subjects: Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2511.07328
Pdf link: https://arxiv.org/pdf/2511.07328
Abstract Retrieval-Augmented Generation (RAG) methods enhance LLM performance by efficiently filtering relevant context for LLMs, reducing hallucinations and inference cost. However, most existing RAG methods focus on single-step retrieval, which is often insufficient for answering complex questions that require multi-step search. Recently, multi-step retrieval approaches have emerged, typically involving the fine-tuning of small LLMs to perform multi-step retrieval. This type of fine-tuning is highly resource-intensive and does not enable the use of larger LLMs. In this work, we propose Q-RAG, a novel approach that fine-tunes the Embedder model for multi-step retrieval using reinforcement learning (RL). Q-RAG offers a competitive, resource-efficient alternative to existing multi-step retrieval methods for open-domain question answering and achieves state-of-the-art results on the popular long-context benchmarks Babilong and RULER for contexts up to 10M tokens.
中文摘要 检索增强生成（RAG）方法通过有效过滤 LLM 的相关上下文、减少幻觉和推理成本来增强 LLM 性能。然而，大多数现有的RAG方法都侧重于单步检索，这通常不足以回答需要多步骤搜索的复杂问题。最近，出现了多步检索方法，通常涉及对小型 LLM 进行微调以执行多步检索。这种类型的微调是高度资源密集型的，并且无法使用更大的 LLM。在这项工作中，我们提出了 Q-RAG，这是一种使用强化学习（RL）微调 Embedder 模型以进行多步检索的新方法。Q-RAG 为现有的开放域问答多步骤检索方法提供了一种具有竞争力的、资源高效的替代方案，并在流行的长上下文基准 Babilong 和 RULER 上取得了最先进的结果，适用于高达 10M 令牌的上下文。

Grounding Computer Use Agents on Human Demonstrations

在人体演示中接地计算机使用代理

Authors: Aarash Feizi, Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Kaixin Li, Rabiul Awal, Xing Han Lù, Johan Obando-Ceron, Juan A. Rodriguez, Nicolas Chapados, David Vazquez, Adriana Romero-Soriano, Reihaneh Rabbany, Perouz Taslakian, Christopher Pal, Spandana Gella, Sai Rajeswar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.07332
Pdf link: https://arxiv.org/pdf/2511.07332
Abstract Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance, and when evaluated in an agentic setting on the OSWorld benchmark using o3 as planner, GroundNext attains comparable or superior results to models trained with substantially more data,. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.
中文摘要 构建可靠的计算机使用代理需要基础：将自然语言指令准确地连接到正确的屏幕元素。虽然存在用于 Web 和移动交互的大型数据集，但桌面环境的高质量资源有限。为了解决这一差距，我们推出了 GroundCUA，这是一个根据专家人类演示构建的大规模桌面接地数据集。它涵盖了 12 个类别的 87 个应用程序，包括 56K 张屏幕截图，每个屏幕上的元素都经过仔细注释，总共超过 3.56M 个人工验证的注释。从这些演示中，我们生成了捕获各种现实世界任务的指令，为模型训练提供高质量的数据。使用 GroundCUA，我们开发了 GroundNext 系列模型，将指令映射到其目标 UI 元素。在 3B 和 7B 规模下，GroundNext 使用监督微调在五个基准测试中取得了最先进的结果，同时所需的训练数据不到先前工作的十分之一。强化学习后训练进一步提高了性能，当使用 o3 作为规划器在 OSWorld 基准测试的代理环境中进行评估时，GroundNext 获得了与使用更多数据训练的模型相当或更好的结果。这些结果证明了高质量、专家驱动的数据集在推进通用计算机使用代理方面的关键作用。

Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training

Transformer 树推理后培训课程的可证明益处

Authors: Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Hau-San Wong, Qingfu Zhang, Taiji Suzuki
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.07372
Pdf link: https://arxiv.org/pdf/2511.07372
Abstract Recent curriculum techniques in the post-training stage of LLMs have been widely observed to outperform non-curriculum approaches in enhancing reasoning performance, yet a principled understanding of why and to what extent they work remains elusive. To address this gap, we develop a theoretical framework grounded in the intuition that progressively learning through manageable steps is more efficient than directly tackling a hard reasoning task, provided each stage stays within the model's effective competence. Under mild complexity conditions linking consecutive curriculum stages, we show that curriculum post-training avoids the exponential complexity bottleneck. To substantiate this result, drawing insights from the Chain-of-Thoughts (CoTs) solving mathematical problems such as Countdown and parity, we model CoT generation as a states-conditioned autoregressive reasoning tree, define a uniform-branching base model to capture pretrained behavior, and formalize curriculum stages as either depth-increasing (longer reasoning chains) or hint-decreasing (shorter prefixes) subtasks. Our analysis shows that, under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with polynomial sample complexity, whereas direct learning suffers from an exponential bottleneck. We further establish analogous guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to polynomial order.
中文摘要 人们广泛观察到，法学硕士培训后阶段的最新课程技术在提高推理表现方面优于非课程方法，但对它们为什么以及在多大程度上有效的原则性理解仍然难以捉摸。为了解决这一差距，我们开发了一个基于直觉的理论框架，即通过可管理的步骤逐步学习比直接处理困难的推理任务更有效，前提是每个阶段都保持在模型的有效能力范围内。在连接连续课程阶段的轻度复杂性条件下，我们表明课程后培训避免了指数复杂性瓶颈。为了证实这一结果，从解决倒计时和奇偶校验等数学问题的思维链（CoT）中汲取见解，我们将CoT生成建模为状态条件的自回归推理树，定义一个均匀分支的基础模型来捕获预训练行为，并将课程阶段形式化为深度增加（较长的推理链）或提示减少（较短的前缀）子任务。我们的分析表明，在仅结果奖励信号下，强化学习微调在多项式样本复杂度方面实现了高精度，而直接学习则存在指数瓶颈。我们进一步建立了类似的测试时间扩展保证，其中课程感知查询将奖励预言机调用和采样成本从指数顺序降低到多项式顺序。

Unified Humanoid Fall-Safety Policy from a Few Demonstrations

来自一些演示的统一人形坠落安全政策

Authors: Zhengjie Xu, Ye Li, Kwan-yee Lin, Stella X. Yu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.07407
Pdf link: https://arxiv.org/pdf/2511.07407
Abstract Falling is an inherent risk of humanoid mobility. Maintaining stability is thus a primary safety focus in robot control and learning, yet no existing approach fully averts loss of balance. When instability does occur, prior work addresses only isolated aspects of falling: avoiding falls, choreographing a controlled descent, or standing up afterward. Consequently, humanoid robots lack integrated strategies for impact mitigation and prompt recovery when real falls defy these scripts. We aim to go beyond keeping balance to make the entire fall-and-recovery process safe and autonomous: prevent falls when possible, reduce impact when unavoidable, and stand up when fallen. By fusing sparse human demonstrations with reinforcement learning and an adaptive diffusion-based memory of safe reactions, we learn adaptive whole-body behaviors that unify fall prevention, impact mitigation, and rapid recovery in one policy. Experiments in simulation and on a Unitree G1 demonstrate robust sim-to-real transfer, lower impact forces, and consistently fast recovery across diverse disturbances, pointing towards safer, more resilient humanoids in real environments. Videos are available at this https URL.
中文摘要 跌倒是人形移动的固有风险。因此，保持稳定性是机器人控制和学习的主要安全重点，但没有现有方法可以完全避免失去平衡。当确实发生不稳定时，之前的工作只解决跌倒的孤立方面：避免跌倒、编排受控下降或之后站起来。因此，人形机器人缺乏减轻冲击和在真实跌倒违背这些脚本时迅速恢复的综合策略。我们的目标是超越保持平衡，使整个跌倒和恢复过程安全自主：尽可能防止跌倒，在不可避免的情况下减少冲击，在跌倒时站起来。通过将稀疏的人类演示与强化学习和基于自适应扩散的安全反应记忆相结合，我们学习了适应性全身行为，将跌倒预防、影响减轻和快速恢复统一在一项政策中。仿真和 Unitree G1 上的实验证明了强大的模拟到真实转移、更低的冲击力以及在各种干扰下始终如一的快速恢复，这表明在真实环境中可以实现更安全、更有弹性的人形生物。视频可在此 https URL 上获得。

Robot Learning from a Physical World Model

从物理世界模型中学习机器人

Authors: Jiageng Mao, Sicheng He, Hao-Ning Wu, Yang You, Shuyang Sun, Zhicheng Wang, Yanan Bao, Huizhong Chen, Leonidas Guibas, Vitor Guizilini, Howard Zhou, Yue Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.07416
Pdf link: https://arxiv.org/pdf/2511.07416
Abstract We introduce PhysWorld, a framework that enables robot learning from video generation through physical world modeling. Recent video generation models can synthesize photorealistic visual demonstrations from language commands and images, offering a powerful yet underexplored source of training signals for robotics. However, directly retargeting pixel motions from generated videos to robots neglects physics, often resulting in inaccurate manipulations. PhysWorld addresses this limitation by coupling video generation with physical world reconstruction. Given a single image and a task command, our method generates task-conditioned videos and reconstructs the underlying physical world from the videos, and the generated video motions are grounded into physically accurate actions through object-centric residual reinforcement learning with the physical world model. This synergy transforms implicit visual guidance into physically executable robotic trajectories, eliminating the need for real robot data collection and enabling zero-shot generalizable robotic manipulation. Experiments on diverse real-world tasks demonstrate that PhysWorld substantially improves manipulation accuracy compared to previous approaches. Visit \href{this https URL}{the project webpage} for details.
中文摘要 我们介绍了 PhysWorld，这是一个框架，它使机器人能够从视频生成到物理世界建模进行学习。最近的视频生成模型可以从语言命令和图像中合成逼真的视觉演示，为机器人技术提供强大但未被充分探索的训练信号源。然而，直接将像素运动从生成的视频重新定位到机器人会忽略物理原理，通常会导致作不准确。PhysWorld 通过将视频生成与物理世界重建相结合来解决这一限制。给定单个图像和一个任务命令，我们的方法生成任务条件视频，并从视频中重建底层物理世界，并通过物理世界模型以对象为中心的残差强化学习，将生成的视频运动基础为物理上精确的动作。这种协同作用将隐式视觉引导转化为物理可执行的机器人轨迹，消除了对真实机器人数据收集的需要，并实现了零样本的可通用机器人作。对各种现实世界任务的实验表明，与以前的方法相比，PhysWorld 显着提高了作精度。请访问 \href{this https URL}{the project webpage} 了解详情。

Keyword: diffusion policy

Gentle Manipulation Policy Learning via Demonstrations from VLM Planned Atomic Skills

通过 VLM 计划原子技能演示进行温和纵策略学习

Authors: Jiayu Zhou, Qiwei Wu, Jian Li, Zhe Chen, Xiaogang Xiong, Renjing Xu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.05855
Pdf link: https://arxiv.org/pdf/2511.05855
Abstract Autonomous execution of long-horizon, contact-rich manipulation tasks traditionally requires extensive real-world data and expert engineering, posing significant cost and scalability challenges. This paper proposes a novel framework integrating hierarchical semantic decomposition, reinforcement learning (RL), visual language models (VLMs), and knowledge distillation to overcome these limitations. Complex tasks are decomposed into atomic skills, with RL-trained policies for each primitive exclusively in simulation. Crucially, our RL formulation incorporates explicit force constraints to prevent object damage during delicate interactions. VLMs perform high-level task decomposition and skill planning, generating diverse expert demonstrations. These are distilled into a unified policy via Visual-Tactile Diffusion Policy for end-to-end execution. We conduct comprehensive ablation studies exploring different VLM-based task planners to identify optimal demonstration generation pipelines, and systematically compare imitation learning algorithms for skill distillation. Extensive simulation experiments and physical deployment validate that our approach achieves policy learning for long-horizon manipulation without costly human demonstrations, while the VLM-guided atomic skill framework enables scalable generalization to diverse tasks.
中文摘要 传统上，自主执行长期、接触丰富的作任务需要大量的真实世界数据和专家工程，这带来了巨大的成本和可扩展性挑战。本文提出了一种集成了分层语义分解、强化学习（RL）、视觉语言模型（VLM）和知识蒸馏的新框架来克服这些限制。复杂的任务被分解为原子技能，每个基元的 RL 训练策略仅在模拟中。至关重要的是，我们的强再生配方结合了明确的力约束，以防止在精细的相互作用过程中损坏物体。VLM 执行高级任务分解和技能规划，生成各种专家演示。这些通过视觉触觉扩散策略提炼成统一的策略，以实现端到端执行。我们进行了全面的消融研究，探索不同的基于VLM的任务规划器，以确定最佳的演示生成管道，并系统地比较用于技能蒸馏的模仿学习算法。广泛的模拟实验和物理部署验证了我们的方法无需昂贵的人工演示即可实现长期作的策略学习，而 VLM 引导的原子技能框架可以扩展到不同的任务。