Arxiv Papers of Today

生成时间: 2026-01-02 16:32:43 (UTC+8); Arxiv 发布时间: 2026-01-01 20:00 EST (2026-01-02 09:00 UTC+8)

今天共有 41 篇相关文章

Keyword: reinforcement learning

A Survey of AI Methods for Geometry Preparation and Mesh Generation in Engineering Simulation

工程仿真中几何准备和网格生成的人工智能方法综述

Authors: Steven Owen, Nathan Brown, Nikos Chrisochoides, Rao Garimella, Xianfeng Gu, Franck Ledoux, Na Lei, Roshan Quadros, Navamita Ray, Nicolas Winovich, Yongjie Jessica Zhang
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.23719
Pdf link: https://arxiv.org/pdf/2512.23719
Abstract Artificial intelligence is beginning to ease long-standing bottlenecks in the CAD-to-mesh pipeline. This survey reviews recent advances where machine learning aids part classification, mesh quality prediction, and defeaturing. We explore methods that improve unstructured and block-structured meshing, support volumetric parameterizations, and accelerate parallel mesh generation. We also examine emerging tools for scripting automation, including reinforcement learning and large language models. Across these efforts, AI acts as an assistive technology, extending the capabilities of traditional geometry and meshing tools. The survey highlights representative methods, practical deployments, and key research challenges that will shape the next generation of data-driven meshing workflows.
中文摘要 人工智能正开始缓解CAD到网格流程中长期存在的瓶颈。本次调查回顾了机器学习在辅助零件分类、网格质量预测和改进方面的最新进展。我们探索改进非结构化和块结构网格、支持体积参数化以及加速并行网格生成的方法。我们还研究了新兴的脚本自动化工具，包括强化学习和大型语言模型。在这些努力中，人工智能作为辅助技术，扩展了传统几何和网格工具的能力。调查重点介绍了具有代表性的方法、实用部署以及将塑造下一代数据驱动网格工作流程的关键研究挑战。

Audited Skill-Graph Self-Improvement for Agentic LLMs via Verifiable Rewards, Experience Synthesis, and Continual Memory

经审计的技能图谱自我提升，适用于代理型大型语言模型，通过可验证的奖励、经验综合和持续记忆

Authors: Ken Huang, Jerry Huang
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.23760
Pdf link: https://arxiv.org/pdf/2512.23760
Abstract Reinforcement learning is increasingly used to transform large language models into agentic systems that act over long horizons, invoke tools, and manage memory under partial observability. While recent work has demonstrated performance gains through tool learning, verifiable rewards, and continual training, deployed self-improving agents raise unresolved security and governance challenges: optimization pressure can incentivize reward hacking, behavioral drift is difficult to audit or reproduce, and improvements are often entangled in opaque parameter updates rather than reusable, verifiable artifacts. This paper proposes Audited Skill-Graph Self-Improvement (ASG-SI), a framework that treats self-improvement as iterative compilation of an agent into a growing, auditable skill graph. Each candidate improvement is extracted from successful trajectories, normalized into a skill with an explicit interface, and promoted only after passing verifier-backed replay and contract checks. Rewards are decomposed into reconstructible components derived from replayable evidence, enabling independent audit of promotion decisions and learning signals. ASG-SI further integrates experience synthesis for scalable stress testing and continual memory control to preserve long-horizon performance under bounded context. We present a complete system architecture, threat model, and security analysis, and provide a fully runnable reference implementation that demonstrates verifier-backed reward construction, skill compilation, audit logging, and measurable improvement under continual task streams. ASG-SI reframes agentic self-improvement as accumulation of verifiable, reusable capabilities, offering a practical path toward reproducible evaluation and operational governance of self-improving AI agents.
中文摘要 强化学习越来越多地被用来将大型语言模型转化为能够跨越长视野、调用工具并在部分可观察性下管理记忆的智能体系统。尽管近期研究通过工具学习、可验证奖励和持续培训显示性能提升，部署的自我改进代理仍带来了未解决的安全和治理挑战：优化压力可能激励奖励黑客行为，行为漂移难以审计或重现，改进往往纠缠在不透明的参数更新中，而非可重用、可验证的工件。本文提出了审计技能图自我提升（ASG-SI）框架，将自我提升视为将代理逐步编译为不断增长、可审计的技能图谱。每个候选改进都从成功的轨迹中提取，归一化为带有显式界面的技能，只有通过验证者支持的回放和合同检查后才能晋升。奖励被分解为可重构的组件，基于可重玩的证据，从而实现对晋升决策和学习信号的独立审计。ASG-SI进一步整合经验综合，实现可扩展的压力测试和持续的记忆控制，以在有限上下文下保持长期性能。我们呈现完整的系统架构、威胁模型和安全分析，并提供了一个完全可运行的参考实现，展示了验证者支持的奖励构建、技能编制、审计日志以及在持续任务流下的可衡量改进。ASG-SI将智能体自我改进重新定义为可验证、可复用能力的积累，提供了一条可重复的自我提升AI智能体的可重复评估和运营治理的实用路径。

Safety-Biased Policy Optimisation: Towards Hard-Constrained Reinforcement Learning via Trust Regions

安全偏向策略优化：通过信任区域实现硬约束强化学习

Authors: Ankit Kanwar, Dominik Wagner, Luke Ong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.23770
Pdf link: https://arxiv.org/pdf/2512.23770
Abstract Reinforcement learning (RL) in safety-critical domains requires agents to maximise rewards while strictly adhering to safety constraints. Existing approaches, such as Lagrangian and projection-based methods, often either fail to ensure near-zero safety violations or sacrifice reward performance in the face of hard constraints. We propose Safety-Biased Trust Region Policy Optimisation (SB-TRPO), a new trust-region algorithm for hard-constrained RL. SB-TRPO adaptively biases policy updates towards constraint satisfaction while still seeking reward improvement. Concretely, it performs trust-region updates using a convex combination of the natural policy gradients of cost and reward, ensuring a fixed fraction of optimal cost reduction at each step. We provide a theoretical guarantee of local progress towards safety, with reward improvement when gradients are suitably aligned. Experiments on standard and challenging Safety Gymnasium tasks show that SB-TRPO consistently achieves the best balance of safety and meaningful task completion compared to state-of-the-art methods.
中文摘要 在安全关键领域，强化学习（RL）要求智能体在严格遵守安全约束的同时最大化奖励。现有方法，如拉格朗日和基于投影的方法，往往要么无法确保近乎零的安全违规，要么在面对硬约束时牺牲了奖励性能。我们提出了安全偏置信任区域策略优化（SB-TRPO），这是一种用于硬约束强化学习的新型信任区域算法。SB-TRPO在寻求奖励改进的同时，自适应地偏向约束满足的政策更新。具体来说，它通过成本和回报的自然策略梯度的凸组合来进行信任区域更新，确保每个步骤都获得固定比例的最优成本降低。我们提供理论上的保证，保证本地安全进展，当梯度适当对齐时，奖励会有所提升。标准且具有挑战性的Safety Gymnasium任务实验表明，SB-TRPO在安全性与有意义任务完成度之间，始终优于最先进方法。

FineFT: Efficient and Risk-Aware Ensemble Reinforcement Learning for Futures Trading

FineFT：期货交易中的高效且风险意识强化学习

Authors: Molei Qin, Xinyu Cai, Yewen Li, Haochong Xia, Chuqiao Zong, Shuo Sun, Xinrun Wang, Bo An
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.23773
Pdf link: https://arxiv.org/pdf/2512.23773
Abstract Futures are contracts obligating the exchange of an asset at a predetermined date and price, notable for their high leverage and liquidity and, therefore, thrive in the Crypto market. RL has been widely applied in various quantitative tasks. However, most methods focus on the spot and could not be directly applied to the futures market with high leverage because of 2 challenges. First, high leverage amplifies reward fluctuations, making training stochastic and difficult to converge. Second, prior works lacked self-awareness of capability boundaries, exposing them to the risk of significant loss when encountering new market state (e.g.,a black swan event like COVID-19). To tackle these challenges, we propose the Efficient and Risk-Aware Ensemble Reinforcement Learning for Futures Trading (FineFT), a novel three-stage ensemble RL framework with stable training and proper risk management. In stage I, ensemble Q learners are selectively updated by ensemble TD errors to improve convergence. In stage II, we filter the Q-learners based on their profitabilities and train VAEs on market states to identify the capability boundaries of the learners. In stage III, we choose from the filtered ensemble and a conservative policy, guided by trained VAEs, to maintain profitability and mitigate risk with new market states. Through extensive experiments on crypto futures in a high-frequency trading environment with high fidelity and 5x leverage, we demonstrate that FineFT outperforms 12 SOTA baselines in 6 financial metrics, reducing risk by more than 40% while achieving superior profitability compared to the runner-up. Visualization of the selective update mechanism shows that different agents specialize in distinct market dynamics, and ablation studies certify routing with VAEs reduces maximum drawdown effectively, and selective update improves convergence and performance.
中文摘要 期货是合同，要求在预定日期和价格下交易资产，以其高杠杆性和流动性著称，因此在加密市场中非常受欢迎。强化学习已被广泛应用于各种定量任务。然而，大多数方法集中在现货，无法直接应用于高杠杆期货市场，存在两个挑战。首先，高杠杆放大了奖励波动，使训练变得随机且难以收敛。其次，以往的工作缺乏对能力边界的自我意识，使其在遇到新市场状态时面临重大损失风险（例如，像COVID-19这样的黑天鹅事件）。为应对这些挑战，我们提出了高效且风险意识的集合强化学习期货交易（FineFT），这是一种新型三阶段集合强化学习框架，具备稳定的训练和适当的风险管理。在第一阶段，集合Q学习者通过集合TD错误选择性更新，以改善收敛性。在第二阶段，我们根据Q学习者的盈利能力进行筛选，并针对市场状态培训VAE以识别学习者的能力边界。在第三阶段，我们从筛选组合和由受过培训的增值工程师指导的保守政策中选择，以保持盈利并降低新市场状态的风险。通过在高保真度和5倍杠杆高频交易环境中对加密期货进行的广泛实验，我们证明FineFT在6个金融指标中优于12个SOTA基线，风险降低超过40%，同时实现了优于亚军的盈利能力。选择性更新机制的可视化显示，不同代理专注于不同的市场动态，消融研究证明使用VAE的路由有效减少最大拉出，选择性更新提升收敛性和性能。

Prompt-Induced Over-Generation as Denial-of-Service: A Black-Box Attack-Side Benchmark

提示诱发的过载生成作为拒绝服务：黑箱攻击端基准测试

Authors: Manu, Yi Guo, Jo Plested, Tim Lynar, Kanchana Thilakarathna, Nirhoshan Sivaroopan, Jack Yang, Wangli Yang
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.23779
Pdf link: https://arxiv.org/pdf/2512.23779
Abstract Large language models (LLMs) can be driven into over-generation, emitting thousands of tokens before producing an end-of-sequence (EOS) token. This degrades answer quality, inflates latency and cost, and can be weaponized as a denial-of-service (DoS) attack. Recent work has begun to study DoS-style prompt attacks, but typically focuses on a single attack algorithm or assumes white-box access, without an attack-side benchmark that compares prompt-based attackers in a black-box, query-only regime with a known tokenizer. We introduce such a benchmark and study two prompt-only attackers. The first is Evolutionary Over-Generation Prompt Search (EOGen), which searches the token space for prefixes that suppress EOS and induce long continuations. The second is a goal-conditioned reinforcement learning attacker (RL-GOAL) that trains a network to generate prefixes conditioned on a target length. To characterize behavior, we introduce Over-Generation Factor (OGF), the ratio of produced tokens to a model's context window, along with stall and latency summaries. Our evolutionary attacker achieves mean OGF = 1.38 +/- 1.15 and Success@OGF >= 2 of 24.5 percent on Phi-3. RL-GOAL is stronger: across victims it achieves higher mean OGF (up to 2.81 +/- 1.38).
中文摘要 大型语言模型（LLM）可以被驱动到溢生成阶段，先发出数千个令牌，然后生成序列结束（EOS）令牌。这会降低答复质量，增加延迟和成本，并可能被用作拒绝服务（DoS）攻击。近期研究开始研究类似DoS的提示攻击，但通常聚焦于单一攻击算法或假设白盒访问，缺乏攻击端基准测试来比较黑箱、仅查询模式下基于提示的攻击者与已知分词器。我们提出了这样的基准测试，并研究了两种仅限提示的攻击者。第一种是进化超生成提示搜索（EOGen），它在令牌空间中搜索抑制EOS并诱导长延续的前缀。第二种是目标条件强化学习攻击者（RL-GOAL），它训练网络生成基于目标长度的前缀。为了描述行为，我们引入了超生成因子（OGF），即产生的代币与模型上下文窗口的比率，以及停滞和延迟总结。我们的进化攻击者在Phi-3上的平均OGF=1.38 +/- 1.15，Success@OGF >=2，占24.5%。RL-GOAL更强：在受害者中实现了更高的平均OGF（最高2.81 +/- 1.38）。

Max-Entropy Reinforcement Learning with Flow Matching and A Case Study on LQR

带流量匹配的最大熵强化学习及LQR案例研究

Authors: Yuyang Zhang, Yang Hu, Bo Dai, Na Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.23870
Pdf link: https://arxiv.org/pdf/2512.23870
Abstract Soft actor-critic (SAC) is a popular algorithm for max-entropy reinforcement learning. In practice, the energy-based policies in SAC are often approximated using simple policy classes for efficiency, sacrificing the expressiveness and robustness. In this paper, we propose a variant of the SAC algorithm that parameterizes the policy with flow-based models, leveraging their rich expressiveness. In the algorithm, we evaluate the flow-based policy utilizing the instantaneous change-of-variable technique and update the policy with an online variant of flow matching developed in this paper. This online variant, termed importance sampling flow matching (ISFM), enables policy update with only samples from a user-specified sampling distribution rather than the unknown target distribution. We develop a theoretical analysis of ISFM, characterizing how different choices of sampling distributions affect the learning efficiency. Finally, we conduct a case study of our algorithm on the max-entropy linear quadratic regulator problems, demonstrating that the proposed algorithm learns the optimal action distribution.
中文摘要 软演员-批判者（SAC）是一种流行的最大熵强化学习算法。在实际作中，SAC中的基于能源的策略通常通过简单的策略类来近似以提高效率，牺牲了表达性和鲁棒性。本文提出了SAC算法的变体，利用基于流的模型参数化策略，充分利用其丰富的表达力。在该算法中，我们利用瞬时变量变更技术评估基于流量的策略，并用本文开发的在线流量匹配变体更新策略。这种在线变体称为重要性抽样流匹配（ISFM），允许仅使用用户指定的抽样分布样本更新策略，而非未知目标分布。我们对ISFM进行了理论分析，描述不同抽样分布选择如何影响学习效率。最后，我们对算法在最大熵线性二次调节子问题上进行了案例研究，证明该算法能够学习最优作用分布。

Distributed Beamforming in Massive MIMO Communication for a Constellation of Airborne Platform Stations

分布式波束成形，用于大规模MIMO通信，用于空中平台站星座

Authors: Hesam Khoshkbari, Georges Kaddoum, Bassant Selim, Omid Abbasi, Halim Yanikomeroglu
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.23900
Pdf link: https://arxiv.org/pdf/2512.23900
Abstract Non-terrestrial base stations (NTBSs), including high-altitude platform stations (HAPSs) and hot-air balloons (HABs), are integral to next-generation wireless networks, offering coverage in remote areas and enhancing capacity in dense regions. In this paper, we propose a distributed beamforming framework for a massive MIMO network with a constellation of aerial platform stations (APSs). Our approach leverages an entropy-based multi-agent deep reinforcement learning (DRL) model, where each APS operates as an independent agent using imperfect channel state information (CSI) in both training and testing phases. Unlike conventional methods, our model does not require CSI sharing among APSs, significantly reducing overhead. Simulations results demonstrate that our method outperforms zero forcing (ZF) and maximum ratio transmission (MRT) techniques, particularly in high-interference scenarios, while remaining robust to CSI imperfections. Additionally, our framework exhibits scalability, maintaining stable performance over an increasing number of users and various cluster configurations. Therefore, the proposed method holds promise for dynamic and interference-rich NTBS networks, advancing scalable and robust wireless solutions.
中文摘要 非地面基站（NTBS），包括高空平台站（HAPS）和热气球（HAB），是下一代无线网络的重要组成部分，能够覆盖偏远地区并增强密集地区的容量。本文提出了一个分布式波束成形框架，用于一个由多个空中平台站（APS）组成的大规模MIMO网络。我们的方法利用基于熵的多智能体深度强化学习（DRL）模型，每个APS作为独立代理使用不完全信道状态信息（CSI）在训练和测试阶段运行。与传统方法不同，我们的模型不要求APS之间共享CSI，显著降低了开销。模拟结果表明，我们的方法在高干扰场景下优于零强迫（ZF）和最大比比传输（MRT）技术，同时对CSI不完美性保持鲁棒性。此外，我们的框架具备可扩展性，能够在越来越多的用户和各种集群配置中保持稳定性能。因此，该方法有望用于动态且干扰丰富的NTBS网络，推动可扩展且稳健的无线解决方案。

Constraint Breeds Generalization: Temporal Dynamics as an Inductive Bias

约束催生推广：时间动力学作为归纳偏压

Authors: Xia Chen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.23916
Pdf link: https://arxiv.org/pdf/2512.23916
Abstract Conventional deep learning prioritizes unconstrained optimization, yet biological systems operate under strict metabolic constraints. We propose that these physical constraints shape dynamics to function not as limitations, but as a temporal inductive bias that breeds generalization. Through a phase-space analysis of signal propagation, we reveal a fundamental asymmetry: expansive dynamics amplify noise, whereas proper dissipative dynamics compress phase space that aligns with the network's spectral bias, compelling the abstraction of invariant features. This condition can be imposed externally via input encoding, or intrinsically through the network's own temporal dynamics. Both pathways require architectures capable of temporal integration and proper constraints to decode induced invariants, whereas static architectures fail to capitalize on temporal structure. Through comprehensive evaluations across supervised classification, unsupervised reconstruction, and zero-shot reinforcement learning, we demonstrate that a critical "transition" regime maximizes generalization capability. These findings establish dynamical constraints as a distinct class of inductive bias, suggesting that robust AI development requires not only scaling and removing limitations, but computationally mastering the temporal characteristics that naturally promote generalization.
中文摘要 传统深度学习优先考虑不受限制的优化，但生物系统却在严格的代谢约束下运行。我们提出，这些物理约束使动力学不再是局限，而是作为一种时间归纳偏见，从而促进泛化。通过对信号传播的相空间分析，我们揭示了一个根本性的不对称性：膨胀动力学放大噪声，而适当的耗散动力学则压缩与网络光谱偏置相符的相空间，迫使不变特征被抽象化。该条件可以通过输入编码外部施加，或通过网络自身的时间动态内在施加。这两种路径都需要具备时间整合能力和适当的约束来解码诱导的不变量，而静态架构则无法充分利用时间结构。通过对监督分类、无监督重建和零样子强化学习的全面评估，我们证明关键的“过渡”状态最大化了泛化能力。这些发现确立了动力学约束作为一类独立的归纳偏置，表明稳健的人工智能开发不仅需要缩放和消除限制，还需要计算掌握自然促进泛化的时间特性。

CEC-Zero: Zero-Supervision Character Error Correction with Self-Generated Rewards

CEC-Zero：零监督字符错误纠正，附带自我生成奖励

Authors: Zhiming Lin, Kai Zhao, Sophie Zhang, Peilai Yu, Canran Xiao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.23971
Pdf link: https://arxiv.org/pdf/2512.23971
Abstract Large-scale Chinese spelling correction (CSC) remains critical for real-world text processing, yet existing LLMs and supervised methods lack robustness to novel errors and rely on costly annotations. We introduce CEC-Zero, a zero-supervision reinforcement learning framework that addresses this by enabling LLMs to correct their own mistakes. CEC-Zero synthesizes errorful inputs from clean text, computes cluster-consensus rewards via semantic similarity and candidate agreement, and optimizes the policy with PPO. It outperforms supervised baselines by 10--13 F$_1$ points and strong LLM fine-tunes by 5--8 points across 9 benchmarks, with theoretical guarantees of unbiased rewards and convergence. CEC-Zero establishes a label-free paradigm for robust, scalable CSC, unlocking LLM potential in noisy text pipelines.
中文摘要 大规模的中文拼写校正（CSC）对现实文本处理依然至关重要，但现有的大型语言模型和监督方法缺乏对新颖错误的鲁棒性，且依赖昂贵的注释。我们介绍了CEC-Zero，一种零监督强化学习框架，通过帮助大型语言模型纠正自身错误来解决这一问题。CEC-Zero 综合了来自干净文本的错误输入，通过语义相似性和候选一致性计算集群共识奖励，并用 PPO 优化策略。它在9个基准测试中比监督基线高出10-13 F$_1$，强LLM微调高出5-8个百分点，理论上保证无偏的奖励和收敛。CEC-Zero 建立了一个无标签的稳健、可扩展 CSC 范式，释放了噪声文本流水线中的大型语言模型潜力。

RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations

RSAgent：通过多回合工具调用学习推理与行动文本引导分割

Authors: Xingqi He, Yujie Zhang, Shuyong Gao, Wenjie Li, Lingyi Hong, Mingxi Chen, Kaixun Jiang, Jiyuan Fu, Wenqiang Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.24023
Pdf link: https://arxiv.org/pdf/2512.24023
Abstract Text-guided object segmentation requires both cross-modal reasoning and pixel grounding abilities. Most recent methods treat text-guided segmentation as one-shot grounding, where the model predicts pixel prompts in a single forward pass to drive an external segmentor, which limits verification, refocusing and refinement when initial localization is wrong. To address this limitation, we propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations. RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks. We further build a data pipeline to synthesize multi-turn reasoning segmentation trajectories, and train RSAgent with a two-stage framework: cold-start supervised fine-tuning followed by agentic reinforcement learning with fine-grained, task-specific rewards. Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance on both in-domain and out-of-domain benchmarks.
中文摘要 文本引导对象分割既需要跨模态推理，也需要像素基础能力。最新方法将文本引导切割视为一次性接地，模型通过单次前向传递预测像素提示驱动外部分段子，这限制了初始定位错误时的验证、重新聚焦和细化。为解决这一限制，我们提出了RSAgent，一种智能多模态大型语言模型（MLLM），通过多回合工具调用交织推理与动作进行分割。RSAgent 查询分割工具箱，观察视觉反馈，并利用历史观测修正其空间假设，重新定位目标并迭代优化掩码。我们进一步构建数据流水线，综合多回合推理的分割轨迹，并用两阶段框架训练RSAgent：冷启动监督微调，随后是带有细粒度、任务特定奖励的能动强化学习。大量实验表明，RSAgent在ReasonSeg测试中实现了66.5%的零样本gIoU性能，比Seg-Zero-7B提升了9%，在RefCOCOg测试中达到81.5%的cIoU，展示了在域内外基准测试中的最先进性能。

Reinforced Diffusion: Learning to Push the Limits of Anisotropic Diffusion for Image Denoising

强化扩散：学习推动各向异性扩散的极限以实现图像去噪

Authors: Xinran Qin, Yuhui Quan, Ruotao Xu, Hui Ji
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.24035
Pdf link: https://arxiv.org/pdf/2512.24035
Abstract Image denoising is an important problem in low-level vision and serves as a critical module for many image recovery tasks. Anisotropic diffusion is a wide family of image denoising approaches with promising performance. However, traditional anisotropic diffusion approaches use explicit diffusion operators which are not well adapted to complex image structures. As a result, their performance is limited compared to recent learning-based approaches. In this work, we describe a trainable anisotropic diffusion framework based on reinforcement learning. By modeling the denoising process as a series of naive diffusion actions with order learned by deep Q-learning, we propose an effective diffusion-based image denoiser. The diffusion actions selected by deep Q-learning at different iterations indeed composite a stochastic anisotropic diffusion process with strong adaptivity to different image structures, which enjoys improvement over the traditional ones. The proposed denoiser is applied to removing three types of often-seen noise. The experiments show that it outperforms existing diffusion-based methods and competes with the representative deep CNN-based methods.
中文摘要 图像去噪是低强度视觉中一个重要问题，也是许多图像恢复任务中的关键模块。各向异性扩散是一类广泛的图像去噪方法，具有良好的性能。然而，传统的各向异性扩散方法使用显式扩散算符，这些算符并不适合复杂的图像结构。因此，与近期基于学习的方法相比，其性能有限。在本研究中，我们描述了一个基于强化学习的可训练各向异性扩散框架。通过将去噪过程建模为一系列通过深度Q学习的有序扩散作用，我们提出了一种有效的基于扩散的图像去噪器。深度Q学习在不同迭代中选择的扩散作用，确实构成了一个随机各向异性扩散过程，具有对不同图像结构的强烈适应性，比传统过程有显著改进。拟议的去噪器用于去除三种常见噪声。实验显示，它优于现有的基于扩散的方法，并与代表性的基于深度卷积神经网络的方法竞争。

ROAD: Reflective Optimization via Automated Debugging for Zero-Shot Agent Alignment

ROAD：通过自动调试实现零射智能体比对的反射优化

Authors: Natchaya Temyingyong, Daman Jain, Neeraj Kumarsahu, Prabhat Kumar, Rachata Phondi, Wachiravit Modecrua, Krittanon Kaewtawee, Krittin Pachtrachai, Touchapon Kraisingkorn
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.24040
Pdf link: https://arxiv.org/pdf/2512.24040
Abstract Automatic Prompt Optimization (APO) has emerged as a critical technique for enhancing Large Language Model (LLM) performance, yet current state-of-the-art methods typically rely on large, labeled gold-standard development sets to compute fitness scores for evolutionary or Reinforcement Learning (RL) approaches. In real-world software engineering, however, such curated datasets are rarely available during the initial cold start of agent development, where engineers instead face messy production logs and evolving failure modes. We present ROAD (Reflective Optimization via Automated Debugging), a novel framework that bypasses the need for refined datasets by treating optimization as a dynamic debugging investigation rather than a stochastic search. Unlike traditional mutation strategies, ROAD utilizes a specialized multi-agent architecture, comprising an Analyzer for root-cause analysis, an Optimizer for pattern aggregation, and a Coach for strategy integration, to convert unstructured failure logs into robust, structured Decision Tree Protocols. We evaluated ROAD across both a standardized academic benchmark and a live production Knowledge Management engine. Experimental results demonstrate that ROAD is highly sample-efficient, achieving a 5.6 percent increase in success rate (73.6 percent to 79.2 percent) and a 3.8 percent increase in search accuracy within just three automated iterations. Furthermore, on complex reasoning tasks in the retail domain, ROAD improved agent performance by approximately 19 percent relative to the baseline. These findings suggest that mimicking the human engineering loop of failure analysis and patching offers a viable, data-efficient alternative to resource-intensive RL training for deploying reliable LLM agents.
中文摘要 自动提示优化（APO）已成为提升大型语言模型（LLM）性能的关键技术，但当前最先进的方法通常依赖大型、带标签的金标准开发集来计算进化学习或强化学习（RL）方法的适应度评分。然而，在现实软件工程中，此类精心策划的数据集在代理开发的初期冷启动阶段很少可用，工程师们反而会面对混乱的生产日志和不断演变的失败模式。我们提出了ROAD（通过自动调试进行反思优化），这是一个新颖框架，通过将优化视为动态调试调查而非随机搜索，绕过了对精炼数据集的需求。与传统突变策略不同，ROAD 采用了专门的多智能体架构，包括用于根因分析的分析仪、用于模式聚合的优化器和用于策略集成的 Coach，将非结构化的失败日志转换为稳健的结构化决策树协议。我们通过标准化学术基准和实时生产知识管理引擎对ROAD进行了评估。实验结果显示，ROAD采样效率极高，在三次自动化迭代中成功率提升了5.6%（73.6%对79.2%），搜索准确率提升了3.8%。此外，在零售领域的复杂推理任务中，ROAD相较基准提升了约19%的代理表现。这些发现表明，模仿人类工程的故障分析和补丁循环，为部署可靠LLM代理提供了一种可行且数据高效的替代方案，替代资源密集型强化学习训练。

HY-MT1.5 Technical Report

HY-MT1.5技术报告

Authors: Mao Zheng, Zheng Li, Tao Chen, Mingyang Song, Di Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.24092
Pdf link: https://arxiv.org/pdf/2512.24092
Abstract In this report, we introduce our latest translation models, HY-MT1.5-1.8B and HY-MT1.5-7B, a new family of machine translation models developed through a holistic training framework tailored for high-performance translation. Our methodology orchestrates a multi-stage pipeline that integrates general and MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning. HY-MT1.5-1.8B, the 1.8B-parameter model demonstrates remarkable parameter efficiency, comprehensively outperforming significantly larger open-source baselines (e.g., Tower-Plus-72B, Qwen3-32B) and mainstream commercial APIs (e.g., Microsoft Translator, Doubao Translator) in standard Chinese-foreign and English-foreign tasks. It achieves approximately 90% of the performance of ultra-large proprietary models such as Gemini-3.0-Pro, while marginally trailing Gemini-3.0-Pro on WMT25 and Mandarin-minority language benchmarks, it maintains a substantial lead over other competing models. Furthermore, HY-MT1.5-7B establishes a new state-of-the-art for its size class, achieving 95% of Gemini-3.0-Pro's performance on Flores-200 and surpassing it on the challenging WMT25 and Mandarin-minority language test sets. Beyond standard translation, the HY-MT1.5 series supports advanced constraints, including terminology intervention, context-aware translation, and format preservation. Extensive empirical evaluations confirm that both models offer highly competitive, robust solutions for general and specialized translation tasks within their respective parameter scales.
中文摘要 在本报告中，我们介绍了最新的翻译模型HY-MT1.5-1.8B和HY-MT1.5-7B，这是一系列通过整体训练框架开发、专为高性能翻译量身定制的新机器翻译模型。我们的方法论协调了一个多阶段的流程，整合了通用和机器学习导向的预训练、监督式微调、策略提炼和强化学习。HY-MT1.5-1.8B，即1.8B参数模型，在标准中外文和英外任务中，显著优于更大的开源基线（如Tower-Plus-72B、Qwen3-32B）和主流商用API（如Microsoft翻译器、豆宝翻译器）。它实现了约90%的超大型专有型号（如Gemini-3.0-Pro）的性能，虽然在WMT25和普通话语言基准测试中略逊于Gemini-3.0-Pro，但仍保持在其他竞争型号中的显著领先优势。此外，HY-MT1.5-7B为其尺寸级别树立了全新的先进技术，在Flores-200上实现了Gemini-3.0-Pro95%的性能，并在具有挑战性的WMT25和普通话测试中超越了Pro。除了标准翻译外，HY-MT1.5系列还支持高级约束，包括术语干预、上下文感知翻译和格式保持。大量实证评估证实，这两种模型在其各自参数尺度内，为通用和专业翻译任务提供了高度竞争且稳健的解决方案。

GARDO: Reinforcing Diffusion Models without Reward Hacking

GARDO：强化扩散模型而不使用奖励黑客

Authors: Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, Ling Pan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.24138
Pdf link: https://arxiv.org/pdf/2512.24138
Abstract Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.
中文摘要 通过在线强化学习（RL）微调扩散模型，显示出增强文本与图像对齐的巨大潜力。然而，由于精确指定视觉任务的真实目标仍然具有挑战性，模型通常采用只部分捕捉真实目标的代理奖励进行优化。这种不匹配常常导致奖励黑客行为，代理分数上升，而真实图像质量下降，代际多样性崩溃。虽然常见的解决方案会针对参考策略添加正则化以防止奖励黑客行为，但它们会牺牲样本效率，并阻碍探索新的高奖励区域，因为参考策略通常不够优。为了满足样本效率、有效探索和奖励黑客缓解的竞争需求，我们提出了带有多样性感知优化的门控与自适应正则化（GARDO），这是一个兼容多种强化学习算法的多功能框架。我们的关键见解是，正则化不必普遍应用;相反，选择性惩罚表现出高度不确定性的样本子集是非常有效的。为应对探索挑战，GARDO 引入了自适应正则化机制，定期更新参考模型以匹配在线策略的能力，确保正则化目标的相关性。为解决强化学习中的模式崩溃问题，GARDO 放大了高质量且具高多样性样本的奖励，鼓励模式覆盖率同时不破坏优化过程。在多样化代理奖励和未公开指标上的广泛实验一致显示，GARDO 能够减轻奖励被黑客攻击，提升生成多样性，同时不牺牲样本效率或探索，彰显其有效性和稳健性。

Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning

通过方向性解耦对齐在扩散强化学习中对齐的驯服偏好模式崩溃

Authors: Chubin Chen, Sujie Hu, Jiashu Zhu, Meiqi Wu, Jintao Chen, Yanxun Li, Nisha Huang, Chengyu Fang, Jiahong Wu, Xiangxiang Chu, Xiu Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.24146
Pdf link: https://arxiv.org/pdf/2512.24146
Abstract Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model's inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D$^2$-Align), a novel framework that mitigates PMC by directionally correcting the reward signal. Specifically, our method first learns a directional correction within the reward model's embedding space while keeping the model frozen. This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D$^2$-Align achieves superior alignment with human preference.
中文摘要 近期研究显示，通过人类反馈强化学习，文本到图像扩散模型与人类偏好的对齐取得了显著进展。然而，尽管现有方法在自动奖励指标上得分很高，但它们常常导致偏好模式崩溃（PMC）——这是一种特定的奖励黑客形式，模型趋于狭窄的高分输出（例如单一风格或普遍过度曝光的图像），严重降低生成多样性。在本研究中，我们介绍并量化了这一现象，提出了DivGenBench，这是一个旨在衡量PMC范围的新基准。我们认为这种崩溃是由奖励模型固有偏差的过度优化驱动的。基于该分析，我们提出了方向性解耦对齐（D$^2$-Align）新颖框架，通过对奖励信号进行方向性修正来缓解PMC。具体来说，我们的方法首先在奖励模型的嵌入空间内学习方向修正，同时保持模型冻结。在优化过程中，这种修正会应用到奖励信号上，防止模型崩溃到特定模式，从而保持多样性。我们的综合评估结合了定性分析与质量和多样性的定量指标，显示D$^2$-Align在与人类偏好的一致性上表现优异。

Deep Reinforcement Learning for Solving the Fleet Size and Mix Vehicle Routing Problem

深度强化学习用于解决车队规模与混合车辆布线问题

Authors: Pengfu Wan, Jiawei Chen, Gangyan Xu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2512.24251
Pdf link: https://arxiv.org/pdf/2512.24251
Abstract The Fleet Size and Mix Vehicle Routing Problem (FSMVRP) is a prominent variant of the Vehicle Routing Problem (VRP), extensively studied in operations research and computational science. FSMVRP requires simultaneous decisions on fleet composition and routing, making it highly applicable to real-world scenarios such as short-term vehicle rental and on-demand logistics. However, these requirements also increase the complexity of FSMVRP, posing significant challenges, particularly in large-scale and time-constrained environments. In this paper, we propose a deep reinforcement learning (DRL)-based approach for solving FSMVRP, capable of generating near-optimal solutions within a few seconds. Specifically, we formulate the problem as a Markov Decision Process (MDP) and develop a novel policy network, termed FRIPN, that seamlessly integrates fleet composition and routing decisions. Our method incorporates specialized input embeddings designed for distinctdecision objectives, including a remaining graph embedding to facilitate effective vehicle employment decisions. Comprehensive experiments are conducted on both randomly generated instances and benchmark datasets. The experimental results demonstrate that our method exhibits notable advantages in terms of computational efficiency and scalability, particularly in large-scale and time-constrained scenarios. These strengths highlight the potential of our approach for practical applications and provide valuable inspiration for extending DRL-based techniques to other variants of VRP.
中文摘要 车队规模与组合车辆路由问题（FSMVRP）是车辆路径问题（VRP）的一个显著变体，在运筹学和计算科学领域被广泛研究。FSMVRP要求同时决策车队组成和路线，因此高度适用于短期车辆租赁和按需物流等现实场景。然而，这些要求也增加了FSMVRP的复杂性，尤其在大规模和时间紧张的环境中，带来了重大挑战。本文提出一种基于深度强化学习（DRL）的方法来解决FSMVRP，能够在几秒钟内生成近似最优解。具体来说，我们将问题制定为马尔可夫决策过程（MDP），并开发了一个名为FRIPN的新策略网络，能够无缝整合车队组成和路线决策。我们的方法包含专门的输入嵌入，旨在实现不同的决策目标，包括剩余的图嵌入，以促进车辆的有效使用决策。在随机生成的实例和基准数据集上进行全面的实验。实验结果表明，我们的方法在计算效率和可扩展性方面表现出显著优势，尤其是在大规模和时间有限的场景中。这些优势凸显了我们方法在实际应用中的潜力，并为将基于日程学习的技术推广到其他VRP变体提供了宝贵的启发。

DRL-TH: Jointly Utilizing Temporal Graph Attention and Hierarchical Fusion for UGV Navigation in Crowded Environments

DRL-TH：在拥挤环境中联合利用时间图注意力和层级融合技术进行UGV导航

Authors: Ruitong Li, Lin Zhang, Yuenan Zhao, Chengxin Liu, Ran Song, Wei Zhang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.24284
Pdf link: https://arxiv.org/pdf/2512.24284
Abstract Deep reinforcement learning (DRL) methods have demonstrated potential for autonomous navigation and obstacle avoidance of unmanned ground vehicles (UGVs) in crowded environments. Most existing approaches rely on single-frame observation and employ simple concatenation for multi-modal fusion, which limits their ability to capture temporal context and hinders dynamic adaptability. To address these challenges, we propose a DRL-based navigation framework, DRL-TH, which leverages temporal graph attention and hierarchical graph pooling to integrate historical observations and adaptively fuse multi-modal information. Specifically, we introduce a temporal-guided graph attention network (TG-GAT) that incorporates temporal weights into attention scores to capture correlations between consecutive frames, thereby enabling the implicit estimation of scene evolution. In addition, we design a graph hierarchical abstraction module (GHAM) that applies hierarchical pooling and learnable weighted fusion to dynamically integrate RGB and LiDAR features, achieving balanced representation across multiple scales. Extensive experiments demonstrate that our DRL-TH outperforms existing methods in various crowded environments. We also implemented DRL-TH control policy on a real UGV and showed that it performed well in real world scenarios.
中文摘要 深度强化学习（DRL）方法已展示出在拥挤环境中自主导航和无人地面车辆（UGV）障碍避让的潜力。大多数现有方法依赖单帧观察，并采用简单连接实现多模态融合，这限制了它们捕捉时间上下文的能力，并阻碍了动态适应性。为应对这些挑战，我们提出了基于DRL的导航框架DRL-TH，利用时序图注意力和分层图池整合历史观测并自适应融合多模态信息。具体来说，我们引入了一种时间引导图注意力网络（TG-GAT），该网络将时间权重纳入注意力评分，以捕捉连续帧之间的相关性，从而实现场景演变的隐式估计。此外，我们设计了一个图层级抽象模块（GHAM），应用分层池和可学习的加权融合，动态集成RGB和激光雷达特征，实现多尺度的平衡表示。大量实验表明，我们的DRL-TH在各种拥挤环境中优于现有方法。我们还在真实的UGV上实现了DRL-TH控制策略，并证明其在实际场景中表现良好。

Real-world Reinforcement Learning from Suboptimal Interventions

来自次优干预的现实强化学习

Authors: Yinuo Zhao, Huiqian Jin, Lechun Jiang, Xinyi Zhang, Kun Wu, Pei Ren, Zhiyuan Xu, Zhengping Che, Lei Sun, Dapeng Wu, Chi Harold Liu, Jian Tang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.24288
Pdf link: https://arxiv.org/pdf/2512.24288
Abstract Real-world reinforcement learning (RL) offers a promising approach to training precise and dexterous robotic manipulation policies in an online manner, enabling robots to learn from their own experience while gradually reducing human labor. However, prior real-world RL methods often assume that human interventions are optimal across the entire state space, overlooking the fact that even expert operators cannot consistently provide optimal actions in all states or completely avoid mistakes. Indiscriminately mixing intervention data with robot-collected data inherits the sample inefficiency of RL, while purely imitating intervention data can ultimately degrade the final performance achievable by RL. The question of how to leverage potentially suboptimal and noisy human interventions to accelerate learning without being constrained by them thus remains open. To address this challenge, we propose SiLRI, a state-wise Lagrangian reinforcement learning algorithm for real-world robot manipulation tasks. Specifically, we formulate the online manipulation problem as a constrained RL optimization, where the constraint bound at each state is determined by the uncertainty of human interventions. We then introduce a state-wise Lagrange multiplier and solve the problem via a min-max optimization, jointly optimizing the policy and the Lagrange multiplier to reach a saddle point. Built upon a human-as-copilot teleoperation system, our algorithm is evaluated through real-world experiments on diverse manipulation tasks. Experimental results show that SiLRI effectively exploits human suboptimal interventions, reducing the time required to reach a 90% success rate by at least 50% compared with the state-of-the-art RL method HIL-SERL, and achieving a 100% success rate on long-horizon manipulation tasks where other RL methods struggle to succeed. Project website: this https URL.
中文摘要 现实世界强化学习（RL）提供了一种有前景的方法，可以在在线方式训练精准且灵活的机器人作策略，使机器人能够从自身经验中学习，同时逐步减少人类劳动。然而，以往的现实强化学习方法常假设人类干预在整个状态空间中最优，忽视了即使是专家操作员也无法在所有状态下持续提供最优作或完全避免错误的事实。无差别地混合干预数据与机器人收集的数据继承了强化学习的样本低效性，而纯粹模仿干预数据最终可能降低强化学习最终可达的性能。因此，如何利用可能不那么优且噪音大的人类干预来加速学习，而不被它们所限制，这个问题依然悬而未决。为应对这一挑战，我们提出了SiLRI，一种针对现实机器人作任务的状态级拉格朗日强化学习算法。具体来说，我们将在线作问题表述为一种受限的强化学习优化，其中每个状态的约束界限由人类干预的不确定性决定。然后我们引入一个状态层次的拉格朗日乘数，并通过最小极大优化来解决问题，同时优化策略和拉格朗日乘数，以达到鞍点。我们的算法基于以人为副驾驶的远程作系统，通过多种作任务的真实实验进行评估。实验结果显示，SiLRI有效利用了人类次优干预，相比先进的强化学习方法HIL-SERL，实现90%成功率所需的时间至少减少了50%，并在其他强化学习方法难以成功的长期作任务中实现100%成功率。项目网站：这个 https URL。

Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking

弄清楚：用主动视觉思维提升推理前沿

Authors: Meiqi Chen, Fandong Meng, Jie Zhou
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.24297
Pdf link: https://arxiv.org/pdf/2512.24297
Abstract Complex reasoning problems often involve implicit spatial, geometric, and structural relationships that are not explicitly encoded in text. While recent reasoning models have achieved strong performance across many domains, purely text-based reasoning struggles to represent global structural constraints in complex settings. In this paper, we introduce FIGR, which integrates active visual thinking into multi-turn reasoning via end-to-end reinforcement learning. FIGR externalizes intermediate structural hypotheses by constructing visual representations during problem solving. By adaptively regulating when and how visual reasoning should be invoked, FIGR enables more stable and coherent reasoning over global structural properties that are difficult to capture from text alone. Experiments on challenging mathematical reasoning benchmarks demonstrate that FIGR outperforms strong text-only chain-of-thought baselines. In particular, FIGR improves the base model by 13.12% on AIME 2025 and 11.00% on BeyondAIME, highlighting the effectiveness of figure-guided multimodal reasoning in enhancing the stability and reliability of complex reasoning.
中文摘要 复杂的推理问题通常涉及隐含的空间、几何和结构关系，这些关系并未在文本中明确编码。尽管近期推理模型在多个领域取得了强劲表现，纯文本推理在复杂环境中难以表现全局结构约束。本文介绍了FIGR，它通过端到端强化学习将主动视觉思维整合进多回合推理。FIGR通过在问题解决过程中构建视觉表征，将中间结构假设外部化。通过自适应调节何时以及如何调用视觉推理，FIGR使得对难以仅凭文本捕捉的全局结构性质进行更稳定和连贯的推理。对具有挑战性的数学推理基准的实验表明，FIGR优于强文本思维链基线。特别是，FIGR在AIME 2025上提升了基础模型13.12%，在BeyondAIME上提升了11.00%，凸显了图形引导多模态推理在提升复杂推理稳定性和可靠性方面的有效性。

MaRCA: Multi-Agent Reinforcement Learning for Dynamic Computation Allocation in Large-Scale Recommender Systems

MaRCA：大规模推荐系统中动态计算分配的多智能体强化学习

Authors: Wan Jiang, Xinyi Zang, Yudong Zhao, Yusi Zou, Yunfei Lu, Junbo Tong, Yang Liu, Ming Li, Jiani Shi, Xin Yang
Subjects: Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.24325
Pdf link: https://arxiv.org/pdf/2512.24325
Abstract Modern recommender systems face significant computational challenges due to growing model complexity and traffic scale, making efficient computation allocation critical for maximizing business revenue. Existing approaches typically simplify multi-stage computation resource allocation, neglecting inter-stage dependencies, thus limiting global optimality. In this paper, we propose MaRCA, a multi-agent reinforcement learning framework for end-to-end computation resource allocation in large-scale recommender systems. MaRCA models the stages of a recommender system as cooperative agents, using Centralized Training with Decentralized Execution (CTDE) to optimize revenue under computation resource constraints. We introduce an AutoBucket TestBench for accurate computation cost estimation, and a Model Predictive Control (MPC)-based Revenue-Cost Balancer to proactively forecast traffic loads and adjust the revenue-cost trade-off accordingly. Since its end-to-end deployment in the advertising pipeline of a leading global e-commerce platform in November 2024, MaRCA has consistently handled hundreds of billions of ad requests per day and has delivered a 16.67% revenue uplift using existing computation resources.
中文摘要 现代推荐系统因模型复杂度和流量规模的增加而面临重大计算挑战，因此高效的计算分配对于最大化业务收入至关重要。现有方法通常简化多阶段计算资源分配，忽视阶段间依赖，从而限制了全局最优性。本文提出了MaRCA，一种用于大规模推荐系统中端到端计算资源分配的多智能体强化学习框架。MaRCA将推荐系统阶段建模为合作智能体，采用去中心化执行中心训练（CTDE）以优化计算资源限制下的收入。我们引入了AutoBucket测试平台，用于精确估算成本，以及基于模型预测控制（MPC）的收入-成本平衡器，用于主动预测流量负载并相应调整收入与成本权衡。自2024年11月MaRCA在一家领先的全球电子商务平台广告流程中端到端部署以来，日均持续处理数千亿次广告请求，利用现有计算资源实现了16.67%的收入增长。

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

SenseNova-MARS：通过强化学习赋能多模态智能推理与搜索

Authors: Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.24330
Pdf link: https://arxiv.org/pdf/2512.24330
Abstract While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.
中文摘要 虽然视觉语言模型（VLM）可以通过智能推理解决复杂任务，但其能力主要受限于文本导向的思维链或孤立的工具调用。它们未能展现出将动态工具作与持续推理无缝结合所需的人性熟练度，尤其是在需要外部协调工具（如搜索和图像裁剪）的知识密集型且视觉复杂场景中。在本研究中，我们介绍了SenseNova-MARS，一种新型多模态智能推理与搜索框架，通过强化学习（RL）赋予VLM交织的视觉推理和工具使用能力。具体来说，SenseNova-MARS 动态集成了图像搜索、文本搜索和图像裁剪工具，以应对细粒度且知识密集型的视觉理解挑战。在强化学习阶段，我们提出了批量归一化组序列策略优化（BN-GSPO）算法，以提升训练稳定性，提升模型有效调用工具和推理的能力。为了全面评估复杂视觉任务中的智能VLM，我们引入了HR-MMSearch基准测试，这是首个以搜索为导向的基准测试，由高分辨率图像和知识密集型、搜索驱动的问题组成。实验表明，SenseNova-MARS 在开源搜索和细粒度图像理解基准测试方面达到了最先进的性能。具体来说，在搜索导向基准测试中，SenseNova-MARS-8B 在 MMSearch 得分为 67.84，在 HR-MMSearch 得分为 41.64，超过了 Gemini-3-Flash 和 GPT-5 等专有模型。SenseNova-MARS 通过提供有效且稳健的工具使用能力，代表了迈向智能型VLM的有希望的一步。为了促进该领域的进一步研究，我们将发布所有代码、模型和数据集。

Efficient Inference for Inverse Reinforcement Learning and Dynamic Discrete Choice Models

逆强化学习和动态离散选择模型的高效推理

Authors: Lars van der Laan, Aurelien Bibaut, Nathan Kallus
Subjects: Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST)
Arxiv link: https://arxiv.org/abs/2512.24407
Pdf link: https://arxiv.org/pdf/2512.24407
Abstract Inverse reinforcement learning (IRL) and dynamic discrete choice (DDC) models explain sequential decision-making by recovering reward functions that rationalize observed behavior. Flexible IRL methods typically rely on machine learning but provide no guarantees for valid inference, while classical DDC approaches impose restrictive parametric specifications and often require repeated dynamic programming. We develop a semiparametric framework for debiased inverse reinforcement learning that yields statistically efficient inference for a broad class of reward-dependent functionals in maximum entropy IRL and Gumbel-shock DDC models. We show that the log-behavior policy acts as a pseudo-reward that point-identifies policy value differences and, under a simple normalization, the reward itself. We then formalize these targets, including policy values under known and counterfactual softmax policies and functionals of the normalized reward, as smooth functionals of the behavior policy and transition kernel, establish pathwise differentiability, and derive their efficient influence functions. Building on this characterization, we construct automatic debiased machine-learning estimators that allow flexible nonparametric estimation of nuisance components while achieving $\sqrt{n}$-consistency, asymptotic normality, and semiparametric efficiency. Our framework extends classical inference for DDC models to nonparametric rewards and modern machine-learning tools, providing a unified and computationally tractable approach to statistical inference in IRL.
中文摘要 逆强化学习（IRL）和动态离散选择（DDC）模型通过恢复合理化观察到的行为的奖励函数来解释顺序决策。灵活的IRL方法通常依赖机器学习，但无法保证推理有效，而经典的DDC方法则要求限制参数化规范，且常需反复进行动态规划。我们开发了一个半参数框架用于偏倚逆强化学习，能够在最大熵真实学习和甘贝尔冲击DDC模型中，对广泛类型的奖励依赖函数实现统计高效的推断。我们证明了对数行为策略作为一种伪奖励，点识别策略价值差异，并在简单归一化下识别奖励本身。随后，我们将这些目标，包括已知和反事实软极大策略下的策略值及归一化奖励的泛函形式化，作为行为策略和过渡核的平滑泛函，建立路径可微性，并推导其高效影响函数。基于这一特性，我们构建了自动去偏学机器学习估计器，允许对无扰成分进行灵活的非参数估计，同时实现$\sqrt{n}$一致性、渐近正规性和半参数效率。我们的框架将DDC模型的经典推断扩展到非参数奖励和现代机器学习工具，提供了一种统一且计算可作的现实中统计推断方法。

Adaptive Learning Guided by Bias-Noise-Alignment Diagnostics

以偏置-噪声-对齐诊断为指导的自适应学习

Authors: Akash Samanta, Sheldon Williamson
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.24445
Pdf link: https://arxiv.org/pdf/2512.24445
Abstract Learning systems deployed in nonstationary and safety-critical environments often suffer from instability, slow convergence, or brittle adaptation when learning dynamics evolve over time. While modern optimization, reinforcement learning, and meta-learning methods adapt to gradient statistics, they largely ignore the temporal structure of the error signal itself. This paper proposes a diagnostic-driven adaptive learning framework that explicitly models error evolution through a principled decomposition into bias, capturing persistent drift; noise, capturing stochastic variability; and alignment, capturing repeated directional excitation leading to overshoot. These diagnostics are computed online from lightweight statistics of loss or temporal-difference error trajectories and are independent of model architecture or task domain. We show that the proposed bias-noise-alignment decomposition provides a unifying control backbone for supervised optimization, actor-critic reinforcement learning, and learned optimizers. Building on this framework, we derive diagnostic-driven instantiations including a stabilized supervised optimizer, a diagnostic-regulated actor-critic scheme, and a diagnostic-conditioned learned optimizer. Under standard smoothness assumptions, we establish bounded effective updates and stability properties for all cases. Representative diagnostic illustrations in actor-critic learning highlight how the proposed signals modulate adaptation in response to temporal-difference error structure. Overall, this work elevates error evolution to a first-class object in adaptive learning and provides an interpretable, lightweight foundation for reliable learning in dynamic environments.
中文摘要 部署在非静止和安全关键环境中的学习系统，随着学习动态的演变，常常面临不稳定性、收敛缓慢或适应脆弱的问题。虽然现代优化、强化学习和元学习方法适应了梯度统计，但它们在很大程度上忽略了误差信号本身的时间结构。本文提出了一个诊断驱动的自适应学习框架，通过原则性分解到偏见，明确建模错误演变，捕捉持续漂移;噪声，捕捉随机变异性;以及对准，捕捉反复导致超冲的方向激发。这些诊断数据是通过在线计算的，基于轻量级的损失或时间差分误差轨迹统计，且不受模型架构或任务域的影响。我们证明，所提出的偏置-噪声-对齐分解为监督优化、actor-critic强化学习和学习优化器提供了统一的控制骨干。基于该框架，我们推导出了诊断驱动的实例化，包括稳定监督优化器、诊断调控的actor-critic方案和诊断条件学习优化器。在标准光滑性假设下，我们为所有情况建立了有界有效更新和稳定性属性。演员-批评者学习中的代表性诊断示例突出展示了所提信号如何根据时间差分错误结构调制适应。总体而言，这项工作将误差演化提升为自适应学习中的一流对象，并为动态环境中可靠的学习提供了可解释且轻量级的基础。

Networked Markets, Fragmented Data: Adaptive Graph Learning for Customer Risk Analytics and Policy Design

网络市场，碎片化数据：客户风险分析与政策设计的自适应图学习

Authors: Lecheng Zheng, Jian Ni, Chris Zobel, John R Birge
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2512.24487
Pdf link: https://arxiv.org/pdf/2512.24487
Abstract Financial institutions face escalating challenges in identifying high-risk customer behaviors within massive transaction networks, where fraudulent activities exploit market fragmentation and institutional boundaries. We address three fundamental problems in customer risk analytics: data silos preventing holistic relationship assessment, extreme behavioral class imbalance, and suboptimal customer intervention strategies that fail to balance compliance costs with relationship value. We develop an integrated customer intelligence framework combining federated learning, relational network analysis, and adaptive targeting policies. Our federated graph neural network enables collaborative behavior modeling across competing institutions without compromising proprietary customer data, using privacy-preserving embeddings to capture cross-market relational patterns. We introduce cross-bank Personalized PageRank to identify coordinated behavioral clusters providing interpretable customer network segmentation for risk managers. A hierarchical reinforcement learning mechanism optimizes dynamic intervention targeting, calibrating escalation policies to maximize prevention value while minimizing customer friction and operational costs. Analyzing 1.4 million customer transactions across seven markets, our approach reduces false positive and false negative rates to 4.64% and 11.07%, substantially outperforming single-institution models. The framework prevents 79.25% of potential losses versus 49.41% under fixed-rule policies, with optimal market-specific targeting thresholds reflecting heterogeneous customer base characteristics. These findings demonstrate that federated customer analytics materially improve both risk management effectiveness and customer relationship outcomes in networked competitive markets.
中文摘要 金融机构在识别庞大交易网络中高风险客户行为时面临日益严峻的挑战，欺诈活动利用市场碎片化和机构界限。我们解决了客户风险分析中的三个根本问题：阻碍整体关系评估的数据孤岛、极端的行为类别不平衡，以及未能平衡合规成本与关系价值的次优客户干预策略。我们开发了一个集成客户情报框架，结合了联邦学习、关系网络分析和自适应定向策略。我们的联邦图神经网络实现了跨竞争机构的协作行为建模，同时不泄露专有客户数据，利用隐私保护的嵌入捕捉跨市场关系模式。我们引入跨银行个性化PageRank，以识别协调的行为集群，为风险管理者提供可解读的客户网络细分。分层强化学习机制优化动态干预目标，校准升级策略以最大化预防价值，同时最小化客户摩擦和运营成本。通过分析跨越七个市场的140万笔客户交易，我们的方法将假阳性和假阴性率分别降至4.64%和11.07%，远超单一机构模型。该框架可防止79.25%的潜在损失，而固定规则政策下的损失为49.41%，最佳市场特定目标阈值反映了异质客户群特征。这些发现表明，联邦客户分析在网络竞争市场中显著提升了风险管理的效能和客户关系结果。

From Building Blocks to Planning: Multi-Step Spatial Reasoning in LLMs with Reinforcement Learning

从构建模块到规划：LLM中的多步空间推理与强化学习

Authors: Amir Tahmasbi, Sadegh Majidi, Kazem Taram, Aniket Bera
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.24532
Pdf link: https://arxiv.org/pdf/2512.24532
Abstract Spatial reasoning in large language models (LLMs) has gained increasing attention due to applications in navigation and planning. Despite strong general language capabilities, LLMs still struggle with spatial transformations and multi-step planning in structured environments. We propose a two-stage approach that decomposes spatial reasoning into atomic building blocks and their composition. First, we apply supervised fine-tuning on elementary spatial transformations, such as rotation, translation, and scaling, to equip the model with basic spatial physics. We then freeze this physics-aware model and train lightweight LoRA adapters within the GRPO framework to learn policies that compose these building blocks for multi-step planning in puzzle-based environments, in a closed-loop manner. To support this pipeline, we synthesize an ASCII-art dataset and construct a corresponding ASCII-based reinforcement learning environment. Our method consistently outperforms baselines, including the generic backbone, physics-aware model, and end-to-end RL models, under both Dynamic environments with explicit state updates and Static environments where the model must rely on its internal state across steps. In addition, the proposed approach converges faster and exhibits more stable training compared to end-to-end reinforcement learning from scratch. Finally, we analyze attention patterns to assess whether fine-tuning induces meaningful improvements in spatial understanding.
中文摘要 大型语言模型（LLMs）中的空间推理因在导航和规划中的应用而日益受到关注。尽管具备强大的通用语言能力，LLMs在结构化环境中仍难以应对空间变换和多步规划。我们提出了一种两阶段方法，将空间推理分解为原子构建模块及其组成。首先，我们对旋转、平移和缩放等基本空间变换进行监督微调，以赋予模型基础空间物理。然后，我们将该物理感知模型冻结，并在GRPO框架内训练轻量级LoRA适配器，学习构成这些构建模块的策略，以闭环方式实现基于谜题的多步规划。为支持该流程，我们综合了一个ASCII艺术数据集，构建了相应的基于ASCII的强化学习环境。我们的方法在动态环境（带有显式状态更新）和静态环境中（模型必须依赖内部状态）中，始终优于基线模型，包括通用骨干、物理感知模型和端到端强化学习模型。此外，该方法收敛速度更快，训练稳定性也更稳定，相较于从零开始的端到端强化学习。最后，我们分析注意力模式，评估微调是否能显著提升空间理解。

From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme

从感知到笑点：用野外表情包艺术赋能VLM

Authors: Xueyan Li, Yingyi Xue, Mengjie Jiang, Qingzi Zhu, Yazhe Niu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.24555
Pdf link: https://arxiv.org/pdf/2512.24555
Abstract Generating humorous memes is a challenging multimodal task that moves beyond direct image-to-caption supervision. It requires a nuanced reasoning over visual content, contextual cues, and subjective humor. To bridge this gap between visual perception and humorous punchline creation, we propose HUMOR}, a novel framework that guides VLMs through hierarchical reasoning and aligns them with group-wise human preferences. First, HUMOR employs a hierarchical, multi-path Chain-of-Thought (CoT): the model begins by identifying a template-level intent, then explores diverse reasoning paths under different contexts, and finally anchors onto a high-quality, context-specific path. This CoT supervision, which traces back from ground-truth captions, enhances reasoning diversity. We further analyze that this multi-path exploration with anchoring maintains a high expected humor quality, under the practical condition that high-quality paths retain significant probability mass. Second, to capture subjective humor, we train a pairwise reward model that operates within groups of memes sharing the same template. Following established theory, this approach ensures a consistent and robust proxy for human preference, even with subjective and noisy labels. The reward model then enables a group-wise reinforcement learning optimization, guaranteeing providing a theoretical guarantee for monotonic improvement within the trust region. Extensive experiments show that HUMOR empowers various VLMs with superior reasoning diversity, more reliable preference alignment, and higher overall meme quality. Beyond memes, our work presents a general training paradigm for open-ended, human-aligned multimodal generation, where success is guided by comparative judgment within coherent output group.
中文摘要 生成幽默表情包是一项具有挑战性的多模态任务，超越了直接从图片到字幕的监督。它需要对视觉内容、语境线索和主观幽默有细致的推理。为了弥合视觉感知与幽默笑点创作之间的差距，我们提出了HUMOR}，这是一个新颖框架，引导VLM通过层级推理，并使其与群体人类偏好保持一致。首先，HUMOR采用了层级、多路径思维链（CoT）：模型从确定模板级意图开始，然后探索不同上下文下的多样推理路径，最终锚定在高质量且特定情境的路径上。这种CoT监督源自实地信息，增强了推理多样性。我们进一步分析，这种带有锚定的多路径探索在高质量路径保留显著概率质量的实际条件下，仍保持较高的预期幽默质量。其次，为了捕捉主观幽默，我们训练了一个两对奖励模型，该模型在共享相同模板的表情包组内运行。遵循既定理论，这种方法确保了人类偏好的一致且稳健的代理指标，即使标签带有主观和噪声。奖励模型随后实现了群体间强化学习优化，确保在信任区域内实现单调性提升。大量实验表明，HUMOR赋予多种VLM更优越的推理多样性、更可靠的偏好对齐以及更高的整体表情包质量。除了模因，我们的工作还提出了一种开放式、以人类为核心的多模态生成的通用训练范式，其中成功基于在连贯输出群体内的比较判断。

Reinforcement Learning-Augmented LLM Agents for Collaborative Decision Making and Performance Optimization

强化学习增强型LLM代理用于协作决策和性能优化

Authors: Dong Qiu, Duo Xu, Limengxi Yue
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.24609
Pdf link: https://arxiv.org/pdf/2512.24609
Abstract Large Language Models (LLMs) perform well in language tasks but often lack collaborative awareness and struggle to optimize global performance in multi-agent settings. We present a reinforcement learning-augmented LLM agent framework that formulates cooperation as a decentralized partially observable Markov decision process (Dec-POMDP) and adopts centralized training with decentralized execution (CTDE). We introduce Group Relative Policy Optimization (GRPO) to jointly optimize agent policies with access to global signals during training, together with a simplified joint reward that balances task quality, speed, and coordination cost. On collaborative writing and coding benchmarks, our framework delivers a 3x increase in task processing speed over single-agent baselines, 98.7% structural/style consistency in writing, and a 74.6% test pass rate in coding. The approach consistently outperforms strong multi-agent LLM baselines and provides a practical path toward reliable collaboration in complex workflows.
中文摘要 大型语言模型（LLMs）在语言任务中表现良好，但通常缺乏协作意识，难以在多智能体环境中优化全局性能。我们提出了一个强化学习增强型LLM代理框架，将合作构建为去中心化、部分可观察的马尔可夫决策过程（Dec-POMDP），并采用去中心化执行（CTDE）的集中训练。我们引入了群相对策略优化（GRPO），在训练期间联合优化代理策略，并结合简化的联合奖励，平衡任务质量、速度和协调成本。在协作写作和编码基准测试中，我们的框架任务处理速度比单代理基线提升了3倍，写作结构/风格一致性提升了98.7%，编码测试通过率为74.6%。该方法始终优于强的多智能体大型语言模型基线，并为复杂工作流程中实现可靠协作提供了实用路径。

Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization

优图代理：通过自动生成和混合策略优化提升代理生产力

Authors: Yuchen Shi, Yuzheng Cai, Siqi Cai, Zihan Xu, Lichao Chen, Yulei Qin, Zhijian Zhou, Xiang Fei, Chaofan Qiu, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Guocan Cai, Yong Mao, Yunsheng Wu, Ke Li, Xing Sun
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.24615
Pdf link: https://arxiv.org/pdf/2512.24615
Abstract Existing Large Language Model (LLM) agent frameworks face two significant challenges: high configuration costs and static capabilities. Building a high-quality agent often requires extensive manual effort in tool integration and prompt engineering, while deployed agents struggle to adapt to dynamic environments without expensive fine-tuning. To address these issues, we propose \textbf{Youtu-Agent}, a modular framework designed for the automated generation and continuous evolution of LLM agents. Youtu-Agent features a structured configuration system that decouples execution environments, toolkits, and context management, enabling flexible reuse and automated synthesis. We introduce two generation paradigms: a \textbf{Workflow} mode for standard tasks and a \textbf{Meta-Agent} mode for complex, non-standard requirements, capable of automatically generating tool code, prompts, and configurations. Furthermore, Youtu-Agent establishes a hybrid policy optimization system: (1) an \textbf{Agent Practice} module that enables agents to accumulate experience and improve performance through in-context optimization without parameter updates; and (2) an \textbf{Agent RL} module that integrates with distributed training frameworks to enable scalable and stable reinforcement learning of any Youtu-Agents in an end-to-end, large-scale manner. Experiments demonstrate that Youtu-Agent achieves state-of-the-art performance on WebWalkerQA (71.47\%) and GAIA (72.8\%) using open-weight models. Our automated generation pipeline achieves over 81\% tool synthesis success rate, while the Practice module improves performance on AIME 2024/2025 by +2.7\% and +5.4\% respectively. Moreover, our Agent RL training achieves 40\% speedup with steady performance improvement on 7B LLMs, enhancing coding/reasoning and searching capabilities respectively up to 35\% and 21\% on Maths and general/multi-hop QA benchmarks.
中文摘要 现有的大型语言模型（LLM）代理框架面临两大挑战：高配置成本和静态能力。构建高质量代理通常需要大量手动工具集成和提示工程，而部署后的代理则难以适应动态环境，除非进行昂贵的微调。为解决这些问题，我们提出了\textbf{Youtu-Agent}，一个模块化框架，旨在实现LLM代理的自动生成和持续演进。优图代理具有结构化配置系统，能够解耦执行环境、工具包和上下文管理，实现灵活的重用和自动综合。我们引入了两种生成范式：\textbf{工作流程}模式用于标准任务，以及\textbf{Meta-Agent}模式用于复杂、非标准需求，能够自动生成工具代码、提示和配置。此外，优图代理建立了混合策略优化系统：（1）\textbf{代理实践}模块，使代理能够通过上下文优化积累经验并提升性能，而无需参数更新;以及（2）\textbf{Agent RL}模块，集成于分布式训练框架，实现对任何优兔代理的可扩展且稳定的大规模强化学习。实验表明，优图代理在WebWalkerQA上（71.47%）和GAIA（72.8%）均达到了最先进的性能，采用开放权重模型。我们的自动化生成流程实现了超过81%的工具合成成功率，而实践模块在AIME 2024/2025考试中分别提升了+2.7%和+5.4%的性能。此外，我们的Agent RL训练在7亿LLM上实现了40%的加速和稳定的性能提升，分别提升了编码/推理和搜索能力，分别提升到数学和通用/多跳QA基准的35%和21%。

Hybrid Motion Planning with Deep Reinforcement Learning for Mobile Robot Navigation

结合深度强化学习的混合运动规划，用于移动机器人导航

Authors: Yury Kolomeytsev, Dmitry Golembiovsky
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.24651
Pdf link: https://arxiv.org/pdf/2512.24651
Abstract Autonomous mobile robots operating in complex, dynamic environments face the dual challenge of navigating large-scale, structurally diverse spaces with static obstacles while safely interacting with various moving agents. Traditional graph-based planners excel at long-range pathfinding but lack reactivity, while Deep Reinforcement Learning (DRL) methods demonstrate strong collision avoidance but often fail to reach distant goals due to a lack of global context. We propose Hybrid Motion Planning with Deep Reinforcement Learning (HMP-DRL), a hybrid framework that bridges this gap. Our approach utilizes a graph-based global planner to generate a path, which is integrated into a local DRL policy via a sequence of checkpoints encoded in both the state space and reward function. To ensure social compliance, the local planner employs an entity-aware reward structure that dynamically adjusts safety margins and penalties based on the semantic type of surrounding agents. We validate the proposed method through extensive testing in a realistic simulation environment derived from real-world map data. Comprehensive experiments demonstrate that HMP-DRL consistently outperforms other methods, including state-of-the-art approaches, in terms of key metrics of robot navigation: success rate, collision rate, and time to reach the goal. Overall, these findings confirm that integrating long-term path guidance with semantically-aware local control significantly enhances both the safety and reliability of autonomous navigation in complex human-centric settings.
中文摘要 在复杂且动态环境中运行的自主移动机器人面临双重挑战：既要在有静态障碍物的大型结构多样空间中导航，又要安全地与各种移动智能体互动。传统的基于图的规划器擅长长距离路径寻找，但缺乏反应性;而深度强化学习（DRL）方法则表现出强的碰撞规避能力，但由于缺乏全局上下文，常常无法实现远距离目标。我们提出了混合式动作规划与深度强化学习（HMP-DRL），这是一种弥合这一差距的混合框架。我们的方法利用基于图的全局规划器生成路径，并通过编码在状态空间和奖励函数中的检查点序列，将其整合进本地的DRL政策中。为确保社会合规，地方规划者采用实体感知的奖励结构，根据周围代理的语义类型动态调整安全余量和罚款。我们通过基于真实地图数据的真实模拟环境进行大量测试验证了该方法。综合实验表明，HMP-DRL在机器人导航的关键指标：成功率、碰撞率和达标时间方面，始终优于包括最先进方法在内的其他方法。总体而言，这些发现证实了将长期路径引导与语义感知的局部控制相结合，显著提升了复杂以人为本的环境中自主导航的安全性和可靠性。

RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence

RoboMIND 2.0：一个多模态、双手移动作数据集，用于可推广具身智能

Authors: Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, Qiuxuan Feng, Zhao Jin, Chenyang Gu, Zhuoyang Liu, Nuowei Han, Xiangju Mi, Yaoxu Lv, Yankai Fu, Gaole Dai, Langzhe Gu, Tao Li, Yuheng Zhang, Yixue Zhang, Xinhua Wang, Shichao Fan, Meng Li, Zhen Zhao, Ning Liu, Zhiyuan Xu, Pei Ren, Junjie Ji, Haonan Liu, Kuan Cheng, Shanghang Zhang, Jian Tang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.24653
Pdf link: https://arxiv.org/pdf/2512.24653
Abstract While data-driven imitation learning has revolutionized robotic manipulation, current approaches remain constrained by the scarcity of large-scale, diverse real-world demonstrations. Consequently, the ability of existing models to generalize across long-horizon bimanual tasks and mobile manipulation in unstructured environments remains limited. To bridge this gap, we present RoboMIND 2.0, a comprehensive real-world dataset comprising over 310K dual-arm manipulation trajectories collected across six distinct robot embodiments and 739 complex tasks. Crucially, to support research in contact-rich and spatially extended tasks, the dataset incorporates 12K tactile-enhanced episodes and 20K mobile manipulation trajectories. Complementing this physical data, we construct high-fidelity digital twins of our real-world environments, releasing an additional 20K-trajectory simulated dataset to facilitate robust sim-to-real transfer. To fully exploit the potential of RoboMIND 2.0, we propose MIND-2 system, a hierarchical dual-system frame-work optimized via offline reinforcement learning. MIND-2 integrates a high-level semantic planner (MIND-2-VLM) to decompose abstract natural language instructions into grounded subgoals, coupled with a low-level Vision-Language-Action executor (MIND-2-VLA), which generates precise, proprioception-aware motor actions.
中文摘要 尽管数据驱动的模仿学习彻底改变了机器人作，但当前方法仍受制于大规模、多样化的真实世界演示的稀缺。因此，现有模型在非结构化环境中跨越长视野双手任务和移动作的能力仍然有限。为弥合这一差距，我们呈现RoboMIND 2.0，这是一个全面的真实世界数据集，包含超过31万条双臂作轨迹，涵盖六个不同机器人形态和739个复杂任务。关键是，为了支持接触丰富且空间延伸任务的研究，数据集包含了12K触觉增强事件和20K移动作轨迹。作为物理数据的补充，我们还构建了真实世界环境的高保真数字孪生，并发布了额外的2万轨迹模拟数据集，以促进稳健的模拟到现实传输。为了充分发挥RoboMIND 2.0的潜力，我们提出了MIND-2系统，这是一种通过离线强化学习优化的分层双系统框架工作。MIND-2集成了高级语义规划器（MIND-2-VLM），将抽象的自然语言指令分解为有基础的子目标，并结合一个低级的视觉-语言-行动执行器（MIND-2-VLA），生成精准且感知本体感觉的运动动作。

Hierarchical Online Optimization Approach for IRS-enabled Low-altitude MEC in Vehicular Networks

车载网络中IRS支持的低空MEC的分层在线优化方法

Authors: Yixian Wang, Geng Sun, Zemin Sun, Jiacheng Wang, Changyuan Zhao, Daxin Tian, Dusit Niyato, Shiwen Mao
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2512.24659
Pdf link: https://arxiv.org/pdf/2512.24659
Abstract In this paper, we propose an intelligent reflecting surface (IRS)-enabled low-altitude multi-access edge computing (MEC) architecture, where an aerial MEC server cooperates with a terrestrial MEC server to provide computing services, while hybrid IRSs (i.e., building-installed and UAV-carried IRSs) are deployed to enhance the air-ground connectivity under blockage. Based on this architecture, we formulate a multi-objective optimization problem (MOOP) to minimize the task completion delay and energy consumption by jointly optimizing task offloading, UAV trajectory control, IRS phase-shift configuration, and computation resource allocation. The considered problem is NP-hard, and thus we propose a hierarchical online optimization approach (HOOA) to efficiently solve the problem. Specifically, we reformulate the MOOP as a Stackelberg game, where MEC servers collectively act as the leader to determine the system-level decisions, while the vehicles act as followers to make individual decisions. At the follower level, we present a many-to-one matching mechanism to generate feasible discrete decisions. At the leader level, we propose a generative diffusion model-enhanced twin delayed deep deterministic policy gradient (GDMTD3) algorithm integrated with a Karush-Kuhn-Tucker (KKT)-based method, which is a deep reinforcement learning (DRL)-based approach, to determine the continuous decisions. Simulation results demonstrate that the proposed HOOA achieves significant improvements, which reduces average task completion delay by 2.5% and average energy consumption by 3.1% compared with the best-performing benchmark approach and state-of-the-art DRL algorithm, respectively. Moreover, the proposed HOOA exhibits superior convergence stability while maintaining strong robustness and scalability in dynamic environments.
中文摘要 本文提出一种智能反射面（IRS）支持的低空多接入边缘计算（MEC）架构，其中空中MEC服务器与地面MEC服务器协作提供计算服务，同时部署混合IRS（即建筑安装和无人机携带的IRS），以增强空地连接，以增强阻塞时的连接。基于该架构，我们制定了多目标优化问题（MOOP），通过联合优化任务卸载、无人机轨迹控制、IRS相位移配置和计算资源分配，以最小化任务完成延迟和能耗。所考虑的问题是NP难的，因此我们提出了一种分层在线优化方法（HOOA）以高效求解该问题。具体来说，我们将MOOP重新表述为Stackelberg游戏，MEC服务器集体作为领导者，决定系统层级决策，而载具则作为追随者，做出个人决策。在跟随者层面，我们提出了多一匹配机制，以生成可行的离散决策。在领导者层面，我们提出一种生成扩散模型增强型双延迟深度确定性策略梯度（GDMTD3）算法，结合基于Karush-Kuhn-Tucker（KKT）的方法，这是一种基于深度强化学习（DRL）的方法，用于确定连续决策。模拟结果表明，所提HOOA实现了显著改进，分别将平均任务完成延迟降低2.5%，平均能耗降低3.1%，相较于性能最佳的基准方法和最先进的日程学习算法。此外，所提HOOA在动态环境中保持强鲁棒性和可扩展性，同时展现出卓越的收敛稳定性。

Dynamic Policy Learning for Legged Robot with Simplified Model Pretraining and Model Homotopy Transfer

带有简化模型预训练和模型同伦转移的腿部机器人动态策略学习

Authors: Dongyun Kang, Min-Gyu Kim, Tae-Gyu Song, Hajun Kim, Sehoon Ha, Hae-Won Park
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.24698
Pdf link: https://arxiv.org/pdf/2512.24698
Abstract Generating dynamic motions for legged robots remains a challenging problem. While reinforcement learning has achieved notable success in various legged locomotion tasks, producing highly dynamic behaviors often requires extensive reward tuning or high-quality demonstrations. Leveraging reduced-order models can help mitigate these challenges. However, the model discrepancy poses a significant challenge when transferring policies to full-body dynamics environments. In this work, we introduce a continuation-based learning framework that combines simplified model pretraining and model homotopy transfer to efficiently generate and refine complex dynamic behaviors. First, we pretrain the policy using a single rigid body model to capture core motion patterns in a simplified environment. Next, we employ a continuation strategy to progressively transfer the policy to the full-body environment, minimizing performance loss. To define the continuation path, we introduce a model homotopy from the single rigid body model to the full-body model by gradually redistributing mass and inertia between the trunk and legs. The proposed method not only achieves faster convergence but also demonstrates superior stability during the transfer process compared to baseline methods. Our framework is validated on a range of dynamic tasks, including flips and wall-assisted maneuvers, and is successfully deployed on a real quadrupedal robot.
中文摘要 为有腿机器人生成动态动作仍是一个充满挑战的问题。虽然强化学习在各种腿部运动任务中取得了显著成功，但要产生高度动态的行为，通常需要大量的奖励调优或高质量的演示。利用降序模型有助于缓解这些挑战。然而，模型差异在将政策转移到全身动力学环境时带来了重大挑战。本研究引入了一个基于延续的学习框架，结合简化模型预训练和模型同伦转移，高效生成和优化复杂的动态行为。首先，我们用单一刚体模型预训练策略，在简化环境中捕捉核心运动模式。接下来，我们采用持续策略，逐步将保单转移到全身环境，最大限度地减少性能损失。为了定义延续路径，我们通过逐步重新分配躯干和腿部的质量和惯性，引入一个从单刚体模型到全身模型的同伦。所提方法不仅收敛速度更快，还在传输过程中表现出优于基线方法的稳定性。我们的框架已在多种动态任务中验证，包括翻转和墙体辅助动作，并成功部署在真实的四足机器人上。

Evolving, Not Training: Zero-Shot Reasoning Segmentation via Evolutionary Prompting

进化而非训练：通过进化提示进行零样本推理分割

Authors: Kai Ye, Xiaotong You, Jianghang Lin, Jiayi Ji, Pingyang Dai, Liujuan Cao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.24702
Pdf link: https://arxiv.org/pdf/2512.24702
Abstract Reasoning Segmentation requires models to interpret complex, context-dependent linguistic queries to achieve pixel-level localization. Current dominant approaches rely heavily on Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). However, SFT suffers from catastrophic forgetting and domain dependency, while RL is often hindered by training instability and rigid reliance on predefined reward functions. Although recent training-free methods circumvent these training burdens, they are fundamentally limited by a static inference paradigm. These methods typically rely on a single-pass "generate-then-segment" chain, which suffers from insufficient reasoning depth and lacks the capability to self-correct linguistic hallucinations or spatial misinterpretations. In this paper, we challenge these limitations and propose EVOL-SAM3, a novel zero-shot framework that reformulates reasoning segmentation as an inference-time evolutionary search process. Instead of relying on a fixed prompt, EVOL-SAM3 maintains a population of prompt hypotheses and iteratively refines them through a "Generate-Evaluate-Evolve" loop. We introduce a Visual Arena to assess prompt fitness via reference-free pairwise tournaments, and a Semantic Mutation operator to inject diversity and correct semantic errors. Furthermore, a Heterogeneous Arena module integrates geometric priors with semantic reasoning to ensure robust final selection. Extensive experiments demonstrate that EVOL-SAM3 not only substantially outperforms static baselines but also significantly surpasses fully supervised state-of-the-art methods on the challenging ReasonSeg benchmark in a zero-shot setting. The code is available at this https URL.
中文摘要 推理切割需要模型解释复杂的、依赖上下文的语言查询，以实现像素级的局部化。当前主流方法高度依赖监督式微调（SFT）或强化学习（RL）。然而，SFT存在灾难性的遗忘和领域依赖，而强化学习常受训练不稳定性和对预设奖励函数的僵化依赖所阻碍。尽管最近的无训练方法绕过了这些训练负担，但它们在静态推理范式上受到根本限制。这些方法通常依赖单次“生成后段”链，这种链推理深度不足，且缺乏自我纠正语言幻觉或空间误解的能力。本文挑战这些局限性，提出EVOL-SAM3新框架，将推理分割重新表述为一种推理时间进化搜索过程。EVOL-SAM3 不依赖固定提示，而是维护一组提示假设，并通过“生成-评估-进化”循环迭代细化。我们引入了视觉竞技场，通过无引用的两人比赛评估提示适应度，并引入语义变异算符注入多样性并纠正语义错误。此外，异构竞技场模块将几何先验与语义推理整合，确保最终选择的稳健性。大量实验表明，EVOL-SAM3不仅在零样本条件下显著优于静态基线，且在具有挑战性的ReasonSeg基准测试中显著超越全监督的先进方法。代码可在该 https URL 访问。

Control of Microrobots with Reinforcement Learning under On-Device Compute Constraints

在设备端计算约束下，利用强化学习控制微型机器人

Authors: Yichen Liu, Kesava Viswanadha, Zhongyu Li, Nelson Lojo, Kristofer S. J. Pister
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.24740
Pdf link: https://arxiv.org/pdf/2512.24740
Abstract An important function of autonomous microrobots is the ability to perform robust movement over terrain. This paper explores an edge ML approach to microrobot locomotion, allowing for on-device, lower latency control under compute, memory, and power constraints. This paper explores the locomotion of a sub-centimeter quadrupedal microrobot via reinforcement learning (RL) and deploys the resulting controller on an ultra-small system-on-chip (SoC), SC$\mu$M-3C, featuring an ARM Cortex-M0 microcontroller running at 5 MHz. We train a compact FP32 multilayer perceptron (MLP) policy with two hidden layers ($[128, 64]$) in a massively parallel GPU simulation and enhance robustness by utilizing domain randomization over simulation parameters. We then study integer (Int8) quantization (per-tensor and per-feature) to allow for higher inference update rates on our resource-limited hardware, and we connect hardware power budgets to achievable update frequency via a cycles-per-update model for inference on our Cortex-M0. We propose a resource-aware gait scheduling viewpoint: given a device power budget, we can select the gait mode (trot/intermediate/gallop) that maximizes expected RL reward at a corresponding feasible update frequency. Finally, we deploy our MLP policy on a real-world large-scale robot on uneven terrain, qualitatively noting that domain-randomized training can improve out-of-distribution stability. We do not claim real-world large-robot empirical zero-shot transfer in this work.
中文摘要 自主微型机器人的一个重要功能是能够在地形上执行稳健的移动。本文探讨了一种边缘机器学习方法用于微型机器人运动，允许在计算、内存和功耗限制下实现设备端低延迟控制。本文通过强化学习（RL）探索了亚厘米四足微型机器人的运动，并将最终的控制器部署在超小型系统芯片（SoC）SC$\mu$M-3C上，该系统配备了运行在5 MHz的ARM Cortex-M0微控制器上。我们在大规模并行GPU仿真中训练一个紧凑的FP32多层感知器（MLP）策略，采用两层隐藏层（$[128， 64]$），并通过利用域随机化提升了鲁棒性。随后，我们研究整数（Int8）量化（每张量和每个特征），以便在资源有限的硬件上实现更高的推断更新率，并通过每次更新周期模型将硬件的电力预算与可实现的更新频率连接起来，用于在Cortex-M0上进行推断。我们提出了一种资源感知型步态调度视角：给定设备功率预算，我们可以选择在相应可行更新频率下最大化预期强化学习奖励的步态模式（小跑/中速/疾驰）。最后，我们将MLP政策应用于现实世界中的大型机器人，且在不平坦地形上，质性指出领域随机训练可以提升分布外稳定性。本研究中我们不声称存在现实世界的大型机器人经验零射点转移。

Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow

Dream2Flow：连接视频生成与开放世界作与3D对象流

Authors: Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, Ruohan Zhang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.24766
Pdf link: https://arxiv.org/pdf/2512.24766
Abstract Generative video modeling has emerged as a compelling tool to zero-shot reason about plausible physical interactions for open-world manipulation. Yet, it remains a challenge to translate such human-led motions into the low-level actions demanded by robotic systems. We observe that given an initial image and task instruction, these models excel at synthesizing sensible object motions. Thus, we introduce Dream2Flow, a framework that bridges video generation and robotic control through 3D object flow as an intermediate representation. Our method reconstructs 3D object motions from generated videos and formulates manipulation as object trajectory tracking. By separating the state changes from the actuators that realize those changes, Dream2Flow overcomes the embodiment gap and enables zero-shot guidance from pre-trained video models to manipulate objects of diverse categories-including rigid, articulated, deformable, and granular. Through trajectory optimization or reinforcement learning, Dream2Flow converts reconstructed 3D object flow into executable low-level commands without task-specific demonstrations. Simulation and real-world experiments highlight 3D object flow as a general and scalable interface for adapting video generation models to open-world robotic manipulation. Videos and visualizations are available at this https URL.
中文摘要 生成式视频建模已成为一种引人注目的工具，能够零镜头推理合理的物理互动，从而实现开放世界作。然而，将这种由人类主导的动作转化为机器人系统所需的低层次动作仍是一项挑战。我们观察到，给定初始图像和任务指令，这些模型在综合感知物体运动方面表现出色。因此，我们介绍了Dream2Flow，一个通过3D物体流作为中间表示，连接视频生成与机器人控制的框架。我们的方法从生成的视频中重建三维物体运动，并将作表述为物体轨迹追踪。通过将状态变化与实现这些变化的执行器分离，Dream2Flow克服了具体化差距，实现了从预训练视频模型到各种类别的零拍摄指导，包括刚性、关节、可变形和颗粒物体。通过轨迹优化或强化学习，Dream2Flow将重建的3D对象流转换为可执行的低级命令，无需任务特定演示。仿真和现实实验强调了3D物体流作为一种通用且可扩展的接口，用于将视频生成模型适应到开放世界机器人作。视频和可视化内容可在此 https 网址观看。

Throughput Optimization in UAV-Mounted RIS under Jittering and Imperfect CSI via DRL

在抖动和不完美CSI下的无人机安装RIS中通过DRL实现吞吐量优化

Authors: Anas K. Saeed, Mahmoud M. Salim, Ali Arshad Nasir, Ali H. Muqaibel
Subjects: Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2512.24773
Pdf link: https://arxiv.org/pdf/2512.24773
Abstract Reconfigurable intelligent surfaces (RISs) mounted on unmanned aerial vehicles (UAVs) can reshape wireless propagation on-demand. However, their performance is sensitive to UAV jitter and cascaded channel uncertainty. This paper investigates a downlink multiple-input single-output UAV-mounted RIS system in which a ground multiple-antenna base station (BS) serves multiple single-antenna users under practical impairments. Our goal is to maximize the expected throughput under stochastic three-dimensional UAV jitter and imperfect cascaded channel state information (CSI) based only on the available channel estimates. This leads to a stochastic nonconvex optimization problem subject to a BS transmit power constraint and strict unit-modulus constraints on all RIS elements. To address this problem, we design a model-free deep reinforcement learning (DRL) framework with a contextual bandit formulation. A differentiable feasibility layer is utilized to map continuous actions to feasible solutions, while the reward is a Monte Carlo estimate of the expected throughput. We instantiate this framework with constrained variants of deep deterministic policy gradient (DDPG) and twin delayed deep deterministic policy gradient (TD3) that do not use target networks. Simulations show that the proposed algorithms yield higher throughput than conventional alternating optimization-based weighted minimum mean-square error (AO-WMMSE) baselines under severe jitter and low CSI quality. Across different scenarios, the proposed methods achieve performance that is either comparable to or slightly below the AO-WMMSE benchmark, based on sample average approximation (SAA) with a relative gap ranging from 0-12%. Moreover, the proposed DRL controllers achieve online inference times of 0.6 ms per decision versus roughly 370-550 ms for AO-WMMSE solvers.
中文摘要 安装在无人机（UAV）上的可重构智能表面（RIS）可以按需重塑无线传播。然而，其性能对无人机抖动和级联通道不确定性非常敏感。本文研究了一种下行多输入单输出无人机安装RIS系统的系统，其中地面多天线基站（BS）在实际受限的情况下为多个单天线用户提供服务。我们的目标是仅基于现有信道估计，最大化在随机三维无人机抖动和不完美级联信道状态信息（CSI）下的预期吞吐量。这导致了一个随机非凸优化问题，受BS传输功率约束和严格单位模约束，适用于所有RIS元素。为解决这一问题，我们设计了一个无模型深度强化学习（DRL）框架，采用上下文强化模型（bandit）表述。利用可微可行层将连续动作映射到可行解，而奖励则是预期吞吐量的蒙特卡洛估计值。我们用不使用目标网络的受限变体实现了该框架，包括深度确定性策略梯度（DDPG）和双延迟深度确定性策略梯度（TD3）。模拟表明，在严重抖动和低CSI质量下，所提出的算法比传统的交替基于优化的加权最小均方误差（AO-WMMSE）基线产生更高的吞吐量。在不同场景下，所提方法基于样本平均近似（SAA）及相对差距为0-12%的相对差距，性能与AO-WMMSE基准相当或略低。此外，所提的DRL控制器实现了每次判定的在线推理时间为0.6毫秒，而AO-WMSE求解器约为370-550毫秒。

Iterative Deployment Improves Planning Skills in LLMs

迭代部署提升大型语言模型的规划能力

Authors: Augusto B. Corrêa, Yoav Gelberg, Luckeciano C. Melo, Ilia Shumailov, André G. Pereira, Yarin Gal
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.24940
Pdf link: https://arxiv.org/pdf/2512.24940
Abstract We show that iterative deployment of large language models (LLMs), each fine-tuned on data carefully curated by users from the previous models' deployment, can significantly change the properties of the resultant models. By testing this mechanism on various planning domains, we observe substantial improvements in planning skills, with later models displaying emergent generalization by discovering much longer plans than the initial models. We then provide theoretical analysis showing that iterative deployment effectively implements reinforcement learning (RL) training in the outer-loop (i.e. not as part of intentional model training), with an implicit reward function. The connection to RL has two important implications: first, for the field of AI safety, as the reward function entailed by repeated deployment is not defined explicitly, and could have unexpected implications to the properties of future model deployments. Second, the mechanism highlighted here can be viewed as an alternative training regime to explicit RL, relying on data curation rather than explicit rewards.
中文摘要 我们展示了，大型语言模型（LLM）的迭代部署，每个模型都经过用户精心策划的数据进行微调，能够显著改变最终模型的属性。通过在不同规划领域测试这一机制，我们观察到规划技能的显著提升，后续模型通过发现比最初模型更长的计划，展现出涌现的推广。随后，我们提供了理论分析，表明迭代部署在外环（即不作为有意模型训练的一部分）中有效实现强化学习（RL）训练，并带有隐式奖励函数。与强化学习的联系有两个重要影响：首先，对人工智能安全领域而言，反复部署所带来的奖励函数未被明确定义，可能对未来模型部署的属性产生意想不到的影响。其次，这里强调的机制可以被视为显式强化学习的替代训练模式，依赖数据整理而非显性奖励。

MSACL: Multi-Step Actor-Critic Learning with Lyapunov Certificates for Exponentially Stabilizing Control

MSACL：多步骤演员-批评者学习，配备利雅普诺夫证书，用于指数稳定控制

Authors: Yongwei Zhang, Yuanzhe Xing, Quan Quan, Zhikun She
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.24955
Pdf link: https://arxiv.org/pdf/2512.24955
Abstract Achieving provable stability in model-free reinforcement learning (RL) remains a challenge, particularly in balancing exploration with rigorous safety. This article introduces MSACL, a framework that integrates exponential stability theory with maximum entropy RL through multi-step Lyapunov certificate learning. Unlike methods relying on complex reward engineering, MSACL utilizes off-policy multi-step data to learn Lyapunov certificates satisfying theoretical stability conditions. By introducing Exponential Stability Labels (ESL) and a $\lambda$-weighted aggregation mechanism, the framework effectively balances the bias-variance trade-off in multi-step learning. Policy optimization is guided by a stability-aware advantage function, ensuring the learned policy promotes rapid Lyapunov descent. We evaluate MSACL across six benchmarks, including stabilization and nonlinear tracking tasks, demonstrating its superiority over state-of-the-art Lyapunov-based RL algorithms. MSACL achieves exponential stability and rapid convergence under simple rewards, while exhibiting significant robustness to uncertainties and generalization to unseen trajectories. Sensitivity analysis establishes the multi-step horizon $n=20$ as a robust default across diverse systems. By linking Lyapunov theory with off-policy actor-critic frameworks, MSACL provides a foundation for verifiably safe learning-based control. Source code and benchmark environments will be made publicly available.
中文摘要 在无模型强化学习（RL）中实现可证明的稳定性仍是一大挑战，尤其是在平衡探索与严格安全性方面。本文介绍了MSACL，这是一个通过多步Lyapunov证书学习，将指数稳定性理论与最大熵强化学习相结合的框架。与依赖复杂奖励工程的方法不同，MSACL利用非策略多步数据学习满足理论稳定性条件的李雅普诺夫证书。通过引入指数稳定性标签（ESL）和$\lambda$加权聚合机制，该框架有效平衡了多步学习中的偏差与方差权衡。策略优化由稳定感知优势函数指导，确保所学策略促进李雅普诺夫快速下降。我们在六个基准测试中评估了MSACL，包括稳定和非线性跟踪任务，展示了其优于基于李雅普诺夫的先进强化学习算法。MSACL在简单奖励下实现指数稳定和快速收敛，同时对不确定性和对未可见轨迹的推广表现出显著的鲁棒性。敏感度分析确立了多步视野$n=20美元作为多样系统中稳健的默认值。通过将李雅普诺夫理论与非策略行为者-批评框架连接起来，MSACL为可验证的安全基于学习的控制奠定了基础。源代码和基准测试环境将公开。

ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning

ResponseRank：通过偏好强度学习实现数据高效奖励建模

Authors: Timo Kaufmann, Yannick Metz, Daniel Keim, Eyke Hüllermeier
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.25023
Pdf link: https://arxiv.org/pdf/2512.25023
Abstract Binary choices, as often used for reinforcement learning from human feedback (RLHF), convey only the direction of a preference. A person may choose apples over oranges and bananas over grapes, but which preference is stronger? Strength is crucial for decision-making under uncertainty and generalization of preference models, but hard to measure reliably. Metadata such as response times and inter-annotator agreement can serve as proxies for strength, but are often noisy and confounded. We propose ResponseRank to address the challenge of learning from noisy strength signals. Our method uses relative differences in proxy signals to rank responses to pairwise comparisons by their inferred preference strength. To control for systemic variation, we compare signals only locally within carefully constructed strata. This enables robust learning of utility differences consistent with strength-derived rankings while making minimal assumptions about the strength signal. Our contributions are threefold: (1) ResponseRank, a novel method that robustly learns preference strength by leveraging locally valid relative strength signals; (2) empirical evidence of improved sample efficiency and robustness across diverse tasks: synthetic preference learning (with simulated response times), language modeling (with annotator agreement), and RL control tasks (with simulated episode returns); and (3) the Pearson Distance Correlation (PDC), a novel metric that isolates cardinal utility learning from ordinal accuracy.
中文摘要 二元选择，常用于人类反馈强化学习（RLHF），仅传达偏好的方向。有人可能选择苹果而非橙子，香蕉而非葡萄，但哪种偏好更强烈？强度对于在不确定性和偏好模型推广下的决策至关重要，但很难可靠测量。元数据如响应时间和标注者间协议可以作为强度的代理，但通常噪声杂乱且混淆。我们提出ResponseRank，以应对从噪声强度信号中学习的挑战。我们的方法利用代理信号的相对差异，根据推断的偏好强度对成对比较的反应进行排序。为了控制系统性变异，我们仅在精心构建的地层内局部比较信号。这使得对效用差异的稳健学习得益得以，同时对强度信号的假设最小。我们的贡献有三方面：（1）ResponseRank，一种通过利用局部有效相对强度信号稳健学习偏好强度的新方法;（2）在多种任务中，样本效率和鲁棒性有所提升的实证证据：合成偏好学习（模拟响应时间）、语言建模（带注释者一致）和强化学习控制任务（模拟剧集返回）;以及（3）皮尔逊距离相关（PDC），这是一种新颖度量，将基数效用学习与序数准确性区分开来。

Many Minds from One Model: Bayesian Transformers for Population Intelligence

一个模型中的多脑：贝叶斯变换器用于人口智能

Authors: Diji Yang, Yi Zhang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.25063
Pdf link: https://arxiv.org/pdf/2512.25063
Abstract Despite their scale and success, modern transformers are almost universally trained as single-minded systems: optimization produces one deterministic set of parameters, representing a single functional hypothesis about the data. Motivated by the idea that intelligence emerge from many minds, we propose Population Bayesian Transformers (B-Trans), which transform a standard Large Language Model into a Bayesian Transformer model to supports sampling diverse yet coherent model instances from a single set of pre-trained weights. B-Trans introduces a Bayesian-motivated posterior proxy by treating the bias-like offsets in normalization layers as stochastic variables with a Gaussian variational approximation, inducing a distribution over model behavior without the cost of training full Bayesian neural networks. Sampling from this proxy yields a set of model instances with diverse behaviors while maintaining general competence. To preserve coherence within each generation, we freeze the sampled noise at the sequence level, enforcing temporal consistency across tokens. B-Trans allows for population-level decision-making, where aggregating predictions across sampled individuals significantly enhances exploration. Experiments across zero-shot generation, Reinforcement Learning with Verifiable Rewards (RLVR), and RL without explicit labels demonstrate that B-Trans effectively leverage the wisdom of crowds, yielding superior semantic diversity while achieving better task performance compared to deterministic baselines.
中文摘要 尽管规模和成功，现代变换器几乎普遍被训练为单一系统：优化产生一组确定性参数，代表数据的单一函数假设。基于智能源自多颗大脑的理念，我们提出了总体贝叶斯变换器（B-Trans），该方法将标准的大语言模型转换为贝叶斯变换器模型，支持从一组预训练权重中采样多样且连贯的模型实例。B-Trans通过将归一化层中的偏置偏移处理为带有高斯变分近似的随机变量，引入了贝叶斯动机的后验代理，从而在不需训练完整贝叶斯神经网络的情况下，诱导模型行为分布。从该代理中抽样可得到一组具有多样行为且保持一般能力的模型实例。为了保持每一代的一致性，我们在序列层面冻结采样噪声，强制执行各个标记的时间一致性。B-Trans支持群体层面的决策，汇总样本个体的预测显著增强了探索效果。零样本生成、可验证奖励强化学习（RLVR）和无明确标签的强化学习实验表明，B-Trans有效利用群体智慧，获得更优的语义多样性，同时在任务表现上优于确定性基线。

Scaling Open-Ended Reasoning to Predict the Future

扩展开放式推理以预测未来

Authors: Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.25070
Pdf link: https://arxiv.org/pdf/2512.25070
Abstract High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.
中文摘要 高风险决策涉及在对未来不确定性下进行推理。在本研究中，我们训练语言模型，以对开放式预测问题进行预测。为了扩大训练数据，我们从每日新闻报道的全球事件中综合新颖的预测问题，采用全自动化、精心策划的方案。我们在我们的数据集OpenForesight上训练Qwen3思维模型。为了防止培训和评估中未来信息的泄露，我们使用离线新闻语料库，既用于数据生成，也用于预测系统中的检索。在少量验证集的指导下，我们展示了检索的益处，以及强化学习（RL）中奖励函数的改进。一旦我们获得最终的预报系统，我们会在2025年5月至8月间进行长期测试。我们的专业模型OpenForecaster 8B匹配了更大规模的专有模型，训练提升了预测的准确性、校准性和一致性。我们发现预测训练带来的校准改进在流行基准测试中具有普遍性。我们将所有模型、代码和数据开源，使语言模型预测的研究能够广泛获取。

Keyword: diffusion policy

There is no result