Arxiv Papers of Today

生成时间: 2026-03-27 16:56:12 (UTC+8); Arxiv 发布时间: 2026-03-27 20:00 EDT (2026-03-28 08:00 UTC+8)

今天共有 31 篇相关文章

Keyword: reinforcement learning

Dual-Graph Multi-Agent Reinforcement Learning for Handover Optimization

双图多智能体强化学习用于切换优化

Authors: Matteo Salvatori, Filippo Vannella, Sebastian Macaluso, Stylianos E. Trevlakis, Carlos Segura Perales, José Suarez-Varela, Alexandros-Apostolos A. Boulogeorgos, Ioannis Arapakis
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.24634
Pdf link: https://arxiv.org/pdf/2603.24634
Abstract HandOver (HO) control in cellular networks is governed by a set of HO control parameters that are traditionally configured through rule-based heuristics. A key parameter for HO optimization is the Cell Individual Offset (CIO), defined for each pair of neighboring cells and used to bias HO triggering decisions. At network scale, tuning CIOs becomes a tightly coupled problem: small changes can redirect mobility flows across multiple neighbors, and static rules often degrade under non-stationary traffic and mobility. We exploit the pairwise structure of CIOs by formulating HO optimization as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) on the network's dual graph. In this representation, each agent controls a neighbor-pair CIO and observes Key Performance Indicators (KPIs) aggregated over its local dual-graph neighborhood, enabling scalable decentralized decisions while preserving graph locality. Building on this formulation, we propose TD3-D-MA, a discrete Multi-Agent Reinforcement Learning (MARL) variant of the TD3 algorithm with a shared-parameter Graph Neural Network (GNN) actor operating on the dual graph and region-wise double critics for training, improving credit assignment in dense deployments. We evaluate TD3-D-MA in an ns-3 system-level simulator configured with real-world network operator parameters across heterogeneous traffic regimes and network topologies. Results show that TD3-D-MA improves network throughput over standard HO heuristics and centralized RL baselines, and generalizes robustly under topology and traffic shifts.
中文摘要 蜂窝网络中的切换（HO）控制由一组传统上通过基于规则的启发式配置的HO控制参数控制。HO优化的一个关键参数是单元单独偏移量（CIO），该单元为每对相邻单元格定义，用于对HO触发决策进行偏置。在网络规模上，调优CIO成为紧密耦合的问题：小幅度的调整可能将移动性流重新定向到多个邻居之间，静态规则在非固定流量和移动性下常常会退化。我们利用CIO的两对结构，将HO优化表述为网络对偶图上的去中心化部分可观测马尔可夫决策过程（Dec-POMDP）。在这种表示中，每个代理控制一个邻居对CIO，并观察其局部双图邻域内聚合的关键绩效指标（KPI），从而实现可扩展的去中心化决策，同时保持图局部性。基于该表述，我们提出了TD3-D-MA，这是TD3算法的离散多智能体强化学习（MARL）变体，采用共享参数的图神经网络（GNN）演员，运行在对偶图和区域双重批判者上进行训练，从而提升密集部署中的学分分配。我们在NS-3系统级模拟器中评估TD3-D-MA，模拟器配置了真实世界网络操作员参数，跨异构流量区间和网络拓扑。结果显示，TD3-D-MA在标准HO启发式和集中式强化学习基线下提升了网络吞吐量，并在拓扑和流量转移下具有强力的推广能力。

Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

训练LLM进行多步工具编排，采用受限数据综合和渐进奖励

Authors: Cheng Jiayang, Xin Liu, Zhihan Zhang, Haoyang Wen, Zixuan Zhang, Qingyu Yin, Shiyang Li, Priyanka Nigam, Bing Yin, Chao Zhang, Yangqiu Song
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.24709
Pdf link: https://arxiv.org/pdf/2603.24709
Abstract Multi-step tool orchestration, where LLMs must invoke multiple dependent APIs in the correct order while propagating intermediate outputs, remains challenging. State-of-the-art models frequently fail on full sequence execution, with parameter value errors accounting for a significant portion of failures. Training models to handle such workflows faces two obstacles: existing environments focus on simple per-turn function calls with simulated data, and binary rewards provide no signal for partial correctness. We present a framework addressing both challenges. First, we construct a reinforcement learning environment backed by a large-scale cache of real API responses, enabling a data synthesis pipeline that samples valid multi-step orchestration traces with controllable complexity and significantly higher generation efficiency than unconstrained methods. Second, we propose a graduated reward design that decomposes correctness into atomic validity (individual function call correctness at increasing granularity) and orchestration (correct tool sequencing with dependency respect). On ComplexFuncBench, our approach demonstrates substantial improvements in turn accuracy. Ablation studies confirm both reward components are essential: using either alone significantly degrades performance.
中文摘要 多步工具编排，即LLM必须按正确顺序调用多个依赖API，同时传播中间输出，依然具有挑战性。最先进的模型在完整序列执行时经常失败，参数值错误占了大量失败的原因。训练模型处理此类工作流面临两个障碍：现有环境仅关注简单的每回合函数调用和模拟数据，二进制奖励则无法提供部分正确性信号。我们提出了一个解决这两个挑战的框架。首先，我们构建了一个由大规模真实API响应缓存支持的强化学习环境，支持一个数据综合流水线，能够采样有效的多步编排轨迹，复杂度可控，生成效率远高于无约束方法。其次，我们提出一种渐进奖励设计，将正确性分解为原子效度（单个函数在递进粒度下调用正确性）和编排（正确工具序列并尊重依赖性）。在ComplexFuncBench上，我们的方法展示了回合准确率的显著提升。消融研究证实，这两种奖励成分都至关重要：单独使用其中一种都会显著降低表现。

Decentralized Task Scheduling in Distributed Systems: A Deep Reinforcement Learning Approach

分布式系统中的去中心化任务调度：深度强化学习方法

Authors: Daniel Benniah John
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.24738
Pdf link: https://arxiv.org/pdf/2603.24738
Abstract Efficient task scheduling in large-scale distributed systems presents significant challenges due to dynamic workloads, heterogeneous resources, and competing quality-of-service requirements. Traditional centralized approaches face scalability limitations and single points of failure, while classical heuristics lack adaptability to changing conditions. This paper proposes a decentralized multi-agent deep reinforcement learning (DRL-MADRL) framework for task scheduling in heterogeneous distributed systems. We formulate the problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) and develop a lightweight actor-critic architecture implemented using only NumPy, enabling deployment on resource-constrained edge devices without heavyweight machine learning frameworks. Using workload characteristics derived from the publicly available Google Cluster Trace dataset, we evaluate our approach on a 100-node heterogeneous system processing 1,000 tasks per episode over 30 experimental runs. Experimental results demonstrate 15.6% improvement in average task completion time (30.8s vs 36.5s for random baseline), 15.2% energy efficiency gain (745.2 kWh vs 878.3 kWh), and 82.3% SLA satisfaction compared to 75.5% for baselines, with all improvements statistically significant (p < 0.001). The lightweight implementation requires only NumPy, Matplotlib, and SciPy. Complete source code and experimental data are provided for full reproducibility at this https URL.
中文摘要 大规模分布式系统中的高效任务调度面临巨大挑战，原因是动态工作负载、资源异构以及相互竞争的服务质量要求。传统的集中式方法面临扩展性限制和单点故障，而经典启发式方法缺乏对变化条件的适应能力。本文提出了一个去中心化多智能体深度强化学习（DRL-MADRL）框架，用于异构分布式系统中的任务调度。我们将问题表述为去中心化部分可观测马尔可夫决策过程（Dec-POMDP），并开发了仅使用 NumPy 实现的轻量级 actor-critic 架构，使得在资源受限的边缘设备上部署而无需强大的机器学习框架。利用来自公开的谷歌集群追踪数据集的工作负载特性，我们评估了在一个100节点异构系统上，该系统在30次实验运行中每集处理1000个任务。实验结果显示，平均任务完成时间提升15.6%（随机基线为30.8秒对36.5秒），能效提升15.2%（745.2千瓦时对878.3千瓦时），SLA满意度为82.3%，基线为75.5%，所有改善均具有统计学显著性（p < 0.001）。轻量级实现只需 NumPy、MatplotLib 和 SciPy。完整的源代码和实验数据均提供于此 https URL，以确保完全可重复性。

Trust as Monitoring: Evolutionary Dynamics of User Trust and AI Developer Behaviour

信任即监控：用户信任与AI开发者行为的演变动态

Authors: Adeela Bashir, Zhao Song, Ndidi Bianca Ogbo, Nataliya Balabanova, Martin Smit, Chin-wing Leung, Paolo Bova, Manuel Chica Serrano, Dhanushka Dissanayake, Manh Hong Duong, Elias Fernandez Domingos, Nikita Huber-Kralj, Marcus Krellner, Andrew Powell, Stefan Sarkadi, Fernando P. Santos, Zia Ush Shamszaman, Chaimaa Tarzi, Paolo Turrini, Grace Ibukunoluwa Ufeoshi, Victor A. Vargas-Perez, Alessandro Di Stefano, Simon T. Powers, The Anh Han
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Adaptation and Self-Organizing Systems (nlin.AO)
Arxiv link: https://arxiv.org/abs/2603.24742
Pdf link: https://arxiv.org/pdf/2603.24742
Abstract AI safety is an increasingly urgent concern as the capabilities and adoption of AI systems grow. Existing evolutionary models of AI governance have primarily examined incentives for safe development and effective regulation, typically representing users' trust as a one-shot adoption choice rather than as a dynamic, evolving process shaped by repeated interactions. We instead model trust as reduced monitoring in a repeated, asymmetric interaction between users and AI developers, where checking AI behaviour is costly. Using evolutionary game theory, we study how user trust strategies and developer choices between safe (compliant) and unsafe (non-compliant) AI co-evolve under different levels of monitoring cost and institutional regimes. We complement the infinite-population replicator analysis with stochastic finite-population dynamics and reinforcement learning (Q-learning) simulations. Across these approaches, we find three robust long-run regimes: no adoption with unsafe development, unsafe but widely adopted systems, and safe systems that are widely adopted. Only the last is desirable, and it arises when penalties for unsafe behaviour exceed the extra cost of safety and users can still afford to monitor at least occasionally. Our results formally support governance proposals that emphasise transparency, low-cost monitoring, and meaningful sanctions, and they show that neither regulation alone nor blind user trust is sufficient to prevent evolutionary drift towards unsafe or low-adoption outcomes.
中文摘要 随着人工智能系统能力和应用的提升，人工智能安全成为日益紧迫的问题。现有的人工智能治理进化模型主要关注安全开发和有效监管的激励机制，通常将用户信任视为一次性的采纳选择，而非由反复互动塑造的动态演变过程。我们将信任建模为用户与AI开发者之间反复且不对称的互动中监控减少，而检测AI行为成本高昂。利用进化博弈论，我们研究用户信任策略和开发者在安全（合规）与不安全（不合规）AI之间的选择，如何在不同监控成本和机构体制下共同演化。我们用随机有限种群动力学和强化学习（Q-learning）模拟补充无限种群复制器分析。在这些方法中，我们发现了三种稳健的长期机制：不采用但开发不安全、不安全但广泛采用的系统，以及被广泛采用的安全系统。只有最后一种才是理想的，且当对不安全行为的处罚超过额外安全成本时，用户仍能偶尔负担监控。我们的结果正式支持强调透明度、低成本监控和有效制裁的治理提案，并表明仅靠监管或盲目用户信任不足以防止向不安全或低采纳结果的演进性漂移。

Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

生成时修剪：在线推出剪枝，以实现更快更好的RLVR

Authors: Haobo Xu, Sirui Chen, Ruizhong Qiu, Yuchen Yan, Chen Luo, Monica Cheng, Jingrui He, Hanghang Tong
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.24840
Pdf link: https://arxiv.org/pdf/2603.24840
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLMs）的推理能力。然而，像GRPO和DAPO这样的方法由于依赖于为每个提示抽样大量展开，存在较大的计算成本。此外，在RLVR中，相对优势往往很有限：许多样本几乎完全正确或全部错误，导致组内奖励方差较低，学习信号较弱。本文介绍了arrol（通过在线滚动剪枝加速RLVR），这是一种在线展开剪枝方法，在生成过程中修剪滚动内容，同时明确引导存活的滚动更平衡，以增强学习信号。具体来说，Arrol会实时训练一台轻量级的脑袋，预测部分推广的成功概率，并用它来做出早期修剪决策。学习到的质量头还可以进一步称重候选人，以提升测试时间尺度中的推断准确性。为了提高效率，我们提出了一种系统设计，在推理引擎内部修剪部署，并重新批处理剩余的部署以进行对数概率计算和策略更新。在Qwen-3和LLaMA-3.2模型（1B-8B）上的GRPO和DAPO中，Arrol平均精度提升+2.30至+2.99，同时训练加速最高达1.7倍，测试时间缩放平均精度提升高达+8.33。代码可在该 https URL 访问。

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

超越模式：语言模型中分布推理的强化学习

Authors: Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, Yoon Kim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.24844
Pdf link: https://arxiv.org/pdf/2603.24844
Abstract Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at this https URL.
中文摘要 给定一个问题时，语言模型（LM）隐式编码了可能答案的分布。实际上，LM的训练后程序通常将该分布归为单一主导模式。虽然这通常对假设一个正确答案的基准式评估来说不是问题，但许多现实任务本质上涉及多个有效答案或不可约的不确定性。例如医学诊断、模糊的问答以及信息不完整的环境。在这些情况下，我们希望LM能够生成多个合理假设，理想情况下每个假设都有置信估计，并且无需大量计算的重复抽样来生成非模态答案。本文描述了一种多答案强化学习方法，用于训练LMs在推理过程中对多个答案进行分布推理。我们修改了强化学习目标，使模型能够在一次前向传递中显式生成多个候选答案，将推理时间搜索的部分内容内化到模型的生成过程中。在问答、医疗诊断和编码基准中，我们观察到多样性、覆盖率和组级校准得分均相较于单一答案训练基线有所提升。采用我们方法训练的模型，生成多重答案所需的token数比其他竞争方法少。在编码任务中，它们的准确度也显著提升。这些结果使多答案强化学习成为一种原则性且计算高效的替代方案，替代推理时间尺度方法（如k法最佳法）。代码及更多信息可在此 https URL 找到。

Gaze patterns predict preference and confidence in pairwise AI image evaluation

凝视模式预测成对AI图像评估中的偏好和信心

Authors: Nikolas Papadopoulos, Shreenithi Navaneethan, Sheng Bai, Ankur Samanta, Paul Sajda
Subjects: Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2603.24849
Pdf link: https://arxiv.org/pdf/2603.24849
Abstract Preference learning methods, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on pairwise human judgments, yet little is known about the cognitive processes underlying these judgments. We investigate whether eye-tracking can reveal preference formation during pairwise AI-generated image evaluation. Thirty participants completed 1,800 trials while their gaze was recorded. We replicated the gaze cascade effect, with gaze shifting toward chosen images approximately one second before the decision. Cascade dynamics were consistent across confidence levels. Gaze features predicted binary choice (68% accuracy), with chosen images receiving more dwell time, fixations, and revisits. Gaze transitions distinguished high-confidence from uncertain decisions (66% accuracy), with low-confidence trials showing more image switches per second. These results show that gaze patterns predict both choice and confidence in pairwise image evaluations, suggesting that eye-tracking provides implicit signals relevant to the quality of preference annotations.
中文摘要 偏好学习方法，如人类反馈强化学习（RLHF）和直接偏好优化（DPO），依赖于两对人类判断，但对这些判断背后的认知过程知之甚少。我们研究眼动追踪是否能在成对AI生成图像评估中揭示偏好形成。30名参与者完成了1800次试验，同时记录了他们的凝视。我们复制了凝视级联效应，目光在决策前大约一秒转向选定的图像。级联动态在置信水平间保持一致。凝视具有预测的二元选择（68%准确率），所选图像获得更多停留时间、注视和重访。视线转换在高置信度与不确定决策中区分开来（准确率66%），而低置信度试验显示每秒图像切换次数更多。这些结果表明，凝视模式既能预测成对图像评估中的选择，也能预测置信度，表明眼动追踪提供了与偏好标注质量相关的隐性信号。

Learning to Staff: Offline Reinforcement Learning and Fine-Tuned LLMs for Warehouse Staffing Optimization

学习配备人员：离线强化学习与优化大型语言模型以优化仓库人员配置

Authors: Kalle Kujanpää, Yuying Zhu, Kristina Klinkner, Shervin Malmasi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.24883
Pdf link: https://arxiv.org/pdf/2603.24883
Abstract We investigate machine learning approaches for optimizing real-time staffing decisions in semi-automated warehouse sortation systems. Operational decision-making can be supported at different levels of abstraction, with different trade-offs. We evaluate two approaches, each in a matching simulation environment. First, we train custom Transformer-based policies using offline reinforcement learning on detailed historical state representations, achieving a 2.4% throughput improvement over historical baselines in learned simulators. In high-volume warehouse operations, improvements of this size translate to significant savings. Second, we explore LLMs operating on abstracted, human-readable state descriptions. These are a natural fit for decisions that warehouse managers make using high-level operational summaries. We systematically compare prompting techniques, automatic prompt optimization, and fine-tuning strategies. While prompting alone proves insufficient, supervised fine-tuning combined with Direct Preference Optimization on simulator-generated preferences achieves performance that matches or slightly exceeds historical baselines in a hand-crafted simulator. Our findings demonstrate that both approaches offer viable paths toward AI-assisted operational decision-making. Offline RL excels with task-specific architectures. LLMs support human-readable inputs and can be combined with an iterative feedback loop that can incorporate manager preferences.
中文摘要 我们研究机器学习方法，用于优化半自动化仓库分拣系统中的实时人员配置决策。操作决策可以在不同抽象层次上得到支持，且有不同的权衡。我们评估了两种方法，分别在匹配的模拟环境中。首先，我们通过离线强化学习训练基于Transformer的定制策略，基于详细的历史状态表示，在学习模拟器中相比历史基线实现了2.4%的吞吐量提升。在大批量仓储运营中，如此规模的改进可带来显著的节省。其次，我们探索在抽象、人类可读状态描述上运行的大型语言模型。这些与仓库经理通过高层次运营总结做出决策时非常契合。我们系统地比较提示技巧、自动提示优化和微调策略。虽然仅靠提示不足，但结合对模拟器生成偏好的直接偏好优化，监督微调可实现与手工模拟器中历史基线匹配或略高的表现。我们的研究结果表明，这两种方法都为AI辅助运营决策提供了可行的路径。离线强化学习在任务特定架构方面表现出色。LLM支持人类可读输入，并可与迭代反馈循环结合，从而纳入管理者偏好。

COIN: Collaborative Interaction-Aware Multi-Agent Reinforcement Learning for Self-Driving Systems

COIN：自动驾驶系统中的协作交互感知多智能体强化学习

Authors: Yifeng Zhang, Jieming Chen, Tingguang Zhou, Tanishq Duhan, Jianghong Dong, Yuhong Cao, Guillaume Sartoretti
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.24931
Pdf link: https://arxiv.org/pdf/2603.24931
Abstract Multi-Agent Self-Driving (MASD) systems provide an effective solution for coordinating autonomous vehicles to reduce congestion and enhance both safety and operational efficiency in future intelligent transportation systems. Multi-Agent Reinforcement Learning (MARL) has emerged as a promising approach for developing advanced end-to-end MASD systems. However, achieving efficient and safe collaboration in dynamic MASD systems remains a significant challenge in dense scenarios with complex agent interactions. To address this challenge, we propose a novel collaborative(CO-) interaction-aware(-IN) MARL framework, named COIN. Specifically, we develop a new counterfactual individual-global twin delayed deep deterministic policy gradient (CIG-TD3) algorithm, crafted in a "centralized training, decentralized execution" (CTDE) manner, which aims to jointly optimize the individual objectives (navigation) and the global objectives (collaboration) of agents. We further introduce a dual-level interaction-aware centralized critic architecture that captures both local pairwise interactions and global system-level dependencies, enabling more accurate global value estimation and improved credit assignment for collaborative policy learning. We conduct extensive simulation experiments in dense urban traffic environments, which demonstrate that COIN consistently outperforms other advanced baseline methods in both safety and efficiency across various system sizes. These results highlight its superiority in complex and dynamic MASD scenarios, as further validated through real-world robot demonstrations. Supplementary videos are available at this https URL
中文摘要 多智能体自动驾驶（MASD）系统为协调自动驾驶车辆提供了有效解决方案，以减少拥堵，提升未来智能交通系统的安全性和运营效率。多智能体强化学习（MARL）已成为开发先进端到端MASD系统的有前景方法。然而，在动态MASD系统中实现高效且安全的协作，在复杂的代理交互密集场景中仍是重大挑战。为应对这一挑战，我们提出了一个新型协作（CO-）交互感知（-IN）MARL框架，名为COIN。具体来说，我们开发了一种新的反事实个体-全局孪生延迟深度确定性策略梯度（CIG-TD3）算法，采用“集中训练，去中心化执行”（CTDE）方式设计，旨在共同优化智能体的个体目标（导航）和全局目标（协作）。我们进一步引入了一种双层次交互感知的集中批评架构，既捕捉局部成对互动，也涵盖全局系统层面的依赖关系，从而实现更准确的全局价值估计和改进协作政策学习的学分分配。我们在密集的城市交通环境中进行大量模拟实验，证明反叛乱在不同系统规模下，在安全性和效率方面始终优于其他先进基线方法。这些结果凸显了其在复杂且动态MASD场景中的优势，并通过真实机器人演示进一步验证。补充视频可在此 https 网址观看

Unbiased Multimodal Reranking for Long-Tail Short-Video Search

长尾短视频搜索的多模态重新排名

Authors: Wenyi Xu, Feiran Zhu, Songyang Li, Renzhe Zhou, Chao Zhang, Chenglei Dai, Yuren Mao, Yunjun Gao, Yi Zhang
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2603.24975
Pdf link: https://arxiv.org/pdf/2603.24975
Abstract Kuaishou serving hundreds of millions of searches daily, the quality of short-video search is paramount. However, it suffers from a severe Matthew effect on long-tail queries: sparse user behavior data causes models to amplify low-quality content such as clickbait and shallow content. The recent advancements in Large Language Models (LLMs) offer a new paradigm, as their inherent world knowledge provides a powerful mechanism to assess content quality, agnostic to sparse user interactions. To this end, we propose a LLM-driven multimodal reranking framework, which estimates user experience without real user behavior. The approach involves a two-stage training process: the first stage uses multimodal evidence to construct high-quality annotations for supervised fine-tuning, while the second stage incorporates pairwise preference optimization to help the model learn partial orderings among candidates. At inference time, the resulting experience scores are used to promote high-quality but underexposed videos in reranking, and further guide page-level optimization through reinforcement learning. Experiments show that the proposed method achieves consistent improvements over strong baselines in offline metrics including AUC, NDCG@K, and human preference judgement. An online A/B test covering 15\% of traffic further demonstrates gains in both user experience and consumption metrics, confirming the practical value of the approach in long-tail video search scenarios.
中文摘要 快手每天服务数亿次搜索，短视频搜索的质量至关重要。然而，它在长尾查询中存在严重的马修效应：稀疏的用户行为数据会导致模型放大低质量内容，如诱点击和浅层内容。大型语言模型（LLMs）的最新进展提供了一种新范式，其固有的世界知识提供了评估内容质量的强大机制，且不受稀疏用户交互的影响。为此，我们提出了一个基于大型语言模型的多模态重新排序框架，能够在没有真实用户行为的情况下估算用户体验。该方法包含两阶段训练过程：第一阶段利用多模态证据构建高质量的注释，用于监督微调;第二阶段则结合成对偏好优化，帮助模型学习候选者间的部分排序。在推理阶段，所得的体验分数用于促进高质量但曝光不足的视频进行重新排名，并通过强化学习进一步指导页面层面的优化。实验显示，所提方法在离线指标（包括 AUC、NDCG@K 和人类偏好判断）上均有持续优异的改进。一项涵盖15%流量的在线A/B测试进一步展示了用户体验和消费指标的提升，证实了该方法在长尾视频搜索场景中的实用价值。

MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models

MoE-GRPO：通过视觉语言模型中的强化学习优化专家混合

Authors: Dohwan Ko, Jinyoung Park, Seoung Choi, Sanghyeok Lee, Seohyun Lee, Hyunwoo J. Kim
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.24984
Pdf link: https://arxiv.org/pdf/2603.24984
Abstract Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that enhances training stability and efficiency by discouraging the router from exploring experts that are infrequently activated for a given modality. Extensive experiments on multi-modal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and enabling a task-level expert specialization.
中文摘要 专家混合（Mixture-of-Experts，简称MoE）已成为一种有效方法，通过稀疏激活每个代币的部分参数，同时保持高模型容量，从而降低Transformer架构的计算开销。这一范式最近被扩展到视觉语言模型（VLM），实现了可扩展的多模态理解，同时降低计算成本。然而，广泛采用的确定性top-K路由机制可能会忽视更优的专家组合，导致专家过拟合。为解决这一限制并提升专家选择的多样性，我们提出了MoE-GRPO，一种基于强化学习（RL）的框架，用于优化基于MoE的VLM中的专家路由。具体来说，我们将专家选择构建为一个顺序决策问题，并利用群体相对策略优化（GRPO）进行优化，使模型能够通过探索和基于奖励的反馈学习自适应专家路由策略。此外，我们还引入了一种模式感知型的路由器指导，通过阻止路由器探索某些模式中不常激活的专家，提升训练稳定性和效率。多模态图像和视频基准测试的广泛实验表明，MoE-GRPO通过促进更多多样化的专家选择，持续优于标准top-K路由及其变体，从而减少专家过拟合，实现任务级专家专精。

Learning Rollout from Sampling:An R1-Style Tokenized Traffic Simulation Model

从抽样中学习推广：一种R1风格的分词化交通模拟模型

Authors: Ziyan Wang, Peng Chen, Ding Li, Chiwei Li, Qichao Zhang, Zhongpu Xia, Guizhen Yu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.24989
Pdf link: https://arxiv.org/pdf/2603.24989
Abstract Learning diverse and high-fidelity traffic simulations from human driving demonstrations is crucial for autonomous driving evaluation. The recent next-token prediction (NTP) paradigm, widely adopted in large language models (LLMs), has been applied to traffic simulation and achieves iterative improvements via supervised fine-tuning (SFT). However, such methods limit active exploration of potentially valuable motion tokens, particularly in suboptimal regions. Entropy patterns provide a promising perspective for enabling exploration driven by motion token uncertainty. Motivated by this insight, we propose a novel tokenized traffic simulation policy, R1Sim, which represents an initial attempt to explore reinforcement learning based on motion token entropy patterns, and systematically analyzes the impact of different motion tokens on simulation outcomes. Specifically, we introduce an entropy-guided adaptive sampling mechanism that focuses on previously overlooked motion tokens with high uncertainty yet high potential. We further optimize motion behaviors using Group Relative Policy Optimization (GRPO), guided by a safety-aware reward design. Overall, these components enable a balanced exploration-exploitation trade-off through diverse high-uncertainty sampling and group-wise comparative estimation, resulting in realistic, safe, and diverse multi-agent behaviors. Extensive experiments on the Waymo Sim Agent benchmark demonstrate that R1Sim achieves competitive performance compared to state-of-the-art methods.
中文摘要 从人类驾驶演示中学习多样化且高保真度的交通模拟对于自动驾驶评估至关重要。最近的下一令牌预测（NTP）范式被广泛应用于大型语言模型（LLM），并已应用于交通仿真，并通过监督微调（SFT）实现迭代改进。然而，这些方法限制了对潜在有价值动作代币的主动探索，尤其是在次优区域。熵模式为推动由运动代币不确定性驱动的探索提供了有前景的视角。基于这一见解，我们提出了一种新的标记化交通模拟策略R1Sim，它代表了基于运动符号熵模式的初步尝试，并系统分析不同运动符号对模拟结果的影响。具体来说，我们引入了一种熵引导的自适应采样机制，专注于此前被忽视的高不确定性但潜力巨大的运动代币。我们进一步利用群体相对策略优化（GRPO），并以安全意识奖励设计为指导。总体而言，这些组成部分通过多样的高不确定性抽样和群体比较估计，实现了探索与利用的平衡权衡，从而实现了真实、安全且多样的多智能体行为。在Waymo模拟代理基准测试上的大量实验表明，R1Sim在性能上优于最先进的方法。

Distributed Real-Time Vehicle Control for Emergency Vehicle Transit: A Scalable Cooperative Method

分布式实时车辆控制：一种可扩展的协作方法

Authors: WenXi Wang, JunQi Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.25000
Pdf link: https://arxiv.org/pdf/2603.25000
Abstract Rapid transit of emergency vehicles is critical for saving lives and reducing property loss but often relies on surrounding ordinary vehicles to cooperatively adjust their driving behaviors. It is important to ensure rapid transit of emergency vehicles while minimizing the impact on ordinary vehicles. Centralized mathematical solver and reinforcement learning are the state-of-the-art methods. The former obtains optimal solutions but is only practical for small-scale scenarios. The latter implicitly learns through extensive centralized training but the trained model exhibits limited scalability to different traffic conditions. Hence, existing methods suffer from two fundamental limitations: high computational cost and lack of scalability. To overcome above limitations, this work proposes a scalable distributed vehicle control method, where vehicles adjust their driving behaviors in a distributed manner online using only local instead of global information. We proved that the proposed distributed method using only local information is approximately equivalent to the one using global information, which enables vehicles to evaluate their candidate states and make approximately optimal decisions in real time without pre-training and with natural adaptability to varying traffic conditions. Then, a distributed conflict resolution mechanism is further proposed to guarantee vehicles' safety by avoiding their decision conflicts, which eliminates the single-point-of-failure risk of centralized methods and provides deterministic safety guarantees that learned methods cannot offer. Compared with existing methods, simulation experiments based on real-world traffic datasets demonstrate that the proposed method achieves faster decision-making, less impact on ordinary vehicles, and maintains much stronger scalability across different traffic densities and road configurations.
中文摘要 紧急车辆的快速运输对于挽救生命和减少财产损失至关重要，但通常依赖周围普通车辆协同调整驾驶行为。确保紧急车辆快速运输，同时尽量减少对普通车辆的影响，这一点非常重要。集中式数学求解器和强化学习是最先进的方法。前者能获得最优解，但仅适用于小尺度场景。后者通过广泛的集中训练隐式学习，但训练后的模型对不同交通状况的扩展性有限。因此，现有方法存在两个根本性限制：高计算成本和缺乏可扩展性。为克服上述限制，本研究提出了一种可扩展的分布式车辆控制方法，即车辆仅使用本地信息而非全局信息，在线分布式调整驾驶行为。我们证明了仅使用局部信息的分布式方法大致等同于使用全局信息的方法，这使得车辆能够实时评估候选状态，并在无需预训练的情况下做出近似最优决策，并具备对不同交通状况的自然适应能力。随后，进一步提出了一种分布式冲突解决机制，通过避免决策冲突来保障车辆安全，消除集中式方法的单点故障风险，并提供了学习方法无法提供的确定性安全保障。与现有方法相比，基于真实世界交通数据集的模拟实验表明，该方法实现了更快的决策速度，对普通车辆的影响更小，并且在不同交通密度和道路配置下保持了更强的可扩展性。

VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

VideoTIR：高效工具集成推理，准确理解长视频

Authors: Zhe Gao, Shiyu Shen, Taifeng Chai, Weinong Wang, Haotian Xu, Xing W, Wenbin Li, Qi Fan, Yang Gao, Dacheng Tao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.25021
Pdf link: https://arxiv.org/pdf/2603.25021
Abstract Existing Multimodal Large Language Models (MLLMs) often suffer from hallucinations in long video understanding (LVU), primarily due to the imbalance between textual and visual tokens. Observing that MLLMs handle short visual inputs well, recent LVU works alleviate hallucinations by automatically parsing the vast visual data into manageable segments that can be effectively processed by MLLMs. SFT-based tool-calling methods can serve this purpose, but they typically require vast amounts of fine-grained, high-quality data and suffer from constrained tool-calling trajectories. We propose a novel VideoTIR that leverages Reinforcement Learning (RL) to encourage proper usage of comprehensive multi-level toolkits for efficient long video understanding. VideoTIR explores both Zero-RL and SFT cold-starting to enable MLLMs to retrieve and focus on meaningful video segments/images/regions, enhancing long video understanding both accurately and efficiently. To reduce redundant tool-calling, we propose Toolkit Action Grouped Policy Optimization (TAGPO), which enhances the efficiency of the calling process through stepwise reward assignment and reuse of failed rollouts. Additionally, we develop a sandbox-based trajectory synthesis framework to generate high-quality trajectories data. Extensive experiments on three long-video QA benchmarks demonstrate the effectiveness and efficiency of our method.
中文摘要 现有的多模态大型语言模型（MLLM）常常在长视频理解（LVU）中出现幻觉，主要原因是文本符号和视觉符号之间的不平衡。观察到MLLM对短视觉输入的处理能力良好，近期LVU研究通过自动将庞大的视觉数据解析为可管理的片段，从而有效处理幻觉。基于SFT的工具调用方法可以达到这一目的，但通常需要大量细粒度、高质量的数据，且工具调用路径受限。我们提出了一种新型视频TIR，利用强化学习（RL）鼓励正确使用全面的多层次工具包，实现高效的长视频理解。VideoTIR探索了零实时光和SFT冷启动技术，使多层次远程模型能够检索并聚焦有意义的视频片段/图像/区域，从而准确高效地提升长视频理解。为减少重复调用工具，我们提出了工具包行动分组策略优化（TAGPO），通过逐步分配奖励和重用失败的推广，提升调用流程的效率。此外，我们还开发了一个基于沙盒的轨迹综合框架，以生成高质量的轨迹数据。在三个长视频质量保证基准测试上的广泛实验展示了我们方法的有效性和效率。

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

Intern-S1-Pro：万亿尺度的科学多模态基础模型

Authors: Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, Xiaomeng Zhao, Zhiyuan Zhao, Yechen Zhang, Jin Zhang, Wenwei Zhang, Hongjie Zhang, Zhuo Zhang, Wenlong Zhang, Bo Zhang, Chao Zhang, Chen Zhang, Yuhang Zang, Fei Yuan, Jiakang Yuan, Jiashuo Yu, Jinhui Yin, Haochen Ye, Qian Yao, Bowen Yang, Danni Yang, Kaichen Yang, Ziang Yan, Jun Xu, Yicheng Xu, Wanghan Xu, Xuenan Xu, Chao Xu, Ruiliang Xu, Shuhao Xing, Long Xing, Xinchen Xie, Ling-I Wu, Zijian Wu, Zhenyu Wu, Lijun Wu, Yue Wu, Jianyu Wu, Wen Wu, Fan Wu, Xilin Wei, Qi Wei, Bingli Wang, Rui Wang, Ziyi Wang, Zun Wang, Yi Wang, Haomin Wang, Yizhou Wang, Lintao Wang, Yiheng Wang, Longjiang Wang, Bin Wang, Jian Tong, Zhongbo Tian, Huanze Tang, Chen Tang, Shixiang Tang, Yu Sun, Qiushi Sun, Xuerui Su, Qisheng Su, Chenlin Su, Demin Song, Jin Shi, Fukai Shang, Yuchen Ren, Pengli Ren, Xiaoye Qu, Yuan Qu, Jiantao Qiu, Yu Qiao, Runyu Peng, Tianshuo Peng, Jiahui Peng, Qizhi Pei, Zhuoshi Pan, Linke Ouyang, Wenchang Ning, Yichuan Ma, Zerun Ma, Ningsheng Ma, Runyuan Ma, Chengqi Lyu, Haijun Lv, Han Lv
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.25040
Pdf link: https://arxiv.org/pdf/2603.25040
Abstract We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences. Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.
中文摘要 我们介绍Intern-S1-Pro，这是首个一万亿参数的科学多模态基础模型。该模型规模达到前所未有的规模，在通用和科学领域均实现全面增强。除了更强的推理和图像-文本理解能力外，其智能还辅以高级代理能力。与此同时，其科学专长大幅扩展，掌握了包括化学、材料、生命科学和地球科学在内的100多个关键科学领域的专业任务。实现如此大规模的实现得益于XTuner和LMDeploy的强大基础设施支持，这些基础设施在1万亿参数层面实现了高效的强化学习（RL）训练，同时确保训练与推理之间的严格精度一致性。通过无缝整合这些进步，Intern-S1-Pro 进一步强化了通用与专业智能的融合，作为一个专业通用专家，展示了其在通用能力开源模型中顶尖的地位，同时在专业科学任务深度上优于专有模型。

Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs

桥接感知与推理：多模态大型语言模型中RLVR的代币重权

Authors: Jinda Lu, Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang, Guoyin Wang, Jiancan Wu, Xiang Wang, Xiangnan He
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.25077
Pdf link: https://arxiv.org/pdf/2603.25077
Abstract Extending Reinforcement Learning with Verifiable Rewards (RLVR) to multimodal large language models (MLLMs) faces a fundamental challenge: their responses inherently interleave perception-related tokens, which ground visual content, with reasoning-related tokens, which construct reasoning chains. These token types instantiate distinct yet interdependent capacities -- visual grounding and symbolic reasoning -- making isolated optimization insufficient. Through token-level empirical analysis, we demonstrate that optimizing either perception- or reasoning-only tokens consistently underperforms full optimization, underscoring their inherent coupling. To address this, we propose a plug-and-play Token-Reweighting (ToR) strategy that explicitly models this interdependence by identifying critical tokens of both types and dynamically reweighting them during RLVR training. Applied on top of existing methods (e.g., GRPO and DAPO), ToR delivers consistent performance gains across multiple multi-modal reasoning benchmarks, achieving state-of-the-art performance with both accurate visual grounding and coherent reasoning.
中文摘要 将可验证奖励强化学习（RLVR）扩展到多模态大型语言模型（MLLM）面临一个根本挑战：它们的响应本质上将基于视觉内容的感知相关代币与构建推理链的相关代币交错。这些代币类型实例化了不同但相互依赖的能力——视觉基础和符号推理——使得孤立的优化不够。通过代币级的实证分析，我们证明了优化感知或推理仅用代币时，整体优化总是表现不佳，凸显了它们固有的耦合性。为此，我们提出了一种即插即用的代币重加权（ToR）策略，通过识别两类关键代币并在RLVR训练过程中动态重加权，明确建模这种相互依赖关系。在现有方法（如GRPO和DAPO）基础上应用，ToR在多个多模态推理基准中实现持续的性能提升，实现了精准的视觉基础和连贯推理的顶尖性能。

MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning

MSRL：通过多阶段强化学习扩展生成多模态奖励建模

Authors: Chenglong Wang, Yifu Huo, Yang Gan, Qiaozhi He, Qi Meng, Bei Li, Yan Wang, Junfu Liu, Tianhua Zhou, Jingbo Zhu, Tong Xiao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.25108
Pdf link: https://arxiv.org/pdf/2603.25108
Abstract Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning from verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVR-based training typically relies on labeled multimodal preference data, which are costly and labor-intensive to obtain, making it difficult to scale MRM training. To overcome this limitation, we propose a Multi-Stage Reinforcement Learning (MSRL) approach, which can achieve scalable RL for MRMs with limited multimodal data. MSRL replaces the conventional RLVR-based training paradigm by first learning a generalizable reward reasoning capability from large-scale textual preference data, and then progressively transferring this capability to multimodal tasks through caption-based and fully multimodal reinforcement-learning stages. Furthermore, we introduce a cross-modal knowledge distillation approach to improve preference generalization within MSRL. Extensive experiments demonstrate that MSRL effectively scales the RLVR-based training of generative MRMs and substantially improves their performance across both visual understanding and visual generation tasks (e.g., from 66.6% to 75.9% on VL-RewardBench and from 70.2% to 75.7% on GenAI-Bench), without requiring additional multimodal preference annotations. Our code is available at: this https URL.
中文摘要 多模态奖励建模的最新进展主要由从判别式方法转向生成式方法的范式转变推动。基于这一进展，近期研究进一步利用可验证奖励强化学习（RLVR）来增强多模奖励模型（MRMs）。尽管取得了成功，基于RLVR的训练通常依赖于标记的多模态偏好数据，获取成本高且劳动密集，难以扩展MRM训练。为克服这一限制，我们提出了一种多阶段强化学习（MSRL）方法，能够实现对多模态数据有限的MRMs的可扩展强化学习。MSRL取代了传统的基于RLVR的训练范式，首先从大规模文本偏好数据中学习可推广的奖励推理能力，然后通过基于字幕和全多模态的强化学习阶段逐步将该能力转移到多模态任务中。此外，我们引入了一种跨模态知识提炼方法，以提升MSRL中的偏好泛化能力。大量实验表明，MSRL有效扩展基于RLVR的生成MRM训练，并在视觉理解和视觉生成任务中显著提升其性能（例如VL-RewardBench上的66.6%提升至75.9%，GenAI-Bench的70.2%提升至75.7%），无需额外的多模态偏好注释。我们的代码可在以下 https URL 获取。

AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization

AnyDoc：通过大规模HTML/CSS数据综合和高度感知强化优化提升文档生成

Authors: Jiawei Lin, Wanrong Zhu, Vlad I Morariu, Christopher Tensmeyer
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.25118
Pdf link: https://arxiv.org/pdf/2603.25118
Abstract Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spectrum of document categories, all represented in a unified HTML/CSS format. To overcome the limited coverage and scale of existing human-crafted document datasets, AnyDoc first establishes a scalable data synthesis pipeline to automatically generate documents in HTML/CSS form. This pipeline yields DocHTML, a large-scale dataset containing 265,206 document samples, while spanning 111 categories and 32 distinct styles. Additionally, all documents are equipped with comprehensive metadata, including design intentions, HTML/CSS source code, visual assets, and rendered screenshots. Building on the curated dataset, AnyDoc fine-tunes multi-modal large language models (MLLMs) to achieve three practical document generation tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issue observed during fine-tuning, AnyDoc further incorporates a height-aware reinforcement learning (HARL) post-training procedure. By defining a reward function based on the difference between predicted and target document heights, overflow is penalized and gradually mitigated during HARL, thereby enhancing overall performance. Qualitative and quantitative experiments demonstrate that AnyDoc outperforms both general-purpose MLLMs and task-specific baselines across all three tasks.
中文摘要 文档生成在AI驱动的内容创作领域日益受到关注。在本项工作中，我们突破了其边界，推出了AnyDoc，这是一个能够处理跨广泛文档类别的多生成任务的框架，所有任务均以统一的HTML/CSS格式表示。为了克服现有人工文档数据集的有限覆盖和规模，AnyDoc 首先建立了可扩展的数据综合流水线，能够自动生成 HTML/CSS 格式的文档。该流程产生了DocHTML，一个包含265,206个文档样本的大型数据集，涵盖111个类别和32种不同风格。此外，所有文档都配备了全面的元数据，包括设计意图、HTML/CSS源代码、视觉素材和渲染截图。基于精心策划的数据集，AnyDoc 微调多模态大型语言模型（MLLM），以实现三项实用文档生成任务：意图文档生成、文档导出和元素文档生成。为了解决微调过程中观察到的内容溢出问题，AnyDoc进一步引入了高度感知强化学习（HARL）训练后流程。通过根据预测与目标文档高度的差值定义奖励函数，溢出在HARL期间受到惩罚并逐步缓解，从而提升整体表现。定性和定量实验表明，AnyDoc在这三种任务中都优于通用MLLm和任务特定基线。

Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model

Training at Moving Edge：在线验证的提示选择，实现大型推理模型的高效强化学习训练

Authors: Jiahao Wu, Ning Lu, Shengcai Liu, Kun Wang, Yanting Yang, Li Qing, Ke Tang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.25184
Pdf link: https://arxiv.org/pdf/2603.25184
Abstract Reinforcement learning (RL) has become essential for post-training large language models (LLMs) in reasoning tasks. While scaling rollouts can stabilize training and enhance performance, the computational overhead is a critical issue. In algorithms like GRPO, multiple rollouts per prompt incur prohibitive costs, as a large portion of prompts provide negligible gradients and are thus of low utility. To address this problem, we investigate how to select high-utility prompts before the rollout phase. Our experimental analysis reveals that sample utility is non-uniform and evolving: the strongest learning signals concentrate at the ``learning edge", the intersection of intermediate difficulty and high uncertainty, which shifts as training proceeds. Motivated by this, we propose HIVE (History-Informed and online-VErified prompt selection), a dual-stage framework for data-efficient RL. HIVE utilizes historical reward trajectories for coarse selection and employs prompt entropy as a real-time proxy to prune instances with stale utility. By evaluating HIVE across multiple math reasoning benchmarks and models, we show that HIVE yields significant rollout efficiency without compromising performance.
中文摘要 强化学习（RL）已成为大型语言模型（LLMs）推理任务中训练后的重要工具。虽然扩展推广可以稳定培训并提升绩效，但计算开销是一个关键问题。在像GRPO这样的算法中，每个提示词多次推出会产生高昂的成本，因为大部分提示提供了可以忽略的梯度，因此效用较低。为了解决这个问题，我们研究如何在推广阶段前选择高效用提示词。我们的实验分析显示，样本效用是非均匀且不断演变的：最强的学习信号集中在“学习边缘”，即中间难度与高不确定性的交汇处，随着训练进行该边界的变化。基于此，我们提出了HIVE（历史知情与在线验证提示选择），这是一个数据高效强化学习的双阶段框架。HIVE利用历史奖励轨迹进行粗选，并利用即时熵作为实时代理，修剪乏效的实例。通过在多个数学推理基准和模型中评估HIVE，我们证明HIVE在不影响性能的前提下，能够实现显著的推广效率。

AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References

AnyID：从任何视觉参考生成超保真通用身份视频

Authors: Jiahao Wang, Hualian Sheng, Sijia Cai, Yuxiao Yang, Weizhan Zhang, Caixia Yan, Bing Deng, Jieping Ye
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.25188
Pdf link: https://arxiv.org/pdf/2603.25188
Abstract Identity-preserving video generation offers powerful tools for creative expression, allowing users to customize videos featuring their beloved characters. However, prevailing methods are typically designed and optimized for a single identity reference. This underlying assumption restricts creative flexibility by inadequately accommodating diverse real-world input formats. Relying on a single source also constitutes an ill-posed scenario, causing an inherently ambiguous setting that makes it difficult for the model to faithfully reproduce an identity across novel contexts. To address these issues, we present AnyID, an ultra-fidelity identity-preservation video generation framework that features two core contributions. First, we introduce a scalable omni-referenced architecture that effectively unifies heterogeneous identity inputs (e.g., faces, portraits, and videos) into a cohesive representation. Second, we propose a primary-referenced generation paradigm, which designates one reference as a canonical anchor and uses a novel differential prompt to enable precise, attribute-level controllability. We conduct training on a large-scale, meticulously curated dataset to ensure robustness and high fidelity, and then perform a final fine-tuning stage using reinforcement learning. This process leverages a preference dataset constructed from human evaluations, where annotators performed pairwise comparisons of videos based on two key criteria: identity fidelity and prompt controllability. Extensive evaluations validate that AnyID achieves ultra-high identity fidelity as well as superior attribute-level controllability across different task settings.
中文摘要 保持身份的视频生成提供了强大的创意表达工具，允许用户自定义包含心爱角色的视频。然而，主流方法通常针对单一恒等引用设计和优化。这一潜在假设限制了创造灵活性，无法充分适应多样的现实输入格式。依赖单一来源也构成了一个不合适的情景，导致环境本身模糊，使模型难以在新颖语境中忠实地再现身份。为解决这些问题，我们介绍了AnyID，一个超保真身份保护视频生成框架，具有两个核心贡献。首先，我们引入了一种可扩展的全指称架构，有效统一异构身份输入（如人脸、肖像和视频）为一个连贯的表示。其次，我们提出了一种主引用生成范式，将一个引用指定为典范锚点，并采用一种新颖的差分提示符，实现精确的属性级可控性。我们在大规模、精心策划的数据集上进行训练，以确保数据集的鲁棒性和高保真度，然后通过强化学习进行最终的微调阶段。该过程利用由人工评估构建的偏好数据集，标注者基于两个关键标准：身份忠实度和提示可控性，对视频进行成对比较。大量评估验证了AnyID在不同任务设置下实现了超高身份忠实度和优异的属性级可控性。

Offline Decision Transformers for Neural Combinatorial Optimization: Surpassing Heuristics on the Traveling Salesman Problem

神经组合优化中的离线决策变换器：超越旅行推销员问题的启发式方法

Authors: Hironori Ohigashi, Shinichiro Hamada
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.25241
Pdf link: https://arxiv.org/pdf/2603.25241
Abstract Combinatorial optimization problems like the Traveling Salesman Problem are critical in industry yet NP-hard. Neural Combinatorial Optimization has shown promise, but its reliance on online reinforcement learning (RL) hampers deployment and underutilizes decades of algorithmic knowledge. We address these limitations by applying the offline RL framework, Decision Transformer, to learn superior strategies directly from datasets of heuristic solutions; it aims to not only to imitate but to synthesize and outperform them. Concretely, we (i) integrate a Pointer Network to handle the instance-dependent, variable action space of node selection, and (ii) employ expectile regression for optimistic conditioning of Return-to-Go, which is crucial for instances with widely varying optimal values. Experiments show that our method consistently produces higher-quality tours than the four classical heuristics it is trained on, demonstrating the potential of offline RL to unlock and exceed the performance embedded in existing domain knowledge.
中文摘要 组合优化问题如旅行推销员问题在工业中至关重要，但NP难。神经组合优化展现出潜力，但其对在线强化学习（RL）的依赖阻碍了应用，且未能充分利用数十年的算法知识。我们通过应用离线强化学习框架——决策变换器，直接从启发式解数据集中学习更优策略，解决了这些局限性;它不仅要模仿，还要综合并超越它们。具体来说，我们（i）集成指针网络以处理节点选择的实例相关、可变动作空间，（ii）采用期望回归来乐观条件，这对于最优值变化很大的实例至关重要。实验表明，我们的方法持续产生比其训练的四种经典启发式更高质量的行程，展示了离线强化学习解锁并超越现有领域知识中性能的潜力。

Macroscopic Characteristics of Mixed Traffic Flow with Deep Reinforcement Learning Based Automated and Human-Driven Vehicles

基于深度强化学习的自动化和人驾驶车辆混合交通流的宏观特性

Authors: Pankaj Kumar, Pranamesh Chakraborty, Subrahmanya Swamy Peruru
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.25328
Pdf link: https://arxiv.org/pdf/2603.25328
Abstract Automated Vehicle (AV) control in mixed traffic, where AVs coexist with human-driven vehicles, poses significant challenges in balancing safety, efficiency, comfort, fuel efficiency, and compliance with traffic rules while capturing heterogeneous driver behavior. Traditional car-following models, such as the Intelligent Driver Model (IDM), often struggle to generalize across diverse traffic scenarios and typically do not account for fuel efficiency, motivating the use of learning-based approaches. Although Deep Reinforcement Learning (DRL) has shown strong microscopic performance in car-following conditions, its macroscopic traffic flow characteristics remain underexplored. This study focuses on analyzing the macroscopic traffic flow characteristics and fuel efficiency of DRL-based models in mixed traffic. A Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm is implemented for AVs' control and trained using the NGSIM highway dataset, enabling realistic interaction with human-driven vehicles. Traffic performance is evaluated using the Fundamental Diagram (FD) under varying driver heterogeneity, heterogeneous time-gap penetration levels, and different shares of RL-controlled vehicles. A macroscopic level comparison of fuel efficiency between the RL-based AV model and the IDM is also conducted. Results show that traffic performance is sensitive to the distribution of safe time gaps and the proportion of RL vehicles. Transitioning from fully human-driven to fully RL-controlled traffic can increase road capacity by approximately 7.52%. Further, RL-based AVs also improve average fuel efficiency by about 28.98% at higher speeds (above 50 km/h), and by 1.86% at lower speeds (below 50 km/h) compared to the IDM. Overall, the DRL framework enhances traffic capacity and fuel efficiency without compromising safety.
中文摘要 在混合交通中，自动驾驶车辆（AV）控制在平衡安全性、效率、舒适性、燃油效率及交通规则合规性方面面临重大挑战，同时捕捉驾驶者行为的异质性。传统的跟车模型，如智能驾驶模型（IDM），常难以在多种交通场景中泛化，且通常不考虑燃油效率，促使采用基于学习的方法。尽管深度强化学习（DRL）在跟车条件下表现出强劲的微观性能，但其宏观交通流特性仍未被充分探索。本研究重点分析基于日日灯模型在混合交通中的宏观交通流特性和燃油效率。为自动驾驶车辆控制实现了双延迟深度确定性策略梯度（TD3）算法，并利用NGSIM高速公路数据集进行训练，实现了与人驾驶车辆的真实交互。在不同驾驶员异质性、异质时间间隙渗透水平以及不同RL控制车辆比例下，使用基本图（FD）评估交通性能。还进行了基于RL的自动驾驶模型与IDM之间燃油效率的宏观层面比较。结果显示，交通性能对安全时间差的分布和强化车辆比例非常敏感。从完全由人驾驶转向完全强化学习控制的交通，道路容量可提升约7.52%。此外，基于RL的自动驾驶车辆在高速（50公里/小时以上）下平均燃油效率提升约28.98%，在低速（低于50公里/小时）时提升1.86%，相比IDM车型。总体而言，DRL框架提升了交通容量和燃油效率，同时不影响安全。

DRL-Based Spectrum Sharing for RIS-Aided Local High-Quality Wireless Networks

基于DRL的频谱共享，用于RIS辅助的本地高质量无线网络

Authors: Hamid Reza Hashempour, Mina Khadem, Eduard A. Jorswieck
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.25332
Pdf link: https://arxiv.org/pdf/2603.25332
Abstract This paper investigates a smart spectrum-sharing framework for reconfigurable intelligent surface (RIS)-aided local high-quality wireless networks (LHQWNs) within a mobile network operator (MNO) ecosystem. Although RISs are often considered potentially harmful due to interference, this work shows that properly controlled RISs can enhance the quality of service (QoS). The proposed system enables temporary spectrum access for multiple vertical service providers (VSPs) by dynamically allocating radio resources according to traffic demand. The spectrum is divided into dedicated subchannels assigned to individual VSPs and reusable subchannels shared among multiple VSPs, while RIS is employed to improve propagation conditions. We formulate a multi-VSP utility maximization problem that jointly optimizes subchannel assignment, transmit power, and RIS phase configuration while accounting for spectrum access costs, RIS leasing costs, and QoS constraints. The resulting mixed-integer non-linear program (MINLP) is intractable using conventional optimization methods. To address this challenge, the problem is modeled as a Markov decision process (MDP) and solved using deep reinforcement learning (DRL). Specifically, deep deterministic policy gradient (DDPG) and soft actor-critic (SAC) algorithms are developed and compared. Simulation results show that SAC outperforms DDPG in convergence speed, stability, and achievable utility, reaching up to 96% of the exhaustive search benchmark and demonstrating the potential of RIS to improve overall utility in multi-VSP scenarios.
中文摘要 本文探讨了一种智能频谱共享框架，用于移动网络运营商（MNO）生态系统内可重构智能表面（RIS）辅助的本地高质量无线网络（LHQWN）。尽管RIS常被认为可能因干扰而有害，但这项研究表明，妥善控制的RIS可以提升服务质量（QoS）。该系统通过根据流量需求动态分配无线资源，使多个垂直服务提供商（VSP）能够临时接入频谱。频谱被划分为分配给单个VSP的专用子信道和多个VSP共享的可重复使用子信道，同时采用RIS技术改善传播条件。我们提出了一个多VSP效用最大化问题，共同优化子信道分配、发射功率和RIS相位配置，同时考虑频谱接入成本、RIS租赁成本和服务质量约束。由此产生的混合整数非线性规划（MINLP）在传统优化方法下是难以处理的。为应对这一挑战，问题被建模为马尔可夫决策过程（MDP），并通过深度强化学习（DRL）进行求解。具体来说，开发并比较了深度确定性策略梯度（DDPG）和软演员-批判者（SAC）算法。模拟结果显示，SAC在收敛速度、稳定性和可实现效用方面均优于DDPG，达到穷尽搜索基准的96%，展示了RIS在多VSP场景下提升整体效用的潜力。

Integrating Deep RL and Bayesian Inference for ObjectNav in Mobile Robotics

在移动机器人中将深度强化学习和贝叶斯推断应用于ObjectNav应用

Authors: João Castelo-Branco, José Santos-Victor, Alexandre Bernardino
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.25366
Pdf link: https://arxiv.org/pdf/2603.25366
Abstract Autonomous object search is challenging for mobile robots operating in indoor environments due to partial observability, perceptual uncertainty, and the need to trade off exploration and navigation efficiency. Classical probabilistic approaches explicitly represent uncertainty but typically rely on handcrafted action-selection heuristics, while deep reinforcement learning enables adaptive policies but often suffers from slow convergence and limited interpretability. This paper proposes a hybrid object-search framework that integrates Bayesian inference with deep reinforcement learning. The method maintains a spatial belief map over target locations, updated online through Bayesian inference from calibrated object detections, and trains a reinforcement learning policy to select navigation actions directly from this probabilistic representation. The approach is evaluated in realistic indoor simulation using Habitat 3.0 and compared against developed baseline strategies. Across two indoor environments, the proposed method improves success rate while reducing search effort. Overall, the results support the value of combining Bayesian belief estimation with learned action selection to achieve more efficient and reliable objectsearch behavior under partial observability.
中文摘要 由于部分可观察性、感知不确定性以及探索与导航效率之间权衡，自主物体搜索对室内环境中的移动机器人来说具有挑战性。经典概率方法明确表示不确定性，但通常依赖手工设计的动作选择启发式，而深度强化学习则支持自适应策略，但通常存在收敛缓慢和解释性有限的问题。本文提出了一种混合对象-搜索框架，将贝叶斯推理与深度强化学习相结合。该方法维护目标位置的空间信念图，通过校准对象检测通过贝叶斯推断在线更新，并训练强化学习策略，直接从该概率表示中选择导航动作。该方法在使用Habitat 3.0的真实室内模拟中进行评估，并与已开发的基线策略进行比较。在两种室内环境中，该方法提高了成功率，同时减少了搜索工作量。总体而言，结果支持将贝叶斯信念估计与习得动作选择结合，以实现部分可观测性下更高效、更可靠的对象搜索行为的价值。

TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning

TAPO：多语言数学推理的翻译增强策略优化

Authors: Xu Huang, Zhejian Lai, Zixian Huang, Jiajun Chen, Shujian Huang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.25419
Pdf link: https://arxiv.org/pdf/2603.25419
Abstract Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.
中文摘要 大型语言模型（LLMs）在英语数学推理方面表现出显著的熟练度，但在多语言环境中仍存在显著的表现差异，主要归因于语言理解能力的不足。为弥合这一差距，我们引入了基于GRPO的新型强化学习框架——翻译增强策略优化（TAPO）。TAPO强制执行显式对齐策略，模型以英语为枢纽，遵循先理解后推理的范式。关键是，我们采用了阶级相对优势机制，将理解与推理分离，实现翻译质量奖励的整合而不引入优化冲突。大量实验表明，TAPO有效地协同语言理解与推理能力，并兼容多种模型。它在多语言数学推理和翻译任务中表现优于基础方法，同时也很好地推广到未见语言和域外任务。

Maximum Entropy Behavior Exploration for Sim2Real Zero-Shot Reinforcement Learning

Sim2Real零射点强化学习中的最大熵行为探索

Authors: Jiajun Hu, Nuria Armengol Urpi, Jin Cheng, Stelian Coros
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.25464
Pdf link: https://arxiv.org/pdf/2603.25464
Abstract Zero-shot reinforcement learning (RL) algorithms aim to learn a family of policies from a reward-free dataset, and recover optimal policies for any reward function directly at test time. Naturally, the quality of the pretraining dataset determines the performance of the recovered policies across tasks. However, pre-collecting a relevant, diverse dataset without prior knowledge of the downstream tasks of interest remains a challenge. In this work, we study $\textit{online}$ zero-shot RL for quadrupedal control on real robotic systems, building upon the Forward-Backward (FB) algorithm. We observe that undirected exploration yields low-diversity data, leading to poor downstream performance and rendering policies impractical for direct hardware deployment. Therefore, we introduce FB-MEBE, an online zero-shot RL algorithm that combines an unsupervised behavior exploration strategy with a regularization critic. FB-MEBE promotes exploration by maximizing the entropy of the achieved behavior distribution. Additionally, a regularization critic shapes the recovered policies toward more natural and physically plausible behaviors. We empirically demonstrate that FB-MEBE achieves and improved performance compared to other exploration strategies in a range of simulated downstream tasks, and that it renders natural policies that can be seamlessly deployed to hardware without further finetuning. Videos and code available on our website.
中文摘要 零样本强化学习（RL）算法旨在从无奖励数据集中学习一系列策略，并在测试时直接恢复任意奖励函数的最优策略。自然，预训练数据集的质量决定了恢复策略在各任务中的表现。然而，在没有对下游任务的先验了解的情况下，预先收集相关且多样化的数据集仍然是个挑战。本研究研究基于前后（Forward-Backward，FB）算法，应用于真实机器人系统中四足控制的 $\textit{online}$ 零射点强化学习。我们观察到无定向探索会产生低多样性数据，导致下游性能较差，使得直接部署硬件的策略变得不切实际。因此，我们引入了FB-MEBE，一种在线零样品强化学习算法，结合了无监督行为探索策略和正则化批评算法。FB-MEBE 通过最大化行为分布的熵来促进探索。此外，正则化批评者会将恢复的政策塑造为更自然、更符合物理意义的行为。我们实证证明，FB-MEBE在一系列模拟下游任务中，相较于其他探索策略实现并提升了性能，并且它生成的自然策略可以无缝部署到硬件，无需进一步微调。视频和代码可在我们的网站上获取。

Cooperative Deep Reinforcement Learning for Fair RIS Allocation

公平RIS分配的合作深度强化学习

Authors: Martin Mark Zan, Stefan Schwarz
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.25572
Pdf link: https://arxiv.org/pdf/2603.25572
Abstract The deployment of reconfigurable intelligent surfaces (RISs) introduces new challenges for resource allocation in multi-cell wireless networks, particularly when user loads are uneven across base stations. In this work, we consider RISs as shared infrastructure that must be dynamically assigned among competing base stations, and we address this problem using a simultaneous ascending auction mechanism. To mitigate performance imbalances between cells, we propose a fairness-aware collaborative multi-agent reinforcement learning approach in which base stations adapt their bidding strategies based on both expected utility gains and relative service quality. A centrally computed performance-dependent fairness indicator is incorporated into the agents' observations, enabling implicit coordination without direct inter-base-station communication. Simulation results show that the proposed framework effectively redistributes RIS resources toward weaker-performing cells, substantially improving the rates of the worst-served users while preserving overall throughput. The results demonstrate that fairness-oriented RIS allocation can be achieved through cooperative learning, providing a flexible tool for balancing efficiency and equity in future wireless networks.
中文摘要 可重构智能表面（RIS）的部署为多单元无线网络中的资源分配带来了新的挑战，尤其是在基站用户负载不均的情况下。在本研究中，我们将RIS视为必须动态分配在竞争基站之间的共享基础设施，并通过同步升拍卖机制解决了这一问题。为减少单元间性能不平衡，我们提出一种公平意识的协作多智能体强化学习方法，基站根据预期效用收益和相对服务质量调整竞价策略。一个由中央计算的性能依赖公平性指标被整合进代理的观察中，实现了无需直接基站间通信的隐式协调。模拟结果表明，所提框架有效地将RIS资源重新分配给性能较差的单元，显著提升了服务最差用户的速度，同时保持整体吞吐量。结果表明，通过协作学习可以实现公平导向的RIS分配，为未来无线网络中平衡效率与公平性提供了灵活的工具。

LanteRn: Latent Visual Structured Reasoning

LanteRn：潜在视觉结构化推理

Authors: André G. Viveiros, Nuno Gonçalves, Matthias Lindemann, André Martins
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.25629
Pdf link: https://arxiv.org/pdf/2603.25629
Abstract While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LanteRn augments a vision-language transformer with the ability to generate and attend to continuous visual thought embeddings during inference. We train the model in two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility. We evaluate LanteRn on three perception-centric benchmarks (VisCoT, V*, and Blink), observing consistent improvements in visual grounding and fine-grained reasoning. These results suggest that internal latent representations provide a promising direction for more efficient multimodal reasoning.
中文摘要 虽然语言推理模型在许多任务中表现出色，但视觉推理对于当前大型多模态模型（LMM）来说仍然具有挑战性。因此，大多数LMM默认将感知内容口头化为文本，这对需要细致空间和视觉理解的任务来说是个强项限制。虽然近期方法通过调用工具或生成中间图像来思考图像，但它们要么依赖外部模块，要么通过直接在像素空间推理而产生不必要的计算。本文介绍了LanteRn框架，使LMM能够将语言与紧凑的潜在视觉表征交错，从而直接在潜在空间中进行视觉推理。LanteRn增强了视觉语言转换器，使其能够在推理过程中生成并关注连续的视觉思维嵌入。我们将模型分为两个阶段进行训练：监督微调以适应潜在状态下的基础视觉特征，随后进行强化学习，使潜在推理与任务层级效用对齐。我们基于三个以感知为中心的基准测试（VisCoT、V*和Blink）评估LanteRn，观察到视觉基础和细致推理的持续提升。这些结果表明，内部潜在表示为更高效的多模态推理提供了有前景的方向。

Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning

持久机器人世界模型：通过强化学习稳定多步推广

Authors: Jai Bardhan, Patrik Drozdik, Josef Sivic, Vladimir Petrik
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.25685
Pdf link: https://arxiv.org/pdf/2603.25685
Abstract Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that trains the world model on its own autoregressive rollouts rather than on ground-truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that generates and compares multiple candidate variable-length futures from the same rollout state, reinforcing higher-fidelity predictions over lower-fidelity ones. Third, we develop efficient, multi-view visual fidelity rewards that combine complementary perceptual metrics across camera views and are aggregated at the clip level for dense, low-variance training signal. Fourth, we show that our approach establishes a new state-of-the-art for rollout fidelity on the DROID dataset, outperforming the strongest baseline on all metrics (e.g., LPIPS reduced by 14% on external cameras, SSIM improved by 9.1% on the wrist camera), winning 98% of paired comparisons, and achieving an 80% preference rate in a blind human study.
中文摘要 动作条件机器人世界模型在给定机器人动作序列后生成控场景的未来视频帧，为模拟传统物理引擎难以建模的任务提供了有前景的替代方案。然而，这些模型针对短期预测进行了优化，但当采用自回归应用时会失效：每个预测的片段都会反馈为下一个片段提供上下文，导致误差累积，视觉质量迅速下降。我们通过以下贡献来应对这一问题。首先，我们引入一种强化学习（RL）训练后方案，该方案基于自身的自回归推广来训练世界模型，而非基于真实历史。我们通过将近期的扩散模型对比强化学习目标镜调整到我们的设定中，并证明其收敛保证完全可继承。其次，我们设计了一个训练协议，能够生成并比较多个候选的可变长度未来，从同一推广状态中强化高保真度预测而非低保真度预测。第三，我们开发高效的多视角视觉保真奖励，将不同摄像头视角的互补感知指标结合起来，并在剪辑层面汇聚，以实现密集、低方差的训练信号。第四，我们证明我们的方法在DROID数据集的推广忠真度上建立了新的先进水平，在所有指标上都优于最强基线（例如，外部摄像头的LPIPS降低了14%，腕部摄像头的SSIM提升了9.1%），在配对比较中赢得了98%，在盲人研究中实现了80%的偏好率。

R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

R-C2：循环一致性强化学习提升多模态推理能力

Authors: Zirui Zhang, Haoyu Dong, Kexin Pei, Chengzhi Mao
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.25720
Pdf link: https://arxiv.org/pdf/2603.25720
Abstract Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for visual and textual representations of the same concept. Rather than masking these failures with standard voting mechanisms, which can amplify systematic biases, we show that cross-modal inconsistency provides a rich and natural signal for learning. We introduce RC2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switch modalities, and reliably reconstruct the answer through forward inference, we obtain a dense, label-free reward. This cyclic constraint encourages the model to align its internal representations autonomously. Optimizing for this structure mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. Our results suggest that advanced reasoning emerges not only from scaling data, but also from enforcing a structurally consistent understanding of the world.
中文摘要 强健的感知和推理需要在感官模式间保持一致性。然而，当前的多模态模型常常违反这一原则，导致同一概念的视觉和文本表现存在矛盾预测。我们没有用标准投票机制掩盖这些失败，避免这些机制加剧系统性偏见，而是展示了跨模态不一致性为学习提供了丰富且自然的信号。我们介绍RC2，一种强化学习框架，通过强制跨模态循环一致性来解决内部冲突。通过要求模型进行逆向推断、切换模态，并通过前向推断可靠地重建答案，我们获得了密集且无标签的奖励。这种循环约束促使模型自主对齐其内部表示。优化该结构可减少特定模态的误差，并提高推理准确性多达7.6分。我们的结果表明，高级推理不仅源于数据的扩展，还源于对世界的结构性一致性理解。

Keyword: diffusion policy

FODMP: Fast One-Step Diffusion of Movement Primitives Generation for Time-Dependent Robot Actions

FODMP：快速一步扩散运动原语生成，用于时间依赖机器人动作

Authors: Xirui Shi, Arya Ebrahimi, Yi Hu, Jun Jin
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.24806
Pdf link: https://arxiv.org/pdf/2603.24806
Abstract Diffusion models are increasingly used for robot learning, but current designs face a clear trade-off. Action-chunking diffusion policies like ManiCM are fast to run, yet they only predict short segments of motion. This makes them reactive, but unable to capture time-dependent motion primitives, such as following a spring-damper-like behavior with built-in dynamic profiles of acceleration and deceleration. Recently, Movement Primitive Diffusion (MPD) partially addresses this limitation by parameterizing full trajectories using Probabilistic Dynamic Movement Primitives (ProDMPs), thereby enabling the generation of temporally structured motions. Nevertheless, MPD integrates the motion decoder directly into a multi-step diffusion process, resulting in prohibitively high inference latency that limits its applicability in real-time control settings. We propose FODMP (Fast One-step Diffusion of Movement Primitives), a new framework that distills diffusion models into the ProDMPs trajectory parameter space and generates motion using a single-step decoder. FODMP retains the temporal structure of movement primitives while eliminating the inference bottleneck through single-step consistency distillation. This enables robots to execute time-dependent primitives at high inference speed, suitable for closed-loop vision-based control. On standard manipulation benchmarks (MetaWorld, ManiSkill), FODMP runs up to 10 times faster than MPD and 7 times faster than action-chunking diffusion policies, while matching or exceeding their success rates. Beyond speed, by generating fast acceleration-deceleration motion primitives, FODMP allows the robot to intercept and securely catch a fast-flying ball, whereas action-chunking diffusion policy and MPD respond too slowly for real-time interception.
中文摘要 扩散模型越来越多地被用于机器人学习，但当前设计面临明显的权衡。像ManiCM这样的行动分块扩散政策运行迅速，但它们只能预测短暂的动态。这使得它们具有反应性，但无法捕捉时间相关的运动图元，比如遵循类似弹簧阻尼器的内置动态加速和减速特性的行为。最近，运动原始扩散（MPD）通过概率动态运动原语（ProDMPs）参数化完整轨迹，部分解决了这一限制，从而实现了时间结构化运动的生成。然而，MPD将运动解码器直接集成到多步扩散过程中，导致推理延迟极高，限制了其在实时控制环境中的适用性。我们提出了FODMP（快速单步扩散运动原语），这是一种新框架，将扩散模型提炼到ProDMP的轨迹参数空间中，并利用单步解码器生成运动。FODMP保留了运动原语的时间结构，同时通过单步一致性蒸馏消除了推理瓶颈。这使得机器人能够以高推理速度执行时间依赖的原语，适合闭环视觉控制。在标准操作基准测试（MetaWorld、ManiSkill）上，FODMP的运行速度是MPD的10倍，是动作分块扩散策略的7倍，同时成功率与MPD相当甚至超过。除了速度外，FODMP通过生成快速加减速运动原语，使机器人能够拦截并稳固接住高速飞行的球，而动作分块扩散政策和MPD响应速度过慢，无法实现实时拦截。