Arxiv Papers of Today

生成时间: 2026-03-26 17:00:25 (UTC+8); Arxiv 发布时间: 2026-03-26 20:00 EDT (2026-03-27 08:00 UTC+8)

今天共有 33 篇相关文章

Keyword: reinforcement learning

Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction

针对主动用户与大型语言模型交互的隐式回合策略优化

Authors: Haoyu Wang, Yuxin Chen, Liang Luo, Buyun Zhang, Ellie Dingqiao Wen, Pan Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.23550
Pdf link: https://arxiv.org/pdf/2603.23550
Abstract Multi-turn human-AI collaboration is fundamental to deploying interactive services such as adaptive tutoring, conversational recommendation, and professional consultation. However, optimizing these interactions via reinforcement learning is hindered by the sparsity of verifiable intermediate rewards and the high stochasticity of user responses. To address these challenges, we introduce Implicit Turn-wise Policy Optimization (ITPO). ITPO leverages an implicit process reward model to derive fine-grained, turn-wise process rewards from sparse outcome signals. Unlike volatile token-level rewards, these turn-level signals exhibit superior robustness and may utilize a normalization mechanism to further enhance training stability. We evaluate ITPO across three representative multi-turn collaborative tasks: math tutoring, document writing, and medical recommendation. Empirical results demonstrate that ITPO, when combined with PPO, GRPO, or RLOO, consistently achieves improved convergence than existing baselines. Elaborate trajectory analysis confirms that ITPO infers turn-wise preferences that are semantically aligned with human judgment. Code is publicly available at this https URL.
中文摘要 多轮人机协作是部署互动服务（如自适应辅导、会话推荐和专业咨询）的基础。然而，通过强化学习优化这些互动受到可验证中间奖励稀缺性和用户反应高随机性的阻碍。为应对这些挑战，我们引入了隐式按回合策略优化（ITPO）。ITPO利用隐式过程奖励模型，从稀疏的结果信号中推导出细粒度、按回合的流程奖励。与波动性较大的代币级奖励不同，这些转折级信号表现出更优越的稳健性，并可能利用归一化机制进一步增强训练稳定性。我们通过三种代表性的多方向协作任务评估ITPO：数学辅导、文档编写和医疗建议。实证结果表明，ITPO与PPO、GRPO或RLOO结合时，始终比现有基线实现更好的收敛效果。复杂的轨迹分析证实，ITPO推断出的回合偏好与人类判断语义一致。代码在此 https URL 公开发布。

Safe Reinforcement Learning with Preference-based Constraint Inference

基于偏好的约束推断安全强化学习

Authors: Chenglin Li, Guangchun Ruan, Hua Geng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.23565
Pdf link: https://arxiv.org/pdf/2603.23565
Abstract Safe reinforcement learning (RL) is a standard paradigm for safety-critical decision making. However, real-world safety constraints can be complex, subjective, and even hard to explicitly specify. Existing works on constraint inference rely on restrictive assumptions or extensive expert demonstrations, which is not realistic in many real-world applications. How to cheaply and reliably learn these constraints is the major challenge we focus on in this study. While inferring constraints from human preferences offers a data-efficient alternative, we identify the popular Bradley-Terry (BT) models fail to capture the asymmetric, heavy-tailed nature of safety costs, resulting in risk underestimation. It is still rare in the literature to understand the impacts of BT models on the downstream policy learning. To address the above knowledge gaps, we propose a novel approach namely Preference-based Constrained Reinforcement Learning (PbCRL). We introduce a novel dead zone mechanism into preference modeling and theoretically prove that it encourages heavy-tailed cost distributions, thereby achieving better constraint alignment. Additionally, we incorporate a Signal-to-Noise Ratio (SNR) loss to encourage exploration by cost variances, which is found to benefit policy learning. Further, two-stage training strategy are deployed to lower online labeling burdens while adaptively enhancing constraint satisfaction. Empirical results demonstrate that PbCRL achieves superior alignment with true safety requirements and outperforms the state-of-the-art baselines in terms of safety and reward. Our work explores a promising and effective way for constraint inference in Safe RL, which has great potential in a range of safety-critical applications.
中文摘要 安全强化学习（RL）是安全关键决策的标准范式。然而，现实世界的安全约束可能复杂、主观，甚至难以明确说明。现有的约束推断研究依赖于限制性假设或大量专家演示，这在许多现实应用中并不现实。如何廉价且可靠地学习这些约束，是我们本研究的主要挑战。虽然从人类偏好推断约束提供了一种数据效率高的替代方案，但我们发现流行的Bradley-Terry（BT）模型未能捕捉安全成本的不对称、尾巴沉重，导致风险被低估。文献中仍然很少有充分理解BT模型对下游政策学习的影响。为弥补上述知识空白，我们提出了一种新方法，即基于偏好的受限强化学习（PbCRL）。我们将一种新颖的死区机制引入偏好建模，并理论上证明其鼓励重尾成本分布，从而实现更好的约束对齐。此外，我们还加入了信噪比（SNR）损失，以鼓励通过成本方差进行探索，这对政策学习有益。此外，采用两阶段培训策略以减轻在线标签负担，同时自适应地提升约束满足度。实证结果表明，PbCRL在安全性和奖励方面更符合真正的安全要求，并且优于最先进的基线。我们的研究探索了一种有前景且有效的Safe RL约束推断方法，该方法在多种安全关键应用中具有巨大潜力。

Utilizing Adversarial Training for Robust Voltage Control: An Adaptive Deep Reinforcement Learning Method

利用对抗性训练实现稳健电压控制：一种自适应深度强化学习方法

Authors: Sungjoo Chung, Ying Zhang
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.23648
Pdf link: https://arxiv.org/pdf/2603.23648
Abstract Adversarial training is a defense method that trains machine learning models on intentionally perturbed attack inputs, so they learn to be robust against adversarial examples. This paper develops a robust voltage control framework for distribution networks with high penetration of distributed energy resources (DERs). Conventional voltage control methods are vulnerable to strategic cyber attacks, as they typically consider only random or black-box perturbations. To address this, we formulate white-box adversarial attacks using Projected Gradient Descent (PGD) and train a deep reinforcement learning (DRL) agent adversarially. The resulting policy adapts in real time to high-impact, strategically optimized perturbations. Simulations on DER-rich networks show that the approach maintains voltage stability and operational efficiency under realistic attack scenarios, highlighting the effectiveness of gradient-based adversarial DRL in enhancing robustness and adaptability in modern distribution system control.
中文摘要 对抗训练是一种防御方法，通过有意扰动的攻击输入训练机器学习模型，使其能够对抗对抗实例保持鲁棒性。本文为分布式能源资源（DERs）高渗透率的配电网络开发了一个稳健的电压控制框架。传统电压控制方法容易受到战略性网络攻击，因为它们通常只考虑随机或黑箱扰动。为此，我们利用投影梯度下降（PGD）制定白盒对抗攻击，并对抗性训练深度强化学习（DRL）代理。由此产生的政策能够实时适应高影响力、战略优化的扰动。在富DER网络上的模拟表明，该方法在真实攻击场景下能够保持电压稳定和运行效率，凸显了基于梯度的对抗式日间拉比在提升现代配电系统控制中鲁棒性和适应性的有效性。

Dual-Gated Epistemic Time-Dilation: Autonomous Compute Modulation in Asynchronous MARL

双门认知时间膨胀：异步MARL中的自主计算调制

Authors: Igor Jankowski
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.23722
Pdf link: https://arxiv.org/pdf/2603.23722
Abstract While Multi-Agent Reinforcement Learning (MARL) algorithms achieve unprecedented successes across complex continuous domains, their standard deployment strictly adheres to a synchronous operational paradigm. Under this paradigm, agents are universally forced to execute deep neural network inferences at every micro-frame, regardless of immediate necessity. This dense throughput acts as a fundamental barrier to physical deployment on edge-devices where thermal and metabolic budgets are highly constrained. We propose Epistemic Time-Dilation MAPPO (ETD-MAPPO), augmented with a Dual-Gated Epistemic Trigger. Instead of depending on rigid frame-skipping (macro-actions), agents autonomously modulate their execution frequency by interpreting aleatoric uncertainty (via Shannon entropy of their policy) and epistemic uncertainty (via state-value divergence in a Twin-Critic architecture). To format this, we structure the environment as a Semi-Markov Decision Process (SMDP) and build the SMDP-Aligned Asynchronous Gradient Masking Critic to ensure proper credit assignment. Empirical findings demonstrate massive improvements (> 60% relative baseline acquisition leaps) over current temporal models. By assessing LBF, MPE, and the 115-dimensional state space of Google Research Football (GRF), ETD correctly prevented premature policy collapse. Remarkably, this unconstrained approach leads to emergent Temporal Role Specialization, reducing computational overhead by a statistically dominant 73.6% entirely during off-ball execution without deteriorating centralized task dominance.
中文摘要 虽然多智能体强化学习（MARL）算法在复杂的连续域中取得了前所未有的成功，但其标准部署严格遵循同步操作范式。在这种范式下，智能体普遍被迫在每个微帧执行深度神经网络推断，无论是否紧急。这种高吞吐量成为边缘设备上物理部署的根本障碍，因为这些设备热和代谢预算极为紧张。我们提出了认识论时间膨胀MAPO（ETD-MAPPO），并辅以双门认知触发器。智能体不再依赖僵硬的跳帧（宏动作），而是通过解释偶然不确定性（通过策略的香农熵）和认识不确定性（通过双重批判架构中的状态值发散）来调节执行频率。为此，我们将环境结构化为半马尔可夫决策过程（SMDP），并构建SMDP对齐异步梯度掩蔽批评器以确保信用分配的正确。实证结果显示，相比现有时间模型，>60%的相对基线获取跃进，取得了巨大进步。通过评估LBF、MPE以及谷歌研究橄榄球（GRF）的115维状态空间，ETD正确地防止了政策的过早崩溃。令人惊讶的是，这种不受约束的方法催生了出现的时间角色专精，在无球执行期间，计算开销降低了统计上占优势的73.6%，同时不削弱集中任务的主导地位。

BXRL: Behavior-Explainable Reinforcement Learning

BXRL：行为可解释强化学习

Authors: Ram Rachum, Yotam Amitai, Yonatan Nakar, Reuth Mirsky, Cameron Allen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.23738
Pdf link: https://arxiv.org/pdf/2603.23738
Abstract A major challenge of Reinforcement Learning is that agents often learn undesired behaviors that seem to defy the reward structure they were given. Explainable Reinforcement Learning (XRL) methods can answer queries such as "explain this specific action", "explain this specific trajectory", and "explain the entire policy". However, XRL lacks a formal definition for behavior as a pattern of actions across many episodes. We provide such a definition, and use it to enable a new query: "Explain this behavior". We present Behavior-Explainable Reinforcement Learning (BXRL), a new problem formulation that treats behaviors as first-class objects. BXRL defines a behavior measure as any function $m : \Pi \to \mathbb{R}$, allowing users to precisely express the pattern of actions that they find interesting and measure how strongly the policy exhibits it. We define contrastive behaviors that reduce the question "why does the agent prefer $a$ to $a'$?" to "why is $m(\pi)$ high?" which can be explored with differentiation. We do not implement an explainability method; we instead analyze three existing methods and propose how they could be adapted to explain behavior. We present a port of the HighwayEnv driving environment to JAX, which provides an interface for defining, measuring, and differentiating behaviors with respect to the model parameters.
中文摘要 强化学习的一个主要挑战是，代理常常学习到看似违背其被赋予的奖励结构的不良行为。可解释强化学习（XRL）方法可以回答诸如“解释此具体动作”、“解释此特定轨迹”和“解释整个策略”等问题。然而，XRL缺乏将行为作为跨多个事件的行为模式的正式定义。我们提供了这样的定义，并用它来启用一个新的查询：“解释此行为”。我们提出了行为可解释强化学习（BXRL），这是一种将行为视为一类对象的新问题表述。BXRL将行为测量定义为任意函数$m：\Pi \到 \mathbb{R}$，允许用户精确表达他们感兴趣的行为模式，并衡量该策略表现出多强。我们定义了对比行为，将“为什么代理更喜欢$a$而不是$a'$？”的问题简化为“为什么$m（\pi）$高？”，这可以通过差异化来探讨。我们不实施可解释性方法;我们分析了三种现有方法，并提出如何调整它们以解释行为。我们将HighwayEnv驾驶环境移植到JAX，提供定义、测量和区分基于模型参数行为的接口。

Self Paced Gaussian Contextual Reinforcement Learning

自进高斯情境强化学习

Authors: Mohsen Sahraei Ardakani, Rui Song
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.23755
Pdf link: https://arxiv.org/pdf/2603.23755
Abstract Curriculum learning improves reinforcement learning (RL) efficiency by sequencing tasks from simple to complex. However, many self-paced curriculum methods rely on computationally expensive inner-loop optimizations, limiting their scalability in high-dimensional context spaces. In this paper, we propose Self-Paced Gaussian Curriculum Learning (SPGL), a novel approach that avoids costly numerical procedures by leveraging a closed-form update rule for Gaussian context distributions. SPGL maintains the sample efficiency and adaptability of traditional self-paced methods while substantially reducing computational overhead. We provide theoretical guarantees on convergence and validate our method across several contextual RL benchmarks, including the Point Mass, Lunar Lander, and Ball Catching environments. Experimental results show that SPGL matches or outperforms existing curriculum methods, especially in hidden context scenarios, and achieves more stable context distribution convergence. Our method offers a scalable, principled alternative for curriculum generation in challenging continuous and partially observable domains.
中文摘要 课程学习通过将任务从简单到复杂排序，提高强化学习（RL）的效率。然而，许多自定进度课程方法依赖计算量高的内环优化，限制了其在高维上下文空间中的可扩展性。本文提出了自进度高斯课程学习（SPGL），这是一种新颖的方法，通过利用高斯上下文分布的封闭形式更新规则，避免了昂贵的数值操作。SPGL保持了传统自定时速方法的样本效率和适应性，同时大幅降低计算开销。我们提供收敛的理论保证，并在多个情境强化环境（包括点质量、月球着陆器和接球环境）中验证方法。实验结果显示，SPGL在隐藏情境场景中表现优于现有课程方法，并实现更稳定的情境分布收敛。我们的方法为具有挑战性且部分可观察的领域中的课程生成提供了可扩展且有原则的替代方案。

Human, AI, and Hybrid Ensembles for Detection of Adaptive, RL-based Social Bots

人类、人工智能与混合集合用于检测自适应、基于强化学习的社交机器人

Authors: Valerio La Gatta, Nathan Subrahmanian, Kaitlyn Wang, Larry Birnbaum, V.S. Subrahmanian
Subjects: Subjects: Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2603.23796
Pdf link: https://arxiv.org/pdf/2603.23796
Abstract The use of reinforcement learning to dynamically adapt and evade detection is now well-documented in several cybersecurity settings including Covert Social Influence Operations (CSIOs), in which bots try to spread disinformation. While AI bot detectors have improved greatly, they are largely limited to detecting static bots that do not adapt dynamically. We present the first systematic study comparing the ability of humans, AI models, and hybrid Human-AI ensembles in detecting adaptive bots powered by reinforcement learning. Using data from a controlled, IRB-approved, five-day experiment with participants interacting on a social media platform infiltrated by RL-trained bots spreading disinformation to influence participants on 4 topics, we examine factors potentially shaping human detection capabilities: demographic characteristics, temporal learning effects, social network position, engagement patterns, and collective intelligence mechanisms. We first test 13 hypotheses comparing human bot detection performance against state-of-the-art AI approaches utilizing both traditional machine learning and large language models. We further investigate several aggregation strategies that combine human reports of bots with AI predictions, as well as retraining protocols that leverage human supervision. Our findings challenge intuitive assumptions about bot detection, reveal unexpected patterns in how humans identify bots, and show that combining human bot reports with AI predictions outperforms humans alone and AI alone. We conclude with a discussion of the practical implications of these results for industry.
中文摘要 强化学习在多个网络安全环境中已被充分记录，包括隐秘社会影响行动（CSIO），机器人试图传播虚假信息。虽然AI机器人检测器有了很大改进，但它们主要只能检测那些无法动态适应的静态机器人。我们提出了首个系统性研究，比较了人类、人工智能模型和混合人-人工智能集合在检测由强化学习驱动的自适应机器人方面的能力。利用一项受控、经IRB批准的五天实验数据，参与者在一个由强化学习训练的机器人渗透的社交媒体平台上互动，这些机器人传播虚假信息以影响参与者，探讨了可能影响人类检测能力的因素：人口特征、时间学习效应、社交网络位置、互动模式和集体智能机制。我们首先测试13个假设，比较人类机器人检测性能与利用传统机器学习和大型语言模型的先进AI方法。我们还进一步研究了多种将人类报告机器人报告与AI预测相结合的聚合策略，以及利用人类监督的再训练协议。我们的发现挑战了对机器人检测的直觉假设，揭示了人类识别机器人方式中的意想不到模式，并证明将人类机器人报告与人工智能预测结合起来，能胜过单独的人类和人工智能。最后，我们讨论了这些结果对产业的实际影响。

Learning-guided Prioritized Planning for Lifelong Multi-Agent Path Finding in Warehouse Automation

以学习为导向的优先规划，实现仓库自动化中终身多智能体路径寻找

Authors: Han Zheng, Yining Ma, Brandon Araki, Jingkai Chen, Cathy Wu
Subjects: Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.23838
Pdf link: https://arxiv.org/pdf/2603.23838
Abstract Lifelong Multi-Agent Path Finding (MAPF) is critical for modern warehouse automation, which requires multiple robots to continuously navigate conflict-free paths to optimize the overall system throughput. However, the complexity of warehouse environments and the long-term dynamics of lifelong MAPF often demand costly adaptations to classical search-based solvers. While machine learning methods have been explored, their superiority over search-based methods remains inconclusive. In this paper, we introduce Reinforcement Learning (RL) guided Rolling Horizon Prioritized Planning (RL-RH-PP), the first framework integrating RL with search-based planning for lifelong MAPF. Specifically, we leverage classical Prioritized Planning (PP) as a backbone for its simplicity and flexibility in integrating with a learning-based priority assignment policy. By formulating dynamic priority assignment as a Partially Observable Markov Decision Process (POMDP), RL-RH-PP exploits the sequential decision-making nature of lifelong planning while delegating complex spatial-temporal interactions among agents to reinforcement learning. An attention-based neural network autoregressively decodes priority orders on-the-fly, enabling efficient sequential single-agent planning by the PP planner. Evaluations in realistic warehouse simulations show that RL-RH-PP achieves the highest total throughput among baselines and generalizes effectively across agent densities, planning horizons, and warehouse layouts. Our interpretive analyses reveal that RL-RH-PP proactively prioritizes congested agents and strategically redirects agents from congestion, easing traffic flow and boosting throughput. These findings highlight the potential of learning-guided approaches to augment traditional heuristics in modern warehouse automation.
中文摘要 终身多智能体路径寻觅（MAPF）对于现代仓库自动化至关重要，现代仓库自动化需要多个机器人持续导航无冲突路径，以优化整体系统吞吐量。然而，仓库环境的复杂性和长期MAPF的动态性，常常要求对传统的基于搜索的求解器进行昂贵的适应。虽然机器学习方法已被探索，但其优于基于搜索的方法仍未得出结论。本文介绍了强化学习（RL）引导的滚动地平线优先规划（RL-RH-PP），这是首个将强化学习与基于搜索的规划整合为终身MAPF的框架。具体来说，我们利用经典的优先级规划（PP）作为基础，使其在与基于学习的优先分配政策集成时具有简洁性和灵活性。通过将动态优先级分配表述为部分可观测马尔可夫决策过程（POMDP），RL-RH-PP利用了终身规划的顺序决策特性，同时将代理间复杂的时空交互委托给强化学习。基于注意力的神经网络能够实时自回归解码优先级命令，使PP规划者能够高效地进行顺序单代理规划。真实仓库模拟的评估显示，RL-RH-PP在基线中实现了最高的总吞吐量，并且能有效推广到代理密度、规划视野和仓库布局。我们的解释分析显示，RL-RH-PP主动优先处理拥堵的座席，并有策略地将座席从拥堵中重定向，缓解流量并提升吞吐量。这些发现凸显了以学习为导的方法在现代仓库自动化中补充传统启发式方法的潜力。

HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

HDPO：通过特权自蒸馏优化混合蒸馏策略

Authors: Ken Ding
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.23871
Pdf link: https://arxiv.org/pdf/2603.23871
Abstract Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - "cliff" prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail, generates privileged rollouts by providing the model with ground-truth information, filters for correct solutions, and distills the teacher's token-level distribution into the student. Because teacher and student share the same weights - differing only in their input - the realizability gap is provably bounded, unlike cross-model distillation. We prove that R=1 filtered privileged generation recovers the optimal KL-regularized RL policy in the hard-threshold limit. Experiments on OpenMathInstruct-2 with Qwen2.5-Math-1.5B-Instruct show that HDPO consistently improves coverage metrics (pass@4 by +0.8-1.1%, pass@8 by +0.4-1.7%) while maintaining greedy accuracy, with the distillation weight lambda providing direct control over the exploration-exploitation tradeoff.
中文摘要 用强化学习（RL）训练的大型数学推理模型面临一个根本挑战：在模型无法解决的问题——“悬崖”提示时，强化学习梯度完全消失，阻止任何学习信号达到这些失败模式。我们引入了混合蒸馏策略优化（HDPO），它通过特权自蒸馏针对悬崖提示来增强标准强化学习。在每个培训步骤中，HDPO识别所有推广失败的提示，通过提供实地信息生成特权推广，筛选正确解决方案，并将教师的代币级分布提炼到学生中。由于教师和学生共享相同的权重——仅在输入上有所不同——实现性差距是可证明的有界的，这与跨模型蒸馏不同。我们证明了 R=1 过滤特权生成在硬阈值限制下恢复了最优的 KL 正则化 RL 策略。在OpenMathInstruct-2上与Qwen2.5-Math-1.5B-Instruct的实验显示，HDPO在保持贪婪准确率的同时，持续提升覆盖度指标（pass@4提升+0.8-1.1%，pass@8提升+0.4-1.7%），蒸馏权重λ则直接控制探索与开发权衡。

The DeepXube Software Package for Solving Pathfinding Problems with Learned Heuristic Functions and Search

DeepXube 软件包，用于解决带有已学习启发式函数和搜索的寻路问题

Authors: Forest Agostinelli
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.23873
Pdf link: https://arxiv.org/pdf/2603.23873
Abstract DeepXube is a free and open-source Python package and command-line tool that seeks to automate the solution of pathfinding problems by using machine learning to learn heuristic functions that guide heuristic search algorithms tailored to deep neural networks (DNNs). DeepXube is comprised of the latest advances in deep reinforcement learning, heuristic search, and formal logic for solving pathfinding problems. This includes limited-horizon Bellman-based learning, hindsight experience replay, batched heuristic search, and specifying goals with answer-set programming. A robust multiple-inheritance structure simplifies the definition of pathfinding domains and the generation of training data. Training heuristic functions is made efficient through the automatic parallelization of the generation of training data across central processing units (CPUs) and reinforcement learning updates across graphics processing units (GPUs). Pathfinding algorithms that take advantage of the parallelism of GPUs and DNN architectures, such as batch weighted A and Q search and beam search are easily employed to solve pathfinding problems through command-line arguments. Finally, several convenient features for visualization, code profiling, and progress monitoring during training and solving are available. The GitHub repository is publicly available at this https URL.
中文摘要 DeepXube 是一个免费开源的 Python 软件包和命令行工具，旨在通过机器学习学习启发式函数来自动化解决路径寻找问题，这些函数指导针对深度神经网络（DNN）定制的启发式搜索算法。DeepXube 包含了深度强化学习、启发式搜索和形式逻辑等最新进展，用于解决寻路问题。这包括有限视野的基于贝尔曼的学习、事后诸葛亮经验回放、批量启发式搜索以及通过答案集编程指定目标。稳健的多重继承结构简化了路径寻域的定义和训练数据的生成。通过中央处理器（CPU）间的训练数据生成自动并行化和图形处理单元（GPU）间的强化学习更新，训练启发式函数得以高效化。利用GPU和DNN架构并行特性的路径寻路算法，如批加权的A和Q搜索和束搜索，可以轻松应用于通过命令行参数解决路径寻找问题。最后，还提供了多种便捷的功能，用于可视化、代码分析和培训和求解过程中的进度监控。GitHub仓库在此HTTPS网址公开。

ProcureGym: A Multi-Agent Markov Game Framework for Modeling National Volume-based Drug Procurement

ProcureGym：一个多智能体马尔可夫博弈框架，用于建模国家量级药品采购

Authors: Jia Wang, Qian Xu, Xuanwen Ding, Zhuangqi Li, Chao He, Bao Liu, Zhongyu Wei
Subjects: Subjects: Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2603.23880
Pdf link: https://arxiv.org/pdf/2603.23880
Abstract In this paper, we introduce ProcureGym, an data-driven multi-agent simulation platform that models China's National Volume-Based drug Procurement (NVBP) as a Markov Game. Based on real-world data from 7 rounds of NVBP (covering 325 drugs and 2,267 firms), the platform establishes a high-fidelity simulation environment. Within this framework, we evaluate diverse agent models, including Reinforcement Learning (RL), Large Language Model (LLM), and Rule-based algorithms. Experimental results demonstrate that RL agents achieve superior winner alignment and profits. Further analyses show that maximum valid bidding price and procurement volume dominate strategic outcomes. ProcureGym thus serves as a rigorous instrument for assessing policy impacts and formulating future procurement strategies.
中文摘要 本文介绍了ProcureGym，一个数据驱动的多代理模拟平台，将中国国家量化药品采购（NVBP）建模为马尔可夫博弈。基于7轮NVBP的真实数据（涵盖325种药物和2267家企业），该平台建立了高保真模拟环境。在此框架下，我们评估了多种智能体模型，包括强化学习（RL）、大型语言模型（LLM）和基于规则的算法。实验结果表明，强化学习代理实现了更优越的赢家对齐和利润。进一步分析显示，最大有效投标价格和采购量主导战略结果。因此，ProcureGym作为评估政策影响和制定未来采购策略的严谨工具。

Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

带有受限乐观探索的非策略安全强化学习

Authors: Guopeng Li, Matthijs T.J. Spaan, Julian F.P. Kooij
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.23889
Pdf link: https://arxiv.org/pdf/2603.23889
Abstract When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost. To address this issue, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy safe RL algorithm that integrates cost-bounded online exploration and conservative offline distributional value learning. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space and adaptively adjusts the trust region to control the training cost. Second, we adopt truncated quantile critics to stabilize the cost value learning. Quantile critics also quantify epistemic uncertainty to guide exploration. Experiments on safe velocity, safe navigation, and autonomous driving tasks demonstrate that COX-Q achieves high sample efficiency, competitive test safety performance, and controlled data collection cost. The results highlight COX-Q as a promising RL method for safety-critical applications.
中文摘要 当安全性被定义为累计成本的极限时，安全强化学习（RL）旨在学习在数据收集和部署成本约束下最大化回报的策略。非策略安全强化学习方法虽然采样效率高，但由于成本无关的探索和累计成本的估计偏差，存在约束违规。为解决这一问题，我们提出了受限乐观探索Q学习（COX-Q），这是一种非策略安全的强化学习算法，集成了成本限制的在线探索和保守的离线分布价值学习。首先，我们引入了一种新型成本约束的乐观探索策略，解决了行动空间中奖励与成本之间的梯度冲突，并自适应地调整信任区域以控制训练成本。其次，我们采用截断分位数批评者以稳定成本值学习。分位数批评者也量化认识论不确定性以指导探索。安全速度、安全导航和自动驾驶任务的实验表明，COX-Q实现了高采样效率、竞争性测试安全表现和受控的数据收集成本。结果凸显了COX-Q作为安全关键应用中有前景的强化学习方法。

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

无限视界MDP的最优方差依赖遗憾界限

Authors: Guy Zamir, Matthew Zurek, Yudong Chen
Subjects: Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.23926
Pdf link: https://arxiv.org/pdf/2603.23926
Abstract Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $\gamma$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\tilde{O}( \sqrt{SA\,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $\gamma$-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span $\Vert h^\star\Vert_\text{sp}$, our algorithm obtains lower-order terms scaling as $\Vert h^\star\Vert_\text{sp} S^2 A$, which we prove is optimal in both $\Vert h^\star\Vert_\text{sp}$ and $A$. Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than $\Vert h^\star \Vert_\text{sp}^2 S A$, and we provide a prior-free algorithm whose lower-order terms scale as $\Vert h^\star\Vert_\text{sp}^2 S^3 A$, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on $\Vert h^\star\Vert_\text{sp}$ in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.
中文摘要 无限视野马尔可夫决策过程（MDP）中的在线强化学习在理论和算法上仍不如其情节式算法成熟，许多算法存在高“烧入”成本，且未能适应无害的实例特定复杂性。本研究探讨了两个无限视野目标的不足：经典的平均回报遗憾和$\gamma$遗憾。我们开发了一种适用于两种环境的可处理UCB风格算法，实现了首个最优的方差依赖遗憾保证。我们在两种环境中的遗憾界限形式为 $\tilde{O}（ \sqrt{SA\，\text{Var}} + \text{低阶项}）$，其中 $S，A$ 是状态空间和动作空间大小，$\text{Var}$ 表示累计转移方差。这意味着在最坏情况下有极小极大-最优平均奖励和$\gamma$-遗憾界限，但也适用于更简单的问题实例，例如在确定性MDP中几乎恒定的遗憾。此外，我们的算法在平均奖励设置下显著提升了低阶项。凭借对最优偏置张度 $\Vert h^\star\Vert_\text{sp}$ 的先验知识，我们的算法获得了缩放为 $\Vert h^\star\Vert_\text{sp} S^2 A$ 的低阶项，我们证明了该项在 $\Vert h^\star\Vert_\text{sp}$ 和 $A$ 中均最优。在没有先验知识的情况下，我们证明任何算法的低阶项都不能小于 $\Vert h^\star \Vert_\text{sp}^2 S A$，并且我们提供了一种先验自由算法，其低阶项的扩展范围为 $\Vert h^\star\Vert_\text{sp}^2 S^3 A$，几乎匹配该下界。综合来看，这些结果完全描述了对 $\Vert h^\star\Vert_\text{sp}$ 的最优依赖关系，无论是前阶项还是低阶项，并揭示了有先验知识和不先验知识可实现的根本性差距。

PointRFT: Explicit Reinforcement Fine-tuning for Point Cloud Few-shot Learning

PointRFT：点云少样本学习的显式强化微调

Authors: Yankai Wang, Yiding Sun, Qirui Wang, Pengbo Li, Chaoyi Lu, Dongxu Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.23957
Pdf link: https://arxiv.org/pdf/2603.23957
Abstract Understanding spatial dynamics and semantics in point cloud is fundamental for comprehensive 3D comprehension. While reinforcement learning algorithms such as Group Relative Policy Optimization (GRPO) have recently achieved remarkable breakthroughs in large language models by incentivizing reasoning capabilities through strategic reward design, their potential remains largely unexplored in the 3D perception domain. This naturally raises a pivotal question: Can RL-based methods effectively empower 3D point cloud fine-tuning? In this paper, we propose PointRFT, the first reinforcement fine-tuning paradigm tailored specifically for point cloud representation learning. We select three prevalent 3D foundation models and devise specialized accuracy reward and dispersion reward functions to stabilize training and mitigate distribution shifts. Through comprehensive few-shot classification experiments comparing distinct training paradigms, we demonstrate that PointRFT consistently outperforms vanilla supervised fine-tuning (SFT) across diverse benchmarks. Furthermore, when organically integrated into a hybrid Pretraining-SFT-RFT paradigm, the representational capacity of point cloud foundation models is substantially unleashed, achieving state-of-the-art performance particularly under data-scarce scenarios.
中文摘要 理解点云中的空间动力学和语义是全面理解三维的基础。尽管强化学习算法如群体相对策略优化（Group Relative Policy Optimization，GRPO）近年来通过战略性奖励设计激励推理能力，在大型语言模型中取得了显著突破，但其在三维感知领域的潜力仍大为未被充分开发。这自然引出了一个关键问题：基于强化学习的方法能否有效赋能三维点云微调？本文提出了PointRFT，这是首个专门为点云表示学习量身定制的强化微调范式。我们选择了三种常见的三维基础模型，并设计了专门的准确性、奖励和扩散奖励函数，以稳定训练并减轻分布变化。通过全面的少数样本分类实验，比较不同训练范式，我们证明PointRFT在多种基准测试中始终优于原版监督微调（SFT）。此外，当有机地整合进混合预训练-SFT-RFT范式时，点云基础模型的表征能力得以大幅释放，在数据稀缺场景下实现最先进的性能。

From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments

从像素到数字代理：强化学习环境分类学与技术趋势的实证研究

Authors: Lijing Luo, Yiben Luo, Alexey Gorbatovski, Sergey Kovalchuk, Xiaodan Liang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.23964
Pdf link: https://arxiv.org/pdf/2603.23964
Abstract The remarkable progress of reinforcement learning (RL) is intrinsically tied to the environments used to train and evaluate artificial agents. Moving beyond traditional qualitative reviews, this work presents a large-scale, data-driven empirical investigation into the evolution of RL environments. By programmatically processing a massive corpus of academic literature and rigorously distilling over 2,000 core publications, we propose a quantitative methodology to map the transition from isolated physical simulations to generalist, language-driven foundation agents. Implementing a novel, multi-dimensional taxonomy, we systematically analyze benchmarks against diverse application domains and requisite cognitive capabilities. Our automated semantic and statistical analysis reveals a profound, data-verified paradigm shift: the bifurcation of the field into a "Semantic Prior" ecosystem dominated by Large Language Models (LLMs) and a "Domain-Specific Generalization" ecosystem. Furthermore, we characterize the "cognitive fingerprints" of these distinct domains to uncover the underlying mechanisms of cross-task synergy, multi-domain interference, and zero-shot generalization. Ultimately, this study offers a rigorous, quantitative roadmap for designing the next generation of Embodied Semantic Simulators, bridging the gap between continuous physical control and high-level logical reasoning.
中文摘要 强化学习（RL）的显著进展与用于训练和评估人工代理的环境密不可分。超越传统的定性综述，本研究呈现了对强化学习环境演变的大规模、数据驱动的实证调查。通过程序化处理庞大的学术文献语料库，并严格提炼2000多篇核心出版物，我们提出了一种定量方法，旨在描绘从孤立物理模拟向通用、语言驱动基础代理的转变。通过实施一种新颖的多维分类法，我们系统地分析基准测试，针对多样化的应用领域和所需的认知能力。我们的自动化语义和统计分析揭示了一个深刻且经过数据验证的范式转变：该领域分裂为一个以大型语言模型（LLMs）和“领域特定泛化”生态系统为主导的“语义先验”生态系统。此外，我们对这些不同领域的“认知指纹”进行了表征，揭示了跨任务协同、多域干扰和零样本推广的潜在机制。最终，本研究为设计下一代具身语义模拟器提供了严谨的定量路线图，弥合连续物理控制与高层逻辑推理之间的鸿沟。

Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage

策略引导威胁狩猎：一个支持大型语言模型的框架，支持 Splunk SOC 分流

Authors: Rishikesh Sahay, Bell Eapen, Weizhi Meng, Md Rasel Al Mamun, Nikhil Kumar Dora, Manjusha Sumasadan, Sumit Kumar Tetarave, Rod Soto
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.23966
Pdf link: https://arxiv.org/pdf/2603.23966
Abstract With frequently evolving Advanced Persistent Threats (APTs) in cyberspace, traditional security solutions approaches have become inadequate for threat hunting for organizations. Moreover, SOC (Security Operation Centers) analysts are often overwhelmed and struggle to analyze the huge volume of logs received from diverse devices in organizations. To address these challenges, we propose an automated and dynamic threat hunting framework for monitoring evolving threats, adapting to changing network conditions, and performing risk-based prioritization for the mitigation of suspicious and malicious traffic. By integrating Agentic AI with Splunk, an established SIEM platform, we developed a unique threat hunting framework. The framework systematically and seamlessly integrates different threat hunting modules together, ranging from traffic ingestion to anomaly assessment using a reconstruction-based autoencoder, deep reinforcement learning (DRL) with two layers for initial triage, and a large language model (LLM) for contextual analysis. We evaluated the framework against a publicly available benchmark dataset, as well as against a simulated dataset. The experimental results show that the framework can effectively adapt to different SOC objectives autonomously and identify suspicious and malicious traffic. The framework enhances operational effectiveness by supporting SOC analysts in their decision-making to block, allow, or monitor network traffic. This study thus enhances cybersecurity and threat hunting literature by presenting the novel threat hunting framework for security decision- making, as well as promoting cumulative research efforts to develop more effective frameworks to battle continuously evolving cyber threats.
中文摘要 随着网络空间中高级持续威胁（APT）不断演变，传统的安全解决方案已不足以应对组织的威胁猎杀。此外，安全运营中心（SOC）分析师常常感到不堪重负，难以分析来自组织中各种设备接收的大量日志。为应对这些挑战，我们提出了一个自动化且动态的威胁狩猎框架，用于监控不断演变的威胁，适应不断变化的网络状况，并基于风险进行优先级处理，以缓解可疑和恶意流量。通过将代理人工智能与成熟的SIEM平台Splunk集成，我们开发了一个独特的威胁狩猎框架。该框架系统且无缝地整合了不同的威胁狩猎模块，涵盖从流量摄取到使用基于重建的自编码器进行异常评估、带有两层初始分流的深度强化学习（DRL）以及用于上下文分析的大型语言模型（LLM）。我们对该框架进行了公开基准数据集和模拟数据集的评估。实验结果表明，该框架能够自主地有效适应不同的 SOC 目标，并识别可疑和恶意流量。该框架通过支持SOC分析师在决策中阻断、允许或监控网络流量，提升了运营效率。本研究通过提出新的威胁狩猎框架，促进了网络安全和威胁狩猎文献的完善，同时推动了积累性研究，以开发更有效的框架以应对不断演变的网络威胁。

PCHC: Enabling Preference Conditioned Humanoid Control via Multi-Objective Reinforcement Learning

PCHC：通过多目标强化学习实现偏好条件人形控制

Authors: Huanyu Li, Dewei Wang, Xinmiao Wang, Xinzhe Liu, Peng Liu, Chenjia Bai, Xuelong Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.24047
Pdf link: https://arxiv.org/pdf/2603.24047
Abstract Humanoid robots often need to balance competing objectives, such as maximizing speed while minimizing energy consumption. While current reinforcement learning (RL) methods can master complex skills like fall recovery and perceptive locomotion, they are constrained by fixed weighting strategies that produce a single suboptimal policy, rather than providing a diverse set of solutions for sophisticated multi-objective control. In this paper, we propose a novel framework leveraging Multi-Objective Reinforcement Learning (MORL) to achieve Preference-Conditioned Humanoid Control (PCHC). Unlike conventional methods that require training a series of policies to approximate the Pareto front, our framework enables a single, preference-conditioned policy to exhibit a wide spectrum of diverse behaviors. To effectively integrate these requirements, we introduce a Beta distribution-based alignment mechanism based on preference vectors modulating a Mixture-of-Experts (MoE) module. We validated our approach on two representative humanoid tasks. Extensive simulations and real-world experiments demonstrate that the proposed framework allows the robot to adaptively shift its objective priorities in real-time based on the input preference condition.
中文摘要 类人机器人常常需要在最大化速度和最小化能耗的同时，平衡两个竞争目标。虽然当前强化学习（RL）方法能够掌握诸如跌倒恢复和感知移动等复杂技能，但它们受限于固定权重策略，导致单一次优策略，而非提供多样化的复杂多目标控制解决方案。本文提出了一个利用多目标强化学习（MORL）实现偏好条件人形控制（PCHC）的新框架。与需要训练一系列策略以近似帕累托前沿的方法不同，我们的框架使单一的偏好条件策略能够展现广泛的多样行为。为有效整合这些需求，我们引入基于偏好向量的基于Beta分布的对齐机制，调制专家混合模块（Mixture-of-Experts，MoE模块）。我们在两个具有代表性的类人生物任务上验证了我们的方法。大量模拟和真实世界实验表明，所提框架允许机器人根据输入偏好条件实时自适应地调整其目标优先级。

Towards Effective Experiential Learning: Dual Guidance for Utilization and Internalization

迈向有效体验式学习：双重指导的应用与内化

Authors: Fei Bai, Zhipeng Chen, Chuan Hao, Ming Yang, Ran Tao, Bryan Dai, Wayne Xin Zhao, Jian Yang, Hongteng Xu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.24093
Pdf link: https://arxiv.org/pdf/2603.24093
Abstract Recently, reinforcement learning~(RL) has become an important approach for improving the capabilities of large language models~(LLMs). In particular, reinforcement learning from verifiable rewards~(RLVR) has emerged as a promising paradigm for reasoning tasks. However, existing RL-based training still remains only a rough approximation to human learning. Human learners leverage both external and internal experience to guide exploration and gradually internalize useful trajectories into stable knowledge. Motivated by this gap, we ask: how can LLMs better utilize and internalize experience during RLVR training? To answer this question, we propose \textbf{D}ual \textbf{G}uidance \textbf{O}ptimization~(\textbf{DGO}), a unified framework that leverages \emph{external} and \emph{internal experience} to improve training effectiveness. Specifically, DGO first constructs an experience bank from previously explored trajectories. The policy then performs exploration under the joint guidance of the experience bank and the model's internal knowledge. The resulting trajectories are further used to refine the experience bank and optimize model parameters, forming a closed loop of experience utilization and internalization. Experiments show that DGO consistently outperforms baseline methods, suggesting that better utilization and internalization of experience lead to more effective reasoning.
中文摘要 近年来，强化学习~（RL）已成为提升大型语言模型~（LLMs）能力的重要方法。特别是，基于可验证奖励的强化学习~（RLVR）已成为推理任务中一个有前景的范式。然而，现有基于强化学习的训练仍只是对人类学习的粗略近似。人类学习者利用外部和内部经验来引导探索，逐步将有用的轨迹内化为稳定的知识。基于这一空白，我们提出问题：LLM如何在RLVR培训中更好地利用和内化经验？为回答这个问题，我们提出了 \textbf{D}ual \textbf{G}uidance \textbf{O}ptimization~（\textbf{DGO}），这是一个结合 \emph{external} 和 \emph{internal experience} 来提升训练效果的统一框架。具体来说，DGO 首先从先前探索的轨迹构建一个经验库。随后，保单在经验库与模型内部知识的联合指导下进行探索。由此产生的轨迹进一步用于优化经验库和优化模型参数，形成经验利用与内化的闭环。实验显示，DGO持续优于基线方法，表明更好的经验利用和内化能带来更有效的推理。

Likelihood hacking in probabilistic program synthesis

概率程序合成中的似然黑客

Authors: Jacek Karwowski, Younesse Kaddar, Zihuiwen Ye, Nikolay Malkin, Sam Staton
Subjects: Subjects: Machine Learning (cs.LG); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2603.24126
Pdf link: https://arxiv.org/pdf/2603.24126
Abstract When language models are trained by reinforcement learning (RL) to write probabilistic programs, they can artificially inflate their marginal-likelihood reward by producing programs whose data distribution fails to normalise instead of fitting the data better. We call this failure likelihood hacking (LH). We formalise LH in a core probabilistic programming language (PPL) and give sufficient syntactic conditions for its prevention, proving that a safe language fragment $\mathcal{L}{\text{safe}}$ satisfying these conditions cannot produce likelihood-hacking programs. Empirically, we show that GRPO-trained models generating PyMC code discover LH exploits within the first few training steps, driving violation rates well above the untrained-model baseline. We implement $\mathcal{L}{\text{safe}}$'s conditions as $\texttt{SafeStan}$, a LH-resistant modification of Stan, and show empirically that it prevents LH under optimisation pressure. These results show that language-level safety constraints are both theoretically grounded and effective in practice for automated Bayesian model discovery.
中文摘要 当语言模型通过强化学习（RL）训练来编写概率程序时，它们可能会通过生成数据分布无法归一化的程序来人为抬高边际似然奖励，而非更好地拟合数据。我们称之为失败概率黑客（LH）。我们将LH形式化为核心概率编程语言（PPL），并给出足够的语法条件以防范其行为，证明满足这些条件的安全语言片段$\mathcal{L}{\text{safe}}}无法产生似然黑客程序。通过实证，我们表明，生成PyMC代码的GRPO训练模型在最初几步训练中发现LH漏洞，导致违规率远高于未训练模型基线。我们将 $\mathcal{L}{\text{safe}}$ 的条件实现为 $\texttt{SafeStan}$，这是 Stan 的抗 LH 修改，并通过实证证明它能防止优化压力下的 LH。这些结果表明，语言层级的安全约束在自动化贝叶斯模型发现中既有理论基础又有效。

Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection

导师-学生强化学习：一套动态课程，用于强健的深度伪造检测

Authors: Zhanhe Lei, Zhongyuan Wang, Jikang Cheng, Baojin Huang, Yuhong Yang, Zhen Han, Chao Liang, Dengpan Ye
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.24139
Pdf link: https://arxiv.org/pdf/2603.24139
Abstract Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a Tutor'' agent learns to guide aStudent'' (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample's loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student's immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student's generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at this https URL.
中文摘要 标准的深度伪造检测监督训练对所有样本的统一重要性，这对于学习稳健且可推广特征来说可能不理想。本研究提出一种新的导师-学生强化学习（TSRL）框架，以动态优化培训课程。我们的方法将训练过程建模为马尔可夫决策过程，其中“导师”代理学习引导“学生”（深度伪造检测器）。Tutor作为一个近端策略优化（PPO）代理实现，观察每个训练样本丰富的状态表示，不仅封装其视觉特征，还包含其历史学习动态，如EMA丢失和遗忘计数。基于该状态，导师通过对样本丢失分配连续权重（0-1）来执行一个动作，从而动态地重新加权训练批次。导师的奖励基于学生的即时表现变化，特别是从错误到正确预测的转变。这种策略鼓励导师学习优先考虑高价值样本的课程，比如难但可学的例子，从而实现更高效、更有效的培训过程。我们证明，这种自适应课程相比传统训练方法提升了学生对隐形操控技术的泛化能力。代码可在此 https URL 访问。

A Deep Dive into Scaling RL for Code Generation with Synthetic Data and Curricula

深入探讨基于合成数据和课程的代码生成扩展强化学习

Authors: Cansu Sancaktar, David Zhang, Gabriel Synnaeve, Taco Cohen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.24202
Pdf link: https://arxiv.org/pdf/2603.24202
Abstract Reinforcement learning (RL) has emerged as a powerful paradigm for improving large language models beyond supervised fine-tuning, yet sustaining performance gains at scale remains an open challenge, as data diversity and structure, rather than volume alone, become the limiting factor. We address this by introducing a scalable multi-turn synthetic data generation pipeline in which a teacher model iteratively refines problems based on in-context student performance summaries, producing structured difficulty progressions without any teacher fine-tuning. Compared to single-turn generation, this multi-turn approach substantially improves the yield of valid synthetic problems and naturally produces stepping stones, i.e. easier and harder variants of the same core task, that support curriculum-based training. We systematically study how task difficulty, curriculum scheduling, and environment diversity interact during RL training across the Llama3.1-8B Instruct and Qwen3-8B Base model families, with additional scaling experiments on Qwen2.5-32B. Our results show that synthetic augmentation consistently improves in-domain code and in most cases out-of-domain math performance, and we provide empirical insights into how curriculum design and data diversity jointly shape RL training dynamics.
中文摘要 强化学习（RL）已成为改进大型语言模型超越监督微调的强大范式，但大规模持续提升性能仍是一个开放挑战，因为数据多样性和结构，而非仅仅是体积，成为限制因素。我们通过引入可扩展的多回合合成数据生成流程来解决这个问题，教师模型基于上下文中的学生表现总结迭代细化问题，生成结构化难度递进，无需教师微调。与单回合生成相比，这种多回合方法显著提升了有效合成问题的产出率，并自然产生了基于课程的基础训练，即同一核心任务的简单和困难变体。我们系统地研究了任务难度、课程安排和环境多样性在强化学习培训中如何相互作用，涵盖Llama3.1-8B Instruct和Qwen3-8B基础模型家族，并对Qwen2.5-32B进行了额外的扩展实验。我们的结果表明，合成增强持续提升了领域内代码，在大多数情况下提升了域外数学表现，并提供了关于课程设计与数据多样性如何共同塑造强化学习训练动态的实证见解。

SumRank: Aligning Summarization Models for Long-Document Listwise Reranking

SumRank：对齐长文档列表重排序的总结模型

Authors: Jincheng Feng, Wenhan Liu, Zhicheng Dou
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2603.24204
Pdf link: https://arxiv.org/pdf/2603.24204
Abstract Large Language Models (LLMs) have demonstrated superior performance in listwise passage reranking task. However, directly applying them to rank long-form documents introduces both effectiveness and efficiency issues due to the substantially increased context length. To address this challenge, we propose a pointwise summarization model SumRank, aligned with downstream listwise reranking, to compress long-form documents into concise rank-aligned summaries before the final listwise reranking stage. To obtain our summarization model SumRank, we introduce a three-stage training pipeline comprising cold-start Supervised Fine-Tuning (SFT), specialized RL data construction, and rank-driven alignment via Reinforcement Learning. This paradigm aligns the SumRank with downstream ranking objectives to preserve relevance signals. We conduct extensive experiments on five benchmark datasets from the TREC Deep Learning tracks (TREC DL 19-23). Results show that our lightweight SumRank model achieves state-of-the-art (SOTA) ranking performance while significantly improving efficiency by reducing both summarization overhead and reranking complexity.
中文摘要 大型语言模型（LLMs）在列表式文章重排序任务中表现出优异性能。然而，直接应用于长篇文档的排名会带来效率和效率问题，因为上下文长度大幅增加。为应对这一挑战，我们提出了一个逐点摘要模型SumRank，配合下游的列表重新排序，将长篇文档压缩成简明且对齐的排序摘要，以备最终列表排序阶段。为了获得我们的总结模型SumRank，我们引入了三阶段训练流程，包括冷启动监督微调（SFT）、专门的强化学习数据构建，以及通过强化学习实现秩驱动的对齐。该范式使SumRank与下游排名目标保持一致，以保持相关性信号。我们对TREC深度学习轨道（TREC DL 19-23）中的五个基准数据集进行了大量实验。结果显示，我们的轻量级SumRank模型实现了最先进的（SOTA）排名性能，同时通过降低摘要开销和重新排序复杂度，显著提升了效率。

Decentralized End-to-End Multi-AAV Pursuit Using Predictive Spatio-Temporal Observation via Deep Reinforcement Learning

通过深度强化学习进行预测时空观察的去中心化端到端多AAV追踪

Authors: Yude Li, Zhexuan Zhou, Huizhe Li, Yanke Sun, Yenan Wu, Yichen Lai, Yiming Wang, Youmin Gong, Jie Mei
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.24238
Pdf link: https://arxiv.org/pdf/2603.24238
Abstract Decentralized cooperative pursuit in cluttered environments is challenging for autonomous aerial swarms, especially under partial and noisy perception. Existing methods often rely on abstracted geometric features or privileged ground-truth states, and therefore sidestep perceptual uncertainty in real-world settings. We propose a decentralized end-to-end multi-agent reinforcement learning (MARL) framework that maps raw LiDAR observations directly to continuous control commands. Central to the framework is the Predictive Spatio-Temporal Observation (PSTO), an egocentric grid representation that aligns obstacle geometry with predictive adversarial intent and teammate motion in a unified, fixed-resolution projection. Built on PSTO, a single decentralized policy enables agents to navigate static obstacles, intercept dynamic targets, and maintain cooperative encirclement. Simulations demonstrate that the proposed method achieves superior capture efficiency and competitive success rates compared to state-of-the-art learning-based approaches relying on privileged obstacle information. Furthermore, the unified policy scales seamlessly across different team sizes without retraining. Finally, fully autonomous outdoor experiments validate the framework on a quadrotor swarm relying on only onboard sensing and computing.
中文摘要 在杂乱环境中实现分散式合作追踪对于自主空中群体来说具有挑战性，尤其是在部分且噪声感知的情况下。现有方法通常依赖抽象的几何特征或特权的地面真实状态，因此在现实环境中规避了感知的不确定性。我们提出了一个去中心化的端到端多智能体强化学习（MARL）框架，将原始LiDAR观测数据直接映射到连续控制指令。该框架的核心是预测时空观察（PSTO），这是一种以自我为中心的网格表示，能够将障碍几何与预测的对抗意图和队友运动对齐，形成统一的固定分辨率投影。基于PSTO，单一去中心化策略使智能体能够通过静态障碍、拦截动态目标并维持协同包围。模拟表明，所提方法相比依赖特权障碍信息的先进基于学习的方法，在捕获效率和竞争成功率上更为优越。此外，统一政策能够无缝扩展到不同团队规模，无需重新培训。最后，完全自主的户外实验在仅依靠机载感测和计算的四旋翼群上验证了该框架。

C-STEP: Continuous Space-Time Empowerment for Physics-informed Safe Reinforcement Learning of Mobile Agents

C-STEP：基于物理学的实时空间赋能，实现移动智能体的安全强化学习

Authors: Guihlerme Daubt, Adrian Redder
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.24241
Pdf link: https://arxiv.org/pdf/2603.24241
Abstract Safe navigation in complex environments remains a central challenge for reinforcement learning (RL) in robotics. This paper introduces Continuous Space-Time Empowerment for Physics-informed (C-STEP) safe RL, a novel measure of agent-centric safety tailored to deterministic, continuous domains. This measure can be used to design physics-informed intrinsic rewards by augmenting positive navigation reward functions. The reward incorporates the agents internal states (e.g., initial velocity) and forward dynamics to differentiate safe from risky behavior. By integrating C-STEP with navigation rewards, we obtain an intrinsic reward function that jointly optimizes task completion and collision avoidance. Numerical results demonstrate fewer collisions, reduced proximity to obstacles, and only marginal increases in travel time. Overall, C-STEP offers an interpretable, physics-informed approach to reward shaping in RL, contributing to safety for agentic mobile robotic systems.
中文摘要 在复杂环境中的安全导航仍然是机器人强化学习（RL）面临的核心挑战。本文介绍了连续时空赋能用于物理知情（C-STEP）安全强化学习，这是一种针对确定性连续领域量身定制的以代理为中心安全的新型衡量方法。该指标可用于设计基于物理的内在奖励，通过增强正向导航奖励函数。奖励包含了主体的内部状态（例如初始速度）和前向动态，以区分安全与风险行为。通过将C-STEP与导航奖励集成，我们获得了内在奖励函数，能够共同优化任务完成和碰撞避免。数值结果显示碰撞减少，障碍物距离减少，行驶时间仅略有增加。总体而言，C-STEP提供了一种可解释、基于物理的强化学习奖励塑造方法，有助于智能移动机器人系统的安全性。

Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions

在逆境条件下的启发式自进度学习领域自适应语义分割

Authors: Shiqin Wang, Haoyang Chen, Huaizhou Huang, Yinkan He, Dongfang Sun, Xiaoqing Chen, Xingyu Liu, Zheng Wang, Kaiyan Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.24322
Pdf link: https://arxiv.org/pdf/2603.24322
Abstract The learning order of semantic classes significantly impacts unsupervised domain adaptation for semantic segmentation, especially under adverse weather conditions. Most existing curricula rely on handcrafted heuristics (e.g., fixed uncertainty metrics) and follow a static schedule, which fails to adapt to a model's evolving, high-dimensional training dynamics, leading to category bias. Inspired by Reinforcement Learning, we cast curriculum learning as a sequential decision problem and propose an autonomous class scheduler. This scheduler consists of two components: (i) a high-dimensional state encoder that maps the model's training status into a latent space and distills key features indicative of progress, and (ii) a category-fair policy-gradient objective that ensures balanced improvement across classes. Coupled with mixed source-target supervision, the learned class rankings direct the network's focus to the most informative classes at each stage, enabling more adaptive and dynamic learning. It is worth noting that our method achieves state-of-the-art performance on three widely used benchmarks (e.g., ACDC, Dark Zurich, and Nighttime Driving) and shows generalization ability in synthetic-to-real semantic segmentation.
中文摘要 语义类的学习顺序显著影响语义分割的无监督领域适应，尤其是在恶劣天气条件下。大多数现有课程依赖手工设计的启发式方法（如固定的不确定性指标），遵循静态的时间表，无法适应模型不断演变的高维训练动态，导致类别偏倚。受强化学习启发，我们将课程学习定位为顺序决策问题，并提出了自主课程调度器。该调度器由两个部分组成：（i）高维状态编码器，将模型的训练状态映射到潜在空间，并提取显示进展的关键特征;（ii）一个类别公平的策略梯度目标，确保各类间的均衡改进。结合混合源-目标监督，所学班级排名将网络重点指向每个阶段最具信息量的班级，从而实现更具适应性和动态性的学习。值得注意的是，我们的方法在三个广泛使用的基准测试（如ACDC、Dark Zurich和Nighttime Driving）上达到了最先进的性能，并在合成到真实语义分割中展现出泛化能力。

LATS: Large Language Model Assisted Teacher-Student Framework for Multi-Agent Reinforcement Learning in Traffic Signal Control

LATS：大语言模型辅助师生框架，用于交通信号控制中的多智能体强化学习

Authors: Yifeng Zhang, Peizhuo Li, Tingguang Zhou, Mingfeng Fan, Guillaume Sartoretti
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.24361
Pdf link: https://arxiv.org/pdf/2603.24361
Abstract Adaptive Traffic Signal Control (ATSC) aims to optimize traffic flow and minimize delays by adjusting traffic lights in real time. Recent advances in Multi-agent Reinforcement Learning (MARL) have shown promise for ATSC, yet existing approaches still suffer from limited representational capacity, often leading to suboptimal performance and poor generalization in complex and dynamic traffic environments. On the other hand, Large Language Models (LLMs) excel at semantic representation, reasoning, and analysis, yet their propensity for hallucination and slow inference speeds often hinder their direct application to decision-making tasks. To address these challenges, we propose a novel learning paradigm named LATS that integrates LLMs and MARL, leveraging the former's strong prior knowledge and inductive abilities to enhance the latter's decision-making process. Specifically, we introduce a plug-and-play teacher-student learning module, where a trained embedding LLM serves as a teacher to generate rich semantic features that capture each intersection's topology structures and traffic dynamics. A much simpler (student) neural network then learns to emulate these features through knowledge distillation in the latent space, enabling the final model to operate independently from the LLM for downstream use in the RL decision-making process. This integration significantly enhances the overall model's representational capacity across diverse traffic scenarios, thus leading to more efficient and generalizable control strategies. Extensive experiments across diverse traffic datasets empirically demonstrate that our method enhances the representation learning capability of RL models, thereby leading to improved overall performance and generalization over both traditional RL and LLM-only approaches. [...]
中文摘要 自适应交通信号控制（ATSC）旨在通过实时调整红绿灯，优化交通流并最大限度减少延误。多智能体强化学习（MARL）的最新进展显示出ATSC的前景，但现有方法仍受表征能力有限的困扰，常常导致在复杂动态的流量环境中表现不佳且泛化能力不佳。另一方面，大型语言模型（LLMs）在语义表示、推理和分析方面表现出色，但其幻觉倾向和推理速度缓慢常常阻碍其直接应用于决策任务。为应对这些挑战，我们提出了一种名为LATS的新型学习范式，整合了大型语言模型和MARL，利用前者扎实的先验知识和归纳能力，提升后者的决策过程。具体来说，我们引入了一个即插即用的师生学习模块，其中一台训练有素的嵌入大型语言模型（LLM）作为教师，生成丰富的语义特征，捕捉每个路口的拓扑结构和交通动态。一个更简单的（学生）神经网络通过潜伏空间的知识蒸馏学习模拟这些特征，使最终模型能够独立于大型语言模型（LLM）运行，用于强化学习的后续决策过程。这种整合显著提升了模型在多种交通场景下的表现能力，从而实现更高效、更通用的控制策略。在多种流量数据集上的大量实验实证表明，我们的方法增强了强化学习模型的表征学习能力，从而提升了整体性能和泛化性，优于传统的强化学习和仅限大型语言模型的方法。[...]

CoordLight: Learning Decentralized Coordination for Network-Wide Traffic Signal Control

CoordLight：学习去中心化协调，实现全网络交通信号控制

Authors: Yifeng Zhang, Harsh Goel, Peizhuo Li, Mehul Damani, Sandeep Chinchali, Guillaume Sartoretti
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.24366
Pdf link: https://arxiv.org/pdf/2603.24366
Abstract Adaptive traffic signal control (ATSC) is crucial in alleviating congestion, maximizing throughput and promoting sustainable mobility in ever-expanding cities. Multi-Agent Reinforcement Learning (MARL) has recently shown significant potential in addressing complex traffic dynamics, but the intricacies of partial observability and coordination in decentralized environments still remain key challenges in formulating scalable and efficient control strategies. To address these challenges, we present CoordLight, a MARL-based framework designed to improve intra-neighborhood traffic by enhancing decision-making at individual junctions (agents), as well as coordination with neighboring agents, thereby scaling up to network-level traffic optimization. Specifically, we introduce the Queue Dynamic State Encoding (QDSE), a novel state representation based on vehicle queuing models, which strengthens the agents' capability to analyze, predict, and respond to local traffic dynamics. We further propose an advanced MARL algorithm, named Neighbor-aware Policy Optimization (NAPO). It integrates an attention mechanism that discerns the state and action dependencies among adjacent agents, aiming to facilitate more coordinated decision-making, and to improve policy learning updates through robust advantage calculation. This enables agents to identify and prioritize crucial interactions with influential neighbors, thus enhancing the targeted coordination and collaboration among agents. Through comprehensive evaluations against state-of-the-art traffic signal control methods over three real-world traffic datasets composed of up to 196 intersections, we empirically show that CoordLight consistently exhibits superior performance across diverse traffic networks with varying traffic flows. The code is available at this https URL
中文摘要 自适应交通信号控制（ATSC）对于缓解拥堵、最大化通行量以及促进不断扩张城市中的可持续出行至关重要。多智能体强化学习（MARL）最近展现出解决复杂交通动态的巨大潜力，但在去中心化环境中，部分可观测性和协调的复杂性仍然是制定可扩展高效控制策略的关键挑战。为应对这些挑战，我们提出了CoordLight，这是一个基于MARL的框架，旨在通过增强单个交汇点（代理）的决策能力以及与邻近代理的协调，提升邻里内交通，从而实现网络层面的交通优化。具体来说，我们介绍了队列动态状态编码（QDSE），这是一种基于车辆排队模型的新型状态表示，增强了代理分析、预测和响应本地交通动态的能力。我们还提出了一种先进的MARL算法，名为邻居感知策略优化（NAPO）。它集成了一种注意力机制，能够识别相邻代理之间的状态和动作依赖关系，旨在促进更协调的决策，并通过稳健的优势计算提升策略学习的更新。这使得代理能够识别并优先处理与有影响力邻居的关键交互，从而增强代理间的有针对性协调与协作。通过对三个由多达196个路口组成的真实交通数据集中，基于最先进的交通信号控制方法进行全面评估，我们实证显示CoordLight在多样化交通网络及不同交通流中始终表现出优越性能。代码可在此 https URL 获取

Improving Lean4 Autoformalization via Cycle Consistency Fine-tuning

通过循环一致性微调改进精益4自形式化

Authors: Arsen Shebzukhov
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.24372
Pdf link: https://arxiv.org/pdf/2603.24372
Abstract Autoformalization - automatically translating natural language mathematical texts into formal proof language such as Lean4 - can help accelerate AI-assisted mathematical research, be it via proof verification or proof search. I fine-tune Qwen3.5-2B with LoRA for natural language to Lean4 formalization on FineLeanCorpus and consider three training regimes: supervised fine-tuning (SFT) with curriculum learning (difficulty 1 to 10), SFT without curriculum ordering, and reinforcement learning using group relative policy optimization (GRPO) with a cycle consistency reward. Cycle consistency measures how well the meaning of a statement is preserved through a NL to Lean4 to NL' loop, computed as cosine similarity of off-the-shelf sentence embeddings. On an unseen subset of FineLeanCorpus (FLC) and on PutnamBench, RL substantially outperforms both SFT variants (mean cycle consistency 0.669 vs. 0.513 on FLC; 0.561 vs. 0.422 on PutnamBench), while increasing cross-entropy loss by only 0.011 nats, with minimal impact on formalization quality. Curriculum ordering provides no measurable benefit over shuffled training.
中文摘要 自形式化——自动将自然语言数学文本翻译成如Lean4等形式证明语言——可以帮助加速AI辅助的数学研究，无论是通过证明验证还是证明搜索。我用LoRA微调Qwen3.5-2B，将自然语言优化到FineLeanCorpus上的Lean4形式化，并考虑三种训练方案：带课程学习的监督微调（SFT）（难度1到10）、无课程排序的SFT，以及使用组相对策略优化（GRPO）进行循环一致性奖励的强化学习。循环一致性衡量一个陈述在NL到Lean4再到NL'循环中保持意义的良好程度，该循环通过现成句子嵌入的余弦相似度计算。在FineLeanCorpus（FLC）和PutnamBench的未见子集上，RL显著优于两种SFT变体（FLC的平均周期一致性为0.669对0.513;PutnamBench为0.561对0.422），同时交叉熵损失仅增加0.011 nats，对形式化质量影响极小。课程顺序相较于洗牌式培训没有明显的优势。

Composer 2 Technical Report

作曲家2技术报告

Authors: Cursor Reseach: Aaron Chan, Ahmed Shalaby, Alexander Wettig, Aman Sanger, Andrew Zhai, Anurag Ajay, Ashvin Nair, Charlie Snell, Chen Lu, Chen Shen, Emily Jia, Federico Cassano, Hanpeng Liu, Haoyu Chen, Henry Wildermuth, Jacob Jackson, Janet Li, Jediah Katz, Jiajun Yao, Joey Hejna, Josh Warner, Julius Vering, Kevin Frans, Lee Danilek, Less Wright, Lujing Cen, Luke Melas-Kyriazi, Michael Truell, Michiel de Jong, Naman Jain, Nate Schmidt, Nathan Wang, Niklas Muennighoff, Oleg Rybkin, Paul Loh, Phillip Kravtsov, Rishabh Yadav, Sahil Shah, Sam Kottler, Alexander M Rush, Shengtong Zhang, Shomil Jain, Sriram Sankar, Stefan Heule, Stuart H. Sul, Sualeh Asif, Victor Rong, Wanqi Zhu, William Lin, Yuchen Wu, Yuri Volkov, Yury Zemlyanskiy, Zack Holbrook, Zhiyuan Zhang
Subjects: Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.24477
Pdf link: https://arxiv.org/pdf/2603.24477
Abstract Composer 2 is a specialized model designed for agentic software engineering. The model demonstrates strong long-term planning and coding intelligence while maintaining the ability to efficiently solve problems for interactive use. The model is trained in two phases: first, continued pretraining to improve the model's knowledge and latent coding ability, followed by large-scale reinforcement learning to improve end-to-end coding performance through stronger reasoning, accurate multi-step execution, and coherence on long-horizon realistic coding problems. We develop infrastructure to support training in the same Cursor harness that is used by the deployed model, with equivalent tools and structure, and use environments that match real problems closely. To measure the ability of the model on increasingly difficult tasks, we introduce a benchmark derived from real software engineering problems in large codebases including our own. Composer 2 is a frontier-level coding model and demonstrates a process for training strong domain-specialized models. On our CursorBench evaluations the model achieves a major improvement in accuracy compared to previous Composer models (61.3). On public benchmarks the model scores 61.7 on Terminal-Bench and 73.7 on SWE-bench Multilingual in our harness, comparable to state-of-the-art systems.
中文摘要 Composer 2 是一个专门为代理软件工程设计的模型。该模型展现了强大的长期规划和编码智能，同时保持高效解决问题以供交互使用的能力。该模型分为两个阶段进行训练：首先，持续的预训练以提升模型的知识和潜在编码能力;随后进行大规模强化学习，通过更强的推理、准确的多步执行以及在长期现实编码问题上的一致性，提升端到端编码性能。我们开发基础设施支持在部署模型中使用的同一光标框架下的培训，配备同等工具和结构，并使用与真实问题紧密匹配的环境。为了衡量模型在日益困难任务中的能力，我们引入了一个基于大型代码库（包括我们自身代码库）真实软件工程问题的基准测试。Composer 2 是一个前沿级编码模型，展示了训练强领域专用模型的过程。在我们的CursorBench评估中，该模型相比之前的Composer模型（61.3）实现了显著的准确性提升。在公开基准测试中，该模型在终端-Bench测试中得分为61.7，在SWE-bench多语言系统中得分为73.7，与最先进的系统相当。

Completeness of Unbounded Best-First Minimax and Descent Minimax

无界最佳优先极小极大和下降极小极大的完备性

Authors: Quentin Cohen-Solal
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.24572
Pdf link: https://arxiv.org/pdf/2603.24572
Abstract In this article, we focus on search algorithms for two-player perfect information games, whose objective is to determine the best possible strategy, and ideally a winning strategy. Unfortunately, some search algorithms for games in the literature are not able to always determine a winning strategy, even with an infinite search time. This is the case, for example, of the following algorithms: Unbounded Best-First Minimax and Descent Minimax, which are core algorithms in state-of-the-art knowledge-free reinforcement learning. They were then improved with the so-called completion technique. However, whether this technique sufficiently improves these algorithms to allow them to always determine a winning strategy remained an open question until now. To answer this question, we generalize the two algorithms (their versions using the completion technique), and we show that any algorithm of this class of algorithms computes the best strategy. Finally, we experimentally show that the completion technique improves winning performance.
中文摘要 本文重点介绍双人完全信息博弈的搜索算法，其目标是确定最佳策略，理想情况下是必胜策略。遗憾的是，文献中的一些游戏搜索算法即使搜索时间无限，也无法总能确定必胜策略。例如，以下算法就是这样：无界最佳第一极小极大法和下降极小极大法，它们是最先进无知识强化学习的核心算法。随后采用所谓的完井技术进行改进。然而，该技术是否足够改进这些算法，使其能够始终确定获胜策略，至今仍是一个悬而未决的问题。为回答这个问题，我们推广了这两种算法（使用完备技术的版本），并证明该类算法中的任何算法都能计算出最佳策略。最后，我们通过实验证明完成技术能提升获胜表现。

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

VFIG：利用视觉语言模型在SVG中向量化复杂图形

Authors: Qijia He, Xunmei Liu, Hammaad Memon, Ziang Li, Zixian Ma, Jaemin Cho, Jason Ren, Daniel S Weld, Ranjay Krishna
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.24575
Pdf link: https://arxiv.org/pdf/2603.24575
Abstract Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only "flat" rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.
中文摘要 可缩放矢量图形（SVG）是技术插图和数字设计中不可或缺的格式，提供精确的分辨率独立性和灵活的语义编辑能力。然而，实际上，原始矢量源文件经常丢失或无法访问，只剩下难以修改或缩放的“平面”光栅化版本（如PNG或JPEG）。手工重建这些图形是一个极其费力的过程，需要专业技能才能恢复原始几何意图。为弥合这一差距，我们提出了VFIG，这是一系列用于复杂且高精度人物转SVG转换的视觉语言模型。虽然这项任务本质上是数据驱动的，但现有数据集通常规模较小，缺乏专业图表的复杂性。我们通过引入VFIG-DATA来解决这个问题，这是一个由66K对高质量图形-SVG组成的大规模数据集，这些数据集由多种真实世界纸质图形和程序生成图表混合整理而成。鉴于SVG由重复的原语和层级局部结构组成，我们引入了从粗到细的训练课程，从监督微调（SFT）开始，学习原子原语，并通过过渡到强化学习（RL）细化，以优化全局图的准确性、布局一致性和拓扑边缘情况。最后，我们介绍了VFIG-BENCH，一套综合评估套件，采用新颖的指标，旨在衡量复杂图形的结构完整性。VFIG在开源模型中达到了最先进的性能，性能与GPT-5.2相当，在VFIG-BENCH上的VLM-Judge评分为0.829。

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

三月：多智能体强化自我检查以防LLM幻觉

Authors: Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie Hu, Yu Qin, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.24579
Pdf link: https://arxiv.org/pdf/2603.24579
Abstract Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems. While existing hallucination detection methods employ LLM-as-a-judge to verify LLM outputs against retrieved evidence, they suffer from inherent confirmation bias, where the verifier inadvertently reproduces the errors of the original generation. To address this, we introduce Multi-Agent Reinforced Self-Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry. MARCH orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, which the Proposer decomposes into claim-level verifiable atomic propositions. Crucially, the Checker validates these propositions against retrieved evidence in isolation, deprived of the Solver's original output. This well-crafted information asymmetry scheme breaks the cycle of self-confirmation bias. By training this pipeline with multi-agent reinforcement learning (MARL), we enable the agents to co-evolve and optimize factual adherence. Extensive experiments across hallucination benchmarks demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B-parameter LLM equipped with MARCH achieves performance competitive with powerful closed-source models. MARCH paves a scalable path for factual self-improvement of LLMs through co-evolution. The code is at this https URL.
中文摘要 幻觉仍然是大型语言模型（LLMs）的关键瓶颈，削弱了它们在现实应用中的可靠性，尤其是在检索增强生成（RAG）系统中。虽然现有的幻觉检测方法采用LLM作为评判来验证LLM输出与检索到的证据，但它们存在固有的确认偏误，即验证者无意中重现了原始生成的错误。为此，我们引入了多智能体强化幻觉自我检查（MARCH），该框架通过有意的信息不对称来强制执行严格的事实对齐。MARCH协调了由三个专业代理组成的协作流程：Solver、Proposer和Checker。求解器生成初始的RAG响应，提案者将其分解为权利要求级别的可验证原子命题。关键是，检验器在孤立地用检索到的证据验证这些命题，剥夺了解算器的原始输出。这种精心设计的信息不对称方案打破了自我确认偏误的循环。通过多智能体强化学习（MARL）训练该管道，我们使智能体能够共同进化并优化事实遵循。多项幻觉基准实验表明，MARCH能显著降低幻觉发生率。值得注意的是，配备MARCH的8B参数LLM的性能可与强大的闭源模型媲美。MARCH为通过共进化实现LLM的事实自我提升铺平了可扩展的道路。代码就在这个 https 网址。

DreamerAD: Efficient Reinforcement Learning via Latent World Model for Autonomous Driving

DreamerAD：通过潜在世界模型实现的高效强化学习，实现自动驾驶

Authors: Pengxuan Yang, Yupeng Zheng, Deheng Qian, Zebin Xing, Qichao Zhang, Linbo Wang, Yichen Zhang, Shaoyu Guo, Zhongpu Xia, Qiang Chen, Junyu Han, Lingyun Xu, Yifeng Pan, Dongbin Zhao
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.24587
Pdf link: https://arxiv.org/pdf/2603.24587
Abstract We introduce DreamerAD, the first latent world model framework that enables efficient reinforcement learning for autonomous driving by compressing diffusion sampling from 100 steps to 1 - achieving 80x speedup while maintaining visual interpretability. Training RL policies on real-world driving data incurs prohibitive costs and safety risks. While existing pixel-level diffusion world models enable safe imagination-based training, they suffer from multi-step diffusion inference latency (2s/frame) that prevents high-frequency RL interaction. Our approach leverages denoised latent features from video generation models through three key mechanisms: (1) shortcut forcing that reduces sampling complexity via recursive multi-resolution step compression, (2) an autoregressive dense reward model operating directly on latent representations for fine-grained credit assignment, and (3) Gaussian vocabulary sampling for GRPO that constrains exploration to physically plausible trajectories. DreamerAD achieves 87.7 EPDMS on NavSim v2, establishing state-of-the-art performance and demonstrating that latent-space RL is effective for autonomous driving.
中文摘要 我们介绍DreamerAD，这是首个潜在世界模型框架，通过将扩散采样从100步压缩到1步，实现自动驾驶的高效强化学习——实现80倍加速，同时保持视觉可解读性。基于真实驾驶数据进行强化学习政策培训会产生高昂的成本和安全风险。虽然现有的像素级扩散世界模型支持安全的想象力训练，但它们存在多步扩散推理延迟（每帧2秒），阻碍了高频强化学习的交互。我们的方法通过三个关键机制利用视频生成模型中的去噪潜在特征：（1）通过递归多分辨率步进压缩降低采样复杂度的捷径强迫，（2）直接基于潜在表示操作的自回归稠密奖励模型，以及（3）限制探索在物理上合理的轨迹中进行GRPO的高斯词汇采样。DreamerAD在NavSim v2上实现了87.7 EPDMS，确立了最先进的性能，并证明了潜空间强化学习在自动驾驶中非常有效。

Keyword: diffusion policy

There is no result