生成时间: 2026-04-27 18:18:46 (UTC+8); Arxiv 发布时间: 2026-04-27 20:00 EDT (2026-04-28 08:00 UTC+8)
今天共有 15 篇相关文章
Keyword: reinforcement learning
Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning
结果奖励并不保证可验证或因果重要推理
- Authors: Qinan Yu, Alexa Tartaglini, Peter Hase, Carlos Guestrin, Christopher Potts
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2604.22074
- Pdf link: https://arxiv.org/pdf/2604.22074
- Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for low CIR and SR; and (3) CIR and SR can be improved even without SFT by applying auxiliary CIR/SR rewards on top of the outcome-based reward. This joint reward matches the accuracy of RLVR while also leading to causally important and sufficient reasoning. These results show that RLVR does not always lead models to rely on reasoning in the way that is commonly thought, but this issue can be remedied with simple modifications to the post-training procedure.
- 中文摘要
基于思维链推理的可验证奖励强化学习(RLVR)已成为语言模型训练后配方的标准组成部分。一个常见的假设是,通过RLVR训练的推理链可靠地代表了模型如何得出答案。本文开发了两个指标来批判性地审视这一假设:推理因果重要性(CIR),衡量推理代币对最终答案的累积影响;以及推理充分性(SR),衡量验证者是否仅凭推理得出明确答案。通过对Qwen2.5模型系列和ReasoningGym任务的实验,我们发现:(1)虽然RLVR确实提高了任务准确性,但并不能可靠地提升CIR或SR,这也让推理在模型表现中的作用受到质疑;(2)在RLVR前少量SFT可以缓解低CIR和SR;(3)即使没有SFT,也可以通过在基于结果的奖励基础上应用辅助CIR/SR奖励来提升CIR和SR。这种联合奖励与RLVR的准确性相匹配,同时也带来了因果重要和充分的推理。这些结果表明,RLVR并不总是导致模型依赖于普遍认为的推理,但这一问题可以通过对训练后程序进行简单调整来解决。
Removing Sandbagging in LLMs by Training with Weak Supervision
通过在弱监督下训练消除大型语言模型中的沙袋策略
- Authors: Emil Ryd, Henning Bartsch, Julian Stastny, Joe Benton, Vivek Hebbar
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.22082
- Pdf link: https://arxiv.org/pdf/2604.22082
- Abstract
As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap through sandbagging, producing work that appears acceptable but falls short of its true abilities. Can training elicit a model's best work even without reliable verification? We study this using model organisms trained to sandbag, testing elicitation techniques on problem-solving math, graduate-level science, and competitive coding tasks. We find that training with weak supervision can reliably elicit sandbagging models when supervised fine-tuning (SFT) and reinforcement learning (RL) are combined: SFT on weak demonstrations breaks the sandbagging behavior, enabling RL to then fully elicit performance. Neither method succeeds reliably alone-RL without SFT almost always leads to reward hacking rather than genuine improvement. Critically, this relies on training being indistinguishable from deployment; when models can distinguish between training and deployment, they can perform well during training while continuing to sandbag afterward. Our results provide initial evidence that training is a viable mitigation against sandbagging, while highlighting the importance of making training indistinguishable from deployment.
- 中文摘要
随着人工智能系统开始自动化复杂任务,监管越来越依赖于较弱的模型或有限的人工监督,无法完全验证输出质量。比其导师更有能力的模特,可以通过沙袋战术利用这一差距,产出看似可接受但未能达到其真实能力的工作。即使没有可靠的验证,培训也能激发模特的最佳表现吗?我们利用训练过的沙袋模型生物进行研究,测试问题解决数学、研究生级科学和竞争性编码任务中的诱发技术。我们发现,当监督微调(SFT)和强化学习(RL)结合时,弱监督训练能够可靠地诱导沙袋模型:弱演示上的SFT打破了沙袋行为,从而使RL能够充分激发性能。这两种方法单独用都不可靠——没有SFT的强化学习几乎总是导致奖励性黑客,而非真正的进步。关键是,这依赖于训练与部署无异;当模型能够区分训练和部署时,它们可以在训练期间表现出色,同时继续进行沙袋式防护。我们的结果提供了初步证据,证明培训是防止沙袋行为的有效缓解方法,同时强调了使培训与部署无异的重要性。
A Hybrid Reinforcement and Self-Supervised Learning Aided Benders Decomposition Algorithm
一种混合强化与自监督学习辅助弯曲者分解算法
- Authors: Bernard T. Agyeman, Zhe Li, Ilias Mitrai, Prodromos Daoutidis
- Subjects: Subjects:
Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2604.22107
- Pdf link: https://arxiv.org/pdf/2604.22107
- Abstract
We propose a hybrid reinforcement and self-supervised learning framework for accelerating generalized Benders decomposition (GBD). In this framework, a graph based reinforcement learning agent operates on a bipartite representation of the master problem and, together with a verification mechanism, determines the integer variable assignments that solve the master problem. These assignments are then used as inputs to a KKT informed neural network, trained via self supervision to predict primal dual solutions that approximately satisfy the Karush Kuhn Tucker conditions of the subproblem. The predicted solutions are used to construct Benders cuts directly. The framework is evaluated on a mixed integer nonlinear programming case study, where it achieves a 57.5% reduction in solution time relative to classical GBD while consistently recovering optimal solutions across all test instances.
- 中文摘要
我们提出了一种混合强化与自监督学习框架,用于加速广义本德斯分解(GBD)。在该框架下,基于图的强化学习代理对主问题的二分表示进行操作,并结合验证机制,确定解决主问题的整数变量赋值。这些赋值随后作为输入,输入通过自监督训练的KKT知情神经网络,预测大致满足子问题Karush Kuhn Tucker条件的原始对偶解。预测解被用来直接构造本德割。该框架基于混合整数非线性规划案例研究,相较经典GBD实现了57.5%的求解时间缩短,同时在所有测试实例中持续恢复最优解。
Do Not Imitate, Reinforce: Iterative Classification via Belief Refinement
不模仿,强化:通过信念细化进行迭代分类
- Authors: Mahdi Kallel, Johannes Tölle, Ahmed Hendawy, Carlo D'Eramo
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2604.22110
- Pdf link: https://arxiv.org/pdf/2604.22110
- Abstract
Standard supervised classification trains models to imitate the exact labels provided by a perfect oracle. This imitation happens in a single pass, restricting the model to a fixed compute budget even when inputs vary in complexity. Moreover, the rigid training objective forces the model to express absolute certainty on its training data, resulting in overconfident predictions during evaluation. We propose Reinforced Iterative Classification (RIC), which replaces the imitative objective with Reinforcement Learning (RL). RIC deploys a recurrent agent that iteratively updates a predictive distribution over classes, receiving reward for stepwise improvement in prediction quality. The value function provides a natural halting criterion by estimating the remaining scope for improvement. We prove that the iterative formulation recovers the same optimal predictions as cross-entropy while yielding an anytime classifier. On image classification benchmarks, RIC matches the accuracy of supervised baselines with improved calibration and learns to allocate computation adaptively across inputs.
- 中文摘要
标准监督分类训练模型模仿完美预言机提供的精确标签。这种模仿在一次循环中完成,即使输入复杂度不同,模型也限制在固定的计算预算内。此外,严格的训练目标迫使模型对训练数据表达绝对确定性,导致评估时预测过于自信。我们提出了强化迭代分类(RIC),用强化学习(RL)取代模仿目标。RIC部署一个循环代理,迭代更新类别的预测分布,并因预测质量的逐步提升而获得奖励。价值函数通过估计剩余的改进空间,提供了一个自然的停机准则。我们证明迭代表述在获得任意分类器的同时,恢复与交叉熵相同的最优预测。在图像分类基准测试中,RIC通过改进校准匹配监督基准的准确性,并学会在输入间自适应分配计算。
Optimal sequential decision-making for error propagation mitigation in digital twins
数字孪生中错误传播缓解的最优顺序决策
- Authors: Annice Najafi, Shokoufeh Mirzaei
- Subjects: Subjects:
Machine Learning (cs.LG); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2604.22168
- Pdf link: https://arxiv.org/pdf/2604.22168
- Abstract
Here, we explore the problem of error propagation mitigation in modular digital twins as a sequential decision process. Building on a companion study that used a Hidden Markov Model (HMM) to infer latent error regimes from surrogate-physics residuals, we develop a Markov Decision Process (MDP) in which the inferred regimes serve as states, corrective interventions serve as actions, and a scalar reward that takes into consideration the cost-benefit tradeoff between system fidelity and maintenance expense. The baseline transition matrix is extracted from the HMM-learned parameters. We then extend the formulation to a Partially Observable MDP (POMDP) that accounts for the imperfect nature of regime classification by maintaining a belief distribution updated via Bayesian filtering, with the HMM confusion matrix serving as the observation model. Both formulations are solved via dynamic programming and validated through Gillespie stochastic simulation. We then benchmark two model-free reinforcement learning algorithms, Q-learning and REINFORCE, to assess whether effective policies can be learned without explicit model knowledge. A systematic comparison of different intervention policies demonstrates that the MDP policy achieves the highest cumulative reward and fraction of time in nominal operation, while the POMDP recovers approximately 95\% of MDP performance under realistic observation noise. Sensitivity analyses across observation quality, repair probability, and discount factor confirm the robustness of these conclusions, and the major gaps in the policy hierarchy are statistically significant at $p < 0.001$. The gap between MDP and POMDP performance quantifies the value of information providing a principled criterion for investing in improved classification accuracy.
- 中文摘要
本文探讨模块化数字孪生中错误传播缓解问题,作为一种顺序决策过程。基于一项使用隐马尔可夫模型(HMM)从替代物理残差推断潜在误差的配套研究,我们开发了一个马尔可夫决策过程(MDP),其中推断的状态为状态,纠正干预作为动作,并给出一个标量奖励,考虑系统忠实度与维护成本之间的成本效益权衡。基线转移矩阵是从HMM学习的参数中提取的。随后我们将该表述扩展为部分可观测MDP(POMDP),通过保持通过贝叶斯滤波更新的信念分布,以HMM混淆矩阵作为观测模型,以解释体制分类的不完美性。这两种表述都通过动态规划求解,并通过吉莱斯皮随机仿真验证。随后,我们对两种无模型强化学习算法Q-learning和REINFORCE进行了基准测试,以评估是否可以在没有显式模型知识的情况下学习有效策略。系统比较不同干预政策表明,MDP策略在名义操作中实现了最高的累计奖励和最高时间比例,而POMDP在现实观测噪声下恢复了约95%的MDP表现。对观测质量、修复概率和贴现因子的敏感性分析证实了这些结论的稳健性,政策层级中的主要缺口在统计学上显著,$p <0.001$。MDP与POMDP性能的差距量化了信息价值,为投资提升分类准确性提供了原则性标准。
Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning
行为金丝雀:在强化学习微调中审计私有检索上下文的使用情况
- Authors: Chaoran Chen, Dayu Yuan, Peter Kairouz
- Subjects: Subjects:
Cryptography and Security (cs.CR); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2604.22191
- Pdf link: https://arxiv.org/pdf/2604.22191
- Abstract
In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model's behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.
- 中文摘要
在代理工作流中,LLM经常处理法律保护、不受进一步训练的检索上下文。然而,审计员目前缺乏可靠的方法来核实提供者是否违反了服务条款,尤其是在强化学习(RL)中将这些数据纳入培训后。虽然标准审计依赖逐字记忆和成员推断,但这些方法对强化学习训练的模型效果不佳,因为强化学习主要影响模型的行为风格,而非具体事实的保留。为弥合这一空白,我们引入了行为金丝雀,这是一种RLFT管道的新审计机制。该框架通过将文档触发与反馈配对,奖励独特的风格反应,从而对偏好数据进行工具化处理,如果这些数据被用于训练,则会诱导潜在的触发条件偏好。实证结果表明,这些行为信号能够检测未经授权的文档条件训练,在1%金丝雀注入率下,在10%的假阳性率下,检测率达到67%(AUROC = 0.756)。更广泛地说,我们的结果确立了行为金丝雀作为RLFT管道的新审计机制,使审计员能够测试训练时间的影响,即使这种影响表现为分布式行为变化而非记忆。
Learning Control Policies to Provably Satisfy Hard Affine Constraints for Black-Box Hybrid Dynamical Systems
学习控制策略以可证明满足黑盒混合动力系统中的硬仿射约束
- Authors: Aayushi Shrivastava, Kartik Nagpal, Sairam Jinkala, Jean-Baptiste Bouvier, Negar Mehr
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2604.22244
- Pdf link: https://arxiv.org/pdf/2604.22244
- Abstract
Ensuring safety for black-box hybrid dynamical systems presents significant challenges due to their instantaneous state jumps and unknown explicit nonlinear dynamics. Existing solutions for strict safety constraint satisfaction, like control barrier functions (CBFs) and reachability analysis, rely on direct knowledge of the dynamics. Similarly, safe reinforcement learning (RL) approaches often rely on known system dynamics or merely discourage safety violations through reward shaping. In this work, we want to learn RL policies which provably satisfy affine state constraints in closed loop for black-box hybrid dynamical systems with affine reset maps. Our key insight is forcing the RL policy to be affine and repulsive near the constraint boundaries for the unknown nonlinear dynamics of the system, providing guarantees that the trajectories will not violate the constraint. We further account for constraint violation due to instantaneous state jumps that occur due to impacts or reset maps in the hybrid system by introducing a second repulsive affine region before the reset that prevents post-reset states from violating the constraint. We derive sufficient conditions under which these policies satisfy safety constraints in closed loop. We also compare our approach with state-of-the-art reward shaping and learned-CBF methods on hybrid dynamical systems like the constrained pendulum and paddle juggler environments. In both scenarios, we show that our methodology learns higher quality policies while always satisfying the safety constraints.
- 中文摘要
由于其瞬时状态跳跃和未知的显式非线性动力学,确保黑箱混合动力系统的安全性面临重大挑战。现有的严格安全约束满足方案,如控制障碍函数(CBF)和可达性分析,依赖于对动力学的直接了解。同样,安全强化学习(RL)方法通常依赖已知的系统动态,或仅通过奖励塑造来阻止安全违规。本研究旨在学习能够在闭环中满足仿射状态约束的强化学习策略,适用于具有仿射复位映射的黑箱混合动力系统。我们的关键见解是,强迫强化学习策略在系统未知非线性动力学的约束边界附近呈现仿射和排斥,从而保证轨迹不会违反约束。我们进一步考虑了混合系统中因撞击或复位映射而产生的瞬时状态跳跃,通过在复位前引入第二个排斥仿射区域,防止复位后状态违反约束。我们推导出足够的条件,使这些政策在闭环中满足安全约束。我们还将我们的方法与最先进的奖励塑造法和学习的CBF方法进行比较,适用于如受限摆锤和桨式杂耍环境等混合动力系统。在这两种情景中,我们证明我们的方法学到更高质量的策略,同时始终满足安全约束。
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
超越思维链:重写作为生成多模态嵌入的通用接口
- Authors: Peixi Wu, Ke Mei, Feipeng Ma, Bosong Chai, Zhibin Lan, Chenxi Zhao, Shannan Yan, Jie Chen, Zhangchi Hu, Yansong Peng, Bo Lin, Junjie Zhou, Dacheng Yin, Tianyi Wang, Fengyun Rao, Jing Lyu, Hebei Li, Xiaoyan Sun
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2604.22280
- Pdf link: https://arxiv.org/pdf/2604.22280
- Abstract
Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings on several embedding tasks. However, Chain-of-Thought (CoT) reasoning tends to generate redundant thinking steps and introduce semantic ambiguity in the summarized answers in broader retrieval scenarios. To address this limitation, we propose Rewrite-driven Multimodal Embedding (RIME), a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite. Meanwhile, we present the Cross-Mode Alignment (CMA) to bridge the generative and discriminative embedding spaces, enabling flexible mutual retrieval to trade off efficiency and accuracy. Based on this, we also introduce Refine Reinforcement Learning (Refine-RL) that treats discriminative embeddings as stable semantic anchors to guide the rewrite optimization. Extensive experiments on MMEB-V2, MRMR and UVRB demonstrate that RIME substantially outperforms prior generative embedding models while significantly reducing the length of thinking.
- 中文摘要
多模态大型语言模型(MLLM)已成为通用多模态嵌入的有前景基础。最新研究表明,推理驱动的生成多模态嵌入在多个嵌入任务中优于判别性嵌入。然而,思维链(CoT)推理往往会产生冗余的思考步骤,并在更广泛的检索场景中引入总结答案的语义歧义。为解决这一限制,我们提出了重写驱动多模态嵌入(RIME)的统一框架,通过有利于检索的重写共同优化生成和嵌入。同时,我们介绍了交叉模式对齐(CMA),以桥接生成嵌入和判别嵌入空间,实现灵活的互检索,以在效率和准确性之间取得平衡。基于此,我们还引入了精炼强化学习(Refine-RL),将判别性嵌入视为稳定的语义锚点,以指导重写优化。对MMEB-V2、MRMR和UVRB的广泛实验表明,RIME在显著优于以往生成嵌入模型的表现,同时显著缩短了思考时间。
Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders
用硬否定的目标塑造:基于强化学习的大型语言模型推荐器的窗口部分AUC优化
- Authors: Wentao Shi, Qifan Wang, Chen Chen, Fei Liu, Dongfang Liu, Xu Liu, Wanli Ma, Junfeng Pan, Linhong Zhu, Fuli Feng
- Subjects: Subjects:
Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2604.22504
- Pdf link: https://arxiv.org/pdf/2604.22504
- Abstract
Reinforcement learning (RL) effectively optimizes Large Language Model (LLM)-based recommenders by contrasting positive and negative items. Empirically, training with beam-search negatives consistently outperforms random negatives, yet the mechanism is not well understood. We address this gap by analyzing the induced optimization objective and show that: (i) Under binary reward feedback, optimizing LLM recommenders with Group Relative Policy Optimization (GRPO) is theoretically equivalent to maximizing the Area Under the ROC Curve (AUC), which is often misaligned with Top-$K$ recommendation; and (ii) Replacing random negatives with beam-search negatives reshapes the objective toward partial AUC, improving alignment with Top-$K$ metrics. Motivated by this perspective, we introduce Windowed Partial AUC (WPAUC), which constrains the false positive rate (FPR) to a window [$\alpha,\alpha+d$] to more directly align with Top-$K$ metrics. We further propose an efficient Threshold-Adjusted Windowed reweighting (TAWin) RL method for its optimization, enabling explicit control over the targeted Top-$K$ performance. Experiments on four real-world datasets validate the theory and deliver consistent state-of-the-art performance.
- 中文摘要
强化学习(RL)通过对比正面和负面题目,有效优化基于大型语言模型(LLM)的推荐工具。从经验上看,使用束搜索负面训练始终优于随机负面,但其机制尚未被充分理解。我们通过分析诱导优化目标来弥补这一空白,并证明:(i) 在二元奖励反馈下,利用群体相对策略优化(GRPO)优化LLM推荐者,理论上等价于最大化ROC曲线下的面积(AUC),而AUC常与Top-$K$推荐不一致;以及(ii)用束搜索负替代随机负面,使目标向部分AUC重塑,从而改善与Top-$K$指标的对齐度。基于这一观点,我们引入了窗口部分AUC(WPAUC),该方法将假阳性率(FPR)限制在窗口[$\alpha,\alpha+d$],以更直接地对齐Top-$K$指标。我们还提出了一种高效的阈值调整窗口重权(TAWin)强化学习优化方法,实现对目标顶$K美元表现的明确控制。在四个真实世界数据集上的实验验证了理论,并提供了稳定的先进性能。
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
SOLAR-RL:半在线长期视野作业强化学习
- Authors: Jichao Wang, Liuyang Bian, Yufeng Zhou, Han Xiao, Yue Pan, Guozhi Wang, Hao Wang, Zhaoxiong Wang, Yafei Wen, Xiaoxin Chen, Shuai Ren, Lingfang Zeng
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.22558
- Pdf link: https://arxiv.org/pdf/2604.22558
- Abstract
As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilemma. Standard Offline RL often relies on static step-level data, neglecting global trajectory semantics such as task completion and execution quality. Conversely, Online RL captures the long-term dynamics but suffers from high interaction costs and potential environmental instability. To bridge this gap, we propose SOLAR-RL (Semi-Online Long-horizon Assignment Reinforcement Learning). Instead of relying solely on expensive online interactions, our framework integrates global trajectory insights directly into the offline learning process. Specifically, we reconstruct diverse rollout candidates from static data, detect the first failure point using per-step validity signals, and retroactively assign dense step-level rewards with target-aligned shaping to reflect trajectory-level execution quality, effectively simulating online feedback without interaction costs. Extensive experiments demonstrate that SOLAR-RL significantly improves long-horizon task completion rates and robustness compared to strong baselines, offering a sample-efficient solution for autonomous GUI navigation.
- 中文摘要
随着多模态大型语言模型(MLLM)的成熟,图形界面代理正从静态交互向复杂导航演进。虽然强化学习(RL)已成为训练MLLM代理处理动态图形界面任务的有前景范式,但其有效应用面临一个难题。标准离线强化学习通常依赖静态的步级数据,忽视了任务完成率和执行质量等全局轨迹语义。相反,在线强化学习捕捉了长期动态,但存在高交互成本和潜在的环境不稳定性。为弥合这一差距,我们提出了SOLAR-RL(半在线长期视野作业强化学习)。我们的框架不再仅依赖昂贵的在线互动,而是将全球发展轨迹的洞察直接整合到线下学习过程中。具体来说,我们从静态数据重建多样化的推广候选方案,利用每步效度信号检测第一个失败点,并追溯性地分配密集的步骤级奖励,配合目标对齐的形状,反映轨迹级执行质量,有效模拟在线反馈且不增加交互成本。大量实验表明,SOLAR-RL相较于强基线显著提升了长视野任务完成率和鲁棒性,为自主图形界面导航提供了样本高效的解决方案。
Learning Evidence Highlighting for Frozen LLMs
冻结大型语言模型的学习证据高亮
- Authors: Shaoang Li, Yanhang Shi, Yufei Li, Mingfu Liang, Xiaohan Wei, Yunchen Pu, Fei Tian, Chonglin Sun, Frank Shyu, Luke Simon, Sandeep Pandey, Xi Liu, Jian Li
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.22565
- Pdf link: https://arxiv.org/pdf/2604.22565
- Abstract
Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver's task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.
- 中文摘要
大型语言模型(LLMs)推理能力强,但当证据被埋藏在漫长、嘈杂的环境中时,往往会遗漏。我们介绍了HiLight,一种证据重点框架,将证据选择与推理解耦,适用于冻结的LLM求解器。HiLight通过训练一个轻量级的Emphasis Actor,在未修改的上下文中,在关键区间插入最小的高亮标签,避免压缩或重写输入,避免丢弃或扭曲证据。冻结的求解器随后对强调的输入进行下游推理。我们将高亮定位为一个监督较弱的决策问题,并仅利用求解器的任务奖励通过强化学习优化演员,无需证据标签,也无需访问或修改求解器。在顺序推荐和长上下文问答中,HiLight 持续提升性能优于强有力的提示和自动提示优化基线。所学到的重点策略将零样本转移到较小和较大的未见求解器族,包括基于API的求解器,表明演员捕捉的是真实且可复用的证据结构,而非过度拟合到单一骨干网。
Adversarial Co-Evolution of Malware and Detection Models: A Bilevel Optimization Perspective
恶意软件与检测模型的对抗性共进:双层优化视角
- Authors: Olha Jurečková, Martin Jureček, Matouš Kozák, Róbert Lórencz
- Subjects: Subjects:
Cryptography and Security (cs.CR); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2604.22569
- Pdf link: https://arxiv.org/pdf/2604.22569
- Abstract
Machine learning-based malware detectors are increasingly vulnerable to adversarial examples. Traditional defenses, such as one-shot adversarial training, often fail against adaptive attackers who use reinforcement learning to bypass detection. This paper proposes a robust defense framework based on bilevel optimization, explicitly modeling the strategic interaction between a defender and an attacker as an adversarial co-evolutionary process. We evaluate our approach using the MAB-malware framework against three distinct malware families: Mokes, Strab, and DCRat. Our experimental results demonstrate that while standard classifiers and basic adversarial retraining often remain vulnerable, showing evasion rates as high as 90 %, the proposed bilevel optimization approach consistently achieves near-total immunity, reducing evasion rates to 0 - 1.89 %. Furthermore, the iterative framework significantly increases the attacker's query complexity, raising the average cost of successful evasion by up to two orders of magnitude. These findings suggest that modeling the iterative cycle of attack and defense through bilevel optimization is essential for developing resilient malware detection systems capable of withstanding evolving adversarial threats.
- 中文摘要
基于机器学习的恶意软件检测器越来越容易受到对抗性案例的影响。传统的防御方法,如一次性对抗训练,常常无法抵挡利用强化学习绕过检测的自适应攻击者。本文提出了基于双层优化的稳健防御框架,明确建模防御者与攻击者之间的战略互动,作为一种对抗性的共进化过程。我们利用MAB恶意软件框架评估了针对三个不同恶意软件家族:Mokes、Strab和DCRat的方法。我们的实验结果表明,虽然标准分类器和基础对抗性再训练常常存在脆弱性,规避率高达90%,但提出的双层优化方法始终能实现近乎完全免疫,将规避率降至0-1.89%。此外,迭代框架显著增加了攻击者的查询复杂度,使成功规避的平均成本提高了多达两个数量级。这些发现表明,通过双层优化建模攻防循环对于开发能够抵御不断演变的对手威胁的韧性恶意软件检测系统至关重要。
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
无言思考:高效的潜在推理与抽象思维链
- Authors: Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2604.22709
- Pdf link: https://arxiv.org/pdf/2604.22709
- Abstract
While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance lags behind verbalized CoT. We propose $\textbf{Abstract Chain-of-Thought}$, a discrete latent reasoning post-training mechanism in which the language model produces a short sequence of tokens from a reserved vocabulary in lieu of a natural language CoT, before generating a response. To make previously unseen ''abstract'' tokens useful, we introduce a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook. After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding. Abstract-CoT achieves up to $11.6\times$ fewer reasoning tokens while demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning, and generalizes across language model families. We also find an emergent power law distribution over the abstract vocabulary, akin to those seen in natural language, that evolves across the training phases. Our findings highlight the potential for post-training latent reasoning mechanisms that enable efficient inference through a learned abstract reasoning language.
- 中文摘要
虽然长而显式的思维链(CoT)在复杂推理任务中已被证明有效,但在推理过程中生成它们的成本较高。通过利用连续表征,非语言推理方法以更短的生成长度出现,但其表现仍落后于口头CoT。我们提出了$\textbf{抽象思维链}$,这是一种离散的潜在推理后训练机制,语言模型在生成响应前,从保留词汇中生成一小串标记,代替自然语言的CoT。为了使此前未见的“抽象”令符变得有用,我们引入了一个策略迭代式的热身循环,该循环交替进行(i)通过掩蔽和监督微调从语言CoT中实现瓶颈,以及(ii)通过训练模型仅通过代码本的约束解码生成抽象词来进行自我蒸馏。预热后,我们通过受限解码的热启动强化学习优化抽象序列的生成。Abstract-CoT 在数学推理、指令跟随和多跳推理等方面表现相当,同时实现了最多 $11.6\ 乘时数的推理标记,并且在语言模型家族中实现了类似的推广。我们还发现了抽象词汇上的涌现幂律分布,类似于自然语言中出现的分布,这种分布在训练阶段中逐渐演变。我们的发现凸显了训练后潜在推理机制的潜力,这些机制使得通过学习的抽象推理语言实现高效推理。
ATRS: Adaptive Trajectory Re-splitting via a Shared Neural Policy for Parallel Optimization
ATRS:通过共享神经策略进行自适应轨迹重拆以实现并行优化
- Authors: Jiajun Yu, Guodong Liu, Li Wang, Pengxiang Zhou, Wentao Liu, Yin He, Chao Xu, Fei Gao, Yanjun Cao
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2604.22715
- Pdf link: https://arxiv.org/pdf/2604.22715
- Abstract
Parallel trajectory optimization via the Alternating Direction Method of Multipliers (ADMM) has emerged as a scalable approach to long-horizon motion planning. However, existing frameworks typically decompose the problem into parallel subproblems based on a predefined fixed structure. Such structural rigidity often causes optimization stagnation in highly constrained regions, where a few lagging subproblems delay global convergence. A natural remedy is to adaptively re-split these stagnating segments online. Yet, deciding when, where, and how to split exceeds the capability of rule-based heuristics. To this end, we propose ATRS, a novel framework that embeds a shared Deep Reinforcement Learning policy into the parallel ADMM loop. We formulate this adaptive adjustment as a Multi-Agent Shared-Policy Markov Decision Process, where all trajectory segments act as homogeneous agents and share a unified neural policy network. This parameter-sharing architecture endows the system with size invariance, enabling it to handle dynamically changing segment counts during re-splitting and generalize to arbitrary trajectory lengths. Furthermore, our formulation inherently supports zero-shot generalization to unseen environments, as our network relies solely on the internal states of the numerical solver rather than on the geometric features of the environment. To ensure solver stability, a Confidence-Based Election mechanism selects only the most stagnating segment for re-splitting at each step. Extensive simulations demonstrate that ATRS accelerates convergence, reducing the number of iterations by up to 26.0% and the computation time by up to 19.1%. Real-world experiments further confirm its applicability to both large-scale offline global planning and real-time onboard replanning within 35 ms per cycle, with no sim-to-real degradation.
- 中文摘要
通过交替方向乘法(ADMM)进行并行轨迹优化,已成为一种可扩展的长视野运动规划方法。然而,现有框架通常会根据预定义的固定结构将问题分解为并行子问题。这种结构刚性常导致高度受限区域的优化停滞,少数滞后子问题延迟全局收敛。一种自然的解决办法是适应性地重新分配这些停滞的网络细分。然而,决定何时、在哪里、如何拆分,超出了基于规则的启发式方法的能力。为此,我们提出了ATRS这一新框架,将共享的深度强化学习策略嵌入并行ADMM循环中。我们将这种自适应调整表述为多智能体共享策略马尔可夫决策过程,所有轨迹段作为同质代理,共享统一的神经策略网络。这种参数共享架构赋予系统大小不变性,使其能够在重新拆分过程中动态变化的段数,并推广到任意轨迹长度。此外,我们的表述本质上支持零样本推广到看不见的环境,因为我们的网络仅依赖数值求解器的内部状态,而非环境的几何特征。为确保求解器的稳定性,基于置信度的选择机制仅在每步选择最停滞的部分进行重新拆分。大量模拟表明,ATRS加速收敛,最多减少26.0%的迭代次数,计算时间缩短19.1%。真实世界实验进一步证实,其适用于大规模离线全球规划和每周期35毫秒内的实时车载重新规划,且无模拟到真实的劣化。
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
智能世界建模:基础、能力、定律及其延伸
- Authors: Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Haoxuan Che, Long Chen, Qifeng Chen, Wenxuan Zhang, Wenya Wang, Xiaojuan Qi, Yang Deng, Yanwei Li, Mike Zheng Shou, Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr, Jiaya Jia
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.22748
- Pdf link: https://arxiv.org/pdf/2604.22748
- Abstract
As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.
- 中文摘要
随着人工智能系统从生成文本转向通过持续互动实现目标,环境动态建模的能力成为核心瓶颈。操控物体、导航软件、协调他人或设计实验的智能体需要预测环境模型,但“世界模型”一词在不同研究群体中含义各异。我们引入了“x级定律”分类法,分为两个轴。第一个定义了三个能力层级:L1预测器,学习一步本地转移算子;L2 Simulator,将其组合成多步、动作条件的滚动,并遵守领域定律;以及L3 Evolver,当预测结果不符合新证据时,它会自动修正自己的模型。第二条规定了四种治理法体系:物理法、数字法、社会法和科学法。这些机制决定了世界模型必须满足哪些约束条件,以及最有可能失效的领域。利用该框架,我们综合了400多篇作品,总结了100多个代表性系统,涵盖基于模型的强化学习、视频生成、网页和图形界面代理、多代理社会模拟以及人工智能驱动的科学发现。我们分析了跨层级-体制对的方法、失效模式和评估实践,提出以决策为中心的评估原则和最小可重复的评估包,并概述架构指导、未解决问题和治理挑战。由此产生的路线图连接了此前孤立的社区,并规划了从被动下一步预测向能够模拟并最终重塑代理所处环境的世界模型的路径。
Keyword: diffusion policy
There is no result