Arxiv Papers of Today

生成时间: 2025-10-24 16:28:09 (UTC+8); Arxiv 发布时间: 2025-10-24 20:00 EDT (2025-10-25 08:00 UTC+8)

今天共有 41 篇相关文章

Keyword: reinforcement learning

An Integrated Approach to Neural Architecture Search for Deep Q-Networks

深度 Q 网络的神经架构搜索集成方法

Authors: Iman Rahmani, Saman Yazdannik, Morteza Tayefi, Jafar Roshanian
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.19872
Pdf link: https://arxiv.org/pdf/2510.19872
Abstract The performance of deep reinforcement learning agents is fundamentally constrained by their neural network architecture, a choice traditionally made through expensive hyperparameter searches and then fixed throughout training. This work investigates whether online, adaptive architecture optimization can escape this constraint and outperform static designs. We introduce NAS-DQN, an agent that integrates a learned neural architecture search controller directly into the DRL training loop, enabling dynamic network reconfiguration based on cumulative performance feedback. We evaluate NAS-DQN against three fixed-architecture baselines and a random search control on a continuous control task, conducting experiments over multiple random seeds. Our results demonstrate that NAS-DQN achieves superior final performance, sample efficiency, and policy stability while incurring negligible computational overhead. Critically, the learned search strategy substantially outperforms both undirected random architecture exploration and poorly-chosen fixed designs, indicating that intelligent, performance-guided search is the key mechanism driving success. These findings establish that architecture adaptation is not merely beneficial but necessary for optimal sample efficiency in online deep reinforcement learning, and suggest that the design of RL agents need not be a static offline choice but can instead be seamlessly integrated as a dynamic component of the learning process itself.
中文摘要 深度强化学习代理的性能从根本上受到其神经网络架构的限制，传统上是通过昂贵的超参数搜索做出的选择，然后在整个训练过程中进行固定。这项工作研究了在线自适应架构优化是否可以摆脱这种限制并优于静态设计。我们引入了 NAS-DQN，这是一种将学习到的神经架构搜索控制器直接集成到 DRL 训练循环中的代理，从而实现基于累积性能反馈的动态网络重新配置。我们根据三个固定架构基线和连续控制任务上的随机搜索对照评估 NAS-DQN，对多个随机种子进行实验。我们的结果表明，NAS-DQN实现了卓越的最终性能、样本效率和策略稳定性，同时产生的计算开销可以忽略不计。至关重要的是，学习的搜索策略大大优于无向随机架构探索和选择不当的固定设计，这表明智能、性能引导的搜索是推动成功的关键机制。这些发现表明，架构适应对于在线深度强化学习中的最佳样本效率不仅有益，而且是必要的，并表明RL智能体的设计不必是静态的离线选择，而是可以无缝集成为学习过程本身的动态组件。

FairGRPO: Fair Reinforcement Learning for Equitable Clinical Reasoning

FairGRPO：公平临床推理的公平强化学习

Authors: Shiqi Dai, Wei Dai, Jiaee Cheong, Paul Pu Liang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.19893
Pdf link: https://arxiv.org/pdf/2510.19893
Abstract Medical artificial intelligence systems have achieved remarkable diagnostic capabilities, yet they consistently exhibit performance disparities across demographic groups, causing real-world harm to underrepresented populations. While recent multimodal reasoning foundation models have advanced clinical diagnosis through integrated analysis of diverse medical data, reasoning trainings via reinforcement learning inherit and often amplify biases present in training datasets dominated by majority populations. We introduce Fairness-aware Group Relative Policy Optimization (FairGRPO), a hierarchical reinforcement learning approach that promotes equitable learning across heterogeneous clinical populations. FairGRPO employs adaptive importance weighting of advantages based on representation, task difficulty, and data source. To address the common issue of missing demographic labels in the clinical domain, we further employ unsupervised clustering, which automatically discovers latent demographic groups when labels are unavailable. Through comprehensive experiments across 7 clinical diagnostic datasets spanning 5 clinical modalities across X-ray, CT scan, dermoscropy, mammography and ultrasound, we demonstrate that FairGRPO reduces predictive parity by 27.2% against all vanilla and bias mitigated RL baselines, while improving F1 score by 12.49%. Furthermore, training dynamics analysis reveals that FairGRPO progressively improves fairness throughout optimization, while baseline RL methods exhibit deteriorating fairness as training progresses. Based on FairGRPO, we release FairMedGemma-4B, a fairness-aware clinical VLLM that achieves state-of-the-art performance while demonstrating significantly reduced disparities across demographic groups.
中文摘要 医疗人工智能系统已经实现了卓越的诊断能力，但它们在不同人口群体之间始终表现出性能差异，对代表性不足的人群造成现实世界的伤害。虽然最近的多模态推理基础模型通过对不同医疗数据的综合分析推进了临床诊断，但通过强化学习进行的推理训练继承并经常放大由大多数人群主导的训练数据集中存在的偏差。我们引入了公平性感知群体相对策略优化（FairGRPO），这是一种分层强化学习方法，可促进异质临床人群之间的公平学习。FairGRPO 采用基于表示、任务难度和数据源的优势自适应重要性加权。为了解决临床领域中人口统计学标签缺失的常见问题，我们进一步采用无监督聚类，当标签不可用时，它会自动发现潜在的人口统计学群体。通过对 7 个临床诊断数据集的综合实验，涵盖 X 射线、CT 扫描、皮肤剪影、乳房 X 光检查和超声的 5 种临床模式，我们证明 FairGRPO 与所有普通和偏倚减轻的 RL 基线相比，预测奇偶校验降低了 27.2%，同时将 F1 分数提高了 12.49%。此外，训练动力学分析表明，FairGRPO在整个优化过程中逐渐提高了公平性，而基线RL方法随着训练的进行表现出公平性下降。基于 FairGRPO，我们发布了 FairMedGemma-4B，这是一种公平意识临床 VLLM，可实现最先进的性能，同时显着减少人口群体之间的差异。

Large Language Model enabled Mathematical Modeling

支持大型语言模型的数学建模

Authors: Guoyun Zhang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.19895
Pdf link: https://arxiv.org/pdf/2510.19895
Abstract The integration of Large Language Models (LLMs) with optimization modeling offers a promising avenue for advancing decision-making in operations research (OR). Traditional optimization methods,such as linear programming, mixed integer programming, and simulation depend heavily on domain expertise to translate real-world problems into solvable mathematical models. While solvers like Gurobi and COPT are powerful, expert input remains essential for defining objectives, constraints, and variables. This research investigates the potential of LLMs, specifically the DeepSeek-R1 model, to bridge this formulation gap using natural language understanding and code generation. Although prior models like GPT-4, Claude, and Bard have shown strong performance in NLP and reasoning tasks, their high token costs and tendency toward hallucinations limit real-world applicability in supply chain contexts. In contrast, DeepSeek-R1, a cost-efficient and high-performing model trained with reinforcement learning, presents a viable alternative. Despite its success in benchmarks such as LiveCodeBench and Math-500, its effectiveness in applied OR scenarios remains under explored. This study systematically evaluates DeepSeek-R1 across four key OR benchmarks: NL4OPT, IndustryOR, EasyLP, and ComplexOR. Our methodology includes baseline assessments, the development of a hallucination taxonomy, and the application of mitigation strategies like LLM-as-a-Judge, Few-shot Learning (FSL), Tool Calling, and a Multi-agent Framework. These techniques aim to reduce hallucinations, enhance formulation accuracy, and better align model outputs with user intent.
中文摘要 大型语言模型（LLM）与优化建模的集成为推进运筹学（OR）决策提供了一条有前途的途径。传统的优化方法，如线性规划、混合整数规划和仿真，在很大程度上依赖于领域专业知识来将现实世界的问题转化为可解决的数学模型。虽然 Gurobi 和 COPT 等求解器功能强大，但专家输入对于定义目标、约束和变量仍然至关重要。本研究调查了法学硕士（特别是 DeepSeek-R1 模型）利用自然语言理解和代码生成来弥合这一公式差距的潜力。尽管 GPT-4、Claude 和 Bard 等先前模型在 NLP 和推理任务中表现出了强大的性能，但它们的高代币成本和幻觉倾向限制了现实世界在供应链环境中的适用性。相比之下，DeepSeek-R1 是一种通过强化学习训练的经济高效且高性能的模型，提供了一种可行的替代方案。尽管它在 LiveCodeBench 和 Math-500 等基准测试中取得了成功，但其在应用手术场景中的有效性仍有待探索。本研究在四个关键的手术室基准测试中系统地评估了 DeepSeek-R1：NL4OPT、IndustryOR、EasyLP 和 ComplexOR。我们的方法包括基线评估、幻觉分类法的开发以及缓解策略的应用，如法学硕士作为法官、少样本学习（FSL）、工具调用和多代理框架。这些技术旨在减少幻觉，提高配方准确性，并更好地使模型输出与用户意图保持一致。

Robust Reinforcement Learning in Finance: Modeling Market Impact with Elliptic Uncertainty Sets

金融中的鲁棒强化学习：使用椭圆不确定性集对市场影响进行建模

Authors: Shaocong Ma, Heng Huang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2510.19950
Pdf link: https://arxiv.org/pdf/2510.19950
Abstract In financial applications, reinforcement learning (RL) agents are commonly trained on historical data, where their actions do not influence prices. However, during deployment, these agents trade in live markets where their own transactions can shift asset prices, a phenomenon known as market impact. This mismatch between training and deployment environments can significantly degrade performance. Traditional robust RL approaches address this model misspecification by optimizing the worst-case performance over a set of uncertainties, but typically rely on symmetric structures that fail to capture the directional nature of market impact. To address this issue, we develop a novel class of elliptic uncertainty sets. We establish both implicit and explicit closed-form solutions for the worst-case uncertainty under these sets, enabling efficient and tractable robust policy evaluation. Experiments on single-asset and multi-asset trading tasks demonstrate that our method achieves superior Sharpe ratio and remains robust under increasing trade volumes, offering a more faithful and scalable approach to RL in financial markets.
中文摘要 在金融应用中，强化学习（RL）代理通常根据历史数据进行训练，其中它们的行为不会影响价格。然而，在部署过程中，这些代理在实时市场上进行交易，他们自己的交易可能会改变资产价格，这种现象被称为市场影响。训练环境和部署环境之间的这种不匹配可能会显着降低性能。传统的稳健 RL 方法通过优化一组不确定性的最坏情况性能来解决这种模型错误规范，但通常依赖于无法捕捉市场影响的方向性质的对称结构。为了解决这个问题，我们开发了一类新的椭圆不确定性集。我们为这些集合下最坏情况下的不确定性建立了隐式和显式封闭式解决方案，从而实现高效且易于处理的稳健政策评估。单一资产和多资产交易任务的实验表明，我们的方法实现了卓越的夏普比率，并在不断增加的交易量下保持稳健，为金融市场的 RL 提供了更忠实和可扩展的方法。

Simultaneous learning of state-to-state minimum-time planning and control

同时学习状态到状态的最小时间规划和控制

Authors: Swati Dantu, Robert Pěnička, Martin Saska
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.20008
Pdf link: https://arxiv.org/pdf/2510.20008
Abstract This paper tackles the challenge of learning a generalizable minimum-time flight policy for UAVs, capable of navigating between arbitrary start and goal states while balancing agile flight and stable hovering. Traditional approaches, particularly in autonomous drone racing, achieve impressive speeds and agility but are constrained to predefined track layouts, limiting real-world applicability. To address this, we propose a reinforcement learning-based framework that simultaneously learns state-to-state minimum-time planning and control and generalizes to arbitrary state-to-state flights. Our approach leverages Point Mass Model (PMM) trajectories as proxy rewards to approximate the true optimal flight objective and employs curriculum learning to scale the training process efficiently and to achieve generalization. We validate our method through simulation experiments, comparing it against Nonlinear Model Predictive Control (NMPC) tracking PMM-generated trajectories and conducting ablation studies to assess the impact of curriculum learning. Finally, real-world experiments confirm the robustness of our learned policy in outdoor environments, demonstrating its ability to generalize and operate on a small ARM-based single-board computer.
中文摘要 本文解决了学习无人机通用的最小飞行时间策略的挑战，该策略能够在任意开始状态和目标状态之间导航，同时平衡敏捷飞行和稳定悬停。传统方法，特别是在自主无人机赛车中，可以实现令人印象深刻的速度和敏捷性，但受到预定义赛道布局的限制，限制了现实世界的适用性。为了解决这个问题，我们提出了一个基于强化学习的框架，该框架同时学习状态到状态的最小时间规划和控制，并推广到任意状态到状态的飞行。我们的方法利用点质量模型（PMM）轨迹作为代理奖励来近似真正的最佳飞行目标，并采用课程学习来有效地扩展训练过程并实现泛化。我们通过模拟实验验证了我们的方法，将其与跟踪 PMM 生成轨迹的非线性模型预测控制（NMPC）进行比较，并进行消融研究以评估课程学习的影响。最后，真实世界的实验证实了我们所学到的策略在室外环境中的稳健性，证明了它在基于 ARM 的小型单板计算机上泛化和作的能力。

SALT: Step-level Advantage Assignment for Long-horizon Agents via Trajectory Graph

SALT：通过轨迹图为长视野代理分配阶梯级优势分配

Authors: Jiazheng Li, Yawei Wang, David Yan, Yijun Tian, Zhichao Xu, Huan Song, Panpan Xu, Lin Lee Cheong
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.20022
Pdf link: https://arxiv.org/pdf/2510.20022
Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities, enabling language agents to excel at single-turn tasks. However, their application to complex, multi-step, and long-horizon tasks remains challenging. While reinforcement learning (RL) offers a promising avenue for addressing these challenges, mainstream approaches typically rely solely on sparse, outcome-based rewards, a limitation that becomes especially problematic for group-based RL algorithms lacking critic models, such as Group Relative Policy Optimization (GRPO). In such methods, uniformly rewarding or penalizing all actions within a trajectory can lead to training instability and suboptimal policies, because beneficial and detrimental actions are often entangled across multi-step interactions. To address this challenge, we propose SALT, a novel and lightweight framework that provides a finer-grained advantage assignment, derived solely from outcome rewards. We achieve this by constructing a graph from trajectories of the same prompt, which allows us to quantify the quality of each step and assign advantages accordingly. Crucially, SALT is designed as a plug-and-play module that seamlessly integrates with existing group-based RL algorithms, requiring no modifications to the rollout procedure and introducing negligible computational overhead. Extensive experiments on the WebShop, ALFWorld, and AppWorld benchmarks with various model sizes demonstrate that SALT consistently improves performance. We also conduct a thorough analysis to validate the design choices behind SALT and offer actionable insights.
中文摘要 大型语言模型（LLM）已表现出卓越的能力，使语言代理能够出色地完成单轮任务。然而，将它们应用于复杂、多步骤和长期任务仍然具有挑战性。虽然强化学习（RL）为应对这些挑战提供了一条有前途的途径，但主流方法通常仅依赖于稀疏的、基于结果的奖励，这一限制对于缺乏批评模型（例如组相对策略优化（GRPO））的基于组的 RL 算法来说尤其成问题。在此类方法中，统一奖励或惩罚轨迹内的所有行动可能会导致训练不稳定和次优政策，因为有益和有害的行为通常纠缠在多步骤交互中。为了应对这一挑战，我们提出了 SALT，这是一个新颖且轻量级的框架，它提供了完全源自结果奖励的更细粒度的优势分配。我们通过根据同一提示的轨迹构建图表来实现这一目标，这使我们能够量化每个步骤的质量并相应地分配优势。至关重要的是，SALT 被设计为即插即用模块，可与现有的基于组的 RL 算法无缝集成，无需修改推出过程，并且引入的计算开销可以忽略不计。在 WebShop、ALFWorld 和 AppWorld 基准测试中对各种模型大小的广泛实验表明，SALT 不断提高性能。我们还进行彻底的分析，以验证 SALT 背后的设计选择并提供可作的见解。

Learning Personalized Ad Impact via Contextual Reinforcement Learning under Delayed Rewards

在延迟奖励下通过情境强化学习学习个性化广告影响

Authors: Yuwei Cheng, Zifeng Zhao, Haifeng Xu
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.20055
Pdf link: https://arxiv.org/pdf/2510.20055
Abstract Online advertising platforms use automated auctions to connect advertisers with potential customers, requiring effective bidding strategies to maximize profits. Accurate ad impact estimation requires considering three key factors: delayed and long-term effects, cumulative ad impacts such as reinforcement or fatigue, and customer heterogeneity. However, these effects are often not jointly addressed in previous studies. To capture these factors, we model ad bidding as a Contextual Markov Decision Process (CMDP) with delayed Poisson rewards. For efficient estimation, we propose a two-stage maximum likelihood estimator combined with data-splitting strategies, ensuring controlled estimation error based on the first-stage estimator's (in)accuracy. Building on this, we design a reinforcement learning algorithm to derive efficient personalized bidding strategies. This approach achieves a near-optimal regret bound of $\tilde{O}{(dH^2\sqrt{T})}$, where $d$ is the contextual dimension, $H$ is the number of rounds, and $T$ is the number of customers. Our theoretical findings are validated by simulation experiments.
中文摘要 在线广告平台使用自动拍卖将广告商与潜在客户联系起来，需要有效的竞价策略来实现利润最大化。准确的广告影响估计需要考虑三个关键因素：延迟和长期影响、累积的广告影响（例如强化或疲劳）以及客户异质性。然而，这些影响在以前的研究中往往没有共同解决。为了捕捉这些因素，我们将广告竞价建模为具有延迟泊松奖励的上下文马尔可夫决策过程（CMDP）。为了实现高效估计，我们提出了一种两阶段最大似然估计器与数据拆分策略相结合，确保基于第一阶段估计器的（不）准确性控制估计误差。在此基础上，我们设计了一种强化学习算法来推导出高效的个性化竞价策略。这种方法实现了 $\tilde{O}{（dH^2\sqrt{T}）}$ 的近乎最佳的后悔界限，其中 $d$ 是上下文维度，$H$ 是轮数，$T$ 是客户数量。我们的理论发现通过模拟实验得到了验证。

Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training

增强小型波斯语医学语言模型的推理能力可以优于大规模数据训练

Authors: Mehrdad Ghassabi, Sadra Hakim, Hamidreza Baradaran Kashani, Pedram Rostami
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.20059
Pdf link: https://arxiv.org/pdf/2510.20059
Abstract Enhancing reasoning capabilities in small language models is critical for specialized applications such as medical question answering, particularly in underrepresented languages like Persian. In this study, we employ Reinforcement Learning with AI Feedback (RLAIF) and Direct preference optimization (DPO) to improve the reasoning skills of a general-purpose Persian language model. To achieve this, we translated a multiple-choice medical question-answering dataset into Persian and used RLAIF to generate rejected-preferred answer pairs, which are essential for DPO training. By prompting both teacher and student models to produce Chain-of-Thought (CoT) reasoning responses, we compiled a dataset containing correct and incorrect reasoning trajectories. This dataset, comprising 2 million tokens in preferred answers and 2.5 million tokens in rejected ones, was used to train a baseline model, significantly enhancing its medical reasoning capabilities in Persian. Remarkably, the resulting model outperformed its predecessor, gaokerena-V, which was trained on approximately 57 million tokens, despite leveraging a much smaller dataset. These results highlight the efficiency and effectiveness of reasoning-focused training approaches in developing domain-specific language models with limited data availability.
中文摘要 增强小型语言模型的推理能力对于医学问答等专业应用至关重要，特别是在波斯语等代表性不足的语言中。在这项研究中，我们采用人工智能反馈强化学习（RLAIF）和直接偏好优化（DPO）来提高通用波斯语语言模型的推理能力。为了实现这一目标，我们将多项选择医学问答数据集翻译成波斯语，并使用 RLAIF 生成拒绝-首选答案对，这对于 DPO 训练至关重要。通过提示教师和学生模型产生思维链（CoT）推理响应，我们编译了一个包含正确和错误推理轨迹的数据集。该数据集由 200 万个首选答案标记和 250 万个拒绝答案标记组成，用于训练基线模型，显着增强了其波斯语医学推理能力。值得注意的是，尽管利用了小得多的数据集，但最终的模型优于其前身 gaokerena-V，后者在大约 5700 万个代币上进行了训练。这些结果凸显了以推理为中心的训练方法在开发数据可用性有限的特定领域语言模型方面的效率和有效性。

StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback

StableSketcher：通过视觉问答反馈增强基于像素的草图生成的扩散模型

Authors: Jiho Park, Sieun Choi, Jaeyoon Seo, Jihie Kim
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.20093
Pdf link: https://arxiv.org/pdf/2510.20093
Abstract Although recent advancements in diffusion models have significantly enriched the quality of generated images, challenges remain in synthesizing pixel-based human-drawn sketches, a representative example of abstract expression. To combat these challenges, we propose StableSketcher, a novel framework that empowers diffusion models to generate hand-drawn sketches with high prompt fidelity. Within this framework, we fine-tune the variational autoencoder to optimize latent decoding, enabling it to better capture the characteristics of sketches. In parallel, we integrate a new reward function for reinforcement learning based on visual question answering, which improves text-image alignment and semantic consistency. Extensive experiments demonstrate that StableSketcher generates sketches with improved stylistic fidelity, achieving better alignment with prompts compared to the Stable Diffusion baseline. Additionally, we introduce SketchDUO, to the best of our knowledge, the first dataset comprising instance-level sketches paired with captions and question-answer pairs, thereby addressing the limitations of existing datasets that rely on image-label pairs. Our code and dataset will be made publicly available upon acceptance.
中文摘要 尽管扩散模型的最新进展显着丰富了生成图像的质量，但在合成基于像素的人类绘制草图（抽象表达的典型例子）方面仍然存在挑战。为了应对这些挑战，我们提出了 StableSketcher，这是一个新颖的框架，使扩散模型能够生成具有高提示保真度的手绘草图。在这个框架内，我们微调变分自动编码器以优化潜在解码，使其能够更好地捕捉草图的特征。同时，我们集成了基于视觉问答的强化学习新奖励函数，提高了文本-图像对齐和语义一致性。广泛的实验表明，与稳定扩散基线相比，StableSketcher 生成的草图具有更高的风格保真度，实现了更好的提示一致性。此外，据我们所知，我们还引入了第一个由实例级草图与标题和问答对配对组成的数据集，从而解决了依赖图像标签对的现有数据集的局限性。我们的代码和数据集将在接受后公开。

Competition is the key: A Game Theoretic Causal Discovery Approach

竞争是关键：博弈论因果发现方法

Authors: Amartya Roy, Souvik Chakraborty
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.20106
Pdf link: https://arxiv.org/pdf/2510.20106
Abstract Causal discovery remains a central challenge in machine learning, yet existing methods face a fundamental gap: algorithms like GES and GraN-DAG achieve strong empirical performance but lack finite-sample guarantees, while theoretically principled approaches fail to scale. We close this gap by introducing a game-theoretic reinforcement learning framework for causal discovery, where a DDQN agent directly competes against a strong baseline (GES or GraN-DAG), always warm-starting from the opponent's solution. This design yields three provable guarantees: the learned graph is never worse than the opponent, warm-starting strictly accelerates convergence, and most importantly, with high probability the algorithm selects the true best candidate graph. To the best of our knowledge, our result makes a first-of-its-kind progress in explaining such finite-sample guarantees in causal discovery: on synthetic SEMs (30 nodes), the observed error probability decays with n, tightly matching theory. On real-world benchmarks including Sachs, Asia, Alarm, Child, Hepar2, Dream, and Andes, our method consistently improves upon GES and GraN-DAG while remaining theoretically safe. Remarkably, it scales to large graphs such as Hepar2 (70 nodes), Dream (100 nodes), and Andes (220 nodes). Together, these results establish a new class of RL-based causal discovery algorithms that are simultaneously provably consistent, sample-efficient, and practically scalable, marking a decisive step toward unifying empirical performance with rigorous finite-sample theory.
中文摘要 因果发现仍然是机器学习的核心挑战，但现有方法面临着一个根本性的差距：GES 和 GraN-DAG 等算法实现了强大的经验性能，但缺乏有限样本保证，而理论原则方法无法扩展。我们通过引入用于因果发现的博弈论强化学习框架来缩小这一差距，其中 DDQN 代理直接与强基线（GES 或 GraN-DAG）竞争，始终从对手的解决方案开始热启动。这种设计产生了三个可证明的保证：学习到的图永远不会比对手差，热启动严格加速收敛，最重要的是，算法大概率选择真正的最佳候选图。据我们所知，我们的结果在解释因果发现中的这种有限样本保证方面取得了史无前例的进展：在合成 SEM（30 个节点）上，观察到的错误概率随着 n 个紧密匹配理论而衰减。在包括 Sachs、Asia、Alarm、Child、Hepar2、Dream 和 Andes 在内的实际基准测试中，我们的方法在 GES 和 GraN-DAG 的基础上不断改进，同时保持理论上的安全性。值得注意的是，它可以扩展到 Hepar2（70 个节点）、Dream（100 个节点）和 Andes（220 个节点）等大型图。这些结果共同建立了一类新的基于 RL 的因果发现算法，这些算法同时具有可证明的一致性、样本效率和实际可扩展性，标志着在将经验性能与严格的有限样本理论统一方面迈出了决定性的一步。

BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation

BoundRL：通过强化边界生成实现高效的结构化文本分割

Authors: Haoyuan Li, Zhengyuan Shen, Sullam Jeoung, Yueyan Chen, Jiayu Li, Qi Zhu, Shuai Wang, Vassilis Ioannidis, Huzefa Rangwala
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.20151
Pdf link: https://arxiv.org/pdf/2510.20151
Abstract As structured texts become increasingly complex across diverse domains -- from technical reports to generative AI prompts -- the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL's effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.
中文摘要 随着结构化文本在不同领域（从技术报告到生成式人工智能提示）变得越来越复杂，将文本分割成语义上有意义的组件的需求变得至关重要。此类文本通常包含简单语言以外的元素，包括表格、代码片段和占位符，而传统的句子或段落级分割方法无法有效处理这些元素。为了应对这一挑战，我们提出了 BoundRL，这是一种新颖而高效的方法，可以联合对长结构文本进行标记级文本分割和标签预测。它不是为每个片段生成完整的内容，而是只生成一系列起始标记，并通过将这些标记定位在原始文本中来重建完整的内容，从而将推理成本降低几个数量级并最大限度地减少幻觉。为了使模型适应输出格式，BoundRL~执行具有可验证奖励（RLVR）的强化学习，并具有专门设计的奖励，共同优化文档重建保真度和语义对齐。为了减轻熵坍缩，它通过系统地扰动生成的片段序列的一小部分来进一步构建中间候选者，从而为更高质量的解决方案创建垫脚石。为了证明 BoundRL 在特别具有挑战性的结构化文本上的有效性，我们重点评估用于 LLM 应用程序的复杂提示。实验表明，BoundRL 使小型语言模型（1.7B 参数）的性能优于大型模型的少量提示。此外，与监督微调相比，具有我们设计奖励的 RLVR 产生了显着改进，并且合并中间候选者进一步提高了性能和泛化。

Soft Switching Expert Policies for Controlling Systems with Uncertain Parameters

用于控制参数不确定系统的软开关专家策略

Authors: Junya Ikemoto
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.20152
Pdf link: https://arxiv.org/pdf/2510.20152
Abstract This paper proposes a simulation-based reinforcement learning algorithm for controlling systems with uncertain and varying system parameters. While simulators are useful for safely learning control policies for physical systems, mitigating the reality gap remains a major challenge. To address the challenge, we propose a two-stage algorithm. In the first stage, multiple control policies are learned for systems with different parameters in a simulator. In the second stage, for a real system, the control policies learned in the first stage are smoothly switched using an online convex optimization algorithm based on observations. Our proposed algorithm is demonstrated through numerical experiments.
中文摘要 该文提出了一种基于仿真的强化学习算法，用于控制系统参数不确定和变化的系统。虽然模拟器对于安全地学习物理系统的控制策略很有用，但缩小现实差距仍然是一个主要挑战。为了应对这一挑战，我们提出了一种两阶段算法。在第一阶段，为模拟器中具有不同参数的系统学习多个控制策略。在第二阶段，对于真实系统，使用基于观测的在线凸优化算法平滑切换第一阶段学习的控制策略。我们提出的算法通过数值实验进行了验证。

Reinforcement Learning-based Robust Wall Climbing Locomotion Controller in Ferromagnetic Environment

基于强化学习的铁磁环境下鲁棒爬壁运动控制器

Authors: Yong Um, Young-Ha Shin, Joon-Ha Kim, Soonpyo Kwon, Hae-Won Park
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.20174
Pdf link: https://arxiv.org/pdf/2510.20174
Abstract We present a reinforcement learning framework for quadrupedal wall-climbing locomotion that explicitly addresses uncertainty in magnetic foot adhesion. A physics-based adhesion model of a quadrupedal magnetic climbing robot is incorporated into simulation to capture partial contact, air-gap sensitivity, and probabilistic attachment failures. To stabilize learning and enable reliable transfer, we design a three-phase curriculum: (1) acquire a crawl gait on flat ground without adhesion, (2) gradually rotate the gravity vector to vertical while activating the adhesion model, and (3) inject stochastic adhesion failures to encourage slip recovery. The learned policy achieves a high success rate, strong adhesion retention, and rapid recovery from detachment in simulation under degraded adhesion. Compared with a model predictive control (MPC) baseline that assumes perfect adhesion, our controller maintains locomotion when attachment is intermittently lost. Hardware experiments with the untethered robot further confirm robust vertical crawling on steel surfaces, maintaining stability despite transient misalignment and incomplete attachment. These results show that combining curriculum learning with realistic adhesion modeling provides a resilient sim-to-real framework for magnetic climbing robots in complex environments.
中文摘要 我们提出了一个用于四足爬壁运动的强化学习框架，该框架明确解决了磁足粘附的不确定性。将基于物理的四足磁力攀爬机器人粘附模型纳入仿真中，以捕获部分接触、气隙灵敏度和概率附着失效。为了稳定学习并实现可靠的转移，我们设计了一个三阶段课程：（1）在平地上获得无粘连的爬行步态，（2）在激活粘连模型的同时逐渐将重力矢量旋转到垂直，以及（3）注入随机粘连失效以促进滑移恢复。学习到的策略在附着力退化下的模拟中实现了高成功率、强的粘连保持力和快速的脱落恢复。与假设完美粘附的模型预测控制（MPC）基线相比，我们的控制器在间歇性失去依恋时保持运动。使用未系留机器人的硬件实验进一步证实了在钢表面上稳健的垂直爬行，尽管存在瞬态错位和不完全连接，但仍能保持稳定性。这些结果表明，将课程学习与现实的附着力建模相结合，为复杂环境中的磁爬机器人提供了一个弹性的模拟到真实框架。

Mixture-of-Minds: Multi-Agent Reinforcement Learning for Table Understanding

Mixture-of-Minds：用于表理解的多智能体强化学习

Authors: Yuhang Zhou, Mingrui Zhang, Ke Li, Mingyi Wang, Qiao Liu, Qifei wang, Jiayi Liu, Fei Liu, Serena Li, Weiwi Li, Mingze Gao, Abhishek Kumar, Xiangjun Fan, Zhuokai Zhao, Lizhu Zhang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.20176
Pdf link: https://arxiv.org/pdf/2510.20176
Abstract Understanding and reasoning over tables is a critical capability for many real-world applications. Large language models (LLMs) have shown promise on this task, but current approaches remain limited. Fine-tuning based methods strengthen language reasoning; yet they are prone to arithmetic errors and hallucination. In contrast, tool-based methods enable precise table manipulation but rely on rigid schemas and lack semantic understanding. These complementary drawbacks highlight the need for approaches that integrate robust reasoning with reliable table processing. In this work, we propose Mixture-of-Minds, a multi-agent framework that decomposes table reasoning into three specialized roles: planning, coding, and answering. This design enables each agent to focus on a specific aspect of the task while leveraging code execution for precise table manipulation. Building on this workflow, we introduce a self-improvement training framework that employs Monte Carlo Tree Search (MCTS) rollouts to generate pseudo-gold trajectories and optimize agents with reinforcement learning (RL). Extensive experiments show that Mixture-of-Minds delivers substantial gains, reaching 62.13% on TableBench and surpassing OpenAI-o4-mini-high. These results demonstrate the promise of combining structured multi-agent workflows with RL to advance table understanding.
中文摘要 理解和推理表是许多实际应用程序的关键功能。大型语言模型（LLM）已显示出这项任务的前景，但目前的方法仍然有限。基于微调的方法加强了语言推理;然而，它们容易出现算术错误和幻觉。相比之下，基于工具的方法可以实现精确的表作，但依赖于僵化的模式并且缺乏语义理解。这些互补的缺点凸显了需要将稳健推理与可靠的表格处理相结合的方法。在这项工作中，我们提出了 Mix-of-Minds，这是一个多智能体框架，它将表格推理分解为三个专业角色：计划、编码和回答。这种设计使每个代理能够专注于任务的特定方面，同时利用代码执行进行精确的表作。在此工作流程的基础上，我们引入了一个自我改进训练框架，该框架采用蒙特卡洛树搜索（MCTS）部署来生成伪黄金轨迹并通过强化学习（RL）优化代理。大量实验表明，Mixture-of-Minds 带来了可观的收益，在 TableBench 上达到 62.13%，超过了 OpenAI-o4-mini-high。这些结果证明了将结构化多智能体工作流程与 RL 相结合以促进表理解的前景。

Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values

每个问题都有自己的价值：具有明确人类价值观的强化学习

Authors: Dian Yu, Yulai Zhao, Kishan Panaganti, Linfeng Song, Haitao Mi, Dong Yu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.20187
Pdf link: https://arxiv.org/pdf/2510.20187
Abstract We propose Reinforcement Learning with Explicit Human Values (RLEV), a method that aligns Large Language Model (LLM) optimization directly with quantifiable human value signals. While Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains models in objective domains using binary correctness rewards, it overlooks that not all tasks are equally significant. RLEV extends this framework by incorporating human-defined value signals directly into the reward function. Using exam-style data with explicit ground-truth value labels, RLEV consistently outperforms correctness-only baselines across multiple RL algorithms and model scales. Crucially, RLEV policies not only improve value-weighted accuracy but also learn a value-sensitive termination policy: concise for low-value prompts, thorough for high-value ones. We demonstrate this behavior stems from value-weighted gradient amplification on end-of-sequence tokens. Ablation studies confirm the gain is causally linked to value alignment. RLEV remains robust under noisy value signals, such as difficulty-based labels, demonstrating that optimizing for an explicit utility function offers a practical path to aligning LLMs with human priorities.
中文摘要 我们提出了具有显式人类价值观的强化学习（RLEV），这是一种将大型语言模型（LLM）优化直接与可量化的人类价值信号相结合的方法。虽然具有可验证奖励的强化学习（RLVR）使用二元正确性奖励有效地训练目标域中的模型，但它忽略了并非所有任务都同样重要。RLEV 通过将人类定义的价值信号直接合并到奖励函数中来扩展这个框架。使用带有显式真实值标签的考试式数据，RLEV 在多个 RL 算法和模型量表上始终优于仅正确性基线。至关重要的是，RLEV 策略不仅提高了价值加权的准确性，而且还学习了价值敏感的终止策略：对于低价值提示要简洁，对于高价值提示要彻底。我们证明这种行为源于序列末端标记上的值加权梯度放大。消融研究证实，增益与价值一致性有因果关系。RLEV 在嘈杂的值信号（例如基于难度的标签）下保持稳健，这表明针对显式效用函数进行优化提供了一条使 LLM 与人类优先事项保持一致的实用途径。

Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents

具有优化确定性等价物的风险规避约束强化学习

Authors: Jane H. Lee, Baturay Saglam, Spyridon Pougkakiotis, Amin Karbasi, Dionysis Kalogerias
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.20199
Pdf link: https://arxiv.org/pdf/2510.20199
Abstract Constrained optimization provides a common framework for dealing with conflicting objectives in reinforcement learning (RL). In most of these settings, the objectives (and constraints) are expressed though the expected accumulated reward. However, this formulation neglects risky or even possibly catastrophic events at the tails of the reward distribution, and is often insufficient for high-stakes applications in which the risk involved in outliers is critical. In this work, we propose a framework for risk-aware constrained RL, which exhibits per-stage robustness properties jointly in reward values and time using optimized certainty equivalents (OCEs). Our framework ensures an exact equivalent to the original constrained problem within a parameterized strong Lagrangian duality framework under appropriate constraint qualifications, and yields a simple algorithmic recipe which can be wrapped around standard RL solvers, such as PPO. Lastly, we establish the convergence of the proposed algorithm under common assumptions, and verify the risk-aware properties of our approach through several numerical experiments.
中文摘要 约束优化为处理强化学习（RL）中的冲突目标提供了一个通用框架。在大多数这些环境中，目标（和约束）是通过预期的累积奖励来表达的。然而，这种公式忽略了奖励分布尾部的风险甚至可能的灾难性事件，并且通常不足以用于异常值所涉及的风险至关重要的高风险应用程序。在这项工作中，我们提出了一个风险感知约束RL框架，该框架使用优化的确定性等价物（OCE）在奖励值和时间方面共同表现出每个阶段的鲁棒性属性。我们的框架确保在适当的约束限定条件下，在参数化强拉格朗日对偶性框架内与原始约束问题完全等效，并产生一个简单的算法配方，可以围绕标准 RL 求解器（例如 PPO）进行包装。最后，在共同假设下建立了所提算法的收敛性，并通过几个数值实验验证了该方法的风险感知特性。

High-order Interactions Modeling for Interpretable Multi-Agent Q-Learning

用于可解释多智能体 Q 学习的高阶交互建模

Authors: Qinyu Xu, Yuanyang Zhu, Xuefei Wu, Chunlin Chen
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.20218
Pdf link: https://arxiv.org/pdf/2510.20218
Abstract The ability to model interactions among agents is crucial for effective coordination and understanding their cooperation mechanisms in multi-agent reinforcement learning (MARL). However, previous efforts to model high-order interactions have been primarily hindered by the combinatorial explosion or the opaque nature of their black-box network structures. In this paper, we propose a novel value decomposition framework, called Continued Fraction Q-Learning (QCoFr), which can flexibly capture arbitrary-order agent interactions with only linear complexity $\mathcal{O}\left({n}\right)$ in the number of agents, thus avoiding the combinatorial explosion when modeling rich cooperation. Furthermore, we introduce the variational information bottleneck to extract latent information for estimating credits. This latent information helps agents filter out noisy interactions, thereby significantly enhancing both cooperation and interpretability. Extensive experiments demonstrate that QCoFr not only consistently achieves better performance but also provides interpretability that aligns with our theoretical analysis.
中文摘要 在多智能体强化学习（MARL）中，对智能体之间交互进行建模的能力对于有效协调和理解智能体之间的合作机制至关重要。然而，以前对高阶相互作用进行建模的努力主要受到组合爆炸或其黑盒网络结构的不透明性质的阻碍。在本文中，我们提出了一种新的价值分解框架，称为连续分数Q学习（QCoFr），它可以灵活地捕获任意顺序的智能体交互，在智能体数量中只有线性复杂度$\mathcal{O}\left（{n}\right）$，从而避免了在建模丰富合作时的组合爆炸。此外，我们还引入了变分信息瓶颈，以提取潜在信息以估计学分。这些潜在信息有助于代理过滤掉嘈杂的交互，从而显着增强协作和可解释性。大量实验表明，QCoFr 不仅始终如一地实现更好的性能，而且还提供了与我们的理论分析相符的可解释性。

Multi-Objective Reinforcement Learning with Max-Min Criterion: A Game-Theoretic Approach

基于Max-Min准则的多目标强化学习：博弈论方法

Authors: Woohyeon Byeon, Giseung Park, Jongseong Chae, Amir Leshem, Youngchul Sung
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.20235
Pdf link: https://arxiv.org/pdf/2510.20235
Abstract In this paper, we propose a provably convergent and practical framework for multi-objective reinforcement learning with max-min criterion. From a game-theoretic perspective, we reformulate max-min multi-objective reinforcement learning as a two-player zero-sum regularized continuous game and introduce an efficient algorithm based on mirror descent. Our approach simplifies the policy update while ensuring global last-iterate convergence. We provide a comprehensive theoretical analysis on our algorithm, including iteration complexity under both exact and approximate policy evaluations, as well as sample complexity bounds. To further enhance performance, we modify the proposed algorithm with adaptive regularization. Our experiments demonstrate the convergence behavior of the proposed algorithm in tabular settings, and our implementation for deep reinforcement learning significantly outperforms previous baselines in many MORL environments.
中文摘要 在本文中，我们提出了一个具有max-min准则的多目标强化学习的可证明收敛和实用的框架。从博弈论的角度出发，我们将max-min多目标强化学习重新表述为双人零和正则化连续博弈，并引入了一种基于镜像下降的高效算法。我们的方法简化了策略更新，同时确保了全球最后迭代的收敛。我们对算法进行了全面的理论分析，包括精确和近似策略评估下的迭代复杂度，以及样本复杂度边界。为了进一步提高性能，我们使用自适应正则化修改了所提出的算法。我们的实验证明了所提出的算法在表格设置中的收敛行为，并且我们在许多 MORL 环境中的深度强化学习实现明显优于以前的基线。

Optimistic Task Inference for Behavior Foundation Models

行为基础模型的乐观任务推理

Authors: Thomas Rupf, Marco Bagatella, Marin Vlastelica, Andreas Krause
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.20264
Pdf link: https://arxiv.org/pdf/2510.20264
Abstract Behavior Foundation Models (BFMs) are capable of retrieving high-performing policy for any reward function specified directly at test-time, commonly referred to as zero-shot reinforcement learning (RL). While this is a very efficient process in terms of compute, it can be less so in terms of data: as a standard assumption, BFMs require computing rewards over a non-negligible inference dataset, assuming either access to a functional form of rewards, or significant labeling efforts. To alleviate these limitations, we tackle the problem of task inference purely through interaction with the environment at test-time. We propose OpTI-BFM, an optimistic decision criterion that directly models uncertainty over reward functions and guides BFMs in data collection for task inference. Formally, we provide a regret bound for well-trained BFMs through a direct connection to upper-confidence algorithms for linear bandits. Empirically, we evaluate OpTI-BFM on established zero-shot benchmarks, and observe that it enables successor-features-based BFMs to identify and optimize an unseen reward function in a handful of episodes with minimal compute overhead. Code is available at this https URL.
中文摘要 行为基础模型（BFM）能够检索在测试时直接指定的任何奖励函数的高性能策略，通常称为零样本强化学习（RL）。虽然这在计算方面是一个非常有效的过程，但在数据方面可能不太有效：作为一个标准假设，BFM 需要计算奖励而不是不可忽略的推理数据集，假设要么访问功能形式的奖励，要么进行大量的标记工作。为了缓解这些限制，我们纯粹通过测试时与环境的交互来解决任务推理问题。我们提出了OpTI-BFM，这是一种乐观决策标准，它直接对奖励函数的不确定性进行建模，并指导BFM进行任务推理的数据收集。从形式上讲，我们通过直接连接到线性强盗的上置信度算法，为训练有素的 BFM 提供了一个后悔的束缚。根据经验，我们在既定的零样本基准上评估了 OpTI-BFM，并观察到它使基于后继特征的 BFM 能够在少数几个事件中识别和优化看不见的奖励函数，同时将计算开销降至最低。代码可在此 https URL 中找到。

ResearchGPT: Benchmarking and Training LLMs for End-to-End Computer Science Research Workflows

ResearchGPT：为端到端计算机科学研究工作流程对法学硕士进行基准测试和培训

Authors: Penghao Wang, Yuhao Zhou, Mengxuan Wu, Ziheng Qin, Bangyuan Zhu, Shengbin Huang, Xuanlei Zhao, Panpan Zhang, Xiaojiang Peng, Yuzhang Shang, Jianfei Yang, Zheng Zhu, Tianlong Chen, Zhangyang Wang, Kai Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.20279
Pdf link: https://arxiv.org/pdf/2510.20279
Abstract As large language models (LLMs) advance, the ultimate vision for their role in science is emerging: we could build an AI collaborator to effectively assist human beings throughout the entire scientific research process. We refer to this envisioned system as ResearchGPT. Given that scientific research progresses through multiple interdependent phases, achieving this vision requires rigorous benchmarks that evaluate the end-to-end workflow rather than isolated sub-tasks. To this end, we contribute CS-54k, a high-quality corpus of scientific Q&A pairs in computer science, built from 14k CC-licensed papers. It is constructed through a scalable, paper-grounded pipeline that combines retrieval-augmented generation (RAG) with multi-stage quality control to ensure factual grounding. From this unified corpus, we derive two complementary subsets: CS-4k, a carefully curated benchmark for evaluating AI's ability to assist scientific research, and CS-50k, a large-scale training dataset. Extensive experiments demonstrate that CS-4k stratifies state-of-the-art LLMs into distinct capability tiers. Open models trained on CS-50k with supervised training and reinforcement learning demonstrate substantial improvements. Even 7B-scale models, when properly trained, outperform many larger proprietary systems, such as GPT-4.1, GPT-4o, and Gemini 2.5 Pro. This indicates that making AI models better research assistants relies more on domain-aligned training with high-quality data than on pretraining scale or general benchmark performance. We release CS-4k and CS-50k in the hope of fostering AI systems as reliable collaborators in CS research.
中文摘要 随着大型语言模型（LLM）的进步，它们在科学中的作用的最终愿景正在出现：我们可以建立一个人工智能合作者，在整个科学研究过程中有效地帮助人类。我们将这个设想的系统称为 ResearchGPT。鉴于科学研究经历了多个相互依赖的阶段，实现这一愿景需要严格的基准来评估端到端工作流程，而不是孤立的子任务。为此，我们贡献了 CS-54k，这是一个由 14k CC 许可论文构建的计算机科学科学问答对的高质量语料库。它通过可扩展的纸质接地管道构建，该管道将检索增强生成（RAG）与多阶段质量控制相结合，以确保事实接地。从这个统一的语料库中，我们得出了两个互补的子集：CS-4k，一个精心策划的基准，用于评估人工智能辅助科学研究的能力，以及 CS-50k，一个大规模训练数据集。广泛的实验表明，CS-4k 将最先进的法学硕士分层为不同的能力层。在 CS-50k 上训练的开放模型具有监督训练和强化学习，显示出显着的改进。即使是 7B 规模的模型，如果训练得当，性能也优于许多更大的专有系统，例如 GPT-4.1、GPT-4o 和 Gemini 2.5 Pro。这表明，使人工智能模型成为更好的研究助理更多地依赖于具有高质量数据的领域对齐训练，而不是预训练规模或一般基准性能。我们发布 CS-4k 和 CS-50k，希望将人工智能系统培养成 CS 研究的可靠合作者。

UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

UI-Ins：通过多视角指令即推理增强 GUI 基础

Authors: Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong Tong, Quyu Kong, Xu Zhang, Chen Liu, Yuqi Liu, Wenxuan Wang, Yue Wang, Qin Jin, Steven Hoi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.20286
Pdf link: https://arxiv.org/pdf/2510.20286
Abstract GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in this https URL.
中文摘要 GUI 基础将自然语言指令映射到可作的 UI 元素，是 GUI 代理的核心功能。先前的工作在很大程度上将指令视为用户意图的静态代理，忽略了指令多样性和质量对基础性能的影响。通过对现有接地数据集的仔细研究，我们发现它们的指令有 23.3% 的缺陷率，并表明对指令多样性的推理时间利用产生了高达 76% 的相对性能提升。在本文中，我们引入了指令即推理范式，将指令视为提供独特视角的动态分析路径，并使模型能够在推理过程中选择最有效的路径。为了实现这一目标，我们提出了一个两阶段的训练框架：对合成的、多样化的指令进行监督微调（SFT）以灌输多视角推理，然后是强化学习（RL）以优化路径选择和组成。我们生成的模型 UI-Ins-7B 和 UI-Ins-32B 在五个具有挑战性的基础基准上取得了最先进的结果，并表现出涌现推理，在推理时有选择地组合和合成新的指令路径。特别是，UI-Ins-32B 获得了最佳的接地精度，在 UI-I2E-Bench 上得分为 87.3%，在 ScreenSpot-Pro 上得分为 57.0%，在 MMBench-GUI L2 上得分为 84.9%。此外，我们的模型展示了强大的代理潜力，使用 UI-Ins-7B 作为执行器在 AndroidWorld 上实现了 74.1% 的成功率。我们的深入分析揭示了其他见解，例如如何制定推理来增强而不是阻碍基础性能，以及我们的方法如何减轻 SFT+RL 框架中的政策崩溃。所有代码和模型检查点都将在此 https URL 中公开发布。

Moving or Predicting? RoleAware-MAPP: A Role-Aware Transformer Framework for Movable Antenna Position Prediction to Secure Wireless Communications

移动还是预测？RoleAware-MAPP：用于移动天线位置预测的角色感知变压器框架，以保护无线通信

Authors: Wenxu Wang, Xiaowu Liu, Wei Gong, Yujia Zhao, Kaixuan Li, Qixun Zhang, Zhiyong Feng, Kan Yu
Subjects: Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2510.20293
Pdf link: https://arxiv.org/pdf/2510.20293
Abstract Movable antenna (MA) technology provides a promising avenue for actively shaping wireless channels through dynamic antenna positioning, thereby enabling electromagnetic radiation reconstruction to enhance physical layer security (PLS). However, its practical deployment is hindered by two major challenges: the high computational complexity of real time optimization and a critical temporal mismatch between slow mechanical movement and rapid channel variations. Although data driven methods have been introduced to alleviate online optimization burdens, they are still constrained by suboptimal training labels derived from conventional solvers or high sample complexity in reinforcement learning. More importantly, existing learning based approaches often overlook communication-specific domain knowledge, particularly the asymmetric roles and adversarial interactions between legitimate users and eavesdroppers, which are fundamental to PLS. To address these issues, this paper reformulates the MA positioning problem as a predictive task and introduces RoleAware-MAPP, a novel Transformer based framework that incorporates domain knowledge through three key components: role-aware embeddings that model user specific intentions, physics-informed semantic features that encapsulate channel propagation characteristics, and a composite loss function that strategically prioritizes secrecy performance over mere geometric accuracy. Extensive simulations under 3GPP-compliant scenarios show that RoleAware-MAPP achieves an average secrecy rate of 0.3569 bps/Hz and a strictly positive secrecy capacity of 81.52%, outperforming the strongest baseline by 48.4% and 5.39 percentage points, respectively, while maintaining robust performance across diverse user velocities and noise conditions.
中文摘要 移动天线（MA）技术为通过动态天线定位主动塑造无线信道提供了一条有前途的途径，从而实现电磁辐射重建，从而增强物理层安全性（PLS）。然而，其实际部署受到两大挑战的阻碍：实时优化的高计算复杂性以及缓慢的机械运动和快速信道变化之间的关键时间不匹配。尽管已经引入了数据驱动的方法来减轻在线优化负担，但它们仍然受到来自传统求解器的次优训练标签或强化学习中的高样本复杂性的限制。更重要的是，现有的基于学习的方法往往忽视了特定于通信的领域知识，特别是合法用户和窃听者之间的不对称角色和对抗互，这是 PLS 的基础。为了解决这些问题，本文将MA定位问题重新表述为预测任务，并引入了RoleAware-MAPP，这是一个基于Transformer的新型框架，它通过三个关键组件整合了领域知识：对用户特定意图进行建模的角色感知嵌入、封装信道传播特征的物理信息语义特征，以及战略性地优先考虑保密性能而不是单纯几何精度的复合损失函数。在符合 3GPP 的场景下进行的广泛仿真表明，RoleAware-MAPP 实现了 0.3569 bps/Hz 的平均保密率和 81.52% 的严格正保密能力，分别比最强基线高出 48.4% 和 5.39 个百分点，同时在不同的用户速度和噪声条件下保持稳健的性能。

Enhancing Security in Deep Reinforcement Learning: A Comprehensive Survey on Adversarial Attacks and Defenses

增强深度强化学习的安全性：对抗性攻击和防御的综合调查

Authors: Wu Yichao, Wang Yirui, Ding Panpan, Wang Hailong, Zhu Bingqian, Liu Chun
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.20314
Pdf link: https://arxiv.org/pdf/2510.20314
Abstract With the wide application of deep reinforcement learning (DRL) techniques in complex fields such as autonomous driving, intelligent manufacturing, and smart healthcare, how to improve its security and robustness in dynamic and changeable environments has become a core issue in current research. Especially in the face of adversarial attacks, DRL may suffer serious performance degradation or even make potentially dangerous decisions, so it is crucial to ensure their stability in security-sensitive scenarios. In this paper, we first introduce the basic framework of DRL and analyze the main security challenges faced in complex and changing environments. In addition, this paper proposes an adversarial attack classification framework based on perturbation type and attack target and reviews the mainstream adversarial attack methods against DRL in detail, including various attack methods such as perturbation state space, action space, reward function and model space. To effectively counter the attacks, this paper systematically summarizes various current robustness training strategies, including adversarial training, competitive training, robust learning, adversarial detection, defense distillation and other related defense techniques, we also discuss the advantages and shortcomings of these methods in improving the robustness of DRL. Finally, this paper looks into the future research direction of DRL in adversarial environments, emphasizing the research needs in terms of improving generalization, reducing computational complexity, and enhancing scalability and explainability, aiming to provide valuable references and directions for researchers.
中文摘要 随着深度强化学习（DRL）技术在自动驾驶、智能制造、智慧医疗等复杂领域的广泛应用，如何在动态多变的环境中提高深度强化学习（deep reinforcement learning，DRL）技术的安全性和鲁棒性已成为当前研究的核心课题。特别是在面对对抗性攻击时，DRL可能会遭受严重的性能下降，甚至做出潜在的危险决策，因此确保其在安全敏感场景中的稳定性至关重要。在本文中，我们首先介绍了DRL的基本框架，并分析了在复杂多变的环境中面临的主要安全挑战。此外，提出了一种基于扰动类型和攻击目标的对抗攻击分类框架，并详细综述了针对DRL的主流对抗攻击方法，包括扰动状态空间、动作空间、奖励函数和模型空间等多种攻击方式。为了有效应对攻击，本文系统地总结了当前各种鲁棒性训练策略，包括对抗训练、竞争训练、鲁棒学习、对抗检测、防御蒸馏等相关防御技术，并讨论了这些方法在提高DRL鲁棒性方面的优缺点。最后，本文展望了DRL在对抗性环境中的未来研究方向，强调了DRL在提高泛化性、降低计算复杂度、增强可扩展性和可解释性等方面的研究需求，旨在为研究人员提供有价值的参考和方向。

Teaching Language Models to Reason with Tools

使用工具教授语言模型进行推理

Authors: Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, Dayiheng Liu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.20342
Pdf link: https://arxiv.org/pdf/2510.20342
Abstract Large reasoning models (LRMs) like OpenAI-o1 have shown impressive capabilities in natural language reasoning. However, these models frequently demonstrate inefficiencies or inaccuracies when tackling complex mathematical operations. While integrating computational tools such as Code Interpreters (CIs) offers a promising solution, it introduces a critical challenge: a conflict between the model's internal, probabilistic reasoning and the external, deterministic knowledge provided by the CI, which often leads models to unproductive deliberation. To overcome this, we introduce CoRT (Code-Optimized Reasoning Training), a post-training framework designed to teach LRMs to effectively utilize CIs. We propose \emph{Hint-Engineering}, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths. This approach generates high-quality, code-integrated reasoning data specifically tailored to optimize LRM-CI interaction. Using this method, we have synthesized 30 high-quality samples to post-train models ranging from 1.5B to 32B parameters through supervised fine-tuning. CoRT further refines the multi-round interleaving of external CI usage and internal thinking by employing rejection sampling and reinforcement learning. Our experimental evaluations demonstrate CoRT's effectiveness, yielding absolute improvements of 4\% and 8\% on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B, respectively, across five challenging mathematical reasoning datasets. Moreover, CoRT significantly enhances efficiency, reducing token usage by approximately 30\% for the 32B model and 50\% for the 1.5B model compared to pure natural language reasoning baselines. The models and code are available at: this https URL.
中文摘要 像 OpenAI-o1 这样的大型推理模型（LRM）在自然语言推理方面显示出令人印象深刻的能力。然而，这些模型在处理复杂的数学运算时经常表现出低效或不准确。虽然集成代码解释器（CI）等计算工具提供了一个有前途的解决方案，但它带来了一个关键挑战：模型的内部概率推理与 CI 提供的外部确定性知识之间的冲突，这通常会导致模型进行无效的审议。为了克服这个问题，我们引入了 CoRT（代码优化推理训练），这是一个训练后框架，旨在教 LRM 有效利用 CI。我们提出了 \emph{Hint-Engineering}，这是一种新的数据合成策略，它战略性地在推理路径内的最佳点注入不同的提示。这种方法生成专门为优化 LRM-CI 交互而定制的高质量、代码集成的推理数据。使用这种方法，我们通过监督微调合成了 30 个高质量样本，用于训练后模型，参数范围从 1.5B 到 32B。CoRT 通过采用拒绝采样和强化学习，进一步完善了外部 CI 使用和内部思维的多轮交错。我们的实验评估证明了 CoRT 的有效性，在五个具有挑战性的数学推理数据集中，DeepSeek-R1-Distill-Qwen-32B 和 DeepSeek-R1-Distill-Qwen-1.5B 分别产生了 4% 和 8% 的绝对改进。此外，CoRT 显着提高了效率，与纯自然语言推理基线相比，32B 模型的令牌使用量减少了约 30\%，1.5B 模型的令牌使用量减少了约 50\%。模型和代码可在以下位置获得：此 https URL。

Multi-Modal Decentralized Reinforcement Learning for Modular Reconfigurable Lunar Robots

模块化可重构月球机器人的多模态分散强化学习

Authors: Ashutosh Mishra, Shreya Santra, Elian Neppel, Edoardo M. Rossi Lombardi, Shamistan Karimov, Kentaro Uno, Kazuya Yoshida
Subjects: Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2510.20347
Pdf link: https://arxiv.org/pdf/2510.20347
Abstract Modular reconfigurable robots suit task-specific space operations, but the combinatorial growth of morphologies hinders unified control. We propose a decentralized reinforcement learning (Dec-RL) scheme where each module learns its own policy: wheel modules use Soft Actor-Critic (SAC) for locomotion and 7-DoF limbs use Proximal Policy Optimization (PPO) for steering and manipulation, enabling zero-shot generalization to unseen configurations. In simulation, the steering policy achieved a mean absolute error of 3.63° between desired and induced angles; the manipulation policy plateaued at 84.6 % success on a target-offset criterion; and the wheel policy cut average motor torque by 95.4 % relative to baseline while maintaining 99.6 % success. Lunar-analogue field tests validated zero-shot integration for autonomous locomotion, steering, and preliminary alignment for reconfiguration. The system transitioned smoothly among synchronous, parallel, and sequential modes for Policy Execution, without idle states or control conflicts, indicating a scalable, reusable, and robust approach for modular lunar robots.
中文摘要 模块化可重构机器人适合特定任务的空间作，但形态的组合增长阻碍了统一控制。我们提出了一种去中心化强化学习（Dec-RL）方案，其中每个模块学习自己的策略：车轮模块使用软Actor-Critic（SAC）进行运动，7-DoF肢体使用近端策略优化（PPO）进行转向和纵，从而实现对看不见的配置的零样本泛化。在仿真中，转向策略在期望角和诱导角之间实现了平均绝对误差3.63°;纵政策在目标抵消标准上的成功率稳定在 84.6%;车轮政策将平均电机扭矩相对于基线降低了 95.4%，同时保持了 99.6% 的成功率。月球模拟现场测试验证了零样本集成，用于自主运动、转向和重新配置的初步对准。该系统在同步、并行和顺序模式之间平稳过渡以执行策略，没有空闲状态或控制冲突，这表明模块化月球机器人是一种可扩展、可重用和稳健的方法。

Ask a Strong LLM Judge when Your Reward Model is Uncertain

当您的奖励模型不确定时，请问一位强大的 LLM 评委

Authors: Zhenghao Xu, Qin Lu, Qingru Zhang, Liang Qiu, Ilgee Hong, Changlong Yu, Wenlin Yao, Yao Liu, Haoming Jiang, Lihong Li, Hyokun Yun, Tuo Zhao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.20369
Pdf link: https://arxiv.org/pdf/2510.20369
Abstract Reward model (RM) plays a pivotal role in reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs). However, classical RMs trained on human preferences are vulnerable to reward hacking and generalize poorly to out-of-distribution (OOD) inputs. By contrast, strong LLM judges equipped with reasoning capabilities demonstrate superior generalization, even without additional training, but incur significantly higher inference costs, limiting their applicability in online RLHF. In this work, we propose an uncertainty-based routing framework that efficiently complements a fast RM with a strong but costly LLM judge. Our approach formulates advantage estimation in policy gradient (PG) methods as pairwise preference classification, enabling principled uncertainty quantification to guide routing. Uncertain pairs are forwarded to the LLM judge, while confident ones are evaluated by the RM. Experiments on RM benchmarks demonstrate that our uncertainty-based routing strategy significantly outperforms random judge calling at the same cost, and downstream alignment results showcase its effectiveness in improving online RLHF.
中文摘要 奖励模型（RM）在人类反馈强化学习（RLHF）中发挥着关键作用，用于调整大型语言模型（LLM）。然而，根据人类偏好进行训练的经典 RM 很容易受到奖励黑客攻击，并且对分布外（OOD）输入的推广性很差。相比之下，即使没有额外的培训，具备推理能力的强大法学硕士法官也表现出卓越的泛化能力，但推理成本明显更高，限制了它们在在线 RLHF 中的适用性。在这项工作中，我们提出了一种基于不确定性的路由框架，该框架通过强大但成本高昂的 LLM 判断有效地补充了快速 RM。我们的方法将策略梯度（PG）方法中的优势估计表述为成对偏好分类，从而能够进行原则性的不确定性量化来指导路由。不确定的对被转发给 LLM 评判，而有信心的则由 RM 评估。RM 基准测试上的实验表明，我们基于不确定性的路由策略在相同成本下显着优于随机评判调用，下游对齐结果展示了其在改善在线 RLHF 方面的有效性。

NeuralTouch: Neural Descriptors for Precise Sim-to-Real Tactile Robot Control

NeuralTouch：用于精确模拟到真实触觉机器人控制的神经描述符

Authors: Yijiong Lin, Bowen Deng, Chenghua Lu, Max Yang, Efi Psomopoulou, Nathan F. Lepora
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.20390
Pdf link: https://arxiv.org/pdf/2510.20390
Abstract Grasping accuracy is a critical prerequisite for precise object manipulation, often requiring careful alignment between the robot hand and object. Neural Descriptor Fields (NDF) offer a promising vision-based method to generate grasping poses that generalize across object categories. However, NDF alone can produce inaccurate poses due to imperfect camera calibration, incomplete point clouds, and object variability. Meanwhile, tactile sensing enables more precise contact, but existing approaches typically learn policies limited to simple, predefined contact geometries. In this work, we introduce NeuralTouch, a multimodal framework that integrates NDF and tactile sensing to enable accurate, generalizable grasping through gentle physical interaction. Our approach leverages NDF to implicitly represent the target contact geometry, from which a deep reinforcement learning (RL) policy is trained to refine the grasp using tactile feedback. This policy is conditioned on the neural descriptors and does not require explicit specification of contact types. We validate NeuralTouch through ablation studies in simulation and zero-shot transfer to real-world manipulation tasks--such as peg-out-in-hole and bottle lid opening--without additional fine-tuning. Results show that NeuralTouch significantly improves grasping accuracy and robustness over baseline methods, offering a general framework for precise, contact-rich robotic manipulation.
中文摘要 抓取精度是精确物体作的关键先决条件，通常需要机器人手和物体之间仔细对齐。神经描述符场（NDF）提供了一种很有前途的基于视觉的方法来生成跨对象类别泛化的抓握姿势。然而，由于相机校准不完善、点云不完整和物体可变性，仅使用 NDF 可能会产生不准确的姿势。同时，触觉传感可以实现更精确的接触，但现有方法通常学习策略仅限于简单的、预定义的接触几何形状。在这项工作中，我们介绍了 NeuralTouch，这是一个多模态框架，它集成了 NDF 和触觉传感，通过温和的物理交互实现准确、可推广的抓握。我们的方法利用 NDF 隐式表示目标接触几何形状，从中训练深度强化学习（RL）策略，以使用触觉反馈细化抓握。此策略以神经描述符为条件，不需要显式指定联系人类型。我们通过模拟中的消融研究和零样本转移到现实世界的作任务（例如钉子孔内和瓶盖打开）来验证 NeuralTouch，而无需进行额外的微调。结果表明，与基线方法相比，NeuralTouch 显着提高了抓取精度和鲁棒性，为精确、接触丰富的机器人作提供了通用框架。

Balancing Specialization and Centralization: A Multi-Agent Reinforcement Learning Benchmark for Sequential Industrial Control

平衡专业化与集中化：面向顺序工业控制的多智能体强化学习基准

Authors: Tom Maus, Asma Atamna, Tobias Glasmachers
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.20408
Pdf link: https://arxiv.org/pdf/2510.20408
Abstract Autonomous control of multi-stage industrial processes requires both local specialization and global coordination. Reinforcement learning (RL) offers a promising approach, but its industrial adoption remains limited due to challenges such as reward design, modularity, and action space management. Many academic benchmarks differ markedly from industrial control problems, limiting their transferability to real-world applications. This study introduces an enhanced industry-inspired benchmark environment that combines tasks from two existing benchmarks, SortingEnv and ContainerGym, into a sequential recycling scenario with sorting and pressing operations. We evaluate two control strategies: a modular architecture with specialized agents and a monolithic agent governing the full system, while also analyzing the impact of action masking. Our experiments show that without action masking, agents struggle to learn effective policies, with the modular architecture performing better. When action masking is applied, both architectures improve substantially, and the performance gap narrows considerably. These results highlight the decisive role of action space constraints and suggest that the advantages of specialization diminish as action complexity is reduced. The proposed benchmark thus provides a valuable testbed for exploring practical and robust multi-agent RL solutions in industrial automation, while contributing to the ongoing debate on centralization versus specialization.
中文摘要 多阶段工业过程的自主控制既需要本地专业化，也需要全球协调。强化学习（RL）提供了一种有前途的方法，但由于奖励设计、模块化和动作空间管理等挑战，其工业化采用仍然有限。许多学术基准与工业控制问题明显不同，限制了它们在实际应用中的可转移性。本研究引入了一个增强的行业启发基准环境，该环境将两个现有基准测试（SortingEnv 和 ContainerGym）的任务结合到具有分类和压制作的顺序回收场景中。我们评估了两种控制策略：具有专用代理的模块化架构和管理整个系统的单体代理，同时还分析了动作掩蔽的影响。我们的实验表明，如果没有动作掩蔽，代理将难以学习有效的策略，而模块化架构的性能更好。应用动作掩码后，两种架构都会得到显着改进，并且性能差距大大缩小。这些结果突出了动作空间约束的决定性作用，并表明随着动作复杂性的降低，专业化的优势会减弱。因此，所提出的基准测试为探索工业自动化中实用且强大的多智能体 RL 解决方案提供了一个有价值的测试平台，同时有助于关于集中化与专业化的持续争论。

Why DPO is a Misspecified Estimator and How to Fix It

为什么 DPO 是一个错误指定的估计器以及如何修复它

Authors: Aditya Gopalan, Sayak Ray Chowdhury, Debangshu Banerjee
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.20413
Pdf link: https://arxiv.org/pdf/2510.20413
Abstract Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO encodes a statistical estimation problem over reward functions induced by a parametric policy class. When the true reward function that generates preferences cannot be realized via the policy class, DPO becomes misspecified, resulting in failure modes such as preference order reversal, worsening of policy reward, and high sensitivity to the input preference data distribution. On the other hand, we study the local behavior of two-stage RLHF for a parametric class and relate it to a natural gradient step in policy space. Our fine-grained geometric characterization allows us to propose AuxDPO, which introduces additional auxiliary variables in the DPO loss function to help move towards the RLHF solution in a principled manner and mitigate the misspecification in DPO. We empirically demonstrate the superior performance of AuxDPO on didactic bandit settings as well as LLM alignment tasks.
中文摘要 直接偏好优化（DPO）等直接对齐算法根据偏好数据微调模型，仅使用监督学习而不是具有人类反馈的两阶段强化学习（RLHF）。我们表明，DPO 对由参数策略类引起的奖励函数进行了统计估计问题。当无法通过策略类实现产生偏好的真实奖励函数时，DPO 就会被错误指定，从而导致偏好顺序反转、策略奖励恶化以及对输入偏好数据分布高度敏感等失效模式。另一方面，我们研究了参数类的两阶段 RLHF 的局部行为，并将其与策略空间中的自然梯度步骤联系起来。我们的细粒度几何表征使我们能够提出 AuxDPO，它在 DPO 损失函数中引入了额外的辅助变量，以帮助以原则性的方式转向 RLHF 解决方案并减轻 DPO 中的错误规范。我们实证地证明了 AuxDPO 在教学强盗设置以及 LLM 对齐任务中的卓越性能。

LM-mixup: Text Data Augmentation via Language Model based Mixup

LM-mixup：通过基于语言模型的混合进行文本数据增强

Authors: Zhijie Deng, Zhouan Shen, Ling Li, Yao Zhou, Zhaowei Zhu, Yanji He, Wei Wang, Jiaheng Wei
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.20449
Pdf link: https://arxiv.org/pdf/2510.20449
Abstract Instruction tuning is crucial for aligning Large Language Models (LLMs), yet the quality of instruction-following data varies significantly. While high-quality data is paramount, it is often scarce; conversely, abundant low-quality data is frequently discarded, leading to substantial information loss. Existing data augmentation methods struggle to augment this low-quality data effectively, and the evaluation of such techniques remains poorly defined. To address this, we formally define the task of Instruction Distillation: distilling multiple low-quality and redundant inputs into high-quality and coherent instruction-output pairs. Specifically, we introduce a comprehensive data construction pipeline to create MIXTURE, a 144K-sample dataset pairing low-quality or semantically redundant imperfect instruction clusters with their high-quality distillations. We then introduce LM-Mixup, by first performing supervised fine-tuning on MIXTURE and then optimizing it with reinforcement learning. This process uses three complementary reward signals: quality, semantic alignment, and format compliance, via Group Relative Policy Optimization (GRPO). We demonstrate that LM-Mixup effectively augments imperfect datasets: fine-tuning LLMs on its distilled data, which accounts for only about 3% of the entire dataset, not only surpasses full-dataset training but also competes with state-of-the-art high-quality data selection methods across multiple benchmarks. Our work establishes that low-quality data is a valuable resource when properly distilled and augmented with LM-Mixup, significantly enhancing the efficiency and performance of instruction-tuned LLMs.
中文摘要 指令调整对于调整大型语言模型（LLM）至关重要，但指令遵循数据的质量差异很大。虽然高质量的数据至关重要，但它往往很稀缺;相反，大量低质量数据经常被丢弃，导致大量信息丢失。现有的数据增强方法难以有效地增强这些低质量的数据，并且对此类技术的评估仍然定义不明确。为了解决这个问题，我们正式定义了指令蒸馏任务：将多个低质量和冗余的输入提炼成高质量和连贯的指令-输出对。具体来说，我们引入了一个全面的数据构建管道来创建 MIXTURE，这是一个 144K 样本的数据集，将低质量或语义冗余的不完美指令集群与其高质量的蒸馏配对。然后，我们引入了 LM-Mixup，首先对 MIXTURE 进行监督微调，然后通过强化学习对其进行优化。此过程通过组相对策略优化（GRPO）使用三个互补的奖励信号：质量、语义对齐和格式合规性。我们证明，LM-Mixup 有效地增强了不完美的数据集：对其蒸馏数据进行微调 LLM（仅占整个数据集的 3% 左右）不仅超越了全数据集训练，而且可以与跨多个基准的最先进的高质量数据选择方法竞争。我们的工作表明，当使用 LM-Mixup 进行适当提炼和增强时，低质量数据是一种宝贵的资源，从而显着提高指令调整的 LLM 的效率和性能。

Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

柯南：在多尺度视觉证据上像侦探一样逐步学习推理

Authors: Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.20470
Pdf link: https://arxiv.org/pdf/2510.20470
Abstract Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.
中文摘要 视频推理需要跨帧进行多步推理，仍然是多模态大语言模型（MLLM）面临的主要挑战。虽然基于强化学习（RL）的方法增强了推理能力，但它们通常依赖于纯文本链，从而产生毫无根据或幻觉的结论。相反，帧检索方法引入了视觉基础，但仍然难以解决不准确的证据定位问题。为了应对这些挑战，我们提出了柯南，这是一个基于证据的多步骤视频推理框架。柯南识别上下文和证据框架，跨框架线索的原因，并自适应地决定何时得出结论或进一步探索。为此，我们（1）构建了Conan-91K，这是一个包含帧识别、证据推理和行动决策在内的自动生成推理轨迹的大规模数据集，以及（2）设计一个多阶段渐进式冷启动策略，结合识别-推理-行动（AIR）RLVR训练框架，共同增强多步视觉推理。在六个多步骤推理基准测试上的广泛实验表明，柯南的准确率平均超过基线 Qwen2.5-VL-7B-Instruct 10% 以上，实现了最先进的性能。此外，柯南有效地推广到长视频理解任务，验证了其强大的可扩展性和鲁棒性。

A Unified Framework for Zero-Shot Reinforcement Learning

零样本强化学习的统一框架

Authors: Jacopo Di Ventura, Jan Felix Kleuker, Aske Plaat, Thomas Moerland
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.20542
Pdf link: https://arxiv.org/pdf/2510.20542
Abstract Zero-shot reinforcement learning (RL) has emerged as a setting for developing general agents in an unsupervised manner, capable of solving downstream tasks without additional training or planning at test-time. Unlike conventional RL, which optimizes policies for a fixed reward, zero-shot RL requires agents to encode representations rich enough to support immediate adaptation to any objective, drawing parallels to vision and language foundation models. Despite growing interest, the field lacks a common analytical lens. We present the first unified framework for zero-shot RL. Our formulation introduces a consistent notation and taxonomy that organizes existing approaches and allows direct comparison between them. Central to our framework is the classification of algorithms into two families: direct representations, which learn end-to-end mappings from rewards to policies, and compositional representations, which decompose the representation leveraging the substructure of the value function. Within this framework, we highlight shared principles and key differences across methods, and we derive an extended bound for successor-feature methods, offering a new perspective on their performance in the zero-shot regime. By consolidating existing work under a common lens, our framework provides a principled foundation for future research in zero-shot RL and outlines a clear path toward developing more general agents.
中文摘要 零样本强化学习（RL）已成为一种以无监督方式开发通用代理的环境，能够在测试时无需额外培训或计划即可解决下游任务。与传统的 RL 不同，传统的 RL 会优化策略以获得固定奖励，零样本 RL 要求代理对足够丰富的表示进行编码，以支持立即适应任何目标，这与视觉和语言基础模型相似。尽管人们对这一领域的兴趣日益浓厚，但该领域缺乏一个共同的分析视角。我们提出了第一个零样本 RL 的统一框架。我们的表述引入了一致的符号和分类法，组织了现有方法并允许在它们之间进行直接比较。我们框架的核心是将算法分为两个系列：直接表示，它学习从奖励到策略的端到端映射，以及组合表示，它利用价值函数的子结构分解表示。在这个框架内，我们强调了不同方法的共同原则和关键差异，并推导出了后续特征方法的扩展范围，为它们在零样本制度中的表现提供了新的视角。通过将现有工作整合到一个共同的视角下，我们的框架为零样本 RL 的未来研究提供了原则性基础，并勾勒出开发更多通用代理的清晰路径。

GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning

GlobalRAG：通过强化学习增强多跳问答中的全局推理

Authors: Jinchang Luo, Mingquan Cheng, Fan Wan, Ni Li, Xiaoling Xia, Shuangshuang Tian, Tingcheng Bian, Haiwei Wang, Haohuan Fu, Yan Tao
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.20548
Pdf link: https://arxiv.org/pdf/2510.20548
Abstract Reinforcement learning has recently shown promise in improving retrieval-augmented generation (RAG). Despite these advances, its effectiveness in multi-hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi-step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi-hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process-oriented and outcome-based objectives. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.
中文摘要 强化学习最近在改进检索增强生成（RAG）方面显示出希望。尽管取得了这些进步，但它在多跳问答（QA）中的有效性仍然受到两个基本限制的限制：（i）缺乏构建多步推理的全局规划，以及（ii）不忠实的执行，这阻碍了有效的查询表述和检索到的证据的一致使用。我们提出了 GlobalRAG，这是一个强化学习框架，旨在增强多跳 QA 中的全局推理。GlobalRAG 将问题分解为子目标，协调检索与推理，并迭代完善证据。为了指导这一过程，我们引入了规划质量奖励和子目标完成奖励，它们鼓励连贯的规划和可靠的子目标执行。此外，渐进式重量退火策略平衡了面向过程的目标和基于结果的目标。在域内和域外基准测试上的广泛实验表明，GlobalRAG 在仅使用 8k 训练数据（占强基线使用的训练数据的 42%）时显着优于强基线，在 EM 和 F1 中实现了 14.2% 的平均改进。

AdaDoS: Adaptive DoS Attack via Deep Adversarial Reinforcement Learning in SDN

AdaDoS：SDN中深度对抗强化学习的自适应DoS攻击

Authors: Wei Shao, Yuhao Wang, Rongguang He, Muhammad Ejaz Ahmed, Seyit Camtepe
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.20566
Pdf link: https://arxiv.org/pdf/2510.20566
Abstract Existing defence mechanisms have demonstrated significant effectiveness in mitigating rule-based Denial-of-Service (DoS) attacks, leveraging predefined signatures and static heuristics to identify and block malicious traffic. However, the emergence of AI-driven techniques presents new challenges to SDN security, potentially compromising the efficacy of existing defence mechanisms. In this paper, we introduce~AdaDoS, an adaptive attack model that disrupt network operations while evading detection by existing DoS-based detectors through adversarial reinforcement learning (RL). Specifically, AdaDoS models the problem as a competitive game between an attacker, whose goal is to obstruct network traffic without being detected, and a detector, which aims to identify malicious traffic. AdaDoS can solve this game by dynamically adjusting its attack strategy based on feedback from the SDN and the detector. Additionally, recognising that attackers typically have less information than defenders, AdaDoS formulates the DoS-like attack as a partially observed Markov decision process (POMDP), with the attacker having access only to delay information between attacker and victim nodes. We address this challenge with a novel reciprocal learning module, where the student agent, with limited observations, enhances its performance by learning from the teacher agent, who has full observational capabilities in the SDN environment. AdaDoS represents the first application of RL to develop DoS-like attack sequences, capable of adaptively evading both machine learning-based and rule-based DoS-like attack detectors.
中文摘要 现有的防御机制在缓解基于规则的拒绝服务（DoS）攻击方面已证明具有显着的有效性，利用预定义的签名和静态启发式方法来识别和阻止恶意流量。然而，人工智能驱动技术的出现给 SDN 安全带来了新的挑战，可能会损害现有防御机制的有效性。在本文中，我们介绍了~AdaDoS，这是一种自适应攻击模型，通过对抗强化学习（RL）扰乱网络运行，同时逃避现有基于DoS的检测器的检测。具体来说，AdaDoS 将问题建模为攻击者和检测器之间的竞争游戏，攻击者的目标是在不被检测到的情况下阻止网络流量，而检测器旨在识别恶意流量。AdaDoS 可以通过根据 SDN 和检测器的反馈动态调整其攻击策略来解决这个问题。此外，AdaDoS 认识到攻击者通常比防御者拥有更少的信息，因此将类似 DoS 的攻击表述为部分观察到的马尔可夫决策过程（POMDP），攻击者只能访问攻击者和受害者节点之间的延迟信息。我们通过一种新颖的互惠学习模块来应对这一挑战，其中学生代理在有限的观察下通过向在 SDN 环境中具有完整观察能力的教师代理学习来提高其性能。AdaDoS 代表了 RL 开发类 DoS 攻击序列的第一个应用，能够自适应地规避基于机器学习和基于规则的类 DoS 攻击检测器。

Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

Open-o3 视频：具有明确时空证据的接地视频推理

Authors: Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2510.20579
Pdf link: https://arxiv.org/pdf/2510.20579
Abstract Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.
中文摘要 大多数视频推理模型仅生成文本推理轨迹，而不指示关键证据出现的时间和地点。最近的模型（例如 OpenAI-o3）引发了人们对以证据为中心的图像推理的广泛兴趣，但将这种能力扩展到视频更具挑战性，因为它需要跨动态场景进行联合时间跟踪和空间定位。我们引入了Open-o3 Video，这是一个将显式时空证据集成到视频推理中的非智能体框架，并仔细收集训练数据并设计训练策略来应对上述挑战。该模型在其答案旁边突出显示了关键时间戳、对象和边界框，从而使推理能够建立在具体的视觉观察基础上。为了实现这一功能，我们首先整理并构建了两个高质量的数据集，即用于SFT的STGR-CoT-30k和用于RL的STGR-RL-36k，并精心构建了时空注释，因为大多数现有数据集要么提供视频的时间跨度，要么提供图像上的空间框，缺乏统一的时空监督和推理轨迹。然后，我们采用冷启动强化学习策略，具有多个专门设计的奖励，共同鼓励答案准确性、时间对齐性和空间精度。在 V-STAR 基准测试中，Open-o3 Video 实现了最先进的性能，在 Qwen2.5-VL 基线上将 mAM 提高了 14.4%，mLGM 提高了 24.2%。在广泛的视频理解基准测试中也观察到持续的改进，包括 VideoMME、WorldSense、VideoMMMU 和 TVGBench。除了准确性之外，Open-o3 Video 产生的推理轨迹还为测试时间缩放提供了有价值的信号，从而实现置信度感知验证并提高答案的可靠性。

The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models

推理的形状：大语言模型中推理痕迹的拓扑分析

Authors: Xue Wen Tan, Nathaniel Tan, Galen Lee, Stanley Kok
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.20665
Pdf link: https://arxiv.org/pdf/2510.20665
Abstract Evaluating the quality of reasoning traces from large language models remains understudied, labor-intensive, and unreliable: current practice relies on expert rubrics, manual annotation, and slow pairwise judgments. Automated efforts are dominated by graph-based proxies that quantify structural connectivity but do not clarify what constitutes high-quality reasoning; such abstractions can be overly simplistic for inherently complex processes. We introduce a topological data analysis (TDA)-based evaluation framework that captures the geometry of reasoning traces and enables label-efficient, automated assessment. In our empirical study, topological features yield substantially higher predictive power for assessing reasoning quality than standard graph metrics, suggesting that effective reasoning is better captured by higher-dimensional geometric structures rather than purely relational graphs. We further show that a compact, stable set of topological features reliably indicates trace quality, offering a practical signal for future reinforcement learning algorithms.
中文摘要 评估大型语言模型推理痕迹的质量仍然没有得到充分研究、劳动密集型且不可靠：目前的实践依赖于专家评分标准、手动注释和缓慢的成对判断。自动化工作以基于图形的代理为主，这些代理量化了结构连接性，但没有阐明什么是高质量推理;对于固有的复杂过程来说，这种抽象可能过于简单化。我们引入了基于拓扑数据分析（TDA）的评估框架，该框架捕获推理轨迹的几何形状，并实现标签高效的自动化评估。在我们的实证研究中，拓扑特征在评估推理质量方面产生的预测能力比标准图指标高得多，这表明高维几何结构而不是纯粹的关系图可以更好地捕捉有效的推理。我们进一步表明，一组紧凑、稳定的拓扑特征可靠地指示了跟踪质量，为未来的强化学习算法提供了实用的信号。

Plan Then Retrieve: Reinforcement Learning-Guided Complex Reasoning over Knowledge Graphs

计划然后检索：强化学习引导的知识图谱复杂推理

Authors: Yanlin Song, Ben Liu, Víctor Gutiérrez-Basulto, Zhiwei Hu, Qianqian Xie, Min Peng, Sophia Ananiadou, Jeff Z. Pan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.20691
Pdf link: https://arxiv.org/pdf/2510.20691
Abstract Knowledge Graph Question Answering aims to answer natural language questions by reasoning over structured knowledge graphs. While large language models have advanced KGQA through their strong reasoning capabilities, existing methods continue to struggle to fully exploit both the rich knowledge encoded in KGs and the reasoning capabilities of LLMs, particularly in complex scenarios. They often assume complete KG coverage and lack mechanisms to judge when external information is needed, and their reasoning remains locally myopic, failing to maintain coherent multi-step planning, leading to reasoning failures even when relevant knowledge exists. We propose Graph-RFT, a novel two-stage reinforcement fine-tuning KGQA framework with a 'plan-KGsearch-and-Websearch-during-think' paradigm, that enables LLMs to perform autonomous planning and adaptive retrieval scheduling across KG and web sources under incomplete knowledge conditions. Graph-RFT introduces a chain-of-thought fine-tuning method with a customized plan-retrieval dataset activates structured reasoning and resolves the GRPO cold-start problem. It then introduces a novel plan-retrieval guided reinforcement learning process integrates explicit planning and retrieval actions with a multi-reward design, enabling coverage-aware retrieval scheduling. It employs a Cartesian-inspired planning module to decompose complex questions into ordered subquestions, and logical expression to guide tool invocation for globally consistent multi-step reasoning. This reasoning retrieval process is optimized with a multi-reward combining outcome and retrieval specific signals, enabling the model to learn when and how to combine KG and web retrieval effectively.
中文摘要 知识图谱问答旨在通过对结构化知识图谱进行推理来回答自然语言问题。虽然大型语言模型通过其强大的推理能力推进了KGQA，但现有方法仍然难以充分利用KG中编码的丰富知识和LLM的推理能力，特别是在复杂场景下。他们通常假设 KG 覆盖完整，缺乏判断何时需要外部信息的机制，并且他们的推理仍然处于局部短视状态，无法保持连贯的多步骤计划，即使存在相关知识，也会导致推理失败。我们提出了 Graph-RFT，这是一种新型的两阶段强化微调 KGQA 框架，具有“计划-KGsearch-and-Websearch-during-think”范式，使 LLM 能够在不完整的知识条件下跨 KG 和 Web 资源执行自主规划和自适应检索调度。Graph-RFT 引入了一种思维链微调方法，具有定制的计划检索数据集，激活结构化推理并解决了 GRPO 冷启动问题。然后，它引入了一种新颖的计划检索引导强化学习过程，将显式规划和检索作与多奖励设计相结合，从而实现覆盖感知检索调度。它采用笛卡尔式的规划模块将复杂的问题分解为有序的子问题，并采用逻辑表达式来指导工具调用，以实现全局一致的多步骤推理。这种推理检索过程通过多奖励组合结果和检索特定信号进行了优化，使模型能够学习何时以及如何有效地结合 KG 和 Web 检索。

Real-Time Gait Adaptation for Quadrupeds using Model Predictive Control and Reinforcement Learning

基于模型预测控制和强化学习的四足动物实时步态适应

Authors: Ganga Nair B, Prakrut Kotecha, Shishir Kolathaya
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.20706
Pdf link: https://arxiv.org/pdf/2510.20706
Abstract Model-free reinforcement learning (RL) has enabled adaptable and agile quadruped locomotion; however, policies often converge to a single gait, leading to suboptimal performance. Traditionally, Model Predictive Control (MPC) has been extensively used to obtain task-specific optimal policies but lacks the ability to adapt to varying environments. To address these limitations, we propose an optimization framework for real-time gait adaptation in a continuous gait space, combining the Model Predictive Path Integral (MPPI) algorithm with a Dreamer module to produce adaptive and optimal policies for quadruped locomotion. At each time step, MPPI jointly optimizes the actions and gait variables using a learned Dreamer reward that promotes velocity tracking, energy efficiency, stability, and smooth transitions, while penalizing abrupt gait changes. A learned value function is incorporated as terminal reward, extending the formulation to an infinite-horizon planner. We evaluate our framework in simulation on the Unitree Go1, demonstrating an average reduction of up to 36.48\% in energy consumption across varying target speeds, while maintaining accurate tracking and adaptive, task-appropriate gaits.
中文摘要 无模型强化学习（RL）实现了适应性和敏捷的四足运动;然而，策略往往趋同于单一步态，导致性能不佳。传统上，模型预测控制（MPC）被广泛用于获取特定任务的最优策略，但缺乏适应不同环境的能力。为了解决这些限制，我们提出了一个在连续步态空间中进行实时步态适应的优化框架，将模型预测路径积分（MPPI）算法与Dreamer模块相结合，为四足运动生成自适应和最优策略。在每个时间步长，MPPI使用学习的梦想家奖励共同优化动作和步态变量，该奖励促进速度跟踪、能源效率、稳定性和平滑过渡，同时惩罚突然的步态变化。学习到的价值函数被纳入最终奖励，将表述扩展到无限视野规划器。我们在 Unitree Go1 上的模拟中评估了我们的框架，表明在不同的目标速度下，能耗平均降低了 36.48\%，同时保持了准确的跟踪和自适应的、适合任务的步态。

No-Regret Thompson Sampling for Finite-Horizon Markov Decision Processes with Gaussian Processes

使用高斯过程进行有限视界马尔可夫决策过程的无后悔汤普森采样

Authors: Jasmine Bayrooti, Sattar Vakili, Amanda Prorok, Carl Henrik Ek
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.20725
Pdf link: https://arxiv.org/pdf/2510.20725
Abstract Thompson sampling (TS) is a powerful and widely used strategy for sequential decision-making, with applications ranging from Bayesian optimization to reinforcement learning (RL). Despite its success, the theoretical foundations of TS remain limited, particularly in settings with complex temporal structure such as RL. We address this gap by establishing no-regret guarantees for TS using models with Gaussian marginal distributions. Specifically, we consider TS in episodic RL with joint Gaussian process (GP) priors over rewards and transitions. We prove a regret bound of $\mathcal{\tilde{O}}(\sqrt{KH\Gamma(KH)})$ over $K$ episodes of horizon $H$, where $\Gamma(\cdot)$ captures the complexity of the GP model. Our analysis addresses several challenges, including the non-Gaussian nature of value functions and the recursive structure of Bellman updates, and extends classical tools such as the elliptical potential lemma to multi-output settings. This work advances the understanding of TS in RL and highlights how structural assumptions and model uncertainty shape its performance in finite-horizon Markov Decision Processes.
中文摘要 汤普森采样（TS）是一种强大且广泛使用的顺序决策策略，其应用范围从贝叶斯优化到强化学习（RL）。尽管取得了成功，但 TS 的理论基础仍然有限，特别是在 RL 等具有复杂时间结构的环境中。我们通过使用具有高斯边际分布的模型为 TS 建立无后悔保证来解决这一差距。具体来说，我们考虑了情节性 RL 中的 TS，其联合高斯过程（GP）先验优先于奖励和转换。我们证明了 $\mathcal{\tilde{O}}（\sqrt{KH\Gamma（KH）}）$ 在 $K$ 个地平线 $H$ 的情节上的遗憾界限，其中 $\Gamma（\cdot）$ 捕获了 GP 模型的复杂性。我们的分析解决了几个挑战，包括值函数的非高斯性质和贝尔曼更新的递归结构，并将椭圆势引理等经典工具扩展到多输出设置。这项工作促进了对 RL 中 TS 的理解，并强调了结构假设和模型不确定性如何影响其在有限视界马尔可夫决策过程中的性能。

GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation

GSWorld：用于机器人作的闭环逼真仿真套件

Authors: Guangqi Jiang, Haoran Chang, Ri-Zhao Qiu, Yutong Liang, Mazeyu Ji, Jiyue Zhu, Zhao Dong, Xueyan Zou, Xiaolong Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.20813
Pdf link: https://arxiv.org/pdf/2510.20813
Abstract This paper presents GSWorld, a robust, photo-realistic simulator for robotics manipulation that combines 3D Gaussian Splatting with physics engines. Our framework advocates "closing the loop" of developing manipulation policies with reproducible evaluation of policies learned from real-robot data and sim2real policy training without using real robots. To enable photo-realistic rendering of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian Scene Description File), that infuses Gaussian-on-Mesh representation with robot URDF and other objects. With a streamlined reconstruction pipeline, we curate a database of GSDF that contains 3 robot embodiments for single-arm and bimanual manipulation, as well as more than 40 objects. Combining GSDF with physics engines, we demonstrate several immediate interesting applications: (1) learning zero-shot sim2real pixel-to-action manipulation policy with photo-realistic rendering, (2) automated high-quality DAgger data collection for adapting policies to deployment environments, (3) reproducible benchmarking of real-robot manipulation policies in simulation, (4) simulation data collection by virtual teleoperation, and (5) zero-shot sim2real visual reinforcement learning. Website: this https URL.
中文摘要 本文介绍了 GSWorld，这是一种强大的、逼真的机器人作模拟器，它将 3D 高斯 Splatting 与物理引擎相结合。我们的框架提倡“闭环”制定纵策略，对从真实机器人数据中学到的策略进行可重复的评估，并在不使用真实机器人的情况下进行 sim2real 策略训练。为了实现不同场景的逼真渲染，我们提出了一种新的资产格式，我们称之为 GSDF（高斯场景描述文件），它将网格上的高斯表示与机器人 URDF 和其他对象相结合。通过简化的重建流程，我们策划了一个 GSDF 数据库，其中包含 3 个用于单臂和双手作的机器人实施例，以及 40 多个对象。将 GSDF 与物理引擎相结合，我们展示了几个直接有趣的应用：（1）通过逼真的渲染学习零样本 sim2real 像素到动作作策略，（2）自动收集高质量 DAgger 数据以使策略适应部署环境，（3）模拟中真实机器人纵策略的可重复基准测试，（4）通过虚拟远程作收集模拟数据，以及（5）零样本 sim2real 视觉强化学习。网站：这个 https URL。

KL-Regularized Reinforcement Learning is Designed to Mode Collapse

KL 正则化强化学习旨在模态崩溃

Authors: Anthony GX-Chen, Jatin Prakash, Jeff Guo, Rob Fergus, Rajesh Ranganath
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.20817
Pdf link: https://arxiv.org/pdf/2510.20817
Abstract It is commonly believed that optimizing the reverse KL divergence results in "mode seeking", while optimizing forward KL results in "mass covering", with the latter being preferred if the goal is to sample from multiple diverse modes. We show -- mathematically and empirically -- that this intuition does not necessarily transfer well to doing reinforcement learning with reverse/forward KL regularization (e.g. as commonly used with language models). Instead, the choice of reverse/forward KL determines the family of optimal target distributions, parameterized by the regularization coefficient. Mode coverage depends primarily on other factors, such as regularization strength, and relative scales between rewards and reference probabilities. Further, we show commonly used settings such as low regularization strength and equal verifiable rewards tend to specify unimodal target distributions, meaning the optimization objective is, by construction, non-diverse. We leverage these insights to construct a simple, scalable, and theoretically justified algorithm. It makes minimal changes to reward magnitudes, yet optimizes for a target distribution which puts high probability over all high-quality sampling modes. In experiments, this simple modification works to post-train both Large Language Models and Chemical Language Models to have higher solution quality and diversity, without any external signals of diversity, and works with both forward and reverse KL when using either naively fails.
中文摘要 人们普遍认为，优化反向 KL 发散会导致“模式寻道”，而优化正向 KL 会导致“质量覆盖”，如果目标是从多种不同模式中采样，则后者是首选。我们从数学和实证上表明，这种直觉不一定能很好地转移到使用反向/正向 KL 正则化（例如，通常用于语言模型）的强化学习中。相反，反向/正向 KL 的选择决定了最佳目标分布族，由正则化系数参数化。模式覆盖率主要取决于其他因素，例如正则化强度以及奖励和参考概率之间的相对尺度。此外，我们展示了常用的设置，例如低正则化强度和相等可验证奖励，倾向于指定单峰目标分布，这意味着优化目标在构造上是非多样化的。我们利用这些见解来构建一个简单、可扩展且理论合理的算法。它对奖励量级的更改最小，但针对目标分布进行了优化，该分布对所有高质量采样模式都具有很高的概率。在实验中，这种简单的修改可以对大型语言模型和化学语言模型进行后训练，使其具有更高的解质量和多样性，而没有任何外部多样性信号，并且在使用任何一个天真失败时都适用于正向和反向 KL。

Keyword: diffusion policy

There is no result