Arxiv Papers of Today

生成时间: 2025-10-29 16:30:55 (UTC+8); Arxiv 发布时间: 2025-10-29 20:00 EDT (2025-10-30 08:00 UTC+8)

今天共有 33 篇相关文章

Keyword: reinforcement learning

Logic-based Task Representation and Reward Shaping in Multiagent Reinforcement Learning

多智能体强化学习中基于逻辑的任务表示与奖励塑造

Authors: Nishant Doshi
Subjects: Subjects: Multiagent Systems (cs.MA); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.23615
Pdf link: https://arxiv.org/pdf/2510.23615
Abstract This paper presents an approach for accelerated learning of optimal plans for a given task represented using Linear Temporal Logic (LTL) in multi-agent systems. Given a set of options (temporally abstract actions) available to each agent, we convert the task specification into the corresponding Buchi Automaton and proceed with a model-free approach which collects transition samples and constructs a product Semi Markov Decision Process (SMDP) on-the-fly. Value-based Reinforcement Learning algorithms can then be used to synthesize a correct-by-design controller without learning the underlying transition model of the multi-agent system. The exponential sample complexity due to multiple agents is dealt with using a novel reward shaping approach. We test the proposed algorithm in a deterministic gridworld simulation for different tasks and find that the reward shaping results in significant reduction in convergence times. We also infer that using options becomes increasing more relevant as the state and action space increases in multi-agent systems.
中文摘要 本文提出了一种在多智能体系统中使用线性时间逻辑（LTL）表示的给定任务的最佳计划的加速学习方法。给定每个智能体可用的一组选项（时间抽象动作），我们将任务规范转换为相应的 Buchi 自动机，并继续采用无模型方法，收集过渡样本并即时构建产品半马尔可夫决策过程（SMDP）。然后，可以使用基于值的强化学习算法来合成设计正确的控制器，而无需学习多智能体系统的底层转换模型。使用一种新颖的奖励塑造方法处理了由于多个智能体而导致的指数样本复杂性。我们在确定性网格世界模拟中针对不同任务测试了所提出的算法，发现奖励塑造导致收敛时间显着缩短。我们还推断，随着多智能体系统中状态和动作空间的增加，使用选项变得越来越重要。

Debiasing Reward Models by Representation Learning with Guarantees

通过具有保证的表示学习消除奖励模型的偏差

Authors: Ignavier Ng, Patrick Blöbaum, Siddharth Bhandari, Kun Zhang, Shiva Kasiviswanathan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.23751
Pdf link: https://arxiv.org/pdf/2510.23751
Abstract Recent alignment techniques, such as reinforcement learning from human feedback, have been widely adopted to align large language models with human preferences by learning and leveraging reward models. In practice, these models often exploit spurious correlations, involving, e.g., response length, discrimination, sycophancy, and conceptual bias, which is a problem that has received increasing attention. In this work, we propose a principled framework that mitigates these biases in reward models while preserving the underlying factors that reflect intended preferences. We first provide a formulation of the data-generating process, assuming that the observed data (e.g., text) is generated from both spurious and non-spurious latent variables. We show that, interestingly, these non-spurious latent variables can be theoretically identified from data, regardless of whether a surrogate for the spurious latent variables is available. This further inspires a practical method that uses variational inference to recover these variables and leverages them to train reward models. Experiments on synthetic and real-world datasets demonstrate that our method effectively mitigates spurious correlation issues and yields more robust reward models.
中文摘要 最近的对齐技术，例如来自人类反馈的强化学习，已被广泛采用，通过学习和利用奖励模型，使大型语言模型与人类偏好保持一致。在实践中，这些模型经常利用虚假相关性，例如，涉及响应长度、歧视、阿谀奉承和概念偏差，这是一个越来越受到关注的问题。在这项工作中，我们提出了一个有原则的框架，可以减轻奖励模型中的这些偏见，同时保留反映预期偏好的潜在因素。我们首先提供数据生成过程的公式，假设观察到的数据（例如文本）是由虚假和非虚假潜在变量生成的。我们表明，有趣的是，这些非虚假潜在变量在理论上可以从数据中识别出来，而不管虚假潜在变量的替代变量是否可用。这进一步激发了一种实用的方法，该方法使用变分推理来恢复这些变量，并利用它们来训练奖励模型。在合成和真实世界数据集上的实验表明，我们的方法有效地缓解了虚假相关问题，并产生了更稳健的奖励模型。

GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA

GIFT：组相关隐式微调将 GRPO 与 DPO 和 UNA 集成

Authors: Zhichao Wang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.23868
Pdf link: https://arxiv.org/pdf/2510.23868
Abstract I propose \textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine \textbf{T}uning (GIFT), a novel reinforcement learning framework for aligning LLMs. Instead of directly maximizing cumulative rewards like PPO or GRPO, GIFT minimizes the discrepancy between implicit and explicit reward models. It combines three key ideas: (1) the online multi-response generation and normalization of GRPO, (2) the implicit reward formulation of DPO, and (3) the implicit-explicit reward alignment principle of UNA. By jointly normalizing the implicit and explicit rewards, GIFT eliminates an otherwise intractable term that prevents effective use of implicit rewards. This normalization transforms the complex reward maximization objective into a simple mean squared error (MSE) loss between the normalized reward functions, converting a non-convex optimization problem into a convex, stable, and analytically differentiable formulation. Unlike offline methods such as DPO and UNA, GIFT remains on-policy and thus retains exploration capability. Compared to GRPO, it requires fewer hyperparameters, converges faster, and generalizes better with significantly reduced training overfitting. Empirically, GIFT achieves superior reasoning and alignment performance on mathematical benchmarks while remaining computationally efficient.
中文摘要 我建议使用 \textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine \textbf{T}uning （GIFT），这是一种用于调整 LLM 的新型强化学习框架。GIFT 不是像 PPO 或 GRPO 那样直接最大化累积奖励，而是最大限度地减少隐式和显式奖励模型之间的差异。它结合了三个关键思想：（1）GRPO的在线多响应生成和规范化，（2）DPO的隐式奖励制定，以及（3）UNA的隐式-显式奖励对齐原则。通过共同规范隐式和显式奖励，GIFT 消除了一个阻碍有效使用隐式奖励的难以处理的术语。这种归一化将复杂的奖励最大化目标转换为归一化奖励函数之间的简单均方误差（MSE）损失，将非凸优化问题转换为凸、稳定且可解析微分的公式。与 DPO 和 UNA 等离线方法不同，GIFT 保持政策，因此保留了勘探能力。与 GRPO 相比，它需要更少的超参数，收敛速度更快，泛化效果更好，训练过度拟合显着减少。根据经验，GIFT 在数学基准上实现了卓越的推理和对齐性能，同时保持了计算效率。

Hybrid Modeling, Sim-to-Real Reinforcement Learning, and Large Language Model Driven Control for Digital Twins

数字孪生的混合建模、模拟到实的强化学习和大型语言模型驱动的控制

Authors: Adil Rasheed, Oscar Ravik, Omer San
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.23882
Pdf link: https://arxiv.org/pdf/2510.23882
Abstract This work investigates the use of digital twins for dynamical system modeling and control, integrating physics-based, data-driven, and hybrid approaches with both traditional and AI-driven controllers. Using a miniature greenhouse as a test platform, four predictive models Linear, Physics-Based Modeling (PBM), Long Short Term Memory (LSTM), and Hybrid Analysis and Modeling (HAM) are developed and compared under interpolation and extrapolation scenarios. Three control strategies Model Predictive Control (MPC), Reinforcement Learning (RL), and Large Language Model (LLM) based control are also implemented to assess trade-offs in precision, adaptability, and implementation effort. Results show that in modeling HAM provides the most balanced performance across accuracy, generalization, and computational efficiency, while LSTM achieves high precision at greater resource cost. Among controllers, MPC delivers robust and predictable performance, RL demonstrates strong adaptability, and LLM-based controllers offer flexible human-AI interaction when coupled with predictive tools.
中文摘要 这项工作研究了数字孪生在动态系统建模和控制中的应用，将基于物理的、数据驱动的和混合方法与传统和人工智能驱动的控制器相结合。以微型温室为测试平台，开发了线性、基于物理的建模（PBM）、长短期记忆（LSTM）和混合分析与建模（HAM）4种预测模型，并在插值和外推情景下进行了比较。还实施了三种控制策略模型预测控制（MPC）、强化学习（RL）和基于大型语言模型（LLM）的控制，以评估精度、适应性和实施工作量方面的权衡。结果表明，在建模中，HAM在精度、泛化和计算效率方面提供了最平衡的性能，而LSTM则以更高的资源成本实现了高精度。在控制器中，MPC 提供强大且可预测的性能，RL 表现出强大的适应性，而基于 LLM 的控制器在与预测工具结合使用时提供灵活的人机交互。

Stand, Walk, Navigate: Recovery-Aware Visual Navigation on a Low-Cost Wheeled Quadruped

站立、行走、导航：低成本轮式四足动物的恢复感知视觉导航

Authors: Jans Solano, Diego Quiroz
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.23902
Pdf link: https://arxiv.org/pdf/2510.23902
Abstract Wheeled-legged robots combine the efficiency of wheels with the obstacle negotiation of legs, yet many state-of-the-art systems rely on costly actuators and sensors, and fall-recovery is seldom integrated, especially for wheeled-legged morphologies. This work presents a recovery-aware visual-inertial navigation system on a low-cost wheeled quadruped. The proposed system leverages vision-based perception from a depth camera and deep reinforcement learning policies for robust locomotion and autonomous recovery from falls across diverse terrains. Simulation experiments show agile mobility with low-torque actuators over irregular terrain and reliably recover from external perturbations and self-induced failures. We further show goal directed navigation in structured indoor spaces with low-cost perception. Overall, this approach lowers the barrier to deploying autonomous navigation and robust locomotion policies in budget-constrained robotic platforms.
中文摘要 轮腿机器人将轮子的效率与腿的障碍物协商结合在一起，但许多最先进的系统依赖于昂贵的执行器和传感器，并且很少集成坠落恢复，特别是对于轮腿形态。这项工作提出了一种低成本轮式四足动物上的恢复感知视觉惯性导航系统。所提出的系统利用来自深度相机的基于视觉的感知和深度强化学习策略来实现强大的运动和从不同地形的跌倒中自主恢复。仿真实验表明，低扭矩执行器在不规则地形上具有敏捷的移动性，并能可靠地从外部扰动和自致故障中恢复。我们进一步展示了在具有低成本感知的结构化室内空间中的目标导向导航。总体而言，这种方法降低了在预算有限的机器人平台中部署自主导航和稳健运动策略的障碍。

Secure Control of Connected and Autonomous Electrified Vehicles Under Adversarial Cyber-Attacks

在对抗性网络攻击下安全控制联网和自动驾驶电动汽车

Authors: Shashank Dhananjay Vyas, Satadru Dey
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.23922
Pdf link: https://arxiv.org/pdf/2510.23922
Abstract Connected and Autonomous Electrified Vehicles (CAEV) is the solution to the future smart mobility having benefits of efficient traffic flow and cleaner environmental impact. Although CAEV has advantages they are still susceptible to adversarial cyber attacks due to their autonomous electric operation and the involved connectivity. To alleviate this issue, we propose a secure control architecture of CAEV. Particularly, we design an additional control input using Reinforcement Learning (RL) to be applied to the vehicle powertrain along with the input commanded by the battery. We present simulation case studies to demonstrate the potential of the proposed approach in keeping the CAEV platoon operating safely without collisions by curbing the effect of adversarial attacks.
中文摘要 互联和自动驾驶电动汽车（CAEV）是未来智能出行的解决方案，具有高效的交通流量和更清洁的环境影响。尽管 CAEV 具有优势，但由于其自主电动作和所涉及的连接性，它们仍然容易受到对抗性网络攻击。为了缓解这个问题，我们提出了一种CAEV的安全控制架构。特别是，我们使用强化学习（RL）设计了一个额外的控制输入，与电池命令的输入一起应用于车辆动力总成。我们提出了模拟案例研究，以证明所提出的方法通过遏制对抗性攻击的影响来保持 CAEV 排安全运行而不会发生碰撞的潜力。

Latent Chain-of-Thought for Visual Reasoning

视觉推理的潜在思维链

Authors: Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.23925
Pdf link: https://arxiv.org/pdf/2510.23925
Abstract Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.
中文摘要 思维链（CoT）推理对于提高大型视觉语言模型（LVLM）的可解释性和可靠性至关重要。然而，现有的训练算法（如 SFT、PPO 和 GRPO）可能无法很好地泛化看不见的推理任务，并且严重依赖有偏见的奖励模型。为了应对这一挑战，我们将 LVLM 中的推理重新表述为后验推理，并提出了一种基于摊销变分推理的可扩展训练算法。通过利用寻求多样性的强化学习算法，我们为标记级学习信号引入了一种新颖的稀疏奖励函数，该函数鼓励多样化、高似然的潜在 CoT，克服确定性采样限制并避免奖励黑客攻击。此外，我们还实施了贝叶斯推理缩放策略，用边际可能性取代了成本高昂的 Best-of-N 和 Beam Search，以有效地对最佳基本原理和答案进行排名。我们实证证明，所提出的方法在有效性、泛化性和可解释性方面增强了七个推理基准上最先进的 LVLM。

Reasoning Visual Language Model for Chest X-Ray Analysis

用于胸部 X 光分析的推理视觉语言模型

Authors: Andriy Myronenko, Dong Yang, Baris Turkbey, Mariam Aboian, Sena Azamat, Esra Akcicek, Hongxu Yin, Pavlo Molchanov, Marc Edgar, Yufan He, Pengfei Guo, Yucheng Tang, Daguang Xu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.23968
Pdf link: https://arxiv.org/pdf/2510.23968
Abstract Vision-language models (VLMs) have shown strong promise for medical image analysis, but most remain opaque, offering predictions without the transparent, stepwise reasoning clinicians rely on. We present a framework that brings chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by reasoning-first training paradigms, our approach is designed to learn how experts reason, not just what they conclude, by aligning intermediate steps with observable image evidence and radiology workflow. Beyond accuracy, the explicit reasoning traces support clinical auditability: they reveal why a conclusion was reached, which alternatives were considered, and where uncertainty remains, enabling quality assurance, error analysis, and safer human-AI collaboration. Our model couples high-fidelity visual encoding with a two-stage training recipe: a reasoning-style supervised fine-tuning (SFT) followed by reinforcement learning (RL) that uses verifiable rewards over a list of X-ray abnormalities. The model outputs reasoning that mirrors radiologists systematic thought process, uncertainty, and differential diagnosis. In out-of-distribution evaluation, the approach achieves competitive multi-label classification while improving interpretability. In a reader study with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports. We release code and the model NV-Reason-CXR-3B to support community progress toward trustworthy, explainable AI in chest radiography and other medical imaging tasks where reasoning quality is as critical as prediction quality.
中文摘要 视觉语言模型（VLM）在医学图像分析方面显示出巨大的前景，但大多数仍然不透明，提供预测时没有临床医生所依赖的透明、逐步推理。我们提出了一个框架，将思维链（CoT）推理引入胸部 X 射线解释。受推理优先训练范式的启发，我们的方法旨在通过将中间步骤与可观察的图像证据和放射学工作流程保持一致，来了解专家如何推理，而不仅仅是他们的结论。除了准确性之外，明确的推理痕迹还支持临床可审计性：它们揭示了得出结论的原因、考虑了哪些替代方案以及存在不确定性的地方，从而实现质量保证、错误分析和更安全的人类与人工智能协作。我们的模型将高保真视觉编码与两阶段训练配方相结合：推理式监督微调（SFT），然后是强化学习（RL），对一系列 X 射线异常使用可验证的奖励。该模型输出的推理反映了放射科医生的系统思维过程、不确定性和鉴别诊断。在分布外评估中，该方法实现了竞争性的多标签分类，同时提高了可解释性。在一项与放射科医生专家的读者研究中，完整的推理跟踪提高了可信度，支持错误审计，并缩短了最终确定报告的时间。我们发布代码和模型 NV-Reason-CXR-3B，以支持社区在胸片照相和其他医学成像任务中朝着值得信赖、可解释的 AI 迈进，在这些任务中，推理质量与预测质量同样重要。

VOCALoco: Viability-Optimized Cost-aware Adaptive Locomotion

VOCALoco：可行性优化的成本感知自适应运动

Authors: Stanley Wu, Mohamad H. Danesh, Simon Li, Hanna Yurchyk, Amin Abyaneh, Anas El Houssaini, David Meger, Hsiu-Chin Lin
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.23997
Pdf link: https://arxiv.org/pdf/2510.23997
Abstract Recent advancements in legged robot locomotion have facilitated traversal over increasingly complex terrains. Despite this progress, many existing approaches rely on end-to-end deep reinforcement learning (DRL), which poses limitations in terms of safety and interpretability, especially when generalizing to novel terrains. To overcome these challenges, we introduce VOCALoco, a modular skill-selection framework that dynamically adapts locomotion strategies based on perceptual input. Given a set of pre-trained locomotion policies, VOCALoco evaluates their viability and energy-consumption by predicting both the safety of execution and the anticipated cost of transport over a fixed planning horizon. This joint assessment enables the selection of policies that are both safe and energy-efficient, given the observed local terrain. We evaluate our approach on staircase locomotion tasks, demonstrating its performance in both simulated and real-world scenarios using a quadrupedal robot. Empirical results show that VOCALoco achieves improved robustness and safety during stair ascent and descent compared to a conventional end-to-end DRL policy
中文摘要 腿式机器人运动的最新进展促进了在日益复杂的地形上的穿越。尽管取得了这些进展，但许多现有方法依赖于端到端深度强化学习（DRL），这在安全性和可解释性方面存在局限性，尤其是在推广到新地形时。为了克服这些挑战，我们引入了 VOCALoco，这是一个模块化的技能选择框架，可以根据感知输入动态调整运动策略。给定一组预训练的运动策略，VOCALoco 通过预测执行安全性和固定规划范围内的预期运输成本来评估其可行性和能耗。根据观察到的当地地形，这种联合评估能够选择既安全又节能的政策。我们评估了我们在楼梯运动任务上的方法，并使用四足机器人在模拟和真实场景中展示了其性能。实证结果表明，与传统的端到端 DRL 策略相比，VOCALoco 在楼梯上升和下降过程中实现了更高的鲁棒性和安全性

Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward

通过细粒度语义信心奖励教法学硕士弃权

Authors: Hao An, Yang Xu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.24020
Pdf link: https://arxiv.org/pdf/2510.24020
Abstract Mitigating hallucinations in Large Language Models (LLMs) is critical for their reliable deployment. Existing methods typically fine-tune LLMs to abstain from answering questions beyond their knowledge scope. However, these methods often rely on coarse-grained signals to guide LLMs to abstain, such as overall confidence or uncertainty scores on multiple sampled answers, which may result in an imprecise awareness of the model's own knowledge boundaries. To this end, we propose a novel reinforcement learning framework built on $\textbf{\underline{Fi}ne-grained \underline{S}emantic \underline{Co}nfidence \underline{Re}ward (\Ours)}$, which guides LLMs to abstain via sample-specific confidence. Specifically, our method operates by sampling multiple candidate answers and conducting semantic clustering, then training the LLM to retain answers within high-confidence clusters and discard those within low-confidence ones, thereby promoting accurate post-hoc abstention. Additionally, we propose a new metric for evaluating the reliability of abstention fine-tuning tasks more comprehensively. Our method significantly enhances reliability in both in-domain and out-of-distribution benchmarks.
中文摘要 减轻大型语言模型（LLM）中的幻觉对于其可靠部署至关重要。现有方法通常会微调法学硕士，避免回答超出其知识范围的问题。然而，这些方法往往依赖于粗粒度信号来引导 LLM 弃权，例如多个抽样答案的总体置信度或不确定性分数，这可能导致对模型自身知识边界的认识不精确。为此，我们提出了一种基于 $\textbf{\underline{Fi}ne-grained \underline{S}emantic \underline{Co}nfidence \underline{Re}ward （\Ours）}$ 构建的新型强化学习框架，它通过特定样本的置信度指导 LLM 弃权。具体来说，我们的方法通过对多个候选答案进行采样并进行语义聚类，然后训练法学硕士在高置信度聚类中保留答案并丢弃低置信度聚类中的答案，从而促进准确的事后弃权。此外，我们还提出了一个新的指标，用于更全面地评估弃权微调任务的可靠性。我们的方法显着提高了域内和分布外基准的可靠性。

Causal-Aware Generative Adversarial Networks with Reinforcement Learning

具有强化学习的因果感知生成对抗网络

Authors: Tu Anh Hoang Nguyen, Dang Nguyen, Tri-Nhan Vo, Thuc Duy Le, Sunil Gupta
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.24046
Pdf link: https://arxiv.org/pdf/2510.24046
Abstract The utility of tabular data for tasks ranging from model training to large-scale data analysis is often constrained by privacy concerns or regulatory hurdles. While existing data generation methods, particularly those based on Generative Adversarial Networks (GANs), have shown promise, they frequently struggle with capturing complex causal relationship, maintaining data utility, and providing provable privacy guarantees suitable for enterprise deployment. We introduce CA-GAN, a novel generative framework specifically engineered to address these challenges for real-world tabular datasets. CA-GAN utilizes a two-step approach: causal graph extraction to learn a robust, comprehensive causal relationship in the data's manifold, followed by a custom Conditional WGAN-GP (Wasserstein GAN with Gradient Penalty) that operates exclusively as per the structure of nodes in the causal graph. More importantly, the generator is trained with a new Reinforcement Learning-based objective that aligns the causal graphs constructed from real and fake data, ensuring the causal awareness in both training and sampling phases. We demonstrate CA-GAN superiority over six SOTA methods across 14 tabular datasets. Our evaluations, focused on core data engineering metrics: causal preservation, utility preservation, and privacy preservation. Our method offers a practical, high-performance solution for data engineers seeking to create high-quality, privacy-compliant synthetic datasets to benchmark database systems, accelerate software development, and facilitate secure data-driven research.
中文摘要 表格数据在从模型训练到大规模数据分析等任务中的实用性通常受到隐私问题或监管障碍的限制。虽然现有的数据生成方法，特别是基于生成对抗网络（GAN）的数据生成方法，已经显示出前景，但它们经常难以捕获复杂的因果关系、维护数据效用以及提供适合企业部署的可证明隐私保证。我们介绍了 CA-GAN，这是一种新型生成框架，专门设计用于解决现实世界表格数据集的这些挑战。CA-GAN 采用两步方法：因果图提取来学习数据流形中稳健、全面的因果关系，然后是自定义的条件 WGAN-GP（具有梯度惩罚的 Wasserstein GAN），该 GAN 专门根据因果图中的节点结构进行作。更重要的是，生成器使用基于强化学习的新目标进行训练，该目标对齐由真实和虚假数据构建的因果图，确保训练和采样阶段的因果意识。我们在 14 个表格数据集中证明了 CA-GAN 优于六种 SOTA 方法。我们的评估侧重于核心数据工程指标：因果关系保存、效用保存和隐私保护。我们的方法为寻求创建高质量、符合隐私标准的合成数据集的数据工程师提供了实用、高性能的解决方案，以对数据库系统进行基准测试、加速软件开发并促进安全的数据驱动研究。

ZTRS: Zero-Imitation End-to-end Autonomous Driving with Trajectory Scoring

ZTRS：零模仿端到端自动驾驶，轨迹评分

Authors: Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Jingde Chen, Nadine Chang, Maying Shen, Jingyu Song, Zuxuan Wu, Shiyi Lan, Jose M. Alvarez
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.24108
Pdf link: https://arxiv.org/pdf/2510.24108
Abstract End-to-end autonomous driving maps raw sensor inputs directly into ego-vehicle trajectories to avoid cascading errors from perception modules and to leverage rich semantic cues. Existing frameworks largely rely on Imitation Learning (IL), which can be limited by sub-optimal expert demonstrations and covariate shift during deployment. On the other hand, Reinforcement Learning (RL) has recently shown potential in scaling up with simulations, but is typically confined to low-dimensional symbolic inputs (e.g. 3D objects and maps), falling short of full end-to-end learning from raw sensor data. We introduce ZTRS (Zero-Imitation End-to-End Autonomous Driving with Trajectory Scoring), a framework that combines the strengths of both worlds: sensor inputs without losing information and RL training for robust planning. To the best of our knowledge, ZTRS is the first framework that eliminates IL entirely by only learning from rewards while operating directly on high-dimensional sensor data. ZTRS utilizes offline reinforcement learning with our proposed Exhaustive Policy Optimization (EPO), a variant of policy gradient tailored for enumerable actions and rewards. ZTRS demonstrates strong performance across three benchmarks: Navtest (generic real-world open-loop planning), Navhard (open-loop planning in challenging real-world and synthetic scenarios), and HUGSIM (simulated closed-loop driving). Specifically, ZTRS achieves the state-of-the-art result on Navhard and outperforms IL-based baselines on HUGSIM. Code will be available at this https URL.
中文摘要 端到端自动驾驶将原始传感器输入直接映射到自车轨迹中，以避免感知模块的级联错误并利用丰富的语义线索。现有框架在很大程度上依赖于模仿学习（IL），这可能会受到次优专家演示和部署过程中协变量偏移的限制。另一方面，强化学习（RL）最近显示出通过模拟扩展的潜力，但通常仅限于低维符号输入（e.g. 3D对象和地图），无法从原始传感器数据中进行完整的端到端学习。我们引入了 ZTRS（零模仿端到端自动驾驶，带轨迹评分），这是一个结合了两个世界优势的框架：传感器输入而不丢失信息和用于稳健规划的 RL 训练。据我们所知，ZTRS 是第一个通过仅从奖励中学习而直接对高维传感器数据进行作来完全消除 IL 的框架。ZTRS 利用离线强化学习和我们提出的详尽策略优化（EPO），这是一种为可枚举的作和奖励量身定制的策略梯度变体。ZTRS 在三个基准测试中表现出强大的性能：Navtest（通用真实世界开环规划）、Navhard（具有挑战性的现实世界和综合场景下的开环规划）和 HUGSIM（模拟闭环驾驶）。具体来说，ZTRS 在 Navhard 上取得了最先进的结果，并在 HUGSIM 上优于基于 IL 的基线。代码将在此 https URL 上提供。

Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

长视野多轮搜索代理的强化学习

Authors: Vivek Kalyan, Martin Andrews
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.24126
Pdf link: https://arxiv.org/pdf/2510.24126
Abstract Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.
中文摘要 大型语言模型（LLM）代理可以利用多个轮次和工具来解决复杂的任务，基于提示的方法可实现强大的性能。这项工作表明，强化学习（RL）可以通过从经验中学习来显着推动能力。通过对法律文件搜索基准的实验，我们表明，我们的 RL 训练的 140 亿参数模型优于前沿类模型（准确率为 85% 对 78%）。此外，我们还探索了在训练期间和测试时的回合限制制度，这表明如果允许这些药物在更长的多回合范围内运行，可以取得更好的结果。

BMGQ: A Bottom-up Method for Generating Complex Multi-hop Reasoning Questions from Semi-structured Data

BMGQ：一种从半结构化数据生成复杂多跳推理问题的自下而上方法

Authors: Bingsen Qiu, Zijian Liu, Xiao Liu, Haoshen Yang, Zeren Gao, Bingjie Wang, Feier Zhang, Yixuan Qin, Chunyan Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.24151
Pdf link: https://arxiv.org/pdf/2510.24151
Abstract Building training-ready multi-hop question answering (QA) datasets that truly stress a model's retrieval and reasoning abilities remains highly challenging recently. While there have been a few recent evaluation datasets that capture the characteristics of hard-to-search but easy-to-verify problems -- requiring the integration of ambiguous, indirect, and cross-domain cues -- these data resources remain scarce and are mostly designed for evaluation, making them unsuitable for supervised fine-tuning (SFT) or reinforcement learning (RL). Meanwhile, manually curating non-trivially retrievable questions -- where answers cannot be found through a single direct query but instead require multi-hop reasoning over oblique and loosely connected evidence -- incurs prohibitive human costs and fails to scale, creating a critical data bottleneck for training high-capability retrieval-and-reasoning agents. To address this, we present an automated framework for generating high-difficulty, training-ready multi-hop questions from semi-structured knowledge sources. The system (i) grows diverse, logically labeled evidence clusters through Natural Language Inference (NLI)-based relation typing and diversity-aware expansion; (ii) applies reverse question construction to compose oblique cues so that isolated signals are underinformative but their combination uniquely identifies the target entity; and (iii) enforces quality with a two-step evaluation pipeline that combines multi-model consensus filtering with structured constraint decomposition and evidence-based matching. The result is a scalable process that yields complex, retrieval-resistant yet verifiable questions suitable for SFT/RL training as well as challenging evaluation, substantially reducing human curation effort while preserving the difficulty profile of strong evaluation benchmarks.
中文摘要 最近，构建真正强调模型检索和推理能力的训练就绪多跳问答（QA）数据集仍然极具挑战性。虽然最近有一些评估数据集捕捉了难以搜索但易于验证的问题的特征——需要集成模糊的、间接的和跨领域的线索——但这些数据资源仍然稀缺，并且大多是为评估而设计的，因此它们不适合监督微调（SFT）或强化学习（RL）。同时，手动策划非平凡的可检索问题——无法通过单个直接查询找到答案，而是需要对倾斜和松散连接的证据进行多跳推理——会产生高昂的人力成本，并且无法扩展，从而为训练高能力检索和推理代理造成了关键的数据瓶颈。为了解决这个问题，我们提出了一个自动化框架，用于从半结构化知识源生成高难度、可训练的多跳问题。该系统（i）通过基于自然语言推理（NLI）的关系类型和多样性感知扩展来发展多样化的、逻辑标记的证据集群;（ii）应用反向问题结构来组成倾斜线索，以便孤立的信号信息不足，但它们的组合唯一地识别目标实体;（iii）通过两步评估管道来加强质量，该管道将多模型共识过滤与结构化约束分解和基于证据的匹配相结合。其结果是一个可扩展的过程，可以产生复杂、难以检索但可验证的问题，适用于 SFT/RL 训练以及具有挑战性的评估，从而大大减少了人工管理工作，同时保留了强评估基准的难度曲线。

PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling

PaTaRM：通过偏好感知任务自适应奖励建模桥接成对和逐点信号

Authors: Ai Jian, Jingqing Ruan, Xing Ma, Dailin Li, QianLin Zhou, Ke Zeng, Xunliang Cai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.24235
Pdf link: https://arxiv.org/pdf/2510.24235
Abstract Reward models (RMs) are central to reinforcement learning from human feedback (RLHF), providing the critical supervision signals that align large language models (LLMs) with human preferences. While generative reward models (GRMs) offer greater interpretability than traditional scalar RMs, current training paradigms remain limited. Pair-wise methods rely on binary good-versus-bad labels, which cause mismatches for point-wise inference and necessitate complex pairing strategies for effective application in RLHF. On the other hand, point-wise methods require more elaborate absolute labeling with rubric-driven criteria, resulting in poor adaptability and high annotation costs. In this work, we propose the Preference-Aware Task-Adaptive Reward Model (PaTaRM), a unified framework that integrates a preference-aware reward (PAR) mechanism with dynamic rubric adaptation. PaTaRM leverages relative preference information from pairwise data to construct robust point-wise training signals, eliminating the need for explicit point-wise labels. Simultaneously, it employs a task-adaptive rubric system that flexibly generates evaluation criteria for both global task consistency and instance-specific fine-grained reasoning. This design enables efficient, generalizable, and interpretable reward modeling for RLHF. Extensive experiments show that PaTaRM achieves an average relative improvement of 4.7% on RewardBench and RMBench across Qwen3-8B and Qwen3-14B models. Furthermore, PaTaRM boosts downstream RLHF performance, with an average improvement of 13.6% across IFEval and InFoBench benchmarks, confirming its effectiveness and robustness. Our code is available at this https URL.
中文摘要 奖励模型（RM）是人类反馈强化学习（RLHF）的核心，提供关键的监督信号，使大型语言模型（LLM）与人类偏好保持一致。虽然生成奖励模型（GRM）比传统的标量 RM 提供了更高的可解释性，但当前的训练范式仍然有限。成对方法依赖于二元好与坏标签，这会导致逐点推理不匹配，需要复杂的配对策略才能在 RLHF 中有效应用。另一方面，逐点方法需要使用评分标准驱动的标准进行更复杂的绝对标记，导致适应性差和注释成本高。在这项工作中，我们提出了偏好感知任务自适应奖励模型（PaTaRM），这是一个将偏好感知奖励（PAR）机制与动态评分标准适应相结合的统一框架。PaTaRM 利用成对数据中的相对偏好信息来构建稳健的逐点训练信号，无需显式逐点标签。同时，它采用了任务自适应评分标准系统，可以灵活地生成全局任务一致性和特定实例细粒度推理的评估标准。这种设计可以实现 RLHF 的高效、通用和可解释的奖励建模。大量实验表明，PaTaRM 在 Qwen3-8B 和 Qwen3-14B 模型上在 RewardBench 和 RMBench 上平均实现了 4.7% 的相对提升。此外，PaTaRM 还提高了下游 RLHF 性能，在 IFEval 和 InFoBench 基准测试中平均提高了 13.6%，证实了其有效性和稳健性。我们的代码可在此 https URL 中找到。

GRAPHIA: Harnessing Social Graph Data to Enhance LLM-Based Social Simulation

GRAPHIA：利用社交图谱数据增强基于法学硕士的社交模拟

Authors: Jiarui Ji, Zehua Zhang, Zhewei Wei, Bin Tong, Guan Wang, Bo Zheng
Subjects: Subjects: Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2510.24251
Pdf link: https://arxiv.org/pdf/2510.24251
Abstract Large language models (LLMs) have shown promise in simulating human-like social behaviors. Social graphs provide high-quality supervision signals that encode both local interactions and global network structure, yet they remain underutilized for LLM training. To address this gap, we propose Graphia, the first general LLM-based social graph simulation framework that leverages graph data as supervision for LLM post-training via reinforcement learning. With GNN-based structural rewards, Graphia trains specialized agents to predict whom to interact with (destination selection) and how to interact (edge generation), followed by designed graph generation pipelines. We evaluate Graphia under two settings: Transductive Dynamic Graph Generation (TDGG), a micro-level task with our proposed node-wise interaction alignment metrics; and Inductive Dynamic Graph Generation (IDGG), a macro-level task with our proposed metrics for aligning emergent network properties. On three real-world networks, Graphia improves micro-level alignment by 6.1% in the composite destination selection score, 12% in edge classification accuracy, and 27.9% in edge content BERTScore over the strongest baseline. For macro-level alignment, it achieves 41.11% higher structural similarity and 32.98% better replication of social phenomena such as power laws and echo chambers. Graphia also supports counterfactual simulation, generating plausible behavioral shifts under platform incentives. Our results show that social graphs can serve as high-quality supervision signals for LLM post-training, closing the gap between agent behaviors and network dynamics for LLM-based simulation. Code is available at this https URL.
中文摘要 大型语言模型（LLM）在模拟类人社交行为方面显示出前景。社交图谱提供了编码本地交互和全球网络结构的高质量监督信号，但它们在法学硕士训练中的利用仍然没有得到充分利用。为了解决这一差距，我们提出了 Graphia，这是第一个基于 LLM 的通用社交图模拟框架，它利用图数据作为通过强化学习进行 LLM 后训练的监督。通过基于 GNN 的结构奖励，Graphia 训练专门的代理来预测与谁交互（目的地选择）以及如何交互（边缘生成），然后设计图生成管道。我们在两种设置下评估 Graphia：Transductive Dynamic Graph Generation （TDGG），这是一项微观任务，具有我们提出的按节点交互对齐指标;以及归纳动态图生成（IDGG），这是一项宏观任务，其中包含我们提出的用于调整涌现网络属性的指标。在三个真实世界的网络上，在最强基线上，Graphia 在综合目的地选择分数方面提高了 6.1%，边缘分类准确度提高了 12%，边缘内容 BERTScore 提高了 27.9%。对于宏观层面的对齐，它实现了 41.11% 的结构相似性和 32.98% 的幂律和回声室等社会现象的复制。Graphia 还支持反事实模拟，在平台激励下产生合理的行为转变。我们的结果表明，社交图谱可以作为LLM后训练的高质量监督信号，缩小基于LLM的模拟的智能体行为和网络动力学之间的差距。代码可在此 https URL 中找到。

Can LLMs Translate Human Instructions into a Reinforcement Learning Agent's Internal Emergent Symbolic Representation?

法学硕士能否将人类指令转化为强化学习代理的内部涌现符号表示？

Authors: Ziqi Ma, Sao Mai Nguyen, Philippe Xu
Subjects: Subjects: Computation and Language (cs.CL); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.24259
Pdf link: https://arxiv.org/pdf/2510.24259
Abstract Emergent symbolic representations are critical for enabling developmental learning agents to plan and generalize across tasks. In this work, we investigate whether large language models (LLMs) can translate human natural language instructions into the internal symbolic representations that emerge during hierarchical reinforcement learning. We apply a structured evaluation framework to measure the translation performance of commonly seen LLMs -- GPT, Claude, Deepseek and Grok -- across different internal symbolic partitions generated by a hierarchical reinforcement learning algorithm in the Ant Maze and Ant Fall environments. Our findings reveal that although LLMs demonstrate some ability to translate natural language into a symbolic representation of the environment dynamics, their performance is highly sensitive to partition granularity and task complexity. The results expose limitations in current LLMs capacity for representation alignment, highlighting the need for further research on robust alignment between language and internal agent representations.
中文摘要 涌现符号表示对于使发展学习代理能够跨任务进行规划和推广至关重要。在这项工作中，我们研究了大型语言模型（LLM）是否可以将人类自然语言指令翻译成分层强化学习过程中出现的内部符号表示。我们应用结构化评估框架来衡量常见的 LLM（GPT、Claude、Deepseek 和 Grok）在 Ant Maze 和 Ant Fall 环境中分层强化学习算法生成的不同内部符号分区的翻译性能。我们的研究结果表明，尽管法学硕士表现出一定的能力，可以将自然语言翻译成环境动态的符号表示，但它们的性能对分区粒度和任务复杂性高度敏感。结果揭示了当前法学硕士在表征对齐方面的局限性，强调了进一步研究语言和内部代理表征之间稳健对齐的必要性。

Survey and Tutorial of Reinforcement Learning Methods in Process Systems Engineering

过程系统工程中强化学习方法的调查和教程

Authors: Maximilian Bloor, Max Mowbray, Ehecatl Antonio Del Rio Chanona, Calvin Tsay
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.24272
Pdf link: https://arxiv.org/pdf/2510.24272
Abstract Sequential decision making under uncertainty is central to many Process Systems Engineering (PSE) challenges, where traditional methods often face limitations related to controlling and optimizing complex and stochastic systems. Reinforcement Learning (RL) offers a data-driven approach to derive control policies for such challenges. This paper presents a survey and tutorial on RL methods, tailored for the PSE community. We deliver a tutorial on RL, covering fundamental concepts and key algorithmic families including value-based, policy-based and actor-critic methods. Subsequently, we survey existing applications of these RL techniques across various PSE domains, such as in fed-batch and continuous process control, process optimization, and supply chains. We conclude with PSE focused discussion of specialized techniques and emerging directions. By synthesizing the current state of RL algorithm development and implications for PSE this work identifies successes, challenges, trends, and outlines avenues for future research at the interface of these fields.
中文摘要 不确定性下的顺序决策是许多过程系统工程（PSE）挑战的核心，其中传统方法通常面临与控制和优化复杂和随机系统相关的局限性。强化学习（RL）提供了一种数据驱动的方法来推导出针对此类挑战的控制策略。本文提出了针对 PSE 社区量身定制的 RL 方法的调查和教程。我们提供了一个关于 RL 的教程，涵盖基本概念和关键算法系列，包括基于价值、基于策略和参与者批评方法。随后，我们调查了这些 RL 技术在各个 PSE 领域的现有应用，例如在补料分批和连续过程控制、过程优化和供应链中。最后，我们以 PSE 重点讨论专业技术和新兴方向为重点。通过综合RL算法开发的现状和对PSE的影响，这项工作确定了成功、挑战、趋势，并概述了这些领域界面的未来研究途径。

ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model

ViPER：赋能视觉语言模型视觉感知能力的自我进化

Authors: Juntian Zhang, Song Jin, Chuanqi Cheng, Yuhan Liu, Yankai Lin, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, Rui Yan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.24285
Pdf link: https://arxiv.org/pdf/2510.24285
Abstract The limited capacity for fine-grained visual perception presents a critical bottleneck for Vision-Language Models (VLMs) in real-world applications. Addressing this is challenging due to the scarcity of high-quality data and the limitations of existing methods: supervised fine-tuning (SFT) often compromises general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual reasoning over visual perception. To bridge this gap, we propose a novel two-stage task that structures visual perception learning as a coarse-to-fine progressive process. Based on this task formulation, we develop ViPER, a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction. By synergistically integrating image-level and instance-level reconstruction with a two-stage reinforcement learning strategy, ViPER establishes a closed-loop training paradigm, where internally synthesized data directly fuel the enhancement of perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the Qwen-Viper series. With an average gain of 1.7% on seven comprehensive benchmarks spanning various tasks and up to 6.0% on fine-grained perception, Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability. Beyond enabling self-improvement in perceptual capabilities, ViPER provides concrete evidence for the reciprocal relationship between generation and understanding, a breakthrough to developing more autonomous and capable VLMs.
中文摘要 细粒度视觉感知的能力有限，这给视觉语言模型（VLM）在实际应用中带来了一个关键瓶颈。由于高质量数据的稀缺和现有方法的局限性，解决这个问题具有挑战性：监督微调（SFT）通常会损害一般功能，而强化微调（RFT）优先考虑文本推理而不是视觉感知。为了弥合这一差距，我们提出了一种新颖的两阶段任务，该任务将视觉感知学习构建为一个从粗到细的渐进过程。基于这个任务表述，我们开发了 ViPER，这是一个自我引导框架，专门设计用于通过自我批评和自我预测实现迭代进化。通过将图像级和实例级重建与两阶段强化学习策略协同融合，ViPER建立了一种闭环训练范式，内部合成的数据直接推动感知能力的增强。ViPER 应用于 Qwen2.5-VL 系列，生产 Qwen-Viper 系列。Qwen-Viper 在跨越各种任务的七个综合基准测试中平均提高了 1.7%，在细粒度感知上提高了 6.0%，在不同的视觉语言场景中始终表现出卓越的性能，同时保持了通用性。除了实现感知能力的自我完善之外，ViPER 还为生成和理解之间的互惠关系提供了具体证据，这是开发更自主、功能更强大的 VLM 的突破。

Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards

基于前瞻树的推出，用于增强强化学习中的轨迹级探索，并具有可验证的奖励

Authors: Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, Xiang Ren
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.24302
Pdf link: https://arxiv.org/pdf/2510.24302
Abstract Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at this https URL.
中文摘要 具有可验证奖励（RLVR）的强化学习，特别是使用群体相对策略优化（GRPO）等算法，已被证明在增强大型语言模型的推理能力方面非常有效。然而，当前管道的一个关键瓶颈在于组推出期间采样轨迹的多样性有限。同质的轨迹及其相关的奖励会减少政策更新的回报信号，从而阻碍有效的政策学习。这种多样性的缺乏主要源于标记级随机抽样，其中局部变化可能会崩溃为几乎相同的推理路径。为了解决这一限制，我们提出了基于前瞻树的推出（LATR），这是一种新颖的推出策略，旨在通过强制分支到可能产生不同延续的不同候选代币来明确促进轨迹级别的多样性。具体来说，LATR 分三个阶段迭代运行：（1）在高不确定性生成步骤中进行分支，（2）对每个新分支执行前瞻模拟，以及（3）修剪在模拟过程中表现出长时间相似性的分支。与随机抽样相比，LATR在GRPO和动态抽样策略优化（DAPO）算法上平均加速了131%，在不同推理任务中将最终pass@1性能提高了4.2%。我们的代码和数据可在此 https URL 上公开获取。

MiniOneRec: An Open-Source Framework for Scaling Generative Recommendation

MiniOneRec：用于扩展生成式推荐的开源框架

Authors: Xiaoyu Kong, Leheng Sheng, Junfei Tan, Yuxin Chen, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.24431
Pdf link: https://arxiv.org/pdf/2510.24431
Abstract The recent success of large language models (LLMs) has renewed interest in whether recommender systems can achieve similar scaling benefits. Conventional recommenders, dominated by massive embedding tables, tend to plateau as embedding dimensions grow. In contrast, the emerging generative paradigm replaces embeddings with compact Semantic ID (SID) sequences produced by autoregressive Transformers. Yet most industrial deployments remain proprietary, leaving two fundamental questions open: (1) Do the expected scaling laws hold on public benchmarks? (2) What is the minimal post-training recipe that enables competitive performance? We present MiniOneRec, to the best of our knowledge, the first fully open-source generative recommendation framework, which provides an end-to-end workflow spanning SID construction, supervised fine-tuning, and recommendation-oriented reinforcement learning. We generate SIDs via a Residual Quantized VAE and post-train Qwen backbones ranging from 0.5B to 7B parameters on the Amazon Review dataset. Our experiments reveal a consistent downward trend in both training and evaluation losses with increasing model size, validating the parameter efficiency of the generative approach. To further enhance performance, we propose a lightweight yet effective post-training pipeline that (1) enforces full-process SID alignment and (2) applies reinforcement learning with constrained decoding and hybrid rewards. Together, these techniques yield significant improvements in both ranking accuracy and candidate diversity.
中文摘要 大型语言模型（LLM）最近的成功重新引发了人们对推荐系统是否能够实现类似扩展优势的兴趣。传统的推荐器以海量嵌入表为主，随着嵌入维度的增长而趋于稳定。相比之下，新兴的生成范式用自回归 Transformer 生成的紧凑语义 ID （SID）序列取代了嵌入。然而，大多数工业部署仍然是专有的，留下了两个基本问题：（1）预期的扩展定律是否符合公共基准？（2）实现竞争表现的最低限度的训练后秘诀是什么？据我们所知，我们展示了第一个完全开源的生成式推荐框架 MiniOneRec，它提供了一个端到端的工作流程，涵盖 SID 构建、监督微调和面向推荐的强化学习。我们通过残差量化 VAE 生成 SID，并在 Amazon Review 数据集上生成训练后 Qwen 主干，参数范围从 0.5B 到 7B 不等。我们的实验表明，随着模型规模的增加，训练和评估损失呈持续下降的趋势，验证了生成方法的参数效率。为了进一步提高性能，我们提出了一个轻量级但有效的训练后管道，该管道（1）强制执行全流程 SID 对齐，以及（2）应用具有约束解码和混合奖励的强化学习。这些技术共同显着提高了排名准确性和候选人多样性。

Fill in the Blanks: Accelerating Q-Learning with a Handful of Demonstrations in Sparse Reward Settings

填空：在稀疏奖励设置中通过一些演示加速 Q-Learning

Authors: Seyed Mahdi Basiri Azad, Joschka Boedecker
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.24432
Pdf link: https://arxiv.org/pdf/2510.24432
Abstract Reinforcement learning (RL) in sparse-reward environments remains a significant challenge due to the lack of informative feedback. We propose a simple yet effective method that uses a small number of successful demonstrations to initialize the value function of an RL agent. By precomputing value estimates from offline demonstrations and using them as targets for early learning, our approach provides the agent with a useful prior over promising actions. The agent then refines these estimates through standard online interaction. This hybrid offline-to-online paradigm significantly reduces the exploration burden and improves sample efficiency in sparse-reward settings. Experiments on benchmark tasks demonstrate that our method accelerates convergence and outperforms standard baselines, even with minimal or suboptimal demonstration data.
中文摘要 由于缺乏信息反馈，稀疏奖励环境中的强化学习（RL）仍然是一个重大挑战。我们提出了一种简单而有效的方法，该方法使用少量成功的演示来初始化 RL 代理的价值函数。通过预先计算离线演示的价值估计并将其用作早期学习的目标，我们的方法为代理提供了有用的先验而不是有希望的行动。然后，代理通过标准在线交互来完善这些估计值。这种混合离线到在线范式显着减轻了探索负担，并提高了稀疏奖励环境中的样本效率。基准任务的实验表明，即使演示数据很少或次优，我们的方法也能加速收敛并优于标准基线。

SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space

SPARTA：通过文本自动编码器潜在空间中的黑盒对抗释义评估推理分割鲁棒性

Authors: Viktoriia Zinkovich, Anton Antonov, Andrei Spiridonov, Denis Shepelev, Andrey Moskalenko, Daria Pugacheva, Elena Tutubalina, Andrey Kuznetsov, Vlad Shakhuro
Subjects: Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.24446
Pdf link: https://arxiv.org/pdf/2510.24446
Abstract Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases-crucial in real-world applications where users express the same intent in varied ways-remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA-a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing-even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.
中文摘要 多模态大型语言模型（MLLM）在推理分割等视觉语言任务中表现出了令人印象深刻的能力，其中模型根据文本查询生成分割掩码。虽然之前的工作主要集中在扰动图像输入上，但语义上等效的文本释义——在用户以不同方式表达相同意图的实际应用中至关重要——仍然没有得到充分探索。为了解决这一差距，我们引入了一种新颖的对抗释义任务：生成语法正确的释义，保留原始查询含义，同时降低分割性能。为了评估对抗性释义的质量，我们开发了一种经过人体研究验证的综合自动评估协议。此外，我们引入了SPARTA——一种黑匣子、句子级优化方法，在强化学习的指导下，在文本自动编码器的低维语义潜在空间中运行。SPARTA 实现了显着更高的成功率，在 ReasonSeg 和 LLMSeg-40k 数据集上的性能都比以前的方法高出 2 倍。我们使用 SPARTA 和竞争基线来评估高级推理分割模型的鲁棒性。我们发现，即使在严格的语义和语法约束下，它们仍然容易受到对抗性释义的影响。所有代码和数据将在接受后公开发布。

Adaptive Surrogate Gradients for Sequential Reinforcement Learning in Spiking Neural Networks

用于尖峰神经网络中顺序强化学习的自适应代理梯度

Authors: Korneel Van den Berghe, Stein Stroobants, Vijay Janapa Reddi, G.C.H.E. de Croon
Subjects: Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.24461
Pdf link: https://arxiv.org/pdf/2510.24461
Abstract Neuromorphic computing systems are set to revolutionize energy-constrained robotics by achieving orders-of-magnitude efficiency gains, while enabling native temporal processing. Spiking Neural Networks (SNNs) represent a promising algorithmic approach for these systems, yet their application to complex control tasks faces two critical challenges: (1) the non-differentiable nature of spiking neurons necessitates surrogate gradients with unclear optimization properties, and (2) the stateful dynamics of SNNs require training on sequences, which in reinforcement learning (RL) is hindered by limited sequence lengths during early training, preventing the network from bridging its warm-up period. We address these challenges by systematically analyzing surrogate gradient slope settings, showing that shallower slopes increase gradient magnitude in deeper layers but reduce alignment with true gradients. In supervised learning, we find no clear preference for fixed or scheduled slopes. The effect is much more pronounced in RL settings, where shallower slopes or scheduled slopes lead to a 2.1x improvement in both training and final deployed performance. Next, we propose a novel training approach that leverages a privileged guiding policy to bootstrap the learning process, while still exploiting online environment interactions with the spiking policy. Combining our method with an adaptive slope schedule for a real-world drone position control task, we achieve an average return of 400 points, substantially outperforming prior techniques, including Behavioral Cloning and TD3BC, which achieve at most --200 points under the same conditions. This work advances both the theoretical understanding of surrogate gradient learning in SNNs and practical training methodologies for neuromorphic controllers demonstrated in real-world robotic systems.
中文摘要 神经形态计算系统将通过实现数量级的效率提升，同时实现原生时间处理，彻底改变能量受限的机器人技术。尖峰神经网络（SNN）代表了这些系统的一种有前途的算法方法，但它们在复杂控制任务中的应用面临着两个关键挑战：（1）尖峰神经元的不可微分性质需要具有不明确优化特性的代理梯度，以及（2）SNN的状态动力学需要对序列进行训练，这在强化学习（RL）中受到早期训练期间有限序列长度的阻碍，阻止网络桥接其预热期。我们通过系统地分析替代梯度坡度设置来应对这些挑战，表明较浅的坡度会增加更深层的梯度幅度，但会降低与真实梯度的对齐。在监督学习中，我们没有发现对固定或预定斜率的明确偏好。这种效果在 RL 设置中更为明显，其中较浅的斜坡或预定的斜率会导致训练和最终部署的性能提高 2.1 倍。接下来，我们提出了一种新颖的训练方法，该方法利用特权引导策略来引导学习过程，同时仍然利用在线环境与峰值策略的交互。将我们的方法与现实世界无人机位置控制任务的自适应坡度计划相结合，我们实现了 400 分的平均回报，大大优于之前的技术，包括行为克隆和 TD3BC，后者在相同条件下最多达到 --200 分。这项工作推进了对 SNN 中代理梯度学习的理论理解以及在现实世界机器人系统中展示的神经形态控制器的实践训练方法。

Sample-efficient and Scalable Exploration in Continuous-Time RL

连续时间RL中的样品高效和可扩展探索

Authors: Klemens Iten, Lenart Treven, Bhavya Sukhija, Florian Dörfler, Andreas Krause
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.24482
Pdf link: https://arxiv.org/pdf/2510.24482
Abstract Reinforcement learning algorithms are typically designed for discrete-time dynamics, even though the underlying real-world control systems are often continuous in time. In this paper, we study the problem of continuous-time reinforcement learning, where the unknown system dynamics are represented using nonlinear ordinary differential equations (ODEs). We leverage probabilistic models, such as Gaussian processes and Bayesian neural networks, to learn an uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily maximizes a weighted sum of the extrinsic reward and model epistemic uncertainty. This yields a scalable and sample-efficient approach to continuous-time model-based RL. We show that COMBRL achieves sublinear regret in the reward-driven setting, and in the unsupervised RL setting (i.e., without extrinsic rewards), we provide a sample complexity bound. In our experiments, we evaluate COMBRL in both standard and unsupervised RL settings and demonstrate that it scales better, is more sample-efficient than prior methods, and outperforms baselines across several deep RL tasks.
中文摘要 强化学习算法通常专为离散时间动态而设计，即使底层的现实世界控制系统通常是连续的。在本文中，我们研究了连续时间强化学习的问题，其中未知系统动力学使用非线性常微分方程（ODE）表示。我们利用概率模型，如高斯过程和贝叶斯神经网络，来学习底层ODE的不确定性感知模型。我们的算法 COMBRL 贪婪地最大化了外在奖励的加权和，并模拟了认识不确定性。这为基于连续时间模型的 RL 产生了一种可扩展且高效的样本方法。我们表明，COMBRL 在奖励驱动的设置中实现了次线性遗憾，并且在无监督的 RL 设置（即没有外在奖励）中，我们提供了样本复杂性边界。在我们的实验中，我们在标准和无监督 RL 设置中评估了 COMBRL，并证明它比以前的方法扩展得更好，样本效率更高，并且在几个深度 RL 任务中优于基线。

Dual-Mind World Models: A General Framework for Learning in Dynamic Wireless Networks

双思维世界模型：动态无线网络学习的通用框架

Authors: Lingyi Wang, Rashed Shelim, Walid Saad, Naren Ramakrishnan
Subjects: Subjects: Information Theory (cs.IT); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.24546
Pdf link: https://arxiv.org/pdf/2510.24546
Abstract Despite the popularity of reinforcement learning (RL) in wireless networks, existing approaches that rely on model-free RL (MFRL) and model-based RL (MBRL) are data inefficient and short-sighted. Such RL-based solutions cannot generalize to novel network states since they capture only statistical patterns rather than the underlying physics and logic from wireless data. These limitations become particularly challenging in complex wireless networks with high dynamics and long-term planning requirements. To address these limitations, in this paper, a novel dual-mind world model-based learning framework is proposed with the goal of optimizing completeness-weighted age of information (CAoI) in a challenging mmWave V2X scenario. Inspired by cognitive psychology, the proposed dual-mind world model encompasses a pattern-driven System 1 component and a logic-driven System 2 component to learn dynamics and logic of the wireless network, and to provide long-term link scheduling over reliable imagined trajectories. Link scheduling is learned through end-to-end differentiable imagined trajectories with logical consistency over an extended horizon rather than relying on wireless data obtained from environment interactions. Moreover, through imagination rollouts, the proposed world model can jointly reason network states and plan link scheduling. During intervals without observations, the proposed method remains capable of making efficient decisions. Extensive experiments are conducted on a realistic simulator based on Sionna with real-world physical channel, ray-tracing, and scene objects with material properties. Simulation results show that the proposed world model achieves a significant improvement in data efficiency and achieves strong generalization and adaptation to unseen environments, compared to the state-of-the-art RL baselines, and the world model approach with only System 1.
中文摘要 尽管强化学习（RL）在无线网络中很受欢迎，但依赖无模型 RL （MFRL）和基于模型的 RL （MBRL）的现有方法数据效率低下且目光短浅。这种基于 RL 的解决方案无法推广到新的网络状态，因为它们仅捕获统计模式，而不是从无线数据中捕获底层物理和逻辑。这些限制在具有高动态和长期规划要求的复杂无线网络中变得特别具有挑战性。为了解决这些局限性，本文提出了一种基于双思维世界模型的新型学习框架，目标是在具有挑战性的毫米波 V2X 场景中优化完整性加权信息年龄（CAoI）。受认知心理学的启发，所提出的双思维世界模型包括模式驱动的系统 1 组件和逻辑驱动的系统 2 组件，以学习无线网络的动态和逻辑，并在可靠的想象轨迹上提供长期链路调度。链路调度是通过端到端可微分的想象轨迹来学习的，在扩展的范围内具有逻辑一致性，而不是依赖于从环境交互中获得的无线数据。此外，通过想象力的展开，所提出的世界模型可以联合推理网络状态并规划链路调度。在没有观测的间隔内，所提出的方法仍然能够做出有效的决策。在基于 Sionna 的逼真模拟器上进行了大量实验，具有真实世界的物理通道、光线追踪和具有材料属性的场景对象。仿真结果表明，与最先进的RL基线和仅使用System 1的世界模型方法相比，所提出的世界模型在数据效率方面取得了显著的提高，并实现了对看不见环境的强泛化和适应性。

Towards Quadrupedal Jumping and Walking for Dynamic Locomotion using Reinforcement Learning

使用强化学习实现动态运动的四足跳跃和行走

Authors: Jørgen Anker Olsen, Lars Rønhaug Pettersen, Kostas Alexis
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.24584
Pdf link: https://arxiv.org/pdf/2510.24584
Abstract This paper presents a curriculum-based reinforcement learning framework for training precise and high-performance jumping policies for the robot `Olympus'. Separate policies are developed for vertical and horizontal jumps, leveraging a simple yet effective strategy. First, we densify the inherently sparse jumping reward using the laws of projectile motion. Next, a reference state initialization scheme is employed to accelerate the exploration of dynamic jumping behaviors without reliance on reference trajectories. We also present a walking policy that, when combined with the jumping policies, unlocks versatile and dynamic locomotion capabilities. Comprehensive testing validates walking on varied terrain surfaces and jumping performance that exceeds previous works, effectively crossing the Sim2Real gap. Experimental validation demonstrates horizontal jumps up to 1.25 m with centimeter accuracy and vertical jumps up to 1.0 m. Additionally, we show that with only minor modifications, the proposed method can be used to learn omnidirectional jumping.
中文摘要 本文提出了一个基于课程的强化学习框架，用于训练机器人“奥林巴斯”的精确和高性能跳跃策略。利用简单而有效的策略，为垂直和水平跳转制定了单独的策略。首先，我们使用弹丸运动定律对固有的稀疏跳跃奖励进行致密化。其次，采用参考状态初始化方案，在不依赖参考轨迹的情况下加速探索动态跳跃行为。我们还提出了一种步行政策，当与跳跃政策相结合时，可以释放多功能和动态运动能力。综合测试验证了在各种地形表面上的行走和超越以往作品的跳跃性能，有效地跨越了 Sim2Real 的差距。实验验证表明，水平跳跃可达 1.25 m，精度为厘米，垂直跳跃可达 1.0 m。此外，我们表明，只需进行微小的修改，所提出的方法就可以用于学习全向跳跃。

Advancing site-specific disease and pest management in precision agriculture: From reasoning-driven foundation models to adaptive, feedback-based learning

推进精准农业中特定地点的病虫害管理：从推理驱动的基础模型到基于反馈的自适应学习

Authors: Nitin Rai, Daeun (Dana)Choi, Nathan S. Boyd, Arnold W. Schumann
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.24650
Pdf link: https://arxiv.org/pdf/2510.24650
Abstract Site-specific disease management (SSDM) in crops has advanced rapidly through machine and deep learning (ML and DL) for real-time computer vision. Research evolved from handcrafted feature extraction to large-scale automated feature learning. With foundation models (FMs), crop disease datasets are now processed in fundamentally new ways. Unlike traditional neural networks, FMs integrate visual and textual data, interpret symptoms in text, reason about symptom-management relationships, and support interactive QA for growers and educators. Adaptive and imitation learning in robotics further enables field-based disease management. This review screened approx. 40 articles on FM applications for SSDM, focusing on large-language models (LLMs) and vision-language models (VLMs), and discussing their role in adaptive learning (AL), reinforcement learning (RL), and digital twin frameworks for targeted spraying. Key findings: (a) FMs are gaining traction with surging literature in 2023-24; (b) VLMs outpace LLMs, with a 5-10x increase in publications; (c) RL and AL are still nascent for smart spraying; (d) digital twins with RL can simulate targeted spraying virtually; (e) addressing the sim-to-real gap is critical for real-world deployment; (f) human-robot collaboration remains limited, especially in human-in-the-loop approaches where robots detect early symptoms and humans validate uncertain cases; (g) multi-modal FMs with real-time feedback will drive next-gen SSDM. For updates, resources, and contributions, visit, this https URL, to submit papers, code, or datasets.
中文摘要 通过机器和深度学习（ML 和 DL）实现实时计算机视觉，作物的特异性病害管理（SSDM）迅速发展。研究从手工制作的特征提取发展到大规模自动化特征学习。借助基础模型（FM），作物病害数据集现在以全新的方式进行处理。与传统的神经网络不同，FM 集成了视觉和文本数据，解释文本中的症状，推理症状与管理关系，并支持种植者和教育工作者的交互式 QA。机器人技术中的自适应和模仿学习进一步实现了基于现场的疾病管理。本综述筛选了大约40篇关于SSDM的FM应用的文章，重点关注大型语言模型（LLMs）和视觉语言模型（VLM），并讨论了它们在自适应学习（AL）、强化学习（RL）和数字孪生框架中的作用。主要发现：（a） 2023-24 年，随着文献的激增，FM 越来越受到关注;（b） VLM 超过法学硕士，出版物增加了 5-10 倍;（c） RL 和 AL 对于智能喷洒仍处于起步阶段;（d）具有RL的数字孪生可以虚拟模拟有针对性的喷洒;（e）解决模拟与实物的差距对于实际部署至关重要;（f）人机协作仍然有限，特别是在机器人检测早期症状和人类验证不确定病例的人机交互方法中;（g）具有实时反馈的多模态 FM 将驱动下一代 SSDM。如需更新、资源和贡献，请访问此 https URL，提交论文、代码或数据集。

Evolving Diagnostic Agents in a Virtual Clinical Environment

在虚拟临床环境中不断发展的诊断药物

Authors: Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao, Haowen Wang, Yun Yue, Qianrui Fan, Shuai Zhen, Jian Wang, Jinjie Gu, Yanfeng Wang, Ya Zhang, Weidi Xie
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.24654
Pdf link: https://arxiv.org/pdf/2510.24654
Abstract In this paper, we present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning, enabling them to manage multi-turn diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction-tuned models trained on static case summaries, our method acquires diagnostic strategies through interactive exploration and outcome-based feedback. Our contributions are fourfold: (i) We present DiagGym, a diagnostics world model trained with electronic health records that emits examination outcomes conditioned on patient history and recommended examination, serving as a virtual clinical environment for realistic diagnosis training and evaluation; (ii) We train DiagAgent via end-to-end, multi-turn reinforcement learning to learn diagnostic policies that optimize both information yield and diagnostic accuracy; (iii) We introduce DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated examination recommendations and 99 cases annotated with 973 physician-written rubrics on diagnosis process; (iv) we demonstrate superior performance across diverse diagnostic settings. DiagAgent significantly outperforms 10 state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34% higher diagnostic accuracy and 44.03% improvement in examination recommendation hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic accuracy and 23.09% boost in examination recommendation F1 score. In rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers dynamic and clinically meaningful diagnostic management abilities unattainable through passive training alone.
中文摘要 在本文中，我们提出了一个框架，用于通过强化学习将大型语言模型（LLM）训练为诊断代理，使它们能够管理多轮诊断过程，自适应地选择检查，并致力于最终诊断。与在静态案例摘要上训练的指令调整模型不同，我们的方法通过交互式探索和基于结果的反馈来获取诊断策略。我们的贡献是四重的：（i）我们推出 DiagGym，这是一种使用电子健康记录训练的诊断世界模型，可根据患者病史和推荐检查发出检查结果，作为现实诊断培训和评估的虚拟临床环境;（ii）我们通过端到端、多轮强化学习训练 DiagAgent，以学习优化信息产量和诊断准确性的诊断策略;（iii）我们引入了 DiagBench，这是一个诊断基准，包括 750 个带有医生验证的检查建议的病例和 99 个带有 973 个医生编写的诊断过程评分标准的病例;（iv）我们在不同的诊断环境中表现出卓越的性能。DiagAgent 的性能明显优于 10 个最先进的 LLM，包括 DeepSeek-v3 和 GPT-4o，以及两个提示工程代理。在单轮设置中，DiagAgent 的诊断准确率提高了 9.34%，检查推荐命中率提高了 44.03%。在端到端环境中，它的诊断准确性提高了 15.12%，检查建议 F1 分数提高了 23.09%。在基于评分标准的评估中，它在加权评分标准分数方面超过了第二好的模型 Claude-sonnet-4 7.1%。这些发现表明，交互式临床环境中的学习政策赋予了仅通过被动训练无法实现的动态和具有临床意义的诊断管理能力。

Learning to Drive Safely with Hybrid Options

学习使用混合动力选项安全驾驶

Authors: Bram De Cooman, Johan Suykens
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.24674
Pdf link: https://arxiv.org/pdf/2510.24674
Abstract Out of the many deep reinforcement learning approaches for autonomous driving, only few make use of the options (or skills) framework. That is surprising, as this framework is naturally suited for hierarchical control applications in general, and autonomous driving tasks in specific. Therefore, in this work the options framework is applied and tailored to autonomous driving tasks on highways. More specifically, we define dedicated options for longitudinal and lateral manoeuvres with embedded safety and comfort constraints. This way, prior domain knowledge can be incorporated into the learning process and the learned driving behaviour can be constrained more easily. We propose several setups for hierarchical control with options and derive practical algorithms following state-of-the-art reinforcement learning techniques. By separately selecting actions for longitudinal and lateral control, the introduced policies over combined and hybrid options obtain the same expressiveness and flexibility that human drivers have, while being easier to interpret than classical policies over continuous actions. Of all the investigated approaches, these flexible policies over hybrid options perform the best under varying traffic conditions, outperforming the baseline policies over actions.
中文摘要 在自动驾驶的众多深度强化学习方法中，只有少数使用选项（或技能）框架。这令人惊讶，因为这个框架自然适用于一般的分层控制应用，特别是自动驾驶任务。因此，在这项工作中，选项框架被应用于高速公路上的自动驾驶任务并进行了定制。更具体地说，我们为纵向和横向作定义了专用选项，并嵌入了安全性和舒适性限制。这样，先验的领域知识可以融入学习过程，并且可以更容易地限制学习的驾驶行为。我们提出了几种带有选项的分层控制设置，并按照最先进的强化学习技术推导出实用算法。通过分别选择纵向和横向控制的动作，引入的策略相对于组合和混合选项获得了与人类驾驶员相同的表现力和灵活性，同时比经典策略比传统策略比连续动作更容易解释。在所有研究的方法中，这些基于混合选项的灵活策略在不同的流量条件下表现最佳，优于基线策略而不是作。

SPICE: Self-Play In Corpus Environments Improves Reasoning

SPICE：语料库环境中的自我游戏改善了推理能力

Authors: Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, Jason Weston
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.24684
Pdf link: https://arxiv.org/pdf/2510.24684
Abstract Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner's capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.
中文摘要 自我改进的系统需要环境相互作用才能持续适应。我们引入了 SPICE（语料库环境中的自我游戏），这是一个强化学习框架，其中单个模型扮演两个角色：一个挑战者从大型语料库中挖掘文档以生成不同的推理任务，另一个推理者解决这些任务。通过对抗动力学，挑战者在推理者能力的前沿创建了一个自动课程，而语料库接地则提供了持续改进所需的丰富、几乎取之不尽用之不竭的外部信号。与现有的无基础自游戏方法不同，这些方法提供的好处更有限，SPICE 在多个模型系列的数学（+8.9%）和一般推理（+9.8%）基准测试中实现了一致的收益。我们的分析揭示了文档基础如何成为 SPICE 的关键要素，以不断产生自己越来越具有挑战性的目标并实现这些目标，从而实现持续的自我完善。

Greedy Sampling Is Provably Efficient for RLHF

贪婪采样对 RLHF 是可证明有效的

Authors: Di Wu, Chengshuai Shi, Jing Yang, Cong Shen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.24700
Pdf link: https://arxiv.org/pdf/2510.24700
Abstract Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique for post-training large language models. Despite its empirical success, the theoretical understanding of RLHF is still limited, as learning the KL-regularized target with only preference feedback poses additional challenges compared with canonical RL. Existing works mostly study the reward-based Bradley-Terry (BT) preference model, and extend classical designs utilizing optimism or pessimism. This work, instead, considers the general preference model (whose practical relevance has been observed recently) and obtains performance guarantees with major, order-wise improvements over existing ones. Surprisingly, these results are derived from algorithms that directly use the empirical estimates (i.e., greedy sampling), as opposed to constructing optimistic or pessimistic estimates in previous works. This insight has a deep root in the unique structural property of the optimal policy class under the KL-regularized target, and we further specialize it to the BT model, highlighting the surprising sufficiency of greedy sampling in RLHF.
中文摘要 人类反馈强化学习（Reinforcement Learning from Human Feedback，RLHF）已成为后期训练大型语言模型的关键技术。尽管在经验上取得了成功，但对 RLHF 的理论理解仍然有限，因为与规范 RL 相比，仅使用偏好反馈学习 KL 正则化目标会带来额外的挑战。现有工作大多研究基于奖励的布拉德利-特里（BT）偏好模型，并利用乐观或悲观扩展经典设计。相反，这项工作考虑了一般偏好模型（最近观察到了其实际相关性），并获得了性能保证，与现有模型相比，在有序方面进行了重大改进。令人惊讶的是，这些结果来自直接使用经验估计（即贪婪抽样）的算法，而不是在之前的工作中构建乐观或悲观估计。这一见解深深植根于KL正则化目标下最优政策类别的独特结构属性，我们将其进一步专业化到BT模型，强调了RLHF中贪婪抽样的惊人充分性。

Keyword: diffusion policy

Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation

用于鲁棒多任务机器人作的语言条件表示和专家混合策略

Authors: Xiucheng Zhang, Yang Jiang, Hongwei Qing, Jiashuo Bai
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.24055
Pdf link: https://arxiv.org/pdf/2510.24055
Abstract Perceptual ambiguity and task conflict limit multitask robotic manipulation via imitation learning. We propose a framework combining a Language-Conditioned Visual Representation (LCVR) module and a Language-conditioned Mixture-ofExperts Density Policy (LMoE-DP). LCVR resolves perceptual ambiguities by grounding visual features with language instructions, enabling differentiation between visually similar tasks. To mitigate task conflict, LMoE-DP uses a sparse expert architecture to specialize in distinct, multimodal action distributions, stabilized by gradient modulation. On real-robot benchmarks, LCVR boosts Action Chunking with Transformers (ACT) and Diffusion Policy (DP) success rates by 33.75% and 25%, respectively. The full framework achieves a 79% average success, outperforming the advanced baseline by 21%. Our work shows that combining semantic grounding and expert specialization enables robust, efficient multi-task manipulation
中文摘要 感知模糊性和任务冲突限制了通过模仿学习进行的多任务机器人作。我们提出了一个结合语言条件视觉表示（LCVR）模块和语言条件专家混合密度策略（LMoE-DP）的框架。LCVR 通过将视觉特征与语言指令相结合来解决感知模糊性，从而能够区分视觉相似的任务。为了减轻任务冲突，LMoE-DP 使用稀疏专家架构来专注于不同的多模态动作分布，并通过梯度调制进行稳定。在真实机器人基准测试中，LCVR 将 Action Chunking with Transformers （ACT）和 Diffusion Policy （DP）的成功率分别提高了 33.75% 和 25%。整个框架取得了 79% 的平均成功率，比高级基线高出 21%。我们的工作表明，将语义基础和专家专业化相结合可以实现稳健、高效的多任务作