生成时间: 2025-11-03 16:31:26 (UTC+8); Arxiv 发布时间: 2025-11-03 20:00 EST (2025-11-04 09:00 UTC+8)
今天共有 26 篇相关文章
Keyword: reinforcement learning
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench
视觉语言模型能达到标准吗?使用 MeasureBench 对视觉测量读数进行基准测试
- Authors: Fenfen Lin, Yesheng Liu, Haiyu Xu, Chen Yue, Zheqi He, Mingxuan Zhao, Miguel Hu Chen, Jiakang Liu, JG Yao, Xi Yang
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.26865
- Pdf link: https://arxiv.org/pdf/2510.26865
- Abstract
Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current vision-language models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle measurement reading in general. A consistent failure mode is indicator localization: models can read digits or labels but misidentify the key positions of pointers or alignments, leading to big numeric errors despite plausible textual reasoning. We have also conducted preliminary experiments with reinforcement learning over synthetic data, and find encouraging results on in-domain synthetic subset but less promising for real-world images. Our analysis highlights a fundamental limitation of current VLMs in fine-grained spatial grounding. We hope this resource can help future advances on visually grounded numeracy and precise spatial perception of VLMs, bridging the gap between recognizing numbers and measuring the world.
- 中文摘要
读取测量仪器对人类来说毫不费力,并且需要相对较少的领域专业知识,但正如我们在初步评估中发现的那样,对于当前的视觉语言模型 (VLM) 来说,它仍然具有惊人的挑战性。在这项工作中,我们介绍了 MeasureBench,这是一个视觉测量读数的基准,涵盖各种类型测量的真实世界和合成图像,以及用于数据合成的可扩展管道。我们的管道按程序生成具有可控视觉外观的指定类型的仪表,从而实现指针、刻度、字体、照明和杂乱等关键细节的可扩展变化。对流行的专有和开放权重 VLM 的评估表明,即使是最强大的前沿 VLM 通常也难以读取测量。一种一致的故障模式是指标定位:模型可以读取数字或标签,但错误地识别指针或对齐方式的关键位置,从而导致很大的数字错误,尽管文本推理是合理的。我们还对合成数据进行了强化学习的初步实验,并在域内合成子集上发现了令人鼓舞的结果,但对现实世界的图像不太有希望。我们的分析强调了当前VLM在细粒度空间接地方面的一个基本局限性。我们希望该资源能够帮助未来在视觉基础计算和 VLM 的精确空间感知方面取得进展,弥合识别数字和测量世界之间的差距。
A Framework for Fair Evaluation of Variance-Aware Bandit Algorithms
方差感知强盗算法的公平评估框架
- Authors: Elise Wolf
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.27001
- Pdf link: https://arxiv.org/pdf/2510.27001
- Abstract
Multi-armed bandit (MAB) problems serve as a fundamental building block for more complex reinforcement learning algorithms. However, evaluating and comparing MAB algorithms remains challenging due to the lack of standardized conditions and replicability. This is particularly problematic for variance-aware extensions of classical methods like UCB, whose performance can heavily depend on the underlying environment. In this study, we address how performance differences between bandit algorithms can be reliably observed, and under what conditions variance-aware algorithms outperform classical ones. We present a reproducible evaluation designed to systematically compare eight classical and variance-aware MAB algorithms. The evaluation framework, implemented in our Bandit Playground codebase, features clearly defined experimental setups, multiple performance metrics (reward, regret, reward distribution, value-at-risk, and action optimality), and an interactive evaluation interface that supports consistent and transparent analysis. We show that variance-aware algorithms can offer advantages in settings with high uncertainty where the difficulty arises from subtle differences between arm rewards. In contrast, classical algorithms often perform equally well or better in more separable scenarios or if fine-tuned extensively. Our contributions are twofold: (1) a framework for systematic evaluation of MAB algorithms, and (2) insights into the conditions under which variance-aware approaches outperform their classical counterparts.
- 中文摘要
多臂强盗 (MAB) 问题是更复杂的强化学习算法的基本构建块。然而,由于缺乏标准化条件和可复制性,评估和比较 MAB 算法仍然具有挑战性。这对于经典方法(如 UCB)的方差感知扩展尤其成问题,其性能可能在很大程度上取决于底层环境。在这项研究中,我们讨论了如何可靠地观察强盗算法之间的性能差异,以及在什么条件下方差感知算法优于经典算法。我们提出了一种可重复的评估,旨在系统地比较八种经典和方差感知的MAB算法。该评估框架在我们的 Bandit Playground 代码库中实现,具有明确定义的实验设置、多个绩效指标(奖励、遗憾、奖励分配、风险价值和行动最优性)以及支持一致和透明分析的交互式评估界面。我们表明,方差感知算法可以在高不确定性的环境中提供优势,在这些环境中,困难源于手臂奖励之间的细微差异。相比之下,经典算法通常在更可分离的场景中或经过广泛微调的情况下表现得同样好或更好。我们的贡献是双重的:(1)用于系统评估MAB算法的框架,以及(2)深入了解方差感知方法优于经典方法的条件。
Algorithmic Predation: Equilibrium Analysis in Dynamic Oligopolies with Smooth Market Sharing
算法掠夺:动态寡头垄断中平滑市场共享的均衡分析
- Authors: Fabian Raoul Pieroth, Ole Petersen, Martin Bichler
- Subjects: Subjects:
Computer Science and Game Theory (cs.GT); General Economics (econ.GN)
- Arxiv link: https://arxiv.org/abs/2510.27008
- Pdf link: https://arxiv.org/pdf/2510.27008
- Abstract
Predatory pricing -- where a firm strategically lowers prices to undermine competitors -- is a contentious topic in dynamic oligopoly theory, with scholars debating practical relevance and the existence of predatory equilibria. Although finite-horizon dynamic models have long been proposed to capture the strategic intertemporal incentives of oligopolists, the existence and form of equilibrium strategies in settings that allow for firm exit (drop-outs following loss-making periods) have remained an open question. We focus on the seminal dynamic oligopoly model by Selten (1965) that introduces the subgame perfect equilibrium and analyzes smooth market sharing. Equilibrium can be derived analytically in models that do not allow for dropouts, but not in models that can lead to predatory pricing. In this paper, we leverage recent advances in deep reinforcement learning to compute and verify equilibria in finite-horizon dynamic oligopoly games. Our experiments reveal two key findings: first, state-of-the-art deep reinforcement learning algorithms reliably converge to equilibrium in both perfect- and imperfect-information oligopoly models; second, when firms face asymmetric cost structures, the resulting equilibria exhibit predatory pricing behavior. These results demonstrate that predatory pricing can emerge as a rational equilibrium strategy across a broad variety of model settings. By providing equilibrium analysis of finite-horizon dynamic oligopoly models with drop-outs, our study answers a decade-old question and offers new insights for competition authorities and regulators.
- 中文摘要
掠夺性定价——公司战略性地降低价格以削弱竞争对手——是动态寡头垄断理论中一个有争议的话题,学者们争论实际相关性和掠夺性均衡的存在。尽管长期以来一直提出有限视界动态模型来捕捉寡头垄断者的战略跨期激励,但在允许公司退出(亏损期后退出)的环境中均衡策略的存在和形式仍然是一个悬而未决的问题。我们重点关注 Selten (1965) 的开创性动态寡头垄断模型,该模型引入了子博弈完美均衡并分析了平滑的市场共享。均衡可以在不允许退出的模型中分析得出,但在可能导致掠夺性定价的模型中则不能得出。在本文中,我们利用深度强化学习的最新进展来计算和验证有限视界动态寡头垄断博弈中的均衡。我们的实验揭示了两个关键发现:首先,最先进的深度强化学习算法在完美和不完美信息寡头垄断模型中可靠地收敛到平衡;其次,当企业面临不对称的成本结构时,由此产生的均衡表现出掠夺性定价行为。这些结果表明,掠夺性定价可以作为一种跨各种模型设置的理性均衡策略出现。通过对具有退出的有限视界动态寡头垄断模型进行均衡分析,我们的研究回答了一个十年前的问题,并为竞争主管部门和监管机构提供了新的见解。
e1: Learning Adaptive Control of Reasoning Effort
e1:学习推理努力的自适应控制
- Authors: Michael Kleinman, Matthew Trager, Alessandro Achille, Wei Xia, Stefano Soatto
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.27042
- Pdf link: https://arxiv.org/pdf/2510.27042
- Abstract
Increasing the thinking budget of AI models can significantly improve accuracy, but not all questions warrant the same amount of reasoning. Users may prefer to allocate different amounts of reasoning effort depending on how they value output quality versus latency and cost. To leverage this tradeoff effectively, users need fine-grained control over the amount of thinking used for a particular query, but few approaches enable such control. Existing methods require users to specify the absolute number of desired tokens, but this requires knowing the difficulty of the problem beforehand to appropriately set the token budget for a query. To address these issues, we propose Adaptive Effort Control, a self-adaptive reinforcement learning method that trains models to use a user-specified fraction of tokens relative to the current average chain-of-thought length for each query. This approach eliminates dataset- and phase-specific tuning while producing better cost-accuracy tradeoff curves compared to standard methods. Users can dynamically adjust the cost-accuracy trade-off through a continuous effort parameter specified at inference time. We observe that the model automatically learns to allocate resources proportionally to the task difficulty and, across model scales ranging from 1.5B to 32B parameters, our approach enables approximately 3x reduction in chain-of-thought length while maintaining or improving performance relative to the base model used for RL training.
- 中文摘要
增加人工智能模型的思维预算可以显着提高准确性,但并非所有问题都需要同等数量的推理。用户可能更愿意分配不同数量的推理工作,具体取决于他们如何评估输出质量与延迟和成本。为了有效地利用这种权衡,用户需要对用于特定查询的思维量进行细粒度控制,但很少有方法能够实现这种控制。现有方法要求用户指定所需令牌的绝对数量,但这需要事先了解问题的难度,以便适当地设置查询的令牌预算。为了解决这些问题,我们提出了自适应努力控制,这是一种自适应强化学习方法,它训练模型使用用户指定的标记分数,相对于每个查询的当前平均思维链长度。与标准方法相比,这种方法消除了特定于数据集和阶段的调整,同时产生了更好的成本精度权衡曲线。用户可以通过在推理时指定的连续努力参数动态调整成本与准确性的权衡。我们观察到,该模型会自动学习根据任务难度按比例分配资源,并且在从 1.5B 到 32B 参数的模型规模上,我们的方法可以将思维链长度减少约 3 倍,同时保持或提高相对于用于 RL 训练的基本模型的性能。
Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning
RLVR 中泛化的局限性:数学推理的两个案例研究
- Authors: Md Tanvirul Alam, Nidhi Rastogi
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.27044
- Pdf link: https://arxiv.org/pdf/2510.27044
- Abstract
Mathematical reasoning is a central challenge for large language models (LLMs), requiring not only correct answers but also faithful reasoning processes. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities; however, its ability to foster genuine reasoning remains unclear. We investigate RLVR on two combinatorial problems with fully verifiable solutions: \emph{Activity Scheduling} and the \emph{Longest Increasing Subsequence}, using carefully curated datasets with unique optima. Across multiple reward designs, we find that RLVR improves evaluation metrics but often by reinforcing superficial heuristics rather than acquiring new reasoning strategies. These findings highlight the limits of RLVR generalization, emphasizing the importance of benchmarks that disentangle genuine mathematical reasoning from shortcut exploitation and provide faithful measures of progress. Code available at this https URL.
- 中文摘要
数学推理是大型语言模型 (LLM) 面临的核心挑战,不仅需要正确的答案,还需要忠实的推理过程。具有可验证奖励的强化学习 (RLVR) 已成为增强此类能力的一种有前途的方法;然而,它促进真正推理的能力仍不清楚。我们使用具有唯一最优值的精心策划的数据集,研究了两个具有完全可验证解决方案的组合问题的 RLVR:\emph{活动调度} 和 \emph{最长递增子序列}。在多种奖励设计中,我们发现 RLVR 改进了评估指标,但通常是通过强化肤浅的启发式方法而不是获得新的推理策略。这些发现凸显了 RLVR 泛化的局限性,强调了将真正的数学推理与捷径利用分开并提供忠实的进步衡量标准的基准的重要性。代码可在此 https URL 中找到。
SpikeATac: A Multimodal Tactile Finger with Taxelized Dynamic Sensing for Dexterous Manipulation
SpikeATac:具有类片化动态传感的多模态触觉手指,可实现灵巧的作
- Authors: Eric T. Chang, Peter Ballentine, Zhanpeng He, Do-Gon Kim, Kai Jiang, Hua-Hsuan Liang, Joaquin Palacios, William Wang, Pedro Piacenza, Ioannis Kymissis, Matei Ciocarlie
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.27048
- Pdf link: https://arxiv.org/pdf/2510.27048
- Abstract
In this work, we introduce SpikeATac, a multimodal tactile finger combining a taxelized and highly sensitive dynamic response (PVDF) with a static transduction method (capacitive) for multimodal touch sensing. Named for its `spiky' response, SpikeATac's 16-taxel PVDF film sampled at 4 kHz provides fast, sensitive dynamic signals to the very onset and breaking of contact. We characterize the sensitivity of the different modalities, and show that SpikeATac provides the ability to stop quickly and delicately when grasping fragile, deformable objects. Beyond parallel grasping, we show that SpikeATac can be used in a learning-based framework to achieve new capabilities on a dexterous multifingered robot hand. We use a learning recipe that combines reinforcement learning from human feedback with tactile-based rewards to fine-tune the behavior of a policy to modulate force. Our hardware platform and learning pipeline together enable a difficult dexterous and contact-rich task that has not previously been achieved: in-hand manipulation of fragile objects. Videos are available at \href{this https URL}{this http URL}.
- 中文摘要
在这项工作中,我们介绍了 SpikeATac,这是一种多模态触觉手指,将类距化和高灵敏度动态响应 (PVDF) 与静态转导方法(电容式)相结合,用于多模态触摸传感。SpikeATac 的 16 紫杉醇 PVDF 薄膜以其“尖峰”响应而得名,以 4 kHz 的频率采样,可在接触开始和断开时提供快速、灵敏的动态信号。我们表征了不同模态的灵敏度,并表明 SpikeATac 在抓取易碎、可变形的物体时能够快速、精细地停止。除了并行抓取之外,我们还表明 SpikeATac 可以在基于学习的框架中使用,以在灵巧的多指机械手上实现新功能。我们使用一种学习配方,将人类反馈的强化学习与基于触觉的奖励相结合,以微调策略的行为以调节力量。我们的硬件平台和学习管道共同实现了以前从未实现过的困难、灵巧且接触丰富的任务:手动作易碎物体。视频可在 \href{this https URL}{this http URL} 获得。
Adaptive Human-Computer Interaction Strategies Through Reinforcement Learning in Complex
复合体中通过强化学习的自适应人机交互策略
- Authors: Rui Liu, Yifan Zhuang, Runsheng Zhang
- Subjects: Subjects:
Human-Computer Interaction (cs.HC)
- Arxiv link: https://arxiv.org/abs/2510.27058
- Pdf link: https://arxiv.org/pdf/2510.27058
- Abstract
This study addresses the challenges of dynamics and complexity in intelligent human-computer interaction and proposes a reinforcement learning-based optimization framework to improve long-term returns and overall experience. Human-computer interaction is modeled as a Markov decision process, with state space, action space, reward function, and discount factor defined to capture the dynamics of user input, system feedback, and interaction environment. The method combines policy function, value function, and advantage function, updates parameters through policy gradient, and continuously adjusts during interaction to balance immediate feedback and long-term benefits. To validate the framework, multimodal dialog and scene-aware datasets are used as the experimental platform, with multiple sensitivity experiments conducted on key factors such as discount factor, exploration rate decay, environmental noise, and data imbalance. Evaluation is carried out using cumulative reward, average episode reward, convergence speed, and task success rate. Results show that the proposed method outperforms existing approaches across several metrics, achieving higher task completion while maintaining strategy stability. Comparative experiments further confirm its advantages in interaction efficiency and long-term return, demonstrating the significant value of reinforcement learning in optimizing human-computer interaction.
- 中文摘要
本研究针对智能人机交互中的动态性和复杂性挑战,提出了一种基于强化学习的优化框架,以提高长期回报和整体体验。人机交互建模为马尔可夫决策过程,定义状态空间、动作空间、奖励函数和贴现因子,以捕捉用户输入、系统反馈和交互环境的动态。该方法结合了政策功能、价值功能和优势功能,通过策略梯度更新参数,并在交互过程中不断调整,以平衡即时反馈和长期收益。为验证该框架,以多模态对话和场景感知数据集为实验平台,对贴现因子、勘探率衰减、环境噪声、数据不平衡等关键因素进行了多重灵敏度实验。使用累积奖励、平均集奖励、收敛速度和任务成功率进行评估。结果表明,所提方法在多个指标上优于现有方法,在保持策略稳定性的同时实现了更高的任务完成率。对比实验进一步证实了其在交互效率和长期回报方面的优势,证明了强化学习在优化人机交互方面的重要价值。
Towards Understanding Self-play for LLM Reasoning
理解 LLM 推理的自我游戏
- Authors: Justin Yang Chae, Md Tanvirul Alam, Nidhi Rastogi
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.27072
- Pdf link: https://arxiv.org/pdf/2510.27072
- Abstract
Recent advances in large language model (LLM) reasoning, led by reinforcement learning with verifiable rewards (RLVR), have inspired self-play post-training, where models improve by generating and solving their own problems. While self-play has shown strong in-domain and out-of-domain gains, the mechanisms behind these improvements remain poorly understood. In this work, we analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner, comparing it against RLVR and supervised fine-tuning (SFT). Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions. We further connect these dynamics to reasoning performance using pass@k evaluations. Together, our findings clarify how self-play differs from other post-training strategies, highlight its inherent limitations, and point toward future directions for improving LLM math reasoning through self-play.
- 中文摘要
以具有可验证奖励的强化学习 (RLVR) 为首的大型语言模型 (LLM) 推理的最新进展激发了训练后的自我游戏,其中模型通过生成和解决自己的问题来改进。虽然自我游戏在域内和域外都显示出强大的收益,但这些改进背后的机制仍然知之甚少。在这项工作中,我们通过绝对零推理器的视角分析了自我游戏的训练动态,并将其与 RLVR 和监督微调 (SFT) 进行了比较。我们的研究检查了参数更新稀疏性、代币分布的熵动力学和替代提议者奖励函数。我们进一步使用pass@k评估将这些动态与推理性能联系起来。总之,我们的研究结果阐明了自我游戏与其他训练后策略的不同之处,强调了其固有的局限性,并指出了通过自我游戏改进法学硕士数学推理的未来方向。
AURA: A Reinforcement Learning Framework for AI-Driven Adaptive Conversational Surveys
AURA:用于人工智能驱动的自适应对话式调查的强化学习框架
- Authors: Jinwen Tang, Yi Shang
- Subjects: Subjects:
Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.27126
- Pdf link: https://arxiv.org/pdf/2510.27126
- Abstract
Conventional online surveys provide limited personalization, often resulting in low engagement and superficial responses. Although AI survey chatbots improve convenience, most are still reactive: they rely on fixed dialogue trees or static prompt templates and therefore cannot adapt within a session to fit individual users, which leads to generic follow-ups and weak response quality. We address these limitations with AURA (Adaptive Understanding through Reinforcement Learning for Assessment), a reinforcement learning framework for AI-driven adaptive conversational surveys. AURA quantifies response quality using a four-dimensional LSDE metric (Length, Self-disclosure, Emotion, and Specificity) and selects follow-up question types via an epsilon-greedy policy that updates the expected quality gain within each session. Initialized with priors extracted from 96 prior campus-climate conversations (467 total chatbot-user exchanges), the system balances exploration and exploitation across 10-15 dialogue exchanges, dynamically adapting to individual participants in real time. In controlled evaluations, AURA achieved a +0.12 mean gain in response quality and a statistically significant improvement over non-adaptive baselines (p=0.044, d=0.66), driven by a 63% reduction in specification prompts and a 10x increase in validation behavior. These results demonstrate that reinforcement learning can give survey chatbots improved adaptivity, transforming static questionnaires into interactive, self-improving assessment systems.
- 中文摘要
传统的在线调查提供的个性化有限,通常会导致参与度低和反应肤浅。尽管人工智能调查聊天机器人提高了便利性,但大多数仍然是被动的:它们依赖于固定的对话树或静态提示模板,因此无法在会话中进行调整以适应单个用户,这导致跟进笼统,响应质量较差。我们通过 AURA(通过强化学习进行评估的自适应理解)解决了这些限制,AURA 是一种用于人工智能驱动的自适应对话调查的强化学习框架。AURA 使用四维 LSDE 指标(长度、自我披露、情绪和特异性)量化响应质量,并通过 epsilon-greedy 策略选择后续问题类型,该策略更新每个会话中的预期质量增益。该系统使用从 96 次先前的校园气候对话(总共 467 次聊天机器人与用户交换)中提取的先验进行初始化,在 10-15 次对话交流中平衡探索和利用,实时动态适应单个参与者。在对照评估中,AURA 在响应质量方面实现了 +0.12 的平均增益,并且比非适应性基线有统计学意义的改善 (p=0.044,d=0.66),这得益于规范提示减少 63% 和验证行为增加 10 倍。这些结果表明,强化学习可以提高调查聊天机器人的适应性,将静态问卷转变为交互式、自我改进的评估系统。
Disrupting Networks: Amplifying Social Dissensus via Opinion Perturbation and Large Language Models
颠覆网络:通过舆论扰动和大型语言模型放大社会分歧
- Authors: Erica Coppolillo, Giuseppe Manco
- Subjects: Subjects:
Social and Information Networks (cs.SI)
- Arxiv link: https://arxiv.org/abs/2510.27152
- Pdf link: https://arxiv.org/pdf/2510.27152
- Abstract
We study how targeted content injection can strategically disrupt social networks. Using the Friedkin-Johnsen (FJ) model, we utilize a measure of social dissensus and show that (i) simple FJ variants cannot significantly perturb the network, (ii) extending the model enables valid graph structures where disruption at equilibrium exceeds the initial state, and (iii) altering an individual's inherent opinion can maximize disruption. Building on these insights, we design a reinforcement learning framework to fine-tune a Large Language Model (LLM) for generating disruption-oriented text. Experiments on synthetic and real-world data confirm that tuned LLMs can approach theoretical disruption limits. Our findings raise important considerations for content moderation, adversarial information campaigns, and generative model regulation.
- 中文摘要
我们研究有针对性的内容注入如何战略性地破坏社交网络。使用 Friedkin-Johnsen (FJ) 模型,我们利用社会分歧的衡量标准,表明 (i) 简单的 FJ 变体不会显着扰动网络,(ii) 扩展模型可以实现平衡时的破坏超过初始状态的有效图结构,以及 (iii) 改变个人的固有观点可以最大限度地破坏。基于这些见解,我们设计了一个强化学习框架来微调大型语言模型 (LLM),以生成面向颠覆的文本。对合成和真实世界数据的实验证实,经过调整的 LLM 可以接近理论上的颠覆极限。我们的研究结果提出了对内容审核、对抗性信息活动和生成模型监管的重要考虑。
ShapleyPipe: Hierarchical Shapley Search for Data Preparation Pipeline Construction
ShapleyPipe:用于数据准备管道构建的分层 Shapley 搜索
- Authors: Jing Chang, Chang Liu, Jinbin Huang, Shuyuan Zheng, Rui Mao, Jianbin Qin
- Subjects: Subjects:
Databases (cs.DB)
- Arxiv link: https://arxiv.org/abs/2510.27168
- Pdf link: https://arxiv.org/pdf/2510.27168
- Abstract
Automated data preparation pipeline construction is critical for machine learning success, yet existing methods suffer from two fundamental limitations: they treat pipeline construction as black-box optimization without quantifying individual operator contributions, and they struggle with the combinatorial explosion of the search space ($N^M$ configurations for N operators and pipeline length M). We introduce ShapleyPipe, a principled framework that leverages game-theoretic Shapley values to systematically quantify each operator's marginal contribution while maintaining full interpretability. Our key innovation is a hierarchical decomposition that separates category-level structure search from operator-level refinement, reducing the search complexity from exponential to polynomial. To make Shapley computation tractable, we develop: (1) a Multi-Armed Bandit mechanism for intelligent category evaluation with provable convergence guarantees, and (2) Permutation Shapley values to correctly capture position-dependent operator interactions. Extensive evaluation on 18 diverse datasets demonstrates that ShapleyPipe achieves 98.1\% of high-budget baseline performance while using 24\% fewer evaluations, and outperforms the state-of-the-art reinforcement learning method by 3.6\%. Beyond performance gains, ShapleyPipe provides interpretable operator valuations ($\rho$=0.933 correlation with empirical performance) that enable data-driven pipeline analysis and systematic operator library refinement.
- 中文摘要
自动化数据准备管道构建对于机器学习的成功至关重要,但现有方法存在两个基本限制:它们将管道构建视为黑盒优化,而不量化单个运算符的贡献,并且它们与搜索空间的组合爆炸作斗争(N 个运算符和管道长度 M 的 $N^M$ 配置)。我们介绍了 ShapleyPipe,这是一个有原则的框架,它利用博弈论 Shapley 值来系统地量化每个算子的边际贡献,同时保持完全的可解释性。我们的主要创新是分层分解,它将类别级结构搜索与运算符级细化分开,将搜索复杂性从指数降低到多项式。为了使 Shapley 计算易于处理,我们开发了:(1) 一种多臂强盗机制,用于具有可证明收敛保证的智能类别评估,以及 (2) 排列 Shapley 值以正确捕获与位置相关的算子交互。对 18 个不同数据集的广泛评估表明,ShapleyPipe 实现了 98.1% 的高预算基线性能,同时使用的评估次数减少了 24%,并且比最先进的强化学习方法高出 3.6%。除了性能提升之外,ShapleyPipe 还提供可解释的运算符估值($\rho$=0.933 与经验性能相关),从而实现数据驱动的流水线分析和系统的运算符库细化。
GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation
GUI-Rise:用于 GUI 导航的结构化推理和历史摘要
- Authors: Tao Liu, Chongyu Wang, Rongjie Li, Yingchen Yu, Xuming He, Bai Song
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2510.27210
- Pdf link: https://arxiv.org/pdf/2510.27210
- Abstract
While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, \textbf{GUI-Rise}, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO). This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework's ability to maintain robust reasoning and generalization across diverse GUI navigation tasks. Code is available at this https URL.
- 中文摘要
虽然多模态大型语言模型(MLLM)具有先进的GUI导航代理,但当前的方法在跨域泛化和有效历史利用方面面临局限性。我们提出了一个推理增强框架,系统地集成了结构化推理、行动预测和历史总结。结构化推理组件生成结合进度估计和决策推理的连贯思维链分析,为未来步骤的即时行动预测和紧凑的历史摘要提供信息。基于这个框架,我们通过对伪标记轨迹的监督微调和使用组相对策略优化(GRPO)的强化学习来训练一个GUI代理\textbf{GUI-Rise}。该框架采用专门的奖励,包括历史感知目标,将摘要质量与后续行动绩效直接联系起来。对标准基准的综合评估展示了在相同训练数据条件下的最先进的结果,在域外场景中性能特别强。这些发现验证了我们的框架在不同的 GUI 导航任务中保持稳健推理和泛化的能力。代码可在此 https URL 中找到。
MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models
MedCalc-Eval 和 MedCalc-Env:推进大型语言模型的医学计算能力
- Authors: Kangkun Mao, Jinru Ding, Jiayuan Chen, Mouxiao Bian, Ruiyao Chen, Xinwei Peng, Sijie Ren, Linyang Li, Jie Xu
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.27267
- Pdf link: https://arxiv.org/pdf/2510.27267
- Abstract
As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs' medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at this https URL.
- 中文摘要
随着大型语言模型 (LLM) 进入医学领域,大多数基准测试都通过问答或描述性推理来评估它们,而忽略了对临床决策至关重要的定量推理。MedCalc-Bench 等现有数据集涵盖的计算任务很少,无法反映现实世界的计算场景。我们介绍了 MedCalc-Eval,这是评估法学硕士医学计算能力的最大基准,包括两种类型的 700+ 任务:基于方程的评分系统(例如 Cockcroft-Gault、BMI、BSA)和基于规则的评分系统(例如 Apgar、Glasgow Coma Scale)。这些任务涵盖内科、外科、儿科和心脏病学等不同专业,提供了更广泛、更具挑战性的评估环境。为了提高绩效,我们进一步开发了 MedCalc-Env,这是一个基于 InternBootcamp 框架构建的强化学习环境,可实现多步骤临床推理和规划。在这种环境中微调Qwen2.5-32B模型可以在MedCalc-Eval上获得最先进的结果,在数值灵敏度、公式选择和推理鲁棒性方面取得了显着的提升。剩下的挑战包括单位转换、多条件逻辑和上下文理解。代码和数据集可在此 https URL 中找到。
Inferring trust in recommendation systems from brain, behavioural, and physiological data
从大脑、行为和生理数据推断对推荐系统的信任
- Authors: Vincent K.M. Cheung, Pei-Cheng Shih, Masato Hirano, Masataka Goto, Shinichi Furuya
- Subjects: Subjects:
Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
- Arxiv link: https://arxiv.org/abs/2510.27272
- Pdf link: https://arxiv.org/pdf/2510.27272
- Abstract
As people nowadays increasingly rely on artificial intelligence (AI) to curate information and make decisions, assigning the appropriate amount of trust in automated intelligent systems has become ever more important. However, current measurements of trust in automation still largely rely on self-reports that are subjective and disruptive to the user. Here, we take music recommendation as a model to investigate the neural and cognitive processes underlying trust in automation. We observed that system accuracy was directly related to users' trust and modulated the influence of recommendation cues on music preference. Modelling users' reward encoding process with a reinforcement learning model further revealed that system accuracy, expected reward, and prediction error were related to oscillatory neural activity recorded via EEG and changes in pupil diameter. Our results provide a neurally grounded account of calibrating trust in automation and highlight the promises of a multimodal approach towards developing trustable AI systems.
- 中文摘要
随着当今人们越来越依赖人工智能 (AI) 来管理信息和做出决策,对自动化智能系统分配适当的信任变得越来越重要。然而,目前对自动化信任度的衡量仍然在很大程度上依赖于自我报告,这些自我报告对用户来说是主观的和破坏性的。在这里,我们以音乐推荐为模型,研究对自动化信任背后的神经和认知过程。我们观察到,系统准确性与用户的信任度直接相关,并调节推荐线索对音乐偏好的影响。使用强化学习模型对用户的奖励编码过程进行建模进一步表明,系统准确性、预期奖励和预测误差与脑电图记录的振荡神经活动和瞳孔直径的变化有关。我们的研究结果提供了校准自动化信任的神经基础解释,并强调了开发可信人工智能系统的多模态方法的前景。
A Digital Twin-based Multi-Agent Reinforcement Learning Framework for Vehicle-to-Grid Coordination
基于数字孪生的车网协调多智能体强化学习框架
- Authors: Zhengchang Hua, Panagiotis Oikonomou, Karim Djemame, Nikos Tziritas, Georgios Theodoropoulos
- Subjects: Subjects:
Distributed, Parallel, and Cluster Computing (cs.DC)
- Arxiv link: https://arxiv.org/abs/2510.27289
- Pdf link: https://arxiv.org/pdf/2510.27289
- Abstract
The coordination of large-scale, decentralised systems, such as a fleet of Electric Vehicles (EVs) in a Vehicle-to-Grid (V2G) network, presents a significant challenge for modern control systems. While collaborative Digital Twins have been proposed as a solution to manage such systems without compromising the privacy of individual agents, deriving globally optimal control policies from the high-level information they share remains an open problem. This paper introduces Digital Twin Assisted Multi-Agent Deep Deterministic Policy Gradient (DT-MADDPG) algorithm, a novel hybrid architecture that integrates a multi-agent reinforcement learning framework with a collaborative DT network. Our core contribution is a simulation-assisted learning algorithm where the centralised critic is enhanced by a predictive global model that is collaboratively built from the privacy-preserving data shared by individual DTs. This approach removes the need for collecting sensitive raw data at a centralised entity, a requirement of traditional multi-agent learning algorithms. Experimental results in a simulated V2G environment demonstrate that DT-MADDPG can achieve coordination performance comparable to the standard MADDPG algorithm while offering significant advantages in terms of data privacy and architectural decentralisation. This work presents a practical and robust framework for deploying intelligent, learning-based coordination in complex, real-world cyber-physical systems.
- 中文摘要
大规模分散系统的协调,例如车辆到电网 (V2G) 网络中的电动汽车 (EV) 车队,对现代控制系统提出了重大挑战。虽然协作数字孪生已被提议作为一种在不损害单个代理隐私的情况下管理此类系统的解决方案,但从它们共享的高级信息中得出全局最佳控制策略仍然是一个悬而未决的问题。本文介绍了数字孪生辅助多智能体深度确定性策略梯度(DT-MADDPG)算法,这是一种将多智能体强化学习框架与协作DT网络相结合的新型混合架构。我们的核心贡献是一种模拟辅助学习算法,其中集中批评者通过预测全局模型得到增强,该模型是根据各个 DT 共享的隐私保护数据协作构建的。这种方法消除了在集中式实体中收集敏感原始数据的需要,这是传统多代理学习算法的要求。在模拟V2G环境中的实验结果表明,DT-MADDPG可以实现与标准MADDPG算法相当的协调性能,同时在数据隐私和架构去中心化方面具有显著优势。这项工作为在复杂的现实世界网络物理系统中部署基于学习的智能协调提供了一个实用且强大的框架。
Reinforcement Learning for Long-Horizon Unordered Tasks: From Boolean to Coupled Reward Machines
针对远视野无序任务的强化学习:从布尔到耦合奖励机
- Authors: Kristina Levina, Nikolaos Pappas, Athanasios Karapantelakis, Aneta Vulgarakis Feljan, Jendrik Seipp
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.27329
- Pdf link: https://arxiv.org/pdf/2510.27329
- Abstract
Reward machines (RMs) inform reinforcement learning agents about the reward structure of the environment. This is particularly advantageous for complex non-Markovian tasks because agents with access to RMs can learn more efficiently from fewer samples. However, learning with RMs is ill-suited for long-horizon problems in which a set of subtasks can be executed in any order. In such cases, the amount of information to learn increases exponentially with the number of unordered subtasks. In this work, we address this limitation by introducing three generalisations of RMs: (1) Numeric RMs allow users to express complex tasks in a compact form. (2) In Agenda RMs, states are associated with an agenda that tracks the remaining subtasks to complete. (3) Coupled RMs have coupled states associated with each subtask in the agenda. Furthermore, we introduce a new compositional learning algorithm that leverages coupled RMs: Q-learning with coupled RMs (CoRM). Our experiments show that CoRM scales better than state-of-the-art RM algorithms for long-horizon problems with unordered subtasks.
- 中文摘要
奖励机 (RM) 告知强化学习代理环境的奖励结构。这对于复杂的非马尔可夫任务特别有利,因为有权访问 RM 的代理可以从更少的样本中更有效地学习。然而,使用 RM 进行学习并不适合长期问题,在这些问题中,一组子任务可以按任何顺序执行。在这种情况下,要学习的信息量随着无序子任务的数量呈指数级增长。在这项工作中,我们通过引入 RM 的三种概括来解决这一限制:(1) 数字 RM 允许用户以紧凑的形式表达复杂的任务。(2) 在议程 RM 中,状态与跟踪要完成的剩余子任务的议程相关联。(3) 耦合的 RM 具有与议程中每个子任务相关的耦合状态。此外,我们引入了一种利用耦合 RM 的新组合学习算法:耦合 RM 的 Q 学习 (CoRM)。我们的实验表明,对于具有无序子任务的长期问题,CoRM 比最先进的 RM 算法更能扩展。
Reasoning Models Sometimes Output Illegible Chains of Thought
推理模型有时会输出难以辨认的思维链
- Authors: Arun Jose
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.27338
- Pdf link: https://arxiv.org/pdf/2510.27338
- Abstract
Language models trained via outcome-based reinforcement learning (RL) to reason using chain-of-thought (CoT) have shown remarkable performance. Monitoring such a model's CoT may allow us to understand its intentions and detect potential malicious behavior. However, to be effective, this requires that CoTs are legible and faithful. We study CoT legibility across 14 reasoning models, finding that RL often causes reasoning to become illegible to both humans and AI monitors, with reasoning models (except Claude) generating illegible CoTs while returning to perfectly readable final answers. We show that models use illegible reasoning to reach correct answers (accuracy dropping by 53\% when forced to use only legible portions), yet find no correlation between legibility and performance when resampling - suggesting the relationship is more nuanced. We also find that legibility degrades on harder questions. We discuss potential hypotheses for these results, including steganography, training artifacts, and vestigial tokens. These results suggest that without explicit optimization for legibility, outcome-based RL naturally produces models with increasingly opaque reasoning processes, potentially undermining monitoring approaches.
- 中文摘要
通过基于结果的强化学习 (RL) 训练的语言模型使用思维链 (CoT) 进行推理,表现出卓越的性能。监控此类模型的 CoT 可能使我们能够了解其意图并检测潜在的恶意行为。然而,为了有效,这要求 CoT 清晰且忠实。我们研究了 14 个推理模型的 CoT 易读性,发现 RL 通常会导致推理对人类和 AI 监视器都变得难以辨认,推理模型(Claude 除外)生成难以辨认的 CoT,同时返回完全可读的最终答案。我们表明,模型使用难以辨认的推理来得出正确答案(当被迫仅使用清晰的部分时,准确率下降了 53\%),但在重采样时发现易读性和性能之间没有相关性 - 表明这种关系更加微妙。我们还发现,在较困难的问题上,易读性会下降。我们讨论了这些结果的潜在假设,包括隐写术、训练伪影和遗迹标记。这些结果表明,如果没有对易读性进行显式优化,基于结果的 RL 自然会产生推理过程越来越不透明的模型,从而可能破坏监控方法。
Realistic pedestrian-driver interaction modelling using multi-agent RL with human perceptual-motor constraints
使用具有人类感知运动约束的多智能体 RL 进行真实的行人-驾驶员交互建模
- Authors: Yueyang Wang, Mehmet Dogar, Gustav Markkula
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.27383
- Pdf link: https://arxiv.org/pdf/2510.27383
- Abstract
Modelling pedestrian-driver interactions is critical for understanding human road user behaviour and developing safe autonomous vehicle systems. Existing approaches often rely on rule-based logic, game-theoretic models, or 'black-box' machine learning methods. However, these models typically lack flexibility or overlook the underlying mechanisms, such as sensory and motor constraints, which shape how pedestrians and drivers perceive and act in interactive scenarios. In this study, we propose a multi-agent reinforcement learning (RL) framework that integrates both visual and motor constraints of pedestrian and driver agents. Using a real-world dataset from an unsignalised pedestrian crossing, we evaluate four model variants, one without constraints, two with either motor or visual constraints, and one with both, across behavioural metrics of interaction realism. Results show that the combined model with both visual and motor constraints performs best. Motor constraints lead to smoother movements that resemble human speed adjustments during crossing interactions. The addition of visual constraints introduces perceptual uncertainty and field-of-view limitations, leading the agents to exhibit more cautious and variable behaviour, such as less abrupt deceleration. In this data-limited setting, our model outperforms a supervised behavioural cloning model, demonstrating that our approach can be effective without large training datasets. Finally, our framework accounts for individual differences by modelling parameters controlling the human constraints as population-level distributions, a perspective that has not been explored in previous work on pedestrian-vehicle interaction modelling. Overall, our work demonstrates that multi-agent RL with human constraints is a promising modelling approach for simulating realistic road user interactions.
- 中文摘要
对行人与驾驶员的互动进行建模对于了解人类道路使用者行为和开发安全的自动驾驶汽车系统至关重要。现有方法通常依赖于基于规则的逻辑、博弈论模型或“黑盒”机器学习方法。然而,这些模型通常缺乏灵活性或忽视潜在机制,例如感觉和运动限制,这些机制决定了行人和驾驶员在交互场景中的感知和行为方式。在这项研究中,我们提出了一种多智能体强化学习(RL)框架,该框架集成了行人和驾驶员智能体的视觉和运动约束。使用来自无信号人行横道的真实世界数据集,我们评估了四种模型变体,一种没有约束,两种有运动或视觉约束,一种同时具有交互真实主义的行为指标。结果表明,同时具有视觉约束和运动约束的组合模型表现最佳。运动约束导致更平稳的运动,类似于穿越互动期间人类的速度调整。视觉约束的增加引入了感知不确定性和视野限制,导致智能体表现出更加谨慎和多变的行为,例如不那么突然的减速。在这种数据有限的环境中,我们的模型优于监督行为克隆模型,这表明我们的方法在没有大型训练数据集的情况下也可以有效。最后,我们的框架通过将控制人类约束的参数建模为人口水平分布来解释个体差异,这一观点在之前的行人-车辆交互建模工作中没有探索过。总的来说,我们的工作表明,具有人类约束的多智能体RL是一种很有前途的建模方法,用于模拟真实的道路使用者交互。
DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains
DeepCompress:动态探索和压缩推理链的双重奖励策略
- Authors: Tian Liang, Wenxiang Jiao, Zhiwei He, Jiahao Xu, Haitao Mi, Dong Yu
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.27419
- Pdf link: https://arxiv.org/pdf/2510.27419
- Abstract
Large Reasoning Models (LRMs) have demonstrated impressive capabilities but suffer from cognitive inefficiencies like
overthinking'' simple problems andunderthinking'' complex ones. While existing methods that use supervised fine-tuning~(SFT) or reinforcement learning~(RL) with token-length rewards can improve efficiency, they often do so at the cost of accuracy. This paper introduces \textbf{DeepCompress}, a novel framework that simultaneously enhances both the accuracy and efficiency of LRMs. We challenge the prevailing approach of consistently favoring shorter reasoning paths, showing that longer responses can contain a broader range of correct solutions for difficult problems. DeepCompress employs an adaptive length reward mechanism that dynamically classifies problems as Simple'' orHard'' in real-time based on the model's evolving capability. It encourages shorter, more efficient reasoning for Simple'' problems while promoting longer, more exploratory thought chains forHard'' problems. This dual-reward strategy enables the model to autonomously adjust its Chain-of-Thought (CoT) length, compressing reasoning for well-mastered problems and extending it for those it finds challenging. Experimental results on challenging mathematical benchmarks show that DeepCompress consistently outperforms baseline methods, achieving superior accuracy while significantly improving token efficiency.
- 中文摘要
大型推理模型 (LRM) 已表现出令人印象深刻的能力,但存在认知效率低下的问题,例如“过度思考”简单问题和“思考不足”复杂问题。虽然使用监督微调~(SFT) 或强化学习~(RL) 和令牌长度奖励的现有方法可以提高效率,但它们往往以牺牲准确性为代价。本文介绍了\textbf{DeepCompress},这是一个新颖的框架,可以同时提高LRM的准确性和效率。我们挑战了始终支持较短推理路径的流行方法,表明较长的响应可以包含针对困难问题的更广泛的正确解决方案。DeepCompress 采用自适应长度奖励机制,根据模型不断发展的能力,实时将问题动态分类为“简单”或“困难”。它鼓励对“简单”问题进行更短、更有效的推理,同时促进对“困难”问题进行更长、更具探索性的思维链。这种双重奖励策略使模型能够自主调整其思维链 (CoT) 长度,压缩对掌握良好的问题的推理,并将其扩展到它认为具有挑战性的问题。在具有挑战性的数学基准测试上的实验结果表明,DeepCompress 始终优于基线方法,实现卓越的准确性,同时显着提高令牌效率。
Learning Soft Robotic Dynamics with Active Exploration
通过主动探索学习软机器人动力学
- Authors: Hehui Zheng, Bhavya Sukhija, Chenhao Li, Klemens Iten, Andreas Krause, Robert K. Katzschmann
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.27428
- Pdf link: https://arxiv.org/pdf/2510.27428
- Abstract
Soft robots offer unmatched adaptability and safety in unstructured environments, yet their compliant, high-dimensional, and nonlinear dynamics make modeling for control notoriously difficult. Existing data-driven approaches often fail to generalize, constrained by narrowly focused task demonstrations or inefficient random exploration. We introduce SoftAE, an uncertainty-aware active exploration framework that autonomously learns task-agnostic and generalizable dynamics models of soft robotic systems. SoftAE employs probabilistic ensemble models to estimate epistemic uncertainty and actively guides exploration toward underrepresented regions of the state-action space, achieving efficient coverage of diverse behaviors without task-specific supervision. We evaluate SoftAE on three simulated soft robotic platforms -- a continuum arm, an articulated fish in fluid, and a musculoskeletal leg with hybrid actuation -- and on a pneumatically actuated continuum soft arm in the real world. Compared with random exploration and task-specific model-based reinforcement learning, SoftAE produces more accurate dynamics models, enables superior zero-shot control on unseen tasks, and maintains robustness under sensing noise, actuation delays, and nonlinear material effects. These results demonstrate that uncertainty-driven active exploration can yield scalable, reusable dynamics models across diverse soft robotic morphologies, representing a step toward more autonomous, adaptable, and data-efficient control in compliant robots.
- 中文摘要
软体机器人在非结构化环境中提供无与伦比的适应性和安全性,但其顺应性、高维和非线性动力学使得控制建模变得众所周知的困难。现有的数据驱动方法通常无法推广,受到狭隘的任务演示或低效的随机探索的限制。我们介绍了 SoftAE,这是一个不确定性感知的主动探索框架,可以自主学习软机器人系统的与任务无关和可推广的动力学模型。SoftAE 采用概率集成模型来估计认识不确定性,并积极引导探索状态行动空间中代表性不足的区域,实现对不同行为的有效覆盖,而无需特定任务的监督。我们在三个模拟的软机器人平台上评估了 SoftAE——一个连续体臂、一个流体中的铰接鱼和一个具有混合驱动的肌肉骨骼腿——以及现实世界中的气动驱动连续体软臂。与随机探索和基于特定任务模型的强化学习相比,SoftAE生成了更准确的动力学模型,能够对看不见的任务进行卓越的零样本控制,并在传感噪声、驱动延迟和非线性材料效应下保持鲁棒性。这些结果表明,不确定性驱动的主动探索可以在不同的软机器人形态中产生可扩展、可重用的动力学模型,代表着合规机器人朝着更加自主、适应性和数据高效的控制迈出了一步。
VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision
VCORE:基于方差控制优化的思维链监督重加权
- Authors: Xuan Gong, Senmiao Wang, Hanbo Huang, Ruoyu Sun, Shiyu Liang
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.27462
- Pdf link: https://arxiv.org/pdf/2510.27462
- Abstract
Supervised fine-tuning (SFT) on long chain-of-thought (CoT) trajectories has emerged as a crucial technique for enhancing the reasoning abilities of large language models (LLMs). However, the standard cross-entropy loss treats all tokens equally, ignoring their heterogeneous contributions across a reasoning trajectory. This uniform treatment leads to misallocated supervision and weak generalization, especially in complex, long-form reasoning tasks. To address this, we introduce \textbf{V}ariance-\textbf{C}ontrolled \textbf{O}ptimization-based \textbf{RE}weighting (VCORE), a principled framework that reformulates CoT supervision as a constrained optimization problem. By adopting an optimization-theoretic perspective, VCORE enables a principled and adaptive allocation of supervision across tokens, thereby aligning the training objective more closely with the goal of robust reasoning generalization. Empirical evaluations demonstrate that VCORE consistently outperforms existing token reweighting methods. Across both in-domain and out-of-domain settings, VCORE achieves substantial performance gains on mathematical and coding benchmarks, using models from the Qwen3 series (4B, 8B, 32B) and LLaMA-3.1-8B-Instruct. Moreover, we show that VCORE serves as a more effective initialization for subsequent reinforcement learning, establishing a stronger foundation for advancing the reasoning capabilities of LLMs. The Code will be released at this https URL.
- 中文摘要
长思维链 (CoT) 轨迹上的监督微调 (SFT) 已成为增强大型语言模型 (LLM) 推理能力的关键技术。然而,标准交叉熵损失平等对待所有代币,忽略了它们在推理轨迹中的异构贡献。这种统一的处理会导致监督分配错误和泛化能力薄弱,特别是在复杂的长篇推理任务中。为了解决这个问题,我们引入了 \textbf{V}ariance-\textbf{C}ontrolled \textbf{O}ptimization-based \textbf{RE}weighting (VCORE),这是一个原则框架,将 CoT 监督重新表述为约束优化问题。通过采用优化理论视角,VCORE 实现了跨标记的原则性和自适应性监管分配,从而使训练目标与鲁棒推理泛化的目标更加紧密地保持一致。实证评估表明,VCORE 的表现始终优于现有的代币重新加权方法。在域内和域外设置中,VCORE 使用 Qwen3 系列(4B、8B、32B)和 LLaMA-3.1-8B-Instruct 的模型,在数学和编码基准测试上实现了显着的性能提升。此外,我们表明VCORE可以作为后续强化学习的更有效初始化,为提高LLM的推理能力奠定更坚实的基础。该准则将在此 https URL 上发布。
Interact-RAG: Reason and Interact with the Corpus, Beyond Black-Box Retrieval
Interact-RAG:超越黑盒检索的推理和与语料库交互
- Authors: Yulong Hui, Chao Chen, Zhihang Fu, Yihao Liu, Jieping Ye, Huanchen Zhang
- Subjects: Subjects:
Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2510.27566
- Pdf link: https://arxiv.org/pdf/2510.27566
- Abstract
Retrieval-Augmented Generation (RAG) has significantly enhanced LLMs by incorporating external information. However, prevailing agentic RAG approaches are constrained by a critical limitation: they treat the retrieval process as a black-box querying operation. This confines agents' actions to query issuing, hindering its ability to tackle complex information-seeking tasks. To address this, we introduce Interact-RAG, a new paradigm that elevates the LLM agent from a passive query issuer into an active manipulator of the retrieval process. We dismantle the black-box with a Corpus Interaction Engine, equipping the agent with a set of action primitives for fine-grained control over information retrieval. To further empower the agent on the entire RAG pipeline, we first develop a reasoning-enhanced workflow, which enables both zero-shot execution and the synthesis of interaction trajectories. We then leverage this synthetic data to train a fully autonomous end-to-end agent via Supervised Fine-Tuning (SFT), followed by refinement with Reinforcement Learning (RL). Extensive experiments across six benchmarks demonstrate that Interact-RAG significantly outperforms other advanced methods, validating the efficacy of our reasoning-interaction strategy.
- 中文摘要
检索增强生成 (RAG) 通过整合外部信息显着增强了法学硕士。然而,流行的代理 RAG 方法受到一个关键限制的限制:它们将检索过程视为黑盒查询作。这限制了代理的作,仅限于查询发出,阻碍了其处理复杂信息搜索任务的能力。为了解决这个问题,我们引入了 Interact-RAG,这是一种新范式,它将 LLM 代理从被动查询发布者提升为检索过程的主动纵者。我们用语料库交互引擎拆除了黑匣子,为代理配备了一组动作原语,用于对信息检索进行细粒度控制。为了进一步增强整个 RAG 管道上的代理能力,我们首先开发了一个推理增强的工作流程,它既支持零样本执行,也支持交互轨迹的合成。然后,我们利用这些合成数据通过监督微调 (SFT) 训练完全自主的端到端代理,然后通过强化学习 (RL) 进行细化。跨六个基准的广泛实验表明,Interact-RAG 的性能明显优于其他先进方法,验证了我们的推理交互策略的有效性。
MARAG-R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval
MARAG-R1:通过强化学习的多工具代理检索超越单个检索器
- Authors: Qi Luo, Xiaonan Li, Yuxin Wang, Tingshuo Fan, Yuan Li, Xinchi Chen, Xipeng Qiu
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.27569
- Pdf link: https://arxiv.org/pdf/2510.27569
- Abstract
Large Language Models (LLMs) excel at reasoning and generation but are inherently limited by static pretraining data, resulting in factual inaccuracies and weak adaptability to new information. Retrieval-Augmented Generation (RAG) addresses this issue by grounding LLMs in external knowledge; However, the effectiveness of RAG critically depends on whether the model can adequately access relevant information. Existing RAG systems rely on a single retriever with fixed top-k selection, restricting access to a narrow and static subset of the corpus. As a result, this single-retriever paradigm has become the primary bottleneck for comprehensive external information acquisition, especially in tasks requiring corpus-level reasoning. To overcome this limitation, we propose MARAG-R1, a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms for broader and more precise information access. MARAG-R1 equips the model with four retrieval tools -- semantic search, keyword search, filtering, and aggregation -- and learns both how and when to use them through a two-stage training process: supervised fine-tuning followed by reinforcement learning. This design allows the model to interleave reasoning and retrieval, progressively gathering sufficient evidence for corpus-level synthesis. Experiments on GlobalQA, HotpotQA, and 2WikiMultiHopQA demonstrate that MARAG-R1 substantially outperforms strong baselines and achieves new state-of-the-art results in corpus-level reasoning tasks.
- 中文摘要
大型语言模型(LLM)擅长推理和生成,但本质上受到静态预训练数据的限制,导致事实不准确和对新信息的适应性较弱。检索增强生成 (RAG) 通过将法学硕士建立在外部知识的基础上来解决这个问题;然而,RAG 的有效性在很大程度上取决于模型是否能够充分访问相关信息。现有的 RAG 系统依赖于具有固定 top-k 选择的单个检索器,限制了对语料库的狭窄和静态子集的访问。因此,这种单检索器范式已成为全面外部信息获取的主要瓶颈,特别是在需要语料库级推理的任务中。为了克服这一限制,我们提出了 MARAG-R1,这是一种强化学习的多工具 RAG 框架,使 LLM 能够动态协调多种检索机制,以实现更广泛、更精确的信息访问。MARAG-R1 为模型配备了四种检索工具——语义搜索、关键字搜索、过滤和聚合——并通过两阶段训练过程学习如何以及何时使用它们:监督微调和强化学习。这种设计允许模型将推理和检索交错,逐步收集足够的证据进行语料库级综合。GlobalQA、HotpotQA 和 2WikiMultiHopQA 上的实验表明,MARAG-R1 的性能大大优于强大的基线,并在语料库级推理任务中取得了新的最先进的结果。
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
Spatial-SSRL:通过自监督强化学习增强空间理解
- Authors: Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.27606
- Pdf link: https://arxiv.org/pdf/2510.27606
- Abstract
Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
- 中文摘要
空间理解仍然是大型视觉语言模型 (LVLM) 的弱点。现有的监督微调 (SFT) 和最近的具有可验证奖励的强化学习 (RLVR) 管道依赖于昂贵的监督、专用工具或限制规模的受限环境。我们引入了 Spatial-SSRL,这是一种自监督 RL 范式,可直接从普通 RGB 或 RGB-D 图像中获取可验证信号。Spatial-SSRL自动制定5个捕捉2D和3D空间结构的借口任务:打乱的补丁重排序、翻转的补丁识别、裁剪的补丁修复、区域深度排序和相对3D位置预测。这些任务提供易于验证且不需要人工或 LVLM 注释的真实答案。对我们的任务进行训练可以大大提高空间推理能力,同时保留一般视觉能力。在图像和视频设置的七个空间理解基准测试中,Spatial-SSRL 比 Qwen2.5-VL 基线提供了 4.63% (3B) 和 3.89% (7B) 的平均精度提升。我们的结果表明,简单的内在监督能够大规模地实现 RLVR,并为 LVLM 中更强的空间智能提供了一条实用途径。
Challenges in Credit Assignment for Multi-Agent Reinforcement Learning in Open Agent Systems
开放智能体系统中多智能体强化学习的学分分配挑战
- Authors: Alireza Saleh Abadi, Leen-Kiat Soh
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2510.27659
- Pdf link: https://arxiv.org/pdf/2510.27659
- Abstract
In the rapidly evolving field of multi-agent reinforcement learning (MARL), understanding the dynamics of open systems is crucial. Openness in MARL refers to the dynam-ic nature of agent populations, tasks, and agent types with-in a system. Specifically, there are three types of openness as reported in (Eck et al. 2023) [2]: agent openness, where agents can enter or leave the system at any time; task openness, where new tasks emerge, and existing ones evolve or disappear; and type openness, where the capabil-ities and behaviors of agents change over time. This report provides a conceptual and empirical review, focusing on the interplay between openness and the credit assignment problem (CAP). CAP involves determining the contribution of individual agents to the overall system performance, a task that becomes increasingly complex in open environ-ments. Traditional credit assignment (CA) methods often assume static agent populations, fixed and pre-defined tasks, and stationary types, making them inadequate for open systems. We first conduct a conceptual analysis, in-troducing new sub-categories of openness to detail how events like agent turnover or task cancellation break the assumptions of environmental stationarity and fixed team composition that underpin existing CAP methods. We then present an empirical study using representative temporal and structural algorithms in an open environment. The results demonstrate that openness directly causes credit misattribution, evidenced by unstable loss functions and significant performance degradation.
- 中文摘要
在快速发展的多智能体强化学习(MARL)领域,了解开放系统的动态至关重要。MARL 中的开放性是指系统中代理群体、任务和代理类型的动态性质。具体来说,有三种类型的开放性,如 (Eck et al. 2023) [2] 中报道的那样:代理开放性,代理可以随时进入或离开系统;任务开放性,新任务出现,现有任务演变或消失;以及类型开放性,其中代理的能力和行为会随着时间的推移而变化。本报告提供了概念和实证回顾,重点关注开放性与学分分配问题 (CAP) 之间的相互作用。CAP 涉及确定单个代理对整体系统性能的贡献,这项任务在开放环境中变得越来越复杂。传统的信用分配 (CA) 方法通常假设静态代理群体、固定和预定义任务以及固定类型,因此它们不适合开放系统。我们首先进行概念分析,引入新的开放性子类别,以详细说明代理人员流动或任务取消等事件如何打破支撑现有 CAP 方法的环境平稳性和固定团队组成的假设。然后,我们在开放环境中使用具有代表性的时间和结构算法进行实证研究。结果表明,开放性直接导致信用错误归因,损失函数不稳定和性能显着下降就证明了这一点。
Keyword: diffusion policy
EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities
EBT 政策:能源释放新兴的物理推理能力
- Authors: Travis Davies, Yiqi Huang, Alexi Gladstone, Yunxin Liu, Xiang Chen, Heng Ji, Huxian Liu, Luhui Hu
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.27545
- Pdf link: https://arxiv.org/pdf/2510.27545
- Abstract
Implicit policies parameterized by generative models, such as Diffusion Policy, have become the standard for policy learning and Vision-Language-Action (VLA) models in robotics. However, these approaches often suffer from high computational cost, exposure bias, and unstable inference dynamics, which lead to divergence under distribution shifts. Energy-Based Models (EBMs) address these issues by learning energy landscapes end-to-end and modeling equilibrium dynamics, offering improved robustness and reduced exposure bias. Yet, policies parameterized by EBMs have historically struggled to scale effectively. Recent work on Energy-Based Transformers (EBTs) demonstrates the scalability of EBMs to high-dimensional spaces, but their potential for solving core challenges in physically embodied models remains underexplored. We introduce a new energy-based architecture, EBT-Policy, that solves core issues in robotic and real-world settings. Across simulated and real-world tasks, EBT-Policy consistently outperforms diffusion-based policies, while requiring less training and inference computation. Remarkably, on some tasks it converges within just two inference steps, a 50x reduction compared to Diffusion Policy's 100. Moreover, EBT-Policy exhibits emergent capabilities not seen in prior models, such as zero-shot recovery from failed action sequences using only behavior cloning and without explicit retry training. By leveraging its scalar energy for uncertainty-aware inference and dynamic compute allocation, EBT-Policy offers a promising path toward robust, generalizable robot behavior under distribution shifts.
- 中文摘要
由生成模型参数化的隐式策略,如扩散策略,已成为机器人技术中策略学习和视觉-语言-行动(VLA)模型的标准。然而,这些方法往往存在计算成本高、暴露偏差和推理动力学不稳定等问题,导致分布偏移下的背离。基于能量的模型 (EBM) 通过端到端学习能源格局和对平衡动力学进行建模来解决这些问题,从而提高稳健性并减少暴露偏差。然而,由 EBM 参数化的政策历来难以有效扩展。最近关于基于能量的变压器(EBT)的工作证明了EBM在高维空间的可扩展性,但它们在解决物理体现模型中的核心挑战方面的潜力仍未得到充分探索。我们引入了一种新的基于能源的架构 EBT-Policy,它解决了机器人和现实世界中的核心问题。在模拟和真实任务中,EBT-Policy 始终优于基于扩散的策略,同时需要更少的训练和推理计算。值得注意的是,在某些任务中,它仅收敛在两个推理步骤中,与 Diffusion Policy 的 50 倍相比,减少了 100 倍。此外,EBT-Policy 表现出以前模型中没有的紧急功能,例如仅使用行为克隆而无需显式重试训练即可从失败的动作序列中零样本恢复。通过利用其标量能量进行不确定性感知推理和动态计算分配,EBT-Policy 为在分布偏移下实现稳健、可通用的机器人行为提供了一条有前途的道路。