Arxiv Papers of Today

生成时间: 2026-01-26 16:35:47 (UTC+8); Arxiv 发布时间: 2026-01-26 20:00 EST (2026-01-27 09:00 UTC+8)

今天共有 16 篇相关文章

Keyword: reinforcement learning

A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning

用于双级强化学习的正则化演员-批评算法

Authors: Sihan Zeng, Sujay Bhatt, Sumitra Ganesh, Alec Koppel
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2601.16399
Pdf link: https://arxiv.org/pdf/2601.16399
Abstract We study a structured bi-level optimization problem where the upper-level objective is a smooth function and the lower-level problem is policy optimization in a Markov decision process (MDP). The upper-level decision variable parameterizes the reward of the lower-level MDP, and the upper-level objective depends on the optimal induced policy. Existing methods for bi-level optimization and RL often require second-order information, impose strong regularization at the lower level, or inefficiently use samples through nested-loop procedures. In this work, we propose a single-loop, first-order actor-critic algorithm that optimizes the bi-level objective via a penalty-based reformulation. We introduce into the lower-level RL objective an attenuating entropy regularization, which enables asymptotically unbiased upper-level hyper-gradient estimation without solving the unregularized RL problem exactly. We establish the finite-time and finite-sample convergence of the proposed algorithm to a stationary point of the original, unregularized bi-level optimization problem through a novel lower-level residual analysis under a special type of Polyak-Lojasiewicz condition. We validate the performance of our method through experiments on a GridWorld goal position problem and on happy tweet generation through reinforcement learning from human feedback (RLHF).
中文摘要 我们研究一个结构化的二层优化问题，其中上层目标是一个光滑函数，下层问题是马尔可夫决策过程（MDP）中的策略优化。上层决策变量参数下层MDP的奖励，上层目标依赖于最优诱导策略。现有的双层优化和强化学习方法通常需要二阶信息，在较低层次强加正则化，或通过嵌套环路过程低效使用样本。在本研究中，我们提出了一种单循环、一阶演员-批判者算法，通过基于惩罚的重构来优化二能级目标。我们在低层强化学习目标中引入了衰减熵正则化，这使得在不完全解决非正则化强化问题的情况下，实现渐近无偏的上层超梯度估计。我们通过一种特殊的Polyak-Lojasiewicz条件下的低层残差分析，建立了所提算法在有限时间和有限样本下收敛到原始非正则化双层优化问题的平稳点。我们通过对GridWorld目标位置问题的实验以及通过人类反馈强化学习（RLHF）生成愉快推文来验证方法的性能。

Clarify or Answer: Reinforcement Learning for Agentic VQA with Context Under-specification

澄清或回答：针对具备上下文不足的智能VQA强化学习

Authors: Zongwan Cao, Bingbing Wen, Lucy Lu Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.16400
Pdf link: https://arxiv.org/pdf/2601.16400
Abstract Real-world visual question answering (VQA) is often context-dependent: an image-question pair may be under-specified, such that the correct answer depends on external information that is not observable in the image. In such cases, directly answering can lead to confident but incorrect predictions. We propose CoA(Clarify-or-Answer), an ask-or-answer agent that separately models the decision to ask or answer, and what to ask if needed. CoA first determines whether clarification is necessary; if so, it asks a single focused question and then incorporates the response to produce the final answer. We introduce CONTEXTCLARIFY with a set of ambiguous VQA questions and the contrast set that is non-ambiguous. We further introduce GRPO-CR (Clarification Reasoning), a reinforcement learning approach that optimizes clarification question generation with multiple reward signals encouraging well-formed, focused, non-trivial questions that resolve ambiguity. Across three VLLMs and three datasets, CoA achieves consistent improvements at both the module and system levels, improving end-to-end VQA accuracy by an average of +15.3 points (83%) over prompting-based baselines
中文摘要 现实世界的视觉问答（VQA）通常依赖上下文：图像与问题对可能被不足指定，正确答案依赖于图像中无法观察到的外部信息。在这种情况下，直接回答可能导致自信但错误的预测。我们提出了CoA（澄清或回答），一种问答代理，单独建模问答的决定以及必要时问的问题。CoA首先确定是否有必要澄清;如果是这样，它会先提出一个聚焦的问题，然后结合回答得出最终答案。我们引入了CONTEXTCLARIFY，包含一组模糊的VQA问题和一个非歧义的对比集。我们进一步介绍GRPO-CR（澄清推理），这是一种强化学习方法，通过多重奖励信号优化澄清题生成，鼓励提出结构良好、聚焦且非平凡的问题，从而解决歧义。在三个VLLM（VLLM）和三个数据集中，CoA在模块和系统层面均实现了持续的改进，端到端VQA的准确性平均提升了+15.3分（83%），相比基于提示的基线

Towards a Theoretical Understanding to the Generalization of RLHF

迈向理论理解到RLHF推广

Authors: Zhaochun Li (1,2), Mingyang Yi (3), Yue Wang (2), Shisheng Cui (1), Yong Liu (3) ((1) Beijing Institute of Technolegy, (2) Zhongguancun Academy, (3) Renmin University of China)
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.16403
Pdf link: https://arxiv.org/pdf/2601.16403
Abstract Reinforcement Learning from Human Feedback (RLHF) and its variants have emerged as the dominant approaches for aligning Large Language Models with human intent. While empirically effective, the theoretical generalization properties of these methods in high-dimensional settings remain to be explored. To this end, we build the generalization theory on RLHF of LLMs under the linear reward model, through the framework of algorithmic stability. In contrast to the existing works built upon the consistency of maximum likelihood estimations on reward model, our analysis is presented under an end-to-end learning framework, which is consistent with practice. Concretely, we prove that under a key \textbf{feature coverage} condition, the empirical optima of policy model have a generalization bound of order $\mathcal{O}(n^{-\frac{1}{2}})$. Moreover, the results can be extrapolated to parameters obtained by gradient-based learning algorithms, i.e., Gradient Ascent (GA) and Stochastic Gradient Ascent (SGA). Thus, we argue that our results provide new theoretical evidence for the empirically observed generalization of LLMs after RLHF.
中文摘要 基于人类反馈的强化学习（RLHF）及其变体已成为将大型语言模型与人类意图对齐的主流方法。虽然经验有效，但这些方法在高维环境中的理论推广性质仍有待探索。为此，我们通过算法稳定性框架，构建了线性奖励模型下LLMS的RLHF推广理论。与现有基于奖励模型最大似然估计一致性的研究不同，我们的分析采用端到端学习框架，符合实践。具体来说，我们证明在关键的 \textbf{特征覆盖}条件下，政策模型的经验最优定义具有阶数 $\mathcal{O}（n^{-\frac{1}{2}}）$ 的推广界限。此外，这些结果还可以外推到基于梯度的学习算法获得的参数，即梯度上升（GA）和随机梯度上升（SGA）。因此，我们认为我们的结果为RLHF之后LLMs的实证推广提供了新的理论证据。

Reinforcement Learning-Based Energy-Aware Coverage Path Planning for Precision Agriculture

基于强化学习的能源感知覆盖路径规划，适用于精准农业

Authors: Beining Wu, Zihao Ding, Leo Ostigaard, Jun Huang
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.16405
Pdf link: https://arxiv.org/pdf/2601.16405
Abstract Coverage Path Planning (CPP) is a fundamental capability for agricultural robots; however, existing solutions often overlook energy constraints, resulting in incomplete operations in large-scale or resource-limited environments. This paper proposes an energy-aware CPP framework grounded in Soft Actor-Critic (SAC) reinforcement learning, designed for grid-based environments with obstacles and charging stations. To enable robust and adaptive decision-making under energy limitations, the framework integrates Convolutional Neural Networks (CNNs) for spatial feature extraction and Long Short-Term Memory (LSTM) networks for temporal dynamics. A dedicated reward function is designed to jointly optimize coverage efficiency, energy consumption, and return-to-base constraints. Experimental results demonstrate that the proposed approach consistently achieves over 90% coverage while ensuring energy safety, outperforming traditional heuristic algorithms such as Rapidly-exploring Random Tree (RRT), Particle Swarm Optimization (PSO), and Ant Colony Optimization (ACO) baselines by 13.4-19.5% in coverage and reducing constraint violations by 59.9-88.3%. These findings validate the proposed SAC-based framework as an effective and scalable solution for energy-constrained CPP in agricultural robotics.
中文摘要 覆盖路径规划（CPP）是农业机器人的基本能力;然而，现有解决方案常常忽视能源限制，导致大规模或资源有限环境中运行不完整。本文提出了一个基于软演员-批判者（SAC）强化学习的能量感知CPP框架，专为基于网格的环境设计，且有障碍物和充电站。为了在能量限制下实现稳健且自适应的决策，该框架整合了卷积神经网络（CNN）用于空间特征提取，以及用于时间动力学的长短期记忆（LSTM）网络。专门的奖励函数设计用于共同优化覆盖效率、能耗和返回基地约束。实验结果表明，所提方法在确保能源安全的同时，始终实现超过90%的覆盖率，在覆盖率上比传统启发式算法如快速探索随机树（RRT）、粒子群优化（PSO）和蚁群优化（ACO）基线高出13.4%-19.5%，约束违规减少了59.9%-88.3%。这些发现验证了基于SAC的框架作为农业机器人中能耗限制CPP的有效且可扩展解决方案。

Endless Terminals: Scaling RL Environments for Terminal Agents

无尽终端：终端代理的强化学习环境扩展

Authors: Kanishk Gandhi, Shivam Garg, Noah D. Goodman, Dimitris Papailiopoulos
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.16443
Pdf link: https://arxiv.org/pdf/2601.16443
Abstract Environments are the bottleneck for self-improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation. The pipeline has four stages: generating diverse task descriptions, building and validating containerized environments, producing completion tests, and filtering for solvability. From this pipeline we obtain 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop: no retrieval, multi-agent coordination, or specialized tools. Despite this simplicity, models trained on Endless Terminals show substantial gains: on our held-out dev set, Llama-3.2-3B improves from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0%. These improvements transfer to human-curated benchmarks: models trained on Endless Terminals show substantial gains on held out human curated benchmarks: on TerminalBench 2.0, Llama-3.2-3B improves from 0.0% to 2.2%, Qwen2.5-7B from 2.2% to 3.4%, and Qwen3-8B-openthinker-sft from 1.1% to 6.7%, in each case outperforming alternative approaches including models with more complex agentic scaffolds. These results demonstrate that simple RL succeeds when environments scale.
中文摘要 环境是自我提升代理的瓶颈。当前终端基准是为评估而非培训而设计的;强化学习需要一个可扩展的流水线，而不仅仅是数据集。我们引入了Endless Terminals，这是一条完全自主的流程，能够在无需人工注释的情况下程序生成终端使用任务。该流程包含四个阶段：生成多样化的任务描述、构建和验证容器化环境、生成完备测试，以及进行可解决性的过滤。通过该流水线，我们获得了3255个任务，涵盖文件作、日志管理、数据处理、脚本编写和数据库作。我们使用普通PPO训练代理，采用二元的剧集级奖励和最小的交互循环：无检索、无多代理协调或专用工具。尽管如此简单，在无尽终端上训练的模型显示出显著提升：在我们保留的开发组中，Llama-3.2-3B从4.0%提升到18.2%，Qwen2.5-7B从10.7%提升到53.3%，Qwen3-8B-openthinker-sft从42.6%提升到59.0%。这些改进也体现在人工策划基准上：在Endless Terminals上训练的模型相较于人工策划基准显著提升：在TerminalBench 2.0中，Llama-3.2-3B从0.0%提升到2.2%，Qwen2.5-7B从2.2%提升到3.4%，Qwen3-8B-openthinker-sft从1.1%提升到6.7%，在每种情况下都优于包含更复杂代理支架模型的其他方法。这些结果表明，当环境可扩展时，简单的强化学习是成功的。

Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go

融合专业知识：将人类思维带回围棋游戏

Authors: Yichuan Ma, Linyang Li, Yongkang Chen, Peiji Li, Jiasheng Ye, Qipeng Guo, Dahua Lin, Kai Chen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.16447
Pdf link: https://arxiv.org/pdf/2601.16447
Abstract Large language models (LLMs) have demonstrated exceptional performance in reasoning tasks such as mathematics and coding, matching or surpassing human capabilities. However, these impressive reasoning abilities face significant challenges in specialized domains. Taking Go as an example, although AlphaGo has established the high performance ceiling of AI systems in Go, mainstream LLMs still struggle to reach even beginner-level proficiency, let alone perform natural language reasoning. This performance gap between general-purpose LLMs and domain experts is significantly limiting the application of LLMs on a wider range of domain-specific tasks. In this work, we aim to bridge the divide between LLMs' general reasoning capabilities and expert knowledge in domain-specific tasks. We perform mixed fine-tuning with structured Go expertise and general long Chain-of-Thought (CoT) reasoning data as a cold start, followed by reinforcement learning to integrate expert knowledge in Go with general reasoning capabilities. Through this methodology, we present \textbf{LoGos}, a powerful LLM that not only maintains outstanding general reasoning abilities, but also conducts Go gameplay in natural language, demonstrating effective strategic reasoning and accurate next-move prediction. LoGos achieves performance comparable to human professional players, substantially surpassing all existing LLMs. Through this work, we aim to contribute insights on applying general LLM reasoning capabilities to specialized domains. We will release the first large-scale Go dataset for LLM training, the first LLM Go evaluation benchmark, and the first general LLM that reaches human professional-level performance in Go at: this https URL.
中文摘要 大型语言模型（LLMs）在数学和编码等推理任务中表现出卓越的表现，能够匹配甚至超越人类能力。然而，这些令人印象深刻的推理能力在专业领域面临重大挑战。以围棋为例，尽管AlphaGo已经为围棋AI系统树立了高性能上限，但主流大型语言模型甚至难以达到初学水平，更别说进行自然语言推理了。通用大型语言模型与领域专家之间的性能差距显著限制了大型语言模型在更广泛领域特定任务中的应用。本研究旨在弥合大型语言模型在通用推理能力与领域特定任务中专家知识之间的鸿沟。我们以结构化围棋专业知识和一般长思考链（CoT）推理数据进行混合微调，作为冷启动，随后进行强化学习，将围棋专业知识与通用推理能力整合起来。通过该方法，我们呈现了\textbf{LoGos}，一个强大的大型语言模型，不仅保持了卓越的通用推理能力，还能用自然语言进行围棋，展示有效的战略推理和准确的下一步预测。LoGos的性能可与人类职业玩家媲美，远超所有现有LLM。通过这项工作，我们旨在为将通用大型语言模型推理能力应用于专业领域提供见解。我们将发布首个大规模 Go 数据集用于 LLM 训练，首个 LLM Go 评估基准，以及首个达到人类专业级 Go 性能的通用 LLM：https URL。

Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic

及时机器：时间意识使测试时间尺度变得有智能

Authors: Yichuan Ma, Linyang Li, Yongkang chen, Peiji Li, Xiaozhe Li, Qipeng Guo, Dahua Lin, Kai Chen
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.16486
Pdf link: https://arxiv.org/pdf/2601.16486
Abstract As large language models (LLMs) increasingly tackle complex reasoning tasks, test-time scaling has become critical for enhancing capabilities. However, in agentic scenarios with frequent tool calls, the traditional generation-length-based definition breaks down: tool latency decouples inference time from generation length. We propose Timely Machine, redefining test-time as wall-clock time, where models dynamically adjust strategies based on time budgets. We introduce Timely-Eval, a benchmark spanning high-frequency tool calls, low-frequency tool calls, and time-constrained reasoning. By varying tool latency, we find smaller models excel with fast feedback through more interactions, while larger models dominate high-latency settings via superior interaction quality. Moreover, existing models fail to adapt reasoning to time budgets. We propose Timely-RL to address this gap. After cold-start supervised fine-tuning, we use reinforcement learning to enhance temporal planning. Timely-RL improves time budget awareness and consistently boosts performance across Timely-Eval. We hope our work offers a new perspective on test-time scaling for the agentic era.
中文摘要 随着大型语言模型（LLM）越来越多地处理复杂的推理任务，测试时间扩展成为提升能力的关键。然而，在频繁调用工具的代理场景中，传统的基于生成长度的定义会失效：工具延迟将推断时间与生成长度解耦。我们提出了Timely Machine，将测试时间重新定义为墙钟时间，模型根据时间预算动态调整策略。我们介绍Timely-Eval，这是一个涵盖高频工具调用、低频工具调用和时间限制推理的基准测试。通过调节工具延迟，我们发现较小模型在更多交互中反馈快速，而大型模型则因更优越的交互质量而主导高延迟环境。此外，现有模型未能将推理调整到时间预算中。我们提议通过Timely-RL来弥补这一空白。经过冷启动监督微调后，我们使用强化学习来增强时间规划。Timely-RL 提升了时间预算的感知，并持续提升整个 Timely-Evaval 的性能。我们希望我们的工作能为代理时代的测试时间缩放提供新的视角。

UAV-Assisted Joint Data Collection and Wireless Power Transfer for Batteryless Sensor Networks

无人机辅助联合数据收集与无线电力传输，用于无电池传感器网络

Authors: Wen Zhang, Aimin Wang, Geng Sun, Jiahui Li, Jiacheng Wang, Changyuan Zhao, Dusit Niyato
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2601.16533
Pdf link: https://arxiv.org/pdf/2601.16533
Abstract The development of wireless power transfer (WPT) and Internet of Things (IoT) offers significant potential but faces challenges such as limited energy supply, dynamic environmental changes, and unstable transmission links. This paper presents an unmanned aerial vehicle (UAV)-assisted data collection and WPT scheme to support batteryless sensor (BLS) networks in remote areas. In this system, BLSs harvest energy from the UAV and utilize the harvested energy to transmit the collected data back to the UAV. The goal is to maximize the collected data volume and fairness index while minimizing the UAV energy consumption. To achieve these objectives, an optimization problem is formulated to jointly optimize the transmit power and UAV trajectory. Due to the non-convexity and dynamic nature of the problem, a deep reinforcement learning (DRL)-based algorithm is proposed to solve the problem. Specifically, this algorithm integrates prioritized experience replay and the performer module to enhance system stability and accelerate convergence. Simulation results demonstrate that the proposed approach consistently outperforms benchmark schemes in terms of collected data volume, fairness, and UAV energy consumption.
中文摘要 无线电力传输（WPT）和物联网（IoT）的发展具有巨大潜力，但也面临能源供应有限、环境动态变化和输电链路不稳定等挑战。本文提出了一种无人机（UAV）辅助数据收集和WPT方案，以支持偏远地区的无电池传感器（BLS）网络。在该系统中，BLS从无人机收集能量，并利用收集到的数据回传给无人机。目标是在最小化无人机能耗的同时，最大化收集的数据量和公平性指数。为实现这些目标，制定了一个优化问题，共同优化发射功率和无人机轨迹。由于问题具有非凸性和动态特性，提出了一种基于深度强化学习（DRL）的算法来解决该问题。具体来说，该算法整合了优先体验回放和执行者模块，以增强系统稳定性并加速收敛。模拟结果表明，该方法在收集的数据量、公平性和无人机能耗方面始终优于基准方案。

Zero-Shot MARL Benchmark in the Cyber-Physical Mobility Lab

网络物理移动实验室中的零射MARL基准测试

Authors: Julius Beerwerth, Jianye Xu, Simon Schäfer, Fynn Belderink, Bassam Alrifaee
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.16578
Pdf link: https://arxiv.org/pdf/2601.16578
Abstract We present a reproducible benchmark for evaluating sim-to-real transfer of Multi-Agent Reinforcement Learning (MARL) policies for Connected and Automated Vehicles (CAVs). The platform, based on the Cyber-Physical Mobility Lab (CPM Lab) [1], integrates simulation, a high-fidelity digital twin, and a physical testbed, enabling structured zero-shot evaluation of MARL motion-planning policies. We demonstrate its use by deploying a SigmaRL-trained policy [2] across all three domains, revealing two complementary sources of performance degradation: architectural differences between simulation and hardware control stacks, and the sim-to-real gap induced by increasing environmental realism. The open-source setup enables systematic analysis of sim-to-real challenges in MARL under realistic, reproducible conditions.
中文摘要 我们提出了一个可重复的基准测试，用于评估多智能体强化学习（MARL）策略在互联和自动驾驶车辆（CAVs）的模拟到真实转移。该平台基于网络物理移动实验室（CPM Lab）[1]，集成了仿真、高精度数字孪生和物理测试平台，实现了对MARL运动规划政策的结构化零样本评估。我们通过在三个领域部署SigmaRL训练的策略[2]来演示其应用，揭示了性能下降的两个互补来源：仿真与硬件控制栈之间的架构差异，以及环境真实度提升所引起的模拟与现实差距。该开源架构使得在真实且可重复的条件下，系统分析MARL中的模拟到实际挑战。

A Cognitive Framework for Autonomous Agents: Toward Human-Inspired Design

自主智能体的认知框架：迈向人为启发的设计

Authors: Francesco Guidi, Jingfeng Shan, Mehrdad Saeidi, Enrico Testi, Elia Favarelli, Andrea Giorgetti, Davide Dardari, Alberto Zanella, Giorgio Li Pira, Francesca Starita, Anna Guerra
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.16648
Pdf link: https://arxiv.org/pdf/2601.16648
Abstract This work introduces a human-inspired reinforcement learning (RL) architecture that integrates Pavlovian and instrumental processes to enhance decision-making in autonomous systems. While existing engineering solutions rely almost exclusively on instrumental learning, neuroscience shows that humans use Pavlovian associations to leverage predictive cues to bias behavior before outcomes occur. We translate this dual-system mechanism into a cue-guided RL framework in which radio-frequency (RF) stimuli act as conditioned (Pavlovian) cues that modulate action selection. The proposed architecture combines Pavlovian values with instrumental policy optimization, improving navigation efficiency and cooperative behavior in unknown, partially observable environments. Simulation results demonstrate that cue-driven agents adapt faster, achieving superior performance compared to traditional instrumental-solo agents. This work highlights the potential of human learning principles to reshape digital agents intelligence.
中文摘要 本研究引入了一种人为启发的强化学习（RL）架构，整合了巴甫洛夫和工具化过程，以增强自主系统中的决策能力。虽然现有的工程解决方案几乎完全依赖工具学习，但神经科学表明，人类利用巴甫洛夫联想来利用预测线索，在结果发生前对行为进行偏向。我们将这种双系统机制转化为线索引导的强化学习框架，其中射频（RF）刺激作为条件反射（巴甫洛夫）线索，调制行动选择。所提架构结合了巴甫洛夫值与工具策略优化，提升导航效率和在未知且部分可观测环境中的协作行为。模拟结果表明，提示驱动的代理适应更快，性能优于传统的仪器单人代理。这项工作强调了人类学习原理在重塑数字智能体智能方面的潜力。

Sim-to-Real Transfer via a Style-Identified Cycle Consistent Generative Adversarial Network: Zero-Shot Deployment on Robotic Manipulators through Visual Domain Adaptation

通过样式识别周期的一致生成对抗网络进行模拟到现实的传输：通过视觉域适应在机器人机械手上的零样本部署

Authors: Lucía Güitta-López, Lionel Güitta-López, Jaime Boal, Álvaro Jesús López-López
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.16677
Pdf link: https://arxiv.org/pdf/2601.16677
Abstract The sample efficiency challenge in Deep Reinforcement Learning (DRL) compromises its industrial adoption due to the high cost and time demands of real-world training. Virtual environments offer a cost-effective alternative for training DRL agents, but the transfer of learned policies to real setups is hindered by the sim-to-real gap. Achieving zero-shot transfer, where agents perform directly in real environments without additional tuning, is particularly desirable for its efficiency and practical value. This work proposes a novel domain adaptation approach relying on a Style-Identified Cycle Consistent Generative Adversarial Network (StyleID-CycleGAN or SICGAN), an original Cycle Consistent Generative Adversarial Network (CycleGAN) based model. SICGAN translates raw virtual observations into real-synthetic images, creating a hybrid domain for training DRL agents that combines virtual dynamics with real-like visual inputs. Following virtual training, the agent can be directly deployed, bypassing the need for real-world training. The pipeline is validated with two distinct industrial robots in the approaching phase of a pick-and-place operation. In virtual environments agents achieve success rates of 90 to 100\%, and real-world deployment confirms robust zero-shot transfer (i.e., without additional training in the physical environment) with accuracies above 95\% for most workspace regions. We use augmented reality targets to improve the evaluation process efficiency, and experimentally demonstrate that the agent successfully generalizes to real objects of varying colors and shapes, including LEGO\textsuperscript{\textregistered}~cubes and a mug. These results establish the proposed pipeline as an efficient, scalable solution to the sim-to-real problem.
中文摘要 深度强化学习（DRL）中的样本效率挑战因其在工业界的应用需求高昂且耗时高昂而受到影响。虚拟环境为训练DRL代理提供了一种经济高效的替代方案，但由于模拟与现实之间的差距，将所学策略迁移到真实环境的过程受到阻碍。实现零拍摄转移，即代理人在真实环境中直接运行而无需额外调校，因其效率和实用价值而尤为理想。本研究提出了一种新颖的领域适应方法，依赖于风格识别周期一致生成对抗网络（StyleID-CycleGAN 或 SICGAN），这是一种基于原始循环一致性生成对抗网络（CycleGAN）的模型。SICGAN将原始虚拟观测数据转换为真实合成图像，创建了一个结合虚拟动态与类似真实视觉输入的混合领域，用于训练DRL代理。虚拟培训后，代理可以直接部署，无需实际培训。管道通过两台不同的工业机器人进行验证，正处于挑选和放置作的临近阶段。在虚拟环境中，代理的成功率达到90%到100%，实际部署验证了鲁棒的零样本传输（即无需物理环境中额外培训），大多数工作区区域的准确率超过95%。我们使用增强现实靶点来提升评估过程效率，并通过实验证明智能体能够成功推广到不同颜色和形状的真实物体，包括LEGO\textsuperscript{\textregistered}~立方体和一个杯子。这些结果确立了所提出的流水线作为模拟到现实问题的高效且可扩展的解决方案。

Adaptive Reinforcement and Model Predictive Control Switching for Safe Human-Robot Cooperative Navigation

自适应强化与模型预测控制切换，实现安全人机协作导航

Authors: Ning Liu, Sen Shen, Zheng Li, Matthew D'Souza, Jen Jen Chung, Thomas Braunl
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.16686
Pdf link: https://arxiv.org/pdf/2601.16686
Abstract This paper addresses the challenge of human-guided navigation for mobile collaborative robots under simultaneous proximity regulation and safety constraints. We introduce Adaptive Reinforcement and Model Predictive Control Switching (ARMS), a hybrid learning-control framework that integrates a reinforcement learning follower trained with Proximal Policy Optimization (PPO) and an analytical one-step Model Predictive Control (MPC) formulated as a quadratic program safety filter. To enable robust perception under partial observability and non-stationary human motion, ARMS employs a decoupled sensing architecture with a Long Short-Term Memory (LSTM) temporal encoder for the human-robot relative state and a spatial encoder for 360-degree LiDAR scans. The core contribution is a learned adaptive neural switcher that performs context-aware soft action fusion between the two controllers, favoring conservative, constraint-aware QP-based control in low-risk regions while progressively shifting control authority to the learned follower in highly cluttered or constrained scenarios where maneuverability is critical, and reverting to the follower action when the QP becomes infeasible. Extensive evaluations against Pure Pursuit, Dynamic Window Approach (DWA), and an RL-only baseline demonstrate that ARMS achieves an 82.5 percent success rate in highly cluttered environments, outperforming DWA and RL-only approaches by 7.1 percent and 3.1 percent, respectively, while reducing average computational latency by 33 percent to 5.2 milliseconds compared to a multi-step MPC baseline. Additional simulation transfer in Gazebo and initial real-world deployment results further indicate the practicality and robustness of ARMS for safe and efficient human-robot collaboration. Source code and a demonstration video are available at this https URL.
中文摘要 本文探讨了在同时接近管制和安全约束下，移动协作机器人实现人力导航的挑战。我们引入了自适应强化与模型预测控制切换（ARMS），这是一种混合学习-控制框架，集成了通过近端策略优化（PPO）训练的强化学习跟随者和作为二次程序安全过滤器的分析单步模型预测控制（MPC）。为了实现部分可观测性和非静止人体运动下的稳健感知，ARMS采用了解耦传感架构，配备长短时记忆（LSTM）时间编码器用于人机相对状态，并采用空间编码器进行360度激光雷达扫描。其核心贡献是一种学习型自适应神经切换器，能够在两个控制器之间实现上下文感知的软动作融合，在低风险区域偏好基于约束的保守控制，同时在高度杂乱或受限且机动性至关重要的场景中，逐步将控制权转移给学习后的跟随者，当QP变得不可行时，则恢复为跟随者行动。对纯追踪、动态窗口方法（DWA）和仅强化学习基线的广泛评估表明，ARMS在高度杂乱环境中的成功率为82.5%，分别比DWA和仅强化算法方法高出7.1%和3.1%，同时平均计算延迟降低33%，达到5.2毫秒，相较多步MPC基线。Gazebo的额外模拟传输和初步实际部署结果进一步表明ARMS在安全高效人机协作中的实用性和稳健性。源代码和演示视频可在此 https URL 获取。

LongCat-Flash-Thinking-2601 Technical Report

LongCat-Flash-Thinking-2601技术报告

Authors: Meituan LongCat Team, Anchun Gui, Bei Li, Bingyang Tao, Bole Zhou, Borun Chen, Chao Zhang, Chao Zhang, Chen Gao, Chen Zhang, Chengcheng Han, Chenhui Yang, Chuyu Zhang, Cong Chen, Cunguang Wang, Daoru Pan, Defei Bu, Dengchang Zhao, Di Xiu, Dishan Liu, Dongyu Ru, Dunwei Tu, Fan Wu, Fengcheng Yuan, Fengcun Li, Gang Xu, Guanyu Wu, Guoyuan Lin, Haibin Wang, Hansi Yang, Hao Yang, Haonan Yan, Haoxiang Ma, Haoxing Wen, Hongyan Hao, Hongyin Tang, Hongyu Zang, Hongzhi Ni, Hui Su, Jiacheng Zhang, Jiahong Zhou, Jiahuan Li, Jiaming Wang, Jian Yang, Jianfei Zhang, Jianhao Xu, Jianing Wang, Jiapeng Zhu, Jiaqi Sun, Jiarong Shi, Jiarui Zhao, Jingang Wang, Jinluan Yang, Jinrui Ding, Jinwei Xiao, Jiyuan He, Juncan Xu, Kefeng Zhang, Keheng Wang, Li Wei, Lianhui Ma, Lin Qiu, Lingbing Kong, Lingchuan Liu, Linsen Guo, Mengshen Zhu, Mengxia Shen, Mingyang Zhu, Peiguang Li, Peng Pei, Pengcheng Jia, Pengtao Zhang, Peng Zhao, Qi Gu, Qiong Huang, Qiyuan Duan, Quanchi Weng, Rongxiang Weng, Rongzhi Zhang, Rumei Li, Shanglin Lei, Shengnan An, Shijun Dai, Shuaikang Liu, Shuang Zhou, Shuo Wang, Songyuan Zhao, Tao Liang, Tianhao Hu, Tianze Chen, Wei Liu, Wei Shi, Wei Wang, Weifeng Tang, Wenjie Shi, Wenlong Zhu, Wentao Chen, Wentao Shi, Xi Su, Xiangcheng Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.16725
Pdf link: https://arxiv.org/pdf/2601.16725
Abstract We introduce LongCat-Flash-Thinking-2601, a 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model with superior agentic reasoning capability. LongCat-Flash-Thinking-2601 achieves state-of-the-art performance among open-source models on a wide range of agentic benchmarks, including agentic search, agentic tool use, and tool-integrated reasoning. Beyond benchmark performance, the model demonstrates strong generalization to complex tool interactions and robust behavior under noisy real-world environments. Its advanced capability stems from a unified training framework that combines domain-parallel expert training with subsequent fusion, together with an end-to-end co-design of data construction, environments, algorithms, and infrastructure spanning from pre-training to post-training. In particular, the model's strong generalization capability in complex tool-use are driven by our in-depth exploration of environment scaling and principled task construction. To optimize long-tailed, skewed generation and multi-turn agentic interactions, and to enable stable training across over 10,000 environments spanning more than 20 domains, we systematically extend our asynchronous reinforcement learning framework, DORA, for stable and efficient large-scale multi-environment training. Furthermore, recognizing that real-world tasks are inherently noisy, we conduct a systematic analysis and decomposition of real-world noise patterns, and design targeted training procedures to explicitly incorporate such imperfections into the training process, resulting in improved robustness for real-world applications. To further enhance performance on complex reasoning tasks, we introduce a Heavy Thinking mode that enables effective test-time scaling by jointly expanding reasoning depth and width through intensive parallel thinking.
中文摘要 我们介绍了LongCat-Flash-Thinking-2601，这是一个拥有5600亿参数的开源专家混合（MoE）推理模型，具有更优越的代理推理能力。LongCat-Flash-Thinking-2601 在开源模型中，在包括代理搜索、代理工具使用和工具集成推理等多种代理基准测试中实现了最先进的性能。除了基准性能外，该模型还展现出对复杂工具交互的强有力泛化能力，并在嘈杂的现实环境中表现出稳健的行为。其先进能力源于一个统一的培训框架，将领域并行专家培训与后续融合相结合，同时实现从预训练到培训后对数据构建、环境、算法和基础设施的端到端协同设计。特别是，模型在复杂工具使用的强大泛化能力，得益于我们对环境尺度和原则性任务构建的深入探索。为了优化长尾、偏斜生成和多回合智能体交互，并实现跨越20多个领域、超过1万个环境的稳定训练，我们系统地扩展了异步强化学习框架DORA，实现稳定高效的大规模多环境训练。此外，我们认识到现实任务本质上具有噪声，系统分析和分解现实噪声模式，设计有针对性的训练程序，明确将这些缺陷纳入训练过程，从而提升现实应用的鲁棒性。为了进一步提升复杂推理任务的性能，我们引入了一种重度思维模式，通过密集的并行思维共同扩展推理深度和宽度，实现有效的测试时间扩展。

Reasoning Promotes Robustness in Theory of Mind Tasks

推理促进心智理论任务的稳健性

Authors: Ian B. de Haan, Peter van der Putten, Max van Duijn
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.16853
Pdf link: https://arxiv.org/pdf/2601.16853
Abstract Large language models (LLMs) have recently shown strong performance on Theory of Mind (ToM) tests, prompting debate about the nature and true performance of the underlying capabilities. At the same time, reasoning-oriented LLMs trained via reinforcement learning with verifiable rewards (RLVR) have achieved notable improvements across a range of benchmarks. This paper examines the behavior of such reasoning models in ToM tasks, using novel adaptations of machine psychological experiments and results from established benchmarks. We observe that reasoning models consistently exhibit increased robustness to prompt variations and task perturbations. Our analysis indicates that the observed gains are more plausibly attributed to increased robustness in finding the correct solution, rather than to fundamentally new forms of ToM reasoning. We discuss the implications of this interpretation for evaluating social-cognitive behavior in LLMs.
中文摘要 大型语言模型（LLM）最近在心智理论（ToM）测试中表现出强劲表现，引发了关于其底层能力本质和真实性能的讨论。与此同时，通过可验证奖励强化学习（RLVR）训练的推理导向大型语言模型在多个基准测试中取得了显著提升。本文利用机器心理学实验的新颖改编和既有基准测试结果，研究了此类推理模型在ToM任务中的行为。我们观察到推理模型在引发变化和任务扰动方面表现出更高的鲁棒性。我们的分析表明，观察到的收益更合理地归因于寻找正确解的鲁棒性提升，而非根本性新型的ToM推理形式。我们讨论了这一解释对评估大型语言模型社会认知行为的影响。

Boosting Deep Reinforcement Learning with Semantic Knowledge for Robotic Manipulators

利用语义知识提升机器人机械手的深度强化学习

Authors: Lucía Güitta-López, Vincenzo Suriani, Jaime Boal, Álvaro J. López-López, Daniele Nardi
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.16866
Pdf link: https://arxiv.org/pdf/2601.16866
Abstract Deep Reinforcement Learning (DRL) is a powerful framework for solving complex sequential decision-making problems, particularly in robotic control. However, its practical deployment is often hindered by the substantial amount of experience required for learning, which results in high computational and time costs. In this work, we propose a novel integration of DRL with semantic knowledge in the form of Knowledge Graph Embeddings (KGEs), aiming to enhance learning efficiency by providing contextual information to the agent. Our architecture combines KGEs with visual observations, enabling the agent to exploit environmental knowledge during training. Experimental validation with robotic manipulators in environments featuring both fixed and randomized target attributes demonstrates that our method achieves up to {60}{\%} reduction in learning time and improves task accuracy by approximately 15 percentage points, without increasing training time or computational complexity. These results highlight the potential of semantic knowledge to reduce sample complexity and improve the effectiveness of DRL in robotic applications.
中文摘要 深度强化学习（DRL）是一个强大的框架，用于解决复杂的顺序决策问题，尤其是在机器人控制领域。然而，其实际应用常常受到学习所需的大量经验限制，这导致计算和时间成本较高。在本研究中，我们提出了一种将DRL与语义知识结合的创新方式，即知识图谱嵌入（Knowledge Graph Embeddings，KGEs），旨在通过向智能体提供上下文信息来提升学习效率。我们的架构将 KGE 与视觉观察结合，使智能体能够在训练过程中利用环境知识。在具有固定和随机目标属性的环境中，使用机器人作手进行实验验证，表明我们的方法可实现多达{60}{\%}的学习时间减少，并使任务准确率提升约15个百分点，同时不增加训练时间或计算复杂度。这些结果凸显了语义知识在降低样本复杂度和提升DRL在机器人应用中效果方面的潜力。

The Trajectory Alignment Coefficient in Two Acts: From Reward Tuning to Reward Learning

轨迹对齐系数分为两个过程：从奖励调优到奖励学习

Authors: Calarina Muslimani, Yunshu Du, Kenta Kawamoto, Kaushik Subramanian, Peter Stone, Peter Wurman
Subjects: Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2601.16906
Pdf link: https://arxiv.org/pdf/2601.16906
Abstract The success of reinforcement learning (RL) is fundamentally tied to having a reward function that accurately reflects the task objective. Yet, designing reward functions is notoriously time-consuming and prone to misspecification. To address this issue, our first goal is to understand how to support RL practitioners in specifying appropriate weights for a reward function. We leverage the Trajectory Alignment Coefficient (TAC), a metric that evaluates how closely a reward function's induced preferences match those of a domain expert. To evaluate whether TAC provides effective support in practice, we conducted a human-subject study in which RL practitioners tuned reward weights for Lunar Lander. We found that providing TAC during reward tuning led participants to produce more performant reward functions and report lower cognitive workload relative to standard tuning without TAC. However, the study also underscored that manual reward design, even with TAC, remains labor-intensive. This limitation motivated our second goal: to learn a reward model that maximizes TAC directly. Specifically, we propose Soft-TAC, a differentiable approximation of TAC that can be used as a loss function to train reward models from human preference data. Validated in the racing simulator Gran Turismo 7, reward models trained using Soft-TAC successfully captured preference-specific objectives, resulting in policies with qualitatively more distinct behaviors than models trained with standard Cross-Entropy loss. This work demonstrates that TAC can serve as both a practical tool for guiding reward tuning and a reward learning objective in complex domains.
中文摘要 强化学习（RL）的成功根本上与拥有准确反映任务目标的奖励函数密切相关。然而，设计奖励函数以耗时且容易出现错误规定而闻名。为了解决这个问题，我们的首要目标是了解如何支持强化学习从业者为奖励函数指定合适的权重。我们利用轨迹对齐系数（TAC）这一指标，用于评估奖励函数诱导偏好与领域专家偏好的高度匹配程度。为了评估TAC在实践中是否有效支持，我们进行了一项人与受试者的研究，强化学习从业者调整了月球着陆器的奖励权重。我们发现，在奖励调优期间提供TAC，使参与者产生更高效的奖励函数，并报告相较于未使用TAC的标准调优，认知负荷更低。然而，研究也强调，即使采用TAC的人工奖励设计，仍然需要大量劳动力。这一限制激励了我们的第二个目标：学习一种直接最大化TAC的奖励模型。具体来说，我们提出了软TAC技术，这是一种可微近似的TAC方法，可用作损失函数，从人类偏好数据中训练奖励模型。在赛车模拟器《Gran Turismo 7》中验证，使用软TAC训练的奖励模型成功捕捉了偏好特定目标，导致策略行为比标准交叉熵损失训练模型更具质的差异。这项工作表明，TAC既可以作为指导奖励调优的实用工具，也能作为复杂领域的奖励学习目标。

Keyword: diffusion policy

There is no result