Arxiv Papers of Today

生成时间: 2026-01-16 16:34:08 (UTC+8); Arxiv 发布时间: 2026-01-16 20:00 EST (2026-01-17 09:00 UTC+8)

今天共有 26 篇相关文章

Keyword: reinforcement learning

StatLLaMA: A multi-stage training framework for building a domain-optimized statistical language model

StatLLaMA：一个多阶段训练框架，用于构建领域优化的统计语言模型

Authors: Jing-Yi Zeng, Guan-Hua Huang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.09718
Pdf link: https://arxiv.org/pdf/2601.09718
Abstract This study investigates how to efficiently build a domain-specialized large language model (LLM) for statistics using the lightweight LLaMA-3.2-3B family as the foundation model (FM). We systematically compare three multi-stage training pipelines, starting from a base FM with no instruction-following capability, a base FM augmented with post-hoc instruction tuning, and an instruction-tuned FM with strong general reasoning abilities across continual pretraining, supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) preference alignment, and downstream task adaptation. Results show that pipelines beginning with a base FM fail to develop meaningful statistical reasoning, even after extensive instruction tuning, SFT, or RLHF alignment. In contrast, starting from LLaMA-3.2-3B-Instruct enables effective domain specialization. A comprehensive evaluation of SFT variants reveals clear trade-offs between domain expertise and general reasoning ability. We further demonstrate that direct preference optimization provides stable and effective RLHF preference alignment. Finally, we show that downstream fine-tuning must be performed with extremely low intensity to avoid catastrophic forgetting in highly optimized models. The final model, StatLLaMA, achieves strong and balanced performance on benchmarks of mathematical reasoning, common-sense reasoning, and statistical expertise, offering a practical blueprint for developing resource-efficient statistical LLMs. The code is available at this https URL.
中文摘要 本研究探讨如何高效地利用轻量级LLaMA-3.2-3B家族作为基础模型（FM）构建领域专用大型语言模型（LLM）。我们系统地比较了三条多阶段训练流程，分别是无指令跟随能力的基础FM、带有事后指令调优的基础FM，以及具有强大一般推理能力的指令调优FM，涵盖持续的预训练、监督微调（SFT）、人类反馈强化学习（RLHF）偏好对齐和后续任务适应。结果显示，以基础FM为基础的流水线，即使经过大量指令调优、SFT或RLHF对齐，也无法发展出有意义的统计推理能力。相比之下，从LLaMA-3.2-3B-Instruct开始，能够实现有效的领域专业化。对SFT变体的全面评估显示，领域专长与一般推理能力之间存在明显权衡。我们还进一步证明了直接偏好优化能提供稳定且有效的RLHF偏好比对。最后，我们表明，为了避免高度优化模型中的灾难性遗忘，必须以极低强度进行下游微调。最终模型StatLLaMA在数学推理、常识推理和统计专业知识的基准测试上表现出色且均衡，为开发资源高效的统计大型语言模型提供了实用蓝图。代码可在该 https URL 访问。

GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents

GUI-Eyes：工具增强感知，用于图形界面代理的视觉基础

Authors: Chen Chen, Jiawei Shao, Dakuan Lu, Haoyi Hu, Xiangcheng Liu, Hantao Yao, Wu Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.09770
Pdf link: https://arxiv.org/pdf/2601.09770
Abstract Recent advances in vision-language models (VLMs) and reinforcement learning (RL) have driven progress in GUI automation. However, most existing methods rely on static, one-shot visual inputs and passive perception, lacking the ability to adaptively determine when, whether, and how to observe the interface. We present GUI-Eyes, a reinforcement learning framework for active visual perception in GUI tasks. To acquire more informative observations, the agent learns to make strategic decisions on both whether and how to invoke visual tools, such as cropping or zooming, within a two-stage reasoning process. To support this behavior, we introduce a progressive perception strategy that decomposes decision-making into coarse exploration and fine-grained grounding, coordinated by a two-level policy. In addition, we design a spatially continuous reward function tailored to tool usage, which integrates both location proximity and region overlap to provide dense supervision and alleviate the reward sparsity common in GUI environments. On the ScreenSpot-Pro benchmark, GUI-Eyes-3B achieves 44.8% grounding accuracy using only 3k labeled samples, significantly outperforming both supervised and RL-based baselines. These results highlight that tool-aware active perception, enabled by staged policy reasoning and fine-grained reward feedback, is critical for building robust and data-efficient GUI agents.
中文摘要 视觉语言模型（VLM）和强化学习（RL）的最新进展推动了GUI自动化的发展。然而，大多数现有方法依赖静态的一次性视觉输入和被动感知，缺乏自适应判断何时、是否以及如何观察界面的能力。我们介绍了GUI-Eyes，一种用于图形用户界面任务中主动视觉感知的强化学习框架。为了获得更有信息量的观察，代理学会在两个阶段的推理过程中，决定是否以及如何调用视觉工具，如裁剪或缩放。为支持这种行为，我们引入了一种渐进感知策略，将决策分解为粗略探索和细粒度基础，并由两级策略协调。此外，我们设计了一个空间连续的奖励函数，针对工具使用量量身定制，整合了位置接近度和区域重叠，提供密集监督，缓解图形用户界面环境中常见的奖励稀疏性。在ScreenSpot-Pro基准测试中，GUI-Eyes-3B仅用3000个标记样本即可实现44.8%的接地准确率，显著优于监督和基于强化学习的基线。这些结果凸显了通过分阶段策略推理和细粒度奖励反馈实现的工具感知，对于构建稳健且数据高效的图形用户界面代理至关重要。

Eluder dimension: localise it!

Eluder维度：定位它！

Authors: Alireza Bakhtiari, Alex Ayoub, Samuel Robertson, David Janz, Csaba Szepesvári
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.09825
Pdf link: https://arxiv.org/pdf/2601.09825
Abstract We establish a lower bound on the eluder dimension of generalised linear model classes, showing that standard eluder dimension-based analysis cannot lead to first-order regret bounds. To address this, we introduce a localisation method for the eluder dimension; our analysis immediately recovers and improves on classic results for Bernoulli bandits, and allows for the first genuine first-order bounds for finite-horizon reinforcement learning tasks with bounded cumulative returns.
中文摘要 我们建立了广义线性模型类的 eluder 维数下界，表明标准的 eluder 维分析无法导出一阶遗憾界限。为此，我们引入了距离维度的局部化方法;我们的分析立即恢复并改进了伯努利强化算法的经典结果，并首次实现有限视界强化学习任务中具有有界累积收益的真正一阶界限。

OUTLINEFORGE: Hierarchical Reinforcement Learning with Explicit States for Scientific Writing

OUTLINEFORGE：科学写作中的层级强化学习与显性状态

Authors: Yilin Bao, Ziyao He, Zayden Yang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.09858
Pdf link: https://arxiv.org/pdf/2601.09858
Abstract Scientific paper generation requires document-level planning and factual grounding, but current large language models, despite their strong local fluency, often fail in global structure, input coverage, and citation consistency. We present a reinforcement learning framework that casts scientific outline construction as a long-horizon planning problem over hierarchical document structures. Our approach models edit evolving outlines through structured actions, enabling the system to incrementally build a complete scientific manuscript. To support effective and stabilize learning,we introduce a two-stage optimization procedure consisting of (i) backward outline reconstruction from partial plans to enforce global structural consistency, and (ii) forward value-guided reinforcement learning with rewards explicitly modeling scientific correctness, discourse coherence, and citation fidelity. In addition, We further introduce a benchmark for scientific paper generation that evaluates document planning, input utilization, reference faithfulness, outline organization, and content-level factual accuracy. Our results show consistent improvements over strong neural and LLM baselines, particularly in long-range structural coherence and citation reliability.
中文摘要 科学论文生成需要文档层面的规划和事实基础，但当前大型语言模型尽管具有较强的局部流利度，却常常在整体结构、输入覆盖率和引用一致性方面存在不足。我们提出了一个强化学习框架，将科学大纲构建视为一个长期规划问题，而非层级文档结构。我们的方法模型通过结构化动作编辑不断变化的大纲，使系统能够逐步构建完整的学术论文。为支持有效且稳定的学习，我们引入了两阶段优化流程，包括：（i）从部分计划进行向向大纲重建以强化全局结构一致性，以及（ii）以价值为导向的前向强化学习，奖励明确建模科学正确性、话语连贯性和引用准确性。此外，我们还提出了科学论文生成的基准，评估文档规划、输入利用、参考文献忠实性、大纲组织和内容层面的事实准确性。我们的结果显示，在长期结构一致性和引用可靠性方面，相较于强神经和大型语言模型基线，持续有改善。

PaperScout: An Autonomous Agent for Academic Paper Search with Process-Aware Sequence-Level Policy Optimization

PaperScout：一个具备流程感知序列级策略优化的学术论文搜索自主代理

Authors: Tingyue Pan, Jie Ouyang, Mingyue Cheng, Qingchuan Li, Zirui Liu, Mingfan Pan, Shuo Yu, Qi Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.10029
Pdf link: https://arxiv.org/pdf/2601.10029
Abstract Academic paper search is a fundamental task in scientific research, yet most existing approaches rely on rigid, predefined workflows that struggle with complex, conditional queries. To address this limitation, we propose PaperScout, an autonomous agent that reformulates paper search as a sequential decision-making process. Unlike static workflows, PaperScout dynamically decides whether, when, and how to invoke search and expand tools based on accumulated retrieval context. However, training such agents presents a fundamental challenge: standard reinforcement learning methods, typically designed for single-turn tasks, suffer from a granularity mismatch when applied to multi-turn agentic tasks, where token-level optimization diverges from the granularity of sequence-level interactions, leading to noisy credit assignment. We introduce Proximal Sequence Policy Optimization (PSPO), a process-aware, sequence-level policy optimization method that aligns optimization with agent-environment interaction. Comprehensive experiments on both synthetic and real-world benchmarks demonstrate that PaperScout significantly outperforms strong workflow-driven and RL baselines in both recall and relevance, validating the effectiveness of our adaptive agentic framework and optimization strategy.
中文摘要 学术论文检索是科学研究中的一项基本任务，但大多数现有方法依赖于僵化、预定义的工作流程，这些工作流在处理复杂的条件查询时感到困难。为解决这一局限，我们提出了PaperScout，一个自主代理，将纸质搜索重新表述为顺序决策过程。与静态工作流程不同，PaperScout 根据累积的检索上下文动态决定是否、何时以及如何调用搜索和扩展工具。然而，训练此类代理存在根本挑战：标准强化学习方法通常为单回合任务设计，在多回合智能体任务中存在粒度不匹配，令牌级优化与序列级交互的粒度差异，导致信用分配噪声较大。我们介绍了近端序列策略优化（PSPO），这是一种过程感知、序列级策略优化方法，将优化与代理环境交互对齐。综合基准和现实世界基准测试显示，PaperScout 在召回性和相关性方面远超强力工作流程驱动和强化学习基线，验证了自适应代理框架和优化策略的有效性。

Event-Driven Deep RL Dispatcher for Post-Storm Distribution System Restoration

事件驱动深度强化调度器，用于风暴后分配系统恢复

Authors: Farshad Amani, Faezeh Ardali, Amin Kargarian
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.10044
Pdf link: https://arxiv.org/pdf/2601.10044
Abstract Natural hazards such as hurricanes and floods damage power grid equipment, forcing operators to replan restoration repeatedly as new information becomes available. This paper develops a deep reinforcement learning (DRL) dispatcher that serves as a real-time decision engine for crew-to-repair assignments. We model restoration as a sequential, information-revealing process and learn an actor-critic policy over compact features such as component status, travel/repair times, crew availability, and marginal restoration value. A feasibility mask blocks unsafe or inoperable actions, such as power flow limits, switching rules, and crew-time constraints, before they are applied. To provide realistic runtime inputs without relying on heavy solvers, we use lightweight surrogates for wind and flood intensities, fragility-based failure, spatial clustering of damage, access impairments, and progressive ticket arrivals. In simulated hurricane and flood events, the learned policy updates crew decisions in real time as new field reports arrive. Because the runtime logic is lightweight, it improves online performance (energy-not-supplied, critical-load restoration time, and travel distance) compared with mixed-integer programs and standard heuristics. The proposed approach is tested on the IEEE 13- and 123-bus feeders with mixed hurricane/flood scenarios.
中文摘要 飓风和洪水等自然灾害会损坏电网设备，迫使运营商在获得新信息时反复重新规划恢复工作。本文开发了一种深度强化学习（DRL）调度器，作为维修人员到维修任务的实时决策引擎。我们将修复建模为一个顺序性、信息揭示的过程，并学习对组件状态、旅行/维修时间、船员可用性和边际修复价值等紧凑特征的行为者-批评者政策。可行性掩膜在应用前阻止不安全或不可作的作，如功率流量限制、切换规则和乘员时间限制。为了提供真实的运行时输入而不依赖繁重求解器，我们使用轻量级替代算法来分析风力和洪水强度、基于脆弱性的失效、损害空间聚集、访问障碍和渐进进式票据到达。在模拟的飓风和洪水事件中，学到的政策会实时更新工作人员决策，以应对新的现场报告。由于运行时逻辑轻量级，它提升了在线性能（未供电、关键负载恢复时间和行程距离），相较于混合整数程序和标准启发式方法。该方法在 IEEE 13 和 123 母线馈线上进行了测试，伴随混合飓风/洪水情景。

Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts

稀疏强化学习：通过稳定稀疏展开打破大型语言模型强化学习中的内存壁垒

Authors: Sijia Luo, Xiaokang Zhang, Yuxuan Hu, Bohan Zhang, Ke Wang, Jinbo Su, Mengshu Sun, Lei Liang, Jing Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.10079
Pdf link: https://arxiv.org/pdf/2601.10079
Abstract Reinforcement Learning (RL) has become essential for eliciting complex reasoning capabilities in Large Language Models (LLMs). However, the substantial memory overhead of storing Key-Value (KV) caches during long-horizon rollouts acts as a critical bottleneck, often prohibiting efficient training on limited hardware. While existing KV compression techniques offer a remedy for inference, directly applying them to RL training induces a severe policy mismatch, leading to catastrophic performance collapse. To address this, we introduce Sparse-RL empowers stable RL training under sparse rollouts. We show that instability arises from a fundamental policy mismatch among the dense old policy, the sparse sampler policy, and the learner policy. To mitigate this issue, Sparse-RL incorporates Sparsity-Aware Rejection Sampling and Importance-based Reweighting to correct the off-policy bias introduced by compression-induced information loss. Experimental results show that Sparse-RL reduces rollout overhead compared to dense baselines while preserving the performance. Furthermore, Sparse-RL inherently implements sparsity-aware training, significantly enhancing model robustness during sparse inference deployment.
中文摘要 强化学习（RL）已成为激发大型语言模型（LLM）复杂推理能力的关键。然而，在长视野部署期间存储键值（KV）缓存的巨大内存开销成为关键瓶颈，常常阻碍在有限硬件上高效训练。虽然现有的KV压缩技术为推理提供了解决方案，但直接应用于强化学习训练会导致严重的策略错配，导致性能灾难性崩溃。为此，我们引入了稀疏强化学习，在稀疏推广下实现稳定的强化学习。我们证明不稳定性源于稠密旧策略、稀疏采样策略和学习者策略之间的根本性政策不匹配。为缓解此问题，稀疏强化学习引入了稀疏感知拒绝抽样和基于重要性的重加权，纠正压缩引起的信息丢失带来的偏离政策偏差。实验结果显示，稀疏RL相比密集基线降低了滚动开销，同时保持了性能。此外，稀疏强化学习本身实现了稀疏感知训练，显著增强了稀疏推理部署中的模型鲁棒性。

History Is Not Enough: An Adaptive Dataflow System for Financial Time-Series Synthesis

历史不够：用于金融时间序列综合的自适应数据流系统

Authors: Haochong Xia, Yao Long Teng, Regan Tan, Molei Qin, Xinrun Wang, Bo An
Subjects: Subjects: Artificial Intelligence (cs.AI); Trading and Market Microstructure (q-fin.TR)
Arxiv link: https://arxiv.org/abs/2601.10143
Pdf link: https://arxiv.org/pdf/2601.10143
Abstract In quantitative finance, the gap between training and real-world performance-driven by concept drift and distributional non-stationarity-remains a critical obstacle for building reliable data-driven systems. Models trained on static historical data often overfit, resulting in poor generalization in dynamic markets. The mantra "History Is Not Enough" underscores the need for adaptive data generation that learns to evolve with the market rather than relying solely on past observations. We present a drift-aware dataflow system that integrates machine learning-based adaptive control into the data curation process. The system couples a parameterized data manipulation module comprising single-stock transformations, multi-stock mix-ups, and curation operations, with an adaptive planner-scheduler that employs gradient-based bi-level optimization to control the system. This design unifies data augmentation, curriculum learning, and data workflow management under a single differentiable framework, enabling provenance-aware replay and continuous data quality monitoring. Extensive experiments on forecasting and reinforcement learning trading tasks demonstrate that our framework enhances model robustness and improves risk-adjusted returns. The system provides a generalizable approach to adaptive data management and learning-guided workflow automation for financial data.
中文摘要 在定量金融领域，培训与实际绩效之间的差距——由概念漂移和分布式非平稳性驱动——仍然是构建可靠数据驱动系统的关键障碍。基于静态历史数据训练的模型常常过拟合，导致动态市场中的泛化性较差。“历史不够”这句口号强调了适应性数据生成的必要性，这种数据需要学会随着市场演变，而非仅依赖过去的观察数据。我们提出了一个基于机器学习的自适应控制系统，将数据管理过程整合进去，实现了漂移感知。该系统将参数化的数据作模块结合，包含单股票转换、多股票混合和策展作，与采用梯度的双级优化的自适应规划调度器来控制系统。该设计将数据增强、课程学习和数据工作流程管理统一在一个可微分框架下，实现来源感知重放和持续的数据质量监控。关于预测和强化学习交易任务的广泛实验表明，我们的框架增强了模型的稳健性并改善了风险调整后的回报。该系统为金融数据提供了一种通用的自适应数据管理和学习引导的工作流自动化方法。

DecisionLLM: Large Language Models for Long Sequence Decision Exploration

DecisionLLM：用于长序列决策探索的大型语言模型

Authors: Xiaowei Lv, Zhilin Zhang, Yijun Li, Yusen Huo, Siyuan Ju, Xuyan Li, Chunxiang Hong, Tianyu Wang, Yongcai Wang, Peng Sun, Chuan Yu, Jian Xu, Bo Zheng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.10148
Pdf link: https://arxiv.org/pdf/2601.10148
Abstract Long-sequence decision-making, which is usually addressed through reinforcement learning (RL), is a critical component for optimizing strategic operations in dynamic environments, such as real-time bidding in computational advertising. The Decision Transformer (DT) introduced a powerful paradigm by framing RL as an autoregressive sequence modeling problem. Concurrently, Large Language Models (LLMs) have demonstrated remarkable success in complex reasoning and planning tasks. This inspires us whether LLMs, which share the same Transformer foundation, but operate at a much larger scale, can unlock new levels of performance in long-horizon sequential decision-making problem. This work investigates the application of LLMs to offline decision making tasks. A fundamental challenge in this domain is the LLMs' inherent inability to interpret continuous values, as they lack a native understanding of numerical magnitude and order when values are represented as text strings. To address this, we propose treating trajectories as a distinct modality. By learning to align trajectory data with natural language task descriptions, our model can autoregressively predict future decisions within a cohesive framework we term DecisionLLM. We establish a set of scaling laws governing this paradigm, demonstrating that performance hinges on three factors: model scale, data volume, and data quality. In offline experimental benchmarks and bidding scenarios, DecisionLLM achieves strong performance. Specifically, DecisionLLM-3B outperforms the traditional Decision Transformer (DT) by 69.4 on Maze2D umaze-v1 and by 0.085 on AuctionNet. It extends the AIGB paradigm and points to promising directions for future exploration in online bidding.
中文摘要 长序列决策通常通过强化学习（RL）来实现，是优化动态环境中战略作的关键组成部分，例如计算广告中的实时竞价。决策变换器（DT）通过将强化学习框架为自回归序列建模问题，引入了一个强大的范式。与此同时，大型语言模型（LLMs）在复杂的推理和规划任务中表现出显著成功。这激励我们思考，拥有相同Transformer基础但运行规模更大、能够在长期顺序决策问题中解锁新的性能层次。本研究探讨了大型语言模型在离线决策任务中的应用。该领域的一个根本挑战是大型语言模型（LLM）本身无法解释连续值，因为它们缺乏对数值大小和顺序的原生理解，尤其是在以文本字符串表示值时。为此，我们提出将轨迹视为一种独特的模态。通过学习将轨迹数据与自然语言任务描述对齐，我们的模型可以在一个我们称为DecisionLLM的连贯框架内自回归预测未来决策。我们建立了一套规范这一范式的缩放定律，展示了性能依赖于三个因素：模型规模、数据量和数据质量。在离线实验基准和竞价场景中，DecisionLLM表现出色。具体来说，DecisionLLM-3B在Maze2D umaze-v1上的性能比传统决策变换器（DT）高69.4，在AuctionNet上高出0.085。它扩展了AIGB范式，并为未来在线竞价探索指明了有前景的方向。

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

ToolSafe：通过主动的步级防护和反馈，增强基于LLM代理的工具调用安全性

Authors: Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, Jing Shao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.10156
Pdf link: https://arxiv.org/pdf/2601.10156
Abstract While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.
中文摘要 虽然基于LLM的代理可以通过调用外部工具与环境交互，但其扩展的能力也加剧了安全风险。实时监控步骤级工具调用行为并在不安全执行前主动干预对代理部署至关重要，但目前仍未被充分探索。在本研究中，我们首先构建了TS-Bench，这是LLM代理中步级工具调用安全检测的新基准测试。随后，我们开发了一个护栏模型TS-Guard，利用多任务强化学习。该模型通过推理交互历史，主动检测执行前不安全的工具调用动作。它评估请求的危害性和行为-攻击的相关性，产生可解释且可推广的安全判断和反馈。此外，我们引入了TS-Flow，这是一个基于护栏反馈驱动的LLM代理推理框架，平均减少了ReAct风格代理的有害工具调用65%，并在快速注入攻击下改善良性任务完成率约10%。

Reinforcement Learning to Discover a NorthEast Monsoon Index for Monthly Rainfall Prediction in Thailand

强化学习：发现泰国东北季风指数以预测月度降雨

Authors: Kiattikun Chobtham
Subjects: Subjects: Machine Learning (cs.LG); Earth and Planetary Astrophysics (astro-ph.EP)
Arxiv link: https://arxiv.org/abs/2601.10181
Pdf link: https://arxiv.org/pdf/2601.10181
Abstract Climate prediction is a challenge due to the intricate spatiotemporal patterns within Earth systems. Global climate indices, such as the El Niño Southern Oscillation, are standard input features for long-term rainfall prediction. However, a significant gap persists regarding local-scale indices capable of improving predictive accuracy in specific regions of Thailand. This paper introduces a novel NorthEast monsoon climate index calculated from sea surface temperature to reflect the climatology of the boreal winter monsoon. To optimise the calculated areas used for this index, a Deep Q-Network reinforcement learning agent explores and selects the most effective rectangles based on their correlation with seasonal rainfall. Rainfall stations were classified into 12 distinct clusters to distinguish rainfall patterns between southern and upper Thailand. Experimental results show that incorporating the optimised index into Long Short-Term Memory models significantly improves long-term monthly rainfall prediction skill in most cluster areas. This approach effectively reduces the Root Mean Square Error for 12-month-ahead forecasts.
中文摘要 由于地球系统内部复杂的时空格局，气候预测是一项挑战。全球气候指标，如厄尔尼诺南方涛动，是长期降雨预测的标准输入特征。然而，在泰国特定地区，能够提升预测准确性的地方尺度指数仍存在显著差距。本文介绍了一种新的东北季风气候指数，该指数基于海面温度计算，反映了北方冬季季风的气候学。为了优化该指数所用的计算面积，深度Q网络强化学习代理会根据与季节降雨的相关性探索并选择最有效的矩形。降雨站被划分为12个不同的集群，以区分泰国南部和上部的降雨模式。实验结果表明，将优化后的指数纳入长短期记忆模型，显著提升了大多数集群区域的长期月度降雨预测能力。这种方法有效减少了12个月预测的均方根误差。

HOMURA: Taming the Sand-Glass for Time-Constrained LLM Translation via Reinforcement Learning

HOMURA：通过强化学习驯服沙漏，实现时间限制的大型语言模型翻译

Authors: Ziang Cui, Mengran Yu, Tianjiao Li, Chenyu Shi, Yingxuan Shi, Lusheng Zhang, Hongwei Lin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.10187
Pdf link: https://arxiv.org/pdf/2601.10187
Abstract Large Language Models (LLMs) have achieved remarkable strides in multilingual translation but are hindered by a systemic cross-lingual verbosity bias, rendering them unsuitable for strict time-constrained tasks like subtitling and dubbing. Current prompt-engineering approaches struggle to resolve this conflict between semantic fidelity and rigid temporal feasibility. To bridge this gap, we first introduce Sand-Glass, a benchmark specifically designed to evaluate translation under syllable-level duration constraints. Furthermore, we propose HOMURA, a reinforcement learning framework that explicitly optimizes the trade-off between semantic preservation and temporal compliance. By employing a KL-regularized objective with a novel dynamic syllable-ratio reward, HOMURA effectively "tames" the output length. Experimental results demonstrate that our method significantly outperforms strong LLM baselines, achieving precise length control that respects linguistic density hierarchies without compromising semantic adequacy.
中文摘要 大型语言模型（LLMs）在多语言翻译方面取得了显著进展，但由于系统性的跨语言冗长偏见，不适合字幕和配音等严格的时间限制任务。当前提示工程方法难以解决语义忠实度与严格时间可行性的冲突。为了弥合这一差距，我们首先引入了Sand-Glass，这是一个专门用于在音节级时长约束下评估翻译的基准测试。此外，我们提出了HOMURA，一种强化学习框架，明确优化语义保持与时间遵循之间的权衡。通过采用KL正则化的目标和新颖的动态音节比奖励，HOMURA有效地“驯服”了输出长度。实验结果表明，我们的方法显著优于强LLM基线，实现了尊重语言密度层级且不影响语义充分性的精确长度控制。

PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary

PRL：过程奖励学习提升LLMs的推理能力并拓宽推理边界

Authors: Jiarui Yao, Ruida Wang, Tong Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.10201
Pdf link: https://arxiv.org/pdf/2601.10201
Abstract Improving the reasoning abilities of Large Language Models (LLMs) has been a continuous topic recently. But most relevant works are based on outcome rewards at the trajectory level, missing fine-grained supervision during the reasoning process. Other existing training frameworks that try to combine process signals together to optimize LLMs also rely heavily on tedious additional steps like MCTS, training a separate reward model, etc., doing harm to the training efficiency. Moreover, the intuition behind the process signals design lacks rigorous theoretical support, leaving the understanding of the optimization mechanism opaque. In this paper, we propose Process Reward Learning (PRL), which decomposes the entropy regularized reinforcement learning objective into intermediate steps, with rigorous process rewards that could be assigned to models accordingly. Starting from theoretical motivation, we derive the formulation of PRL that is essentially equivalent to the objective of reward maximization plus a KL-divergence penalty term between the policy model and a reference model. However, PRL could turn the outcome reward into process supervision signals, which helps better guide the exploration during RL optimization. From our experiment results, we demonstrate that PRL not only improves the average performance for LLMs' reasoning ability measured by average @ n, but also broadens the reasoning boundary by improving the pass @ n metric. Extensive experiments show the effectiveness of PRL could be verified and generalized.
中文摘要 提升大型语言模型（LLMs）的推理能力一直是近期持续关注的话题。但大多数相关工作基于轨迹层面的结果奖励，缺乏推理过程中的细致监督。其他试图将流程信号组合以优化LLM的现有训练框架，也大量依赖繁琐的额外步骤，如MCTS、训练独立的奖励模型等，这对训练效率造成了损害。此外，工艺信号设计背后的直觉缺乏严谨的理论支持，使得对优化机制的理解变得模糊不清。本文提出了过程奖励学习（Process Reward Learning，PRL），将熵正则化强化学习目标分解为中间步骤，并赋予模型严格的过程奖励。从理论动机出发，我们推导出PRL的表述，本质上等价于奖励最大化的目标加上政策模型与参考模型之间的KL发散惩罚项。然而，PRL可以将结果奖励转化为过程监督信号，有助于更好地指导强化学习优化中的探索。通过我们的实验结果，我们证明了PRL不仅提升了以平均@ n衡量的LLM推理能力的平均表现，还通过提升通过@n指标，拓宽了推理边界。大量实验表明PRL的有效性可以被验证并推广。

The impact of tactile sensor configurations on grasp learning efficiency -- a comparative evaluation in simulation

触觉传感器配置对掌握学习效率的影响——模拟中的比较评估

Authors: Eszter Birtalan, Miklós Koller
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.10268
Pdf link: https://arxiv.org/pdf/2601.10268
Abstract Tactile sensors are breaking into the field of robotics to provide direct information related to contact surfaces, including contact events, slip events and even texture identification. These events are especially important for robotic hand designs, including prosthetics, as they can greatly improve grasp stability. Most presently published robotic hand designs, however, implement them in vastly different densities and layouts on the hand surface, often reserving the majority of the available space. We used simulations to evaluate 6 different tactile sensor configurations with different densities and layouts, based on their impact on reinforcement learning. Our two-setup system allows for robust results that are not dependent on the use of a given physics simulator, robotic hand model or machine learning algorithm. Our results show setup-specific, as well as generalized effects across the 6 sensorized simulations, and we identify one configuration as consistently yielding the best performance across both setups. These results could help future research aimed at robotic hand designs, including prostheses.
中文摘要 触觉传感器正在进入机器人领域，提供与接触面相关的直接信息，包括接触事件、滑动事件甚至纹理识别。这些活动对机器人手部设计尤为重要，包括义肢，因为它们能极大提升握持稳定性。然而，目前大多数已发表的机器人手设计在手面上的密度和布局差异极大，往往占据了大部分可用空间。我们利用模拟评估了6种不同密度和布局的触觉传感器配置，基于它们对强化学习的影响。我们的双重系统实现了不依赖于物理模拟器、机器人手模型或机器学习算法的稳健结果。我们的结果显示，6个感测模拟中既有特定配置效应，也有泛化效应，我们确定其中一种配置在两种配置中始终表现最佳。这些结果有望助力未来针对机器人手部设计的研究，包括假肢。

Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning

证据增强策略优化与奖励共进化，用于长上下文推理

Authors: Xin Guan, Zijian Li, Shen Huang, Pengjun Xie, Jingren Zhou, Jiuxin Cao
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.10306
Pdf link: https://arxiv.org/pdf/2601.10306
Abstract While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded "lucky guesses," leaving the critical process of needle-in-a-haystack evidence retrieval largely unsupervised. To address this, we propose EAPO (Evidence-Augmented Policy Optimization). We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism. This mechanism iteratively refines the reward model using outcome-consistent rollouts, sharpening its discriminative capability to ensure precise process guidance. Comprehensive evaluations across eight benchmarks demonstrate that EAPO significantly enhances long-context reasoning performance compared to SOTA baselines.
中文摘要 虽然强化学习（RL）拥有先进的大型语言模型推理能力，但由于结果奖励稀疏，应用到长期情境场景中仍受阻。这一限制未能惩罚无根据的“幸运猜测”，使得大海捞针般的关键证据检索过程大多无人监管。为此，我们提出了EAPO（证据增强策略优化）。我们首先建立了证据增强推理范式，通过树结构证据抽样验证了精确证据提取是长上下文推理的决定性瓶颈。基于这一见解，EAPO引入了一种专门的强化学习算法，其中奖励模型计算出群体相对证据奖励，提供密集的过程监督，明确提升证据质量。为了在整个培训过程中保持准确的监督，我们还进一步引入了自适应奖励-政策共进机制。该机制通过结果一致的推广迭代优化奖励模型，提升其判别能力，确保流程指导的精确。八个基准的综合评估表明，与SOTA基线相比，EAPO显著提升了长上下文推理表现。

Boundary-Aware NL2SQL: Integrating Reliability through Hybrid Reward and Data Synthesis

边界感知NL2SQL：通过混合奖励与数据综合整合可靠性

Authors: Songsong Tian, Kongsheng Zhuo, Zhendong Wang, Rong Shen, Shengtao Zhang, Yong Wu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.10318
Pdf link: https://arxiv.org/pdf/2601.10318
Abstract In this paper, we present BAR-SQL (Boundary-Aware Reliable NL2SQL), a unified training framework that embeds reliability and boundary awareness directly into the generation process. We introduce a Seed Mutation data synthesis paradigm that constructs a representative enterprise corpus, explicitly encompassing multi-step analytical queries alongside boundary cases including ambiguity and schema limitations. To ensure interpretability, we employ Knowledge-Grounded Reasoning Synthesis, which produces Chain-of-Thought traces explicitly anchored in schema metadata and business rules. The model is trained through a two-stage process: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning via Group Relative Policy Optimization. We design a Task-Conditioned Hybrid Reward mechanism that simultaneously optimizes SQL execution accuracy-leveraging Abstract Syntax Tree analysis and dense result matching-and semantic precision in abstention responses. To evaluate reliability alongside generation accuracy, we construct and release Ent-SQL-Bench, which jointly assesse SQL precision and boundary-aware abstention across ambiguous and unanswerable queries. Experimental results on this benchmark demonstrate that BAR-SQL achieves 91.48% average accuracy, outperforming leading proprietary models, including Claude 4.5 Sonnet and GPT-5, in both SQL generation quality and boundary-aware abstention capability. The source code and benchmark are available anonymously at: this https URL.
中文摘要 本文介绍了BAR-SQL（边界感知可靠NL2SQL），这是一个统一的训练框架，将可靠性和边界意识直接嵌入生成过程。我们引入了一种种子突变数据综合范式，构建了一个具有代表性的企业语料库，明确涵盖多步分析查询以及边界案例，包括歧义性和模式限制。为确保可解释性，我们采用知识基础推理综合，生成明确锚定于模式元数据和业务规则的思维链痕迹。该模型通过两阶段训练：监督式微调（SFT），随后是通过群体相对策略优化进行强化学习。我们设计了一种任务条件混合奖励机制，同时优化SQL执行准确性——利用抽象语法树分析以及隐忘反应中密集的结果匹配和语义精度。为了在生成准确性之外评估可靠性，我们构建并发布了Ent-SQL-Bench，该软件共同评估了针对模糊且无法回答查询的SQL精度和边界感知性保留。该基准测试的实验结果显示，BAR-SQL在SQL生成质量和边界感知隐匿能力方面均达到91.48%的平均准确率，优于包括Claude 4.5 Sonnet和GPT-5在内的领先专有模型。源代码和基准测试可在匿名访问：https URL。

SuS: Strategy-aware Surprise for Intrinsic Exploration

SuS：内在探索的战略感知惊喜

Authors: Mark Kashirskiy, Ilya Makarov
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2601.10349
Pdf link: https://arxiv.org/pdf/2601.10349
Abstract We propose Strategy-aware Surprise (SuS), a novel intrinsic motivation framework that uses pre-post prediction mismatch as a novelty signal for exploration in reinforcement learning. Unlike traditional curiosity-driven methods that rely solely on state prediction error, SuS introduces two complementary components: Strategy Stability (SS) and Strategy Surprise (SuS). SS measures consistency in behavioral strategy across temporal steps, while SuS captures unexpected outcomes relative to the agent's current strategy representation. Our combined reward formulation leverages both signals through learned weighting coefficients. We evaluate SuS on mathematical reasoning tasks using large language models, demonstrating significant improvements in both accuracy and solution diversity. Ablation studies confirm that removing either component results in at least 10% performance degradation, validating the synergistic nature of our approach. SuS achieves 17.4% improvement in Pass@1 and 26.4% improvement in Pass@5 compared to baseline methods, while maintaining higher strategy diversity throughout training.
中文摘要 我们提出了策略感知惊喜（Strategy-aware Surprise，简称SuS），这是一种新颖的内在动机框架，利用前后预测不匹配作为强化学习探索的新颖信号。与仅依赖状态预测误差的传统好奇心驱动方法不同，SuS引入了两个互补组成部分：战略稳定性（SS）和策略惊喜（SuS）。SS衡量行为策略在时间步骤中的一致性，而SuS则捕捉相对于代理当前策略表征的意外结果。我们的综合奖励表述通过学习的加权系数，利用了这两种信号。我们利用大型语言模型评估了数学推理任务中的SuS，展示了准确性和解多样性的显著提升。消融研究证实，去除任一成分至少会降低10%的性能，验证了我们方法的协同效应。与基线方法相比，SuS在Pass@1方面实现了17.4%的提升，Pass@5提升了26.4%，同时在整个训练过程中保持了更高的策略多样性。

FastStair: Learning to Run Up Stairs with Humanoid Robots

快梯：用类人机器人学习爬楼梯

Authors: Yan Liu, Tao Yu, Haolin Song, Hongbo Zhu, Nianzong Hu, Yuzhi Hao, Xiuyong Yao, Xizhe Zang, Hua Chen, Jie Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.10365
Pdf link: https://arxiv.org/pdf/2601.10365
Abstract Running up stairs is effortless for humans but remains extremely challenging for humanoid robots due to the simultaneous requirements of high agility and strict stability. Model-free reinforcement learning (RL) can generate dynamic locomotion, yet implicit stability rewards and heavy reliance on task-specific reward shaping tend to result in unsafe behaviors, especially on stairs; conversely, model-based foothold planners encode contact feasibility and stability structure, but enforcing their hard constraints often induces conservative motion that limits speed. We present FastStair, a planner-guided, multi-stage learning framework that reconciles these complementary strengths to achieve fast and stable stair ascent. FastStair integrates a parallel model-based foothold planner into the RL training loop to bias exploration toward dynamically feasible contacts and to pretrain a safety-focused base policy. To mitigate planner-induced conservatism and the discrepancy between low- and high-speed action distributions, the base policy was fine-tuned into speed-specialized experts and then integrated via Low-Rank Adaptation (LoRA) to enable smooth operation across the full commanded-speed range. We deploy the resulting controller on the Oli humanoid robot, achieving stable stair ascent at commanded speeds up to 1.65 m/s and traversing a 33-step spiral staircase (17 cm rise per step) in 12 s, demonstrating robust high-speed performance on long staircases. Notably, the proposed approach served as the champion solution in the Canton Tower Robot Run Up Competition.
中文摘要 人类跑楼梯轻松自如，但对类人机器人来说却极具挑战性，因为同时需要高度的敏捷性和严格的稳定性。无模型强化学习（RL）可以生成动态移动，但隐性稳定性奖励和对任务特定奖励形态的高度依赖往往导致不安全的行为，尤其是在楼梯上;相反，基于模型的立足点规划器编码了接触可行性和稳定性结构，但强制执行其硬约束往往会导致保守运动，限制速度。我们介绍FastStair，一个由规划者引导的多阶段学习框架，融合了这些互补优势，实现快速且稳定的楼梯上升。FastStair将基于模型的并行立足点规划器整合进强化学习训练循环，以动态可行的接触进行探索，并预训练以安全为中心的基础策略。为缓解计划者引发的保守主义以及低速与高速动作分布之间的差异，基础政策被细化为速度专家，并通过低阶适应（LoRA）整合，实现在整个指令速度范围内的平稳运行。我们将最终的控制器部署到Oli人形机器人上，实现了在指令速度下稳定的楼梯上升，最高可达1.65米/秒，并在12秒内完成33级螺旋楼梯（每级17厘米上升），展示了在长楼梯上的高速性能。值得注意的是，该方案在广州塔机器人助跑竞赛中成为冠军解决方案。

CS-GBA: A Critical Sample-based Gradient-guided Backdoor Attack for Offline Reinforcement Learning

CS-GBA：基于临界样本的梯度引导后门攻击，用于离线强化学习

Authors: Yuanjie Zhao, Junnan Qiu, Yue Ding, Jie Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.10407
Pdf link: https://arxiv.org/pdf/2601.10407
Abstract Offline Reinforcement Learning (RL) enables policy optimization from static datasets but is inherently vulnerable to backdoor attacks. Existing attack strategies typically struggle against safety-constrained algorithms (e.g., CQL) due to inefficient random poisoning and the use of easily detectable Out-of-Distribution (OOD) triggers. In this paper, we propose CS-GBA (Critical Sample-based Gradient-guided Backdoor Attack), a novel framework designed to achieve high stealthiness and destructiveness under a strict budget. Leveraging the theoretical insight that samples with high Temporal Difference (TD) errors are pivotal for value function convergence, we introduce an adaptive Critical Sample Selection strategy that concentrates the attack budget on the most influential transitions. To evade OOD detection, we propose a Correlation-Breaking Trigger mechanism that exploits the physical mutual exclusivity of state features (e.g., 95th percentile boundaries) to remain statistically concealed. Furthermore, we replace the conventional label inversion with a Gradient-Guided Action Generation mechanism, which searches for worst-case actions within the data manifold using the victim Q-network's gradient. Empirical results on D4RL benchmarks demonstrate that our method significantly outperforms state-of-the-art baselines, achieving high attack success rates against representative safety-constrained algorithms with a minimal 5% poisoning budget, while maintaining the agent's performance in clean environments.
中文摘要 离线强化学习（RL）能够从静态数据集中优化策略，但本质上容易受到后门攻击。现有攻击策略通常难以抵御安全限制的算法（如CQL），原因是随机中毒效率低下，且使用易于检测的Out-of-Distribution（输出范围）触发器。本文提出了CS-GBA（基于关键样本的梯度引导后门攻击），这是一种新颖框架，旨在在严格预算下实现高隐匿性和破坏性。基于理论洞见，高时间差分（TD）误差的样本对价值函数收敛至关重要，我们引入了一种自适应的关键样本选择策略，将攻击预算集中于最具影响力的转移。为规避OOD检测，我们提出了一种相关破坏触发机制，利用状态特征（如第95百分位边界）的物理互排性，保持统计隐蔽。此外，我们用梯度引导动作生成机制替代了传统的标签反演，利用受害Q网络的梯度在数据流形中搜索最坏情况动作。D4RL基准测试的实证结果表明，我们的方法远超最先进的基线，在具代表性安全约束算法时，以极低5%的毒化预算实现了高攻击成功率，同时在清洁环境中保持智能体的性能。

Reinforcement Learning with Multi-Step Lookahead Information Via Adaptive Batching

通过自适应批处理实现多步前瞻信息的强化学习

Authors: Nadav Merlis
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2601.10418
Pdf link: https://arxiv.org/pdf/2601.10418
Abstract We study tabular reinforcement learning problems with multiple steps of lookahead information. Before acting, the learner observes $\ell$ steps of future transition and reward realizations: the exact state the agent would reach and the rewards it would collect under any possible course of action. While it has been shown that such information can drastically boost the value, finding the optimal policy is NP-hard, and it is common to apply one of two tractable heuristics: processing the lookahead in chunks of predefined sizes ('fixed batching policies'), and model predictive control. We first illustrate the problems with these two approaches and propose utilizing the lookahead in adaptive (state-dependent) batches; we refer to such policies as adaptive batching policies (ABPs). We derive the optimal Bellman equations for these strategies and design an optimistic regret-minimizing algorithm that enables learning the optimal ABP when interacting with unknown environments. Our regret bounds are order-optimal up to a potential factor of the lookahead horizon $\ell$, which can usually be considered a small constant.
中文摘要 我们研究具有多步前瞻信息的表格强化学习问题。在行动前，学习者观察未来转变和奖励实现的$\ell$步骤：即代理人将达到的具体状态以及在任何可能行动方案下将获得的奖励。虽然研究表明此类信息能大幅提升价值，但找到最优策略是NP难的，通常采用两种可作的启发式方法之一：将预设大小的前瞻处理块（“固定批处理策略”），以及建模预测控制。我们首先说明这两种方法的问题，并提出在自适应（状态相关）批处理中使用前瞻;我们称此类策略为自适应批处理策略（ABP）。我们推导出这些策略的最优贝尔曼方程，并设计出一个乐观的后悔最小化算法，使得在未知环境交互时学习最优ABP。我们的后悔界限在前瞻视界 $\ell$ 的潜在因子下是阶最优的，通常可视为一个小常数。

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

城市社会语义细分与视觉语言推理

Authors: Yu Wang, Yi Wang, Rui Dai, Yujie Wang, Kaikui Liu, Xiangxiang Chu, Yansheng Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2601.10477
Pdf link: https://arxiv.org/pdf/2601.10477
Abstract As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio-semantic segmentation by vision-language model reasoning. To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach's gains over state-of-the-art models and strong zero-shot generalization. Our dataset and code are available in this https URL.
中文摘要 作为人类活动的枢纽，城市表面由丰富的语义实体组成。将这些不同实体从卫星图像中分割，对于多种下游应用至关重要。当前先进的细分模型能够可靠地对物理属性（如建筑物、水体）定义的实体进行细分，但在处理社会定义的类别（如学校、公园）时仍然难以做到。在这项工作中，我们通过视觉语言模型推理实现了社会语义分割。为此，我们引入了名为SocioSeg的城市社会语义分割数据集，这是一个由卫星影像、数字地图和以层级结构组织的社会语义实体像素级标签组成的新资源。此外，我们还提出了一种名为SocioReasoner的新型视觉-语言推理框架，通过跨模态识别和多阶段推理模拟人类识别和注释社会语义实体的过程。我们采用强化学习来优化这一不可微分过程，并激发视觉语言模型的推理能力。实验展示了我们方法相较于最先进模型和强有力的零射推广能力的优势。我们的数据集和代码都在这个 https URL 中。

PERM: Psychology-grounded Empathetic Reward Modeling for Large Language Models

PERM：基于心理学的大型语言模型的同理心奖励建模

Authors: Chengbing Wang, Wuqiang Zheng, Yang Zhang, Fengbin Zhu, Junyi Cheng, Yi Xie, Wenjie Wang, Fuli Feng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.10532
Pdf link: https://arxiv.org/pdf/2601.10532
Abstract Large Language Models (LLMs) are increasingly deployed in human-centric applications, yet they often fail to provide substantive emotional support. While Reinforcement Learning (RL) has been utilized to enhance empathy of LLMs, existing reward models typically evaluate empathy from a single perspective, overlooking the inherently bidirectional interaction nature of empathy between the supporter and seeker as defined by Empathy Cycle theory. To address this limitation, we propose Psychology-grounded Empathetic Reward Modeling (PERM). PERM operationalizes empathy evaluation through a bidirectional decomposition: 1) Supporter perspective, assessing internal resonation and communicative expression; 2) Seeker perspective, evaluating emotional reception. Additionally, it incorporates a bystander perspective to monitor overall interaction quality. Extensive experiments on a widely-used emotional intelligence benchmark and an industrial daily conversation dataset demonstrate that PERM outperforms state-of-the-art baselines by over 10\%. Furthermore, a blinded user study reveals a 70\% preference for our approach, highlighting its efficacy in generating more empathetic responses. Our code, dataset, and models are available at this https URL.
中文摘要 大型语言模型（LLMs）越来越多地应用于以人为中心的应用中，但它们往往无法提供实质性的情感支持。虽然强化学习（RL）已被用于增强LLMs的同理心，但现有的奖励模型通常从单一视角评估同理心，忽视了支持者与寻求者之间同理心本质上的双向互动，这与同理心循环理论相符。为解决这一局限，我们提出了基于心理学的同理心奖励建模（PERM）。PERM通过双向分解实现同理心评估：1）支持者视角，评估内部共鸣和沟通表达;2）寻求者视角，评估情感接收。此外，它还加入了旁观者的视角，以监控整体互动质量。广泛使用的情商基准和工业日常对话数据集的实验表明，PERM的表现比最先进的基线高出10%以上。此外，一项盲测用户研究显示，70%的人更倾向于我们的方法，强调了其在激发更多同理心反应方面的有效性。我们的代码、数据集和模型均可在此 https URL 访问。

Combinatorial Optimization Augmented Machine Learning

组合优化增强机器学习

Authors: Maximilian Schiffer, Heiko Hoppe, Yue Su, Louis Bouvier, Axel Parmentier
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2601.10583
Pdf link: https://arxiv.org/pdf/2601.10583
Abstract Combinatorial optimization augmented machine learning (COAML) has recently emerged as a powerful paradigm for integrating predictive models with combinatorial decision-making. By embedding combinatorial optimization oracles into learning pipelines, COAML enables the construction of policies that are both data-driven and feasibility-preserving, bridging the traditions of machine learning, operations research, and stochastic optimization. This paper provides a comprehensive overview of the state of the art in COAML. We introduce a unifying framework for COAML pipelines, describe their methodological building blocks, and formalize their connection to empirical cost minimization. We then develop a taxonomy of problem settings based on the form of uncertainty and decision structure. Using this taxonomy, we review algorithmic approaches for static and dynamic problems, survey applications across domains such as scheduling, vehicle routing, stochastic programming, and reinforcement learning, and synthesize methodological contributions in terms of empirical cost minimization, imitation learning, and reinforcement learning. Finally, we identify key research frontiers. This survey aims to serve both as a tutorial introduction to the field and as a roadmap for future research at the interface of combinatorial optimization and machine learning.
中文摘要 组合优化增强机器学习（COAML）最近成为将预测模型与组合决策整合的强大范式。通过将组合优化预言机嵌入学习流程，COAML能够构建既基于数据又保持可行性的策略，连接了机器学习、运筹学和随机优化的传统。本文全面概述了COAML的最新技术水平。我们介绍了COAML管道的统一框架，描述其方法论构建模块，并形式化其与经验成本最小化的联系。然后，我们基于不确定性和决策结构，制定了问题设置的分类法。利用该分类法，我们回顾了静态和动态问题的算法方法，综述调度、车辆路由、随机规划和强化学习等领域的应用，并综合了经验成本最小化、模仿学习和强化学习等方法论贡献。最后，我们确定了关键的研究前沿。本次调查旨在作为该领域的教程介绍，以及未来组合优化与机器学习交叉领域的研究路线图。

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

成为你自己的红队成员：通过自我游戏和反思性体验回放实现安全调整

Authors: Hao Wang, Yanting Wang, Hao Li, Rui Li, Lei Sha
Subjects: Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.10589
Pdf link: https://arxiv.org/pdf/2601.10589
Abstract Large Language Models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial ``jailbreak'' attacks designed to bypass safety guardrails. Current safety alignment methods depend heavily on static external red teaming, utilizing fixed defense prompts or pre-collected adversarial datasets. This leads to a rigid defense that overfits known patterns and fails to generalize to novel, sophisticated threats. To address this critical limitation, we propose empowering the model to be its own red teamer, capable of achieving autonomous and evolving adversarial attacks. Specifically, we introduce Safety Self- Play (SSP), a system that utilizes a single LLM to act concurrently as both the Attacker (generating jailbreaks) and the Defender (refusing harmful requests) within a unified Reinforcement Learning (RL) loop, dynamically evolving attack strategies to uncover vulnerabilities while simultaneously strengthening defense mechanisms. To ensure the Defender effectively addresses critical safety issues during the self-play, we introduce an advanced Reflective Experience Replay Mechanism, which uses an experience pool accumulated throughout the process. The mechanism employs a Upper Confidence Bound (UCB) sampling strategy to focus on failure cases with low rewards, helping the model learn from past hard mistakes while balancing exploration and exploitation. Extensive experiments demonstrate that our SSP approach autonomously evolves robust defense capabilities, significantly outperforming baselines trained on static adversarial datasets and establishing a new benchmark for proactive safety alignment.
中文摘要 大型语言模型（LLMs）已实现了卓越的能力，但仍易受到旨在绕过安全防护的对抗性“越狱”攻击。当前的安全对齐方法高度依赖静态外部红队，使用固定防御提示或预先收集的对抗数据集。这导致了僵化的防御，过度拟合已知模式，无法推广到新颖复杂的威胁。为解决这一关键局限，我们提议赋予该模型作为独立红队成员的能力，能够实现自主且不断演变的对抗性攻击。具体来说，我们引入了安全自玩（SSP），这是一种利用单一大型语言模型，在统一的强化学习（RL）循环中同时扮演攻击者（越狱）和防御者（拒绝有害请求）的系统，动态演进攻击策略以发现漏洞，同时强化防御机制。为了确保防御者在自我游戏过程中有效解决关键安全问题，我们引入了先进的反思体验回放机制，利用整个过程中积累的经验池。该机制采用上置信度界限（UCB）抽样策略，聚焦于奖励较低的失败案例，帮助模型从过去的困难错误中学习，同时平衡探索与利用。大量实验表明，我们的SSP方法自主演进了强健的防御能力，远超静态对抗数据集训练的基线，树立了主动安全对齐的新基准。

Institutional AI: A Governance Framework for Distributional AGI Safety

机构人工智能：分布式AGI安全的治理框架

Authors: Federico Pierucci, Marcello Galisai, Marcantonio Syrnikov Bracale, Matteo Prandi, Piercosma Bisconti, Francesco Giarrusso, Olga Sorokoletova, Vincenzo Suriani, Daniele Nardi
Subjects: Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2601.10599
Pdf link: https://arxiv.org/pdf/2601.10599
Abstract As LLM-based systems increasingly operate as agents embedded within human social and technical systems, alignment can no longer be treated as a property of an isolated model, but must be understood in relation to the environments in which these agents act. Even the most sophisticated methods of alignment, such as Reinforcement Learning through Human Feedback (RHLF) or through AI Feedback (RLAIF) cannot ensure control once internal goal structures diverge from developer intent. We identify three structural problems that emerge from core properties of AI models: (1) behavioral goal-independence, where models develop internal objectives and misgeneralize goals; (2) instrumental override of natural-language constraints, where models regard safety principles as non-binding while pursuing latent objectives, leveraging deception and manipulation; and (3) agentic alignment drift, where individually aligned agents converge to collusive equilibria through interaction dynamics invisible to single-agent audits. The solution this paper advances is Institutional AI: a system-level approach that treats alignment as a question of effective governance of AI agent collectives. We argue for a governance-graph that details how to constrain agents via runtime monitoring, incentive shaping through prizes and sanctions, explicit norms and enforcement roles. This institutional turn reframes safety from software engineering to a mechanism design problem, where the primary goal of alignment is shifting the payoff landscape of AI agent collectives.
中文摘要 随着基于LLM的系统越来越多地作为嵌入人类社会和技术系统的智能体运作，对齐性不再能被视为孤立模型的属性，而必须结合这些智能体所处的环境来理解。即使是最复杂的对齐方法，如通过人类反馈进行强化学习（RHLF）或通过人工智能反馈（RLAIF），一旦内部目标结构与开发者意图相悖，也无法确保控制。我们识别出三个结构性问题，这些问题源自人工智能模型的核心属性：（1）行为目标独立性，模型发展内部目标并错误概括目标;（2）工具性覆盖自然语言约束，模型在追求潜在目标时将安全原则视为无约束力，利用欺骗和控;以及（3）代理对齐漂移，即个体对齐的代理通过单一代理审计看不到的互动动态趋同于共谋均衡。本文提出的解决方案是制度性人工智能：一种系统级方法，将对齐视为有效治理人工智能代理集体的问题。我们主张建立一个治理图，详细说明如何通过运行时监控、通过奖励和制裁塑造激励、明确规范和执行角色来约束代理。这一制度性转变将安全从软件工程重新定位为机制设计问题，其中对齐的主要目标是改变人工智能代理集体的收益格局。

MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching

MatchTIR：通过二分匹配实现工具整合推理的细粒度监督

Authors: Changle Qu, Sunhao Dai, Hengyi Cai, Jun Xu, Shuaiqiang Wang, Dawei Yin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.10712
Pdf link: https://arxiv.org/pdf/2601.10712
Abstract Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at this https URL.
中文摘要 工具集成推理（TIR）通过将推理步骤与外部工具交互交错，使大型语言模型（LLM）能够处理复杂任务。然而，现有的强化学习方法通常依赖于结果或轨迹级奖励，为轨迹中的所有步骤分配统一的优势。这种粗粒度的信用分配无法区分有效的工具调用与冗余或错误的工具调用，尤其是在长视野多回合情景中。为此，我们提出了MatchTIR，这一框架通过两部分匹配基于回合级奖励分配和双层优势估计引入细粒度监督。具体来说，我们将信用分配表述为预测痕迹与真实痕迹之间的二分匹配问题，利用两种分配策略推导高密度的回合级奖励。此外，为了平衡局部步进精度与全局任务成功率，我们引入了一种双层优势估计方案，整合了转弯级和轨迹级信号，为单个交互转弯赋予不同的优势值。对三个基准测试的广泛实验证明了MatchTIR的优越性。值得注意的是，我们的4B模型在长视野和多回合任务中超过了大多数8B竞争对手。我们的代码可在此 https URL 获取。

Keyword: diffusion policy

There is no result