Arxiv Papers of Today

生成时间: 2025-11-24 16:31:35 (UTC+8); Arxiv 发布时间: 2025-11-24 20:00 EST (2025-11-25 09:00 UTC+8)

今天共有 18 篇相关文章

Keyword: reinforcement learning

Improving Latent Reasoning in LLMs via Soft Concept Mixing

通过软概念混合提升大型语言模型中的潜在推理能力

Authors: Kang Wang, Xiangyu Duan, Tianyi Du
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.16885
Pdf link: https://arxiv.org/pdf/2511.16885
Abstract Unlike human reasoning in abstract conceptual spaces, large language models (LLMs) typically reason by generating discrete tokens, which potentially limit their expressive power. The recent work Soft Thinking has shown that LLMs' latent reasoning via soft concepts is a promising direction, but LLMs are trained on discrete tokens. To reduce this gap between the soft concepts in reasoning and the discrete tokens in training, we propose Soft Concept Mixing (SCM), a soft concept aware training scheme that directly exposes the model to soft representations during training. Specifically, SCM constructs a soft concept vector by forming a probability-weighted average of embeddings. Then, this vector is mixed into the model's hidden states, which embody rich contextual information. Finally, the entire latent reasoning process is optimized with Reinforcement Learning (RL). Experiments on five reasoning benchmarks demonstrate that SCM improves the reasoning performance of LLMs, and simultaneously maintains a stable training dynamic.
中文摘要 与抽象概念空间中的人类推理不同，大型语言模型（LLMs）通常通过生成离散的代币来推理，这可能限制了它们的表达能力。最近的研究《软思维》表明，大型语言模型通过软概念进行潜在推理是有前景的方向，但大型语言模型是在离散代币上训练的。为了缩小推理中软概念与训练中离散令牌之间的差距，我们提出了软概念混合（SCM），这是一种软概念感知训练方案，在训练过程中直接将模型暴露于软表示之下。具体来说，SCM通过形成嵌入的概率加权平均来构造一个软概念向量。然后，该向量被混合到模型的隐藏状态中，隐藏态蕴含丰富的上下文信息。最后，整个潜在推理过程通过强化学习（RL）得到优化。五个推理基准测试的实验表明，SCM提升了LLMs的推理性能，同时保持了稳定的训练动态。

When Motion Learns to Listen: Diffusion-Prior Lyapunov Actor-Critic Framework with LLM Guidance for Stable and Robust AUV Control in Underwater Tasks

当运动学会倾听：扩散-先验的李雅普诺夫行为者-批评者框架及大型语言模型指导，实现水下任务中稳定稳健的AUV控制

Authors: Jingzehua Xu, Weiyi Liu, Weihang Zhang, Zhuofan Xi, Guanwen Xie, Shuai Zhang, Yi Li
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.16900
Pdf link: https://arxiv.org/pdf/2511.16900
Abstract Autonomous Underwater Vehicles (AUVs) are indispensable for marine exploration; yet, their control is hindered by nonlinear hydrodynamics, time-varying disturbances, and localization uncertainty. Traditional controllers provide only limited adaptability, while Reinforcement Learning (RL), though promising, suffers from sample inefficiency, weak long-term planning, and lacks stability guarantees, leading to unreliable behavior. To address these challenges, we propose a diffusion-prior Lyapunov actor-critic framework that unifies exploration, stability, and semantic adaptability. Specifically, a diffusion model generates smooth, multimodal, and disturbance-resilient candidate actions; a Lyapunov critic further imposes dual constraints that ensure stability; and a Large Language Model (LLM)-driven outer loop adaptively selects and refines Lyapunov functions based on task semantics and training feedback. This "generation-filtering-optimization" mechanism not only enhances sample efficiency and planning capability but also aligns stability guarantees with diverse mission requirements in the multi-objective optimization task. Extensive simulations under complex ocean dynamics demonstrate that the proposed framework achieves more accurate trajectory tracking, higher task completion rates, improved energy efficiency, faster convergence, and improved robustness compared with conventional RL and diffusion-augmented baselines.
中文摘要 自主水下载具（AUV）是海洋探索不可或缺的;然而，它们的控制受到非线性流体动力学、时间变化扰动和定位不确定性的阻碍。传统控制器的适应性有限，而强化学习（RL）虽然前景看好，但存在样本效率低下、长期规划薄弱且缺乏稳定性保障，导致行为不可靠。为应对这些挑战，我们提出了一种扩散先验的Lyapunov演员-批评者框架，统一了探索、稳定性和语义适应性。具体来说，扩散模型能够生成平滑、多模态且抗扰的候选动作;李雅普诺夫批评者进一步施加了双重约束以确保稳定性;以及由大型语言模型（LLM）驱动的外环，基于任务语义和训练反馈自适应地选择并优化李雅普诺夫函数。这一“生成-过滤-优化”机制不仅提升了样品效率和规划能力，还使稳定性保证与多目标优化任务中多样化的任务需求保持一致。在复杂海洋动力学下的大量模拟表明，所提框架相比传统强化学习和扩散增强基线，实现了更精确的轨迹跟踪、更高的任务完成率、更高的能效、更快的收敛速度和更强的鲁棒性。

R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios

R-AVST：在复杂视听场景下赋能视频大型语言模型的细粒度时空推理能力

Authors: Lu Zhu, Tiantian Geng, Yangye Chen, Teng Wang, Ping Lu, Feng Zheng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.16901
Pdf link: https://arxiv.org/pdf/2511.16901
Abstract Recently, rapid advancements have been made in multimodal large language models (MLLMs), especially in video understanding tasks. However, current research focuses on simple video scenarios, failing to reflect the complex and diverse nature of real-world audio-visual events in videos. To bridge this gap, we firstly introduce R-AVST, a dataset for audio-visual reasoning featuring fine-grained spatio-temporal annotations. In constructing this, we design a pipeline consisting of LLM-based key object extraction, automatic spatial annotation and manual quality inspection, resulting in over 5K untrimmed videos with 27K objects across 100 types of audio-visual events. Building on this dataset, we define three core tasks for spatio-temporal reasoning in audio-visual scenes and generate more than 8K high-quality, evenly distributed question-answer pairs to effectively benchmark model performance. To further enhance reasoning, we propose AVST-Zero, a reinforcement learning-based model that avoids intermediate supervision, directly optimizing behavior via carefully designed multi-dimensional rewards. Extensive experiments validate the effectiveness of our R-AVST in advancing audio-visual spatio-temporal reasoning, upon which AVST-Zero demonstrates competitive performance compared to existing models. To the best of our knowledge, R-AVST is the first dataset designed for real-world audio-visual spatio-temporal reasoning, and AVST-Zero offers a novel perspective for tackling future challenges in this domain.
中文摘要 近年来，多模态大型语言模型（MLLM）取得了快速进展，尤其是在视频理解任务中。然而，当前的研究主要集中在简单的视频场景上，未能反映视频中现实世界视听事件的复杂性和多样性。为弥合这一差距，我们首先引入了R-AVST，这是一个具有细粒度时空注释的视听推理数据集。在此过程中，我们设计了一个流程，包括基于LLM的关键对象提取、自动空间标注和人工质量检测，最终生成了超过5000个未裁剪视频，涵盖2.7万个对象，涵盖100种视听事件。基于该数据集，我们定义了视听场景中时空推理的三个核心任务，并生成超过8K高质量、均匀分布的问答对，以有效基准模型性能。为了进一步提升推理能力，我们提出了AVST-Zero，一种基于强化学习的模型，避免中间监督，通过精心设计的多维奖励直接优化行为。大量实验验证了我们的R-AVST在推进视听时空推理中的有效性，AVST-Zero在该方面表现优于现有模型。据我们所知，R-AVST 是首个专为现实视听时空推理设计的数据集，AVST-Zero 为应对该领域的未来挑战提供了新颖视角。

Predicting Talent Breakout Rate using Twitter and TV data

利用Twitter和电视数据预测人才突破率

Authors: Bilguun Batsaikhan, Hiroyuki Fukuda
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.16905
Pdf link: https://arxiv.org/pdf/2511.16905
Abstract Early detection of rising talents is of paramount importance in the field of advertising. In this paper, we define a concept of talent breakout and propose a method to detect Japanese talents before their rise to stardom. The main focus of the study is to determine the effectiveness of combining Twitter and TV data on predicting time-dependent changes in social data. Although traditional time-series models are known to be robust in many applications, the success of neural network models in various fields (e.g.\ Natural Language Processing, Computer Vision, Reinforcement Learning) continues to spark an interest in the time-series community to apply new techniques in practice. Therefore, in order to find the best modeling approach, we have experimented with traditional, neural network and ensemble learning methods. We observe that ensemble learning methods outperform traditional and neural network models based on standard regression metrics. However, by utilizing the concept of talent breakout, we are able to assess the true forecasting ability of the models, where neural networks outperform traditional and ensemble learning methods in terms of precision and recall.
中文摘要 早期发现新兴人才在广告领域至关重要。本文定义了人才突破的概念，并提出了一种在日本人才成名前发现其成名的方法。本研究的主要焦点是评估将推特和电视数据结合起来，预测社交数据随时间变化的有效性。尽管传统时间序列模型在许多应用中已被证明稳健，但神经网络模型在多个领域（如自然语言处理、计算机视觉、强化学习）的成功持续激发时间序列社区对应用新技术的兴趣。因此，为了找到最佳建模方法，我们尝试了传统的神经网络和集合学习方法。我们观察到集合学习方法优于基于标准回归指标的传统和神经网络模型。然而，通过利用人才突破的概念，我们能够评估模型的真实预测能力，神经网络在精度和回忆率方面优于传统和集合学习方法。

Hybrid Differential Reward: Combining Temporal Difference and Action Gradients for Efficient Multi-Agent Reinforcement Learning in Cooperative Driving

混合差分奖励：结合时间差分梯度与动作梯度，实现合作驾驶中高效的多智能体强化学习

Authors: Ye Han, Lijun Zhang, Dejian Meng, Zhuang Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.16916
Pdf link: https://arxiv.org/pdf/2511.16916
Abstract In multi-vehicle cooperative driving tasks involving high-frequency continuous control, traditional state-based reward functions suffer from the issue of vanishing reward differences. This phenomenon results in a low signal-to-noise ratio (SNR) for policy gradients, significantly hindering algorithm convergence and performance improvement. To address this challenge, this paper proposes a novel Hybrid Differential Reward (HDR) mechanism. We first theoretically elucidate how the temporal quasi-steady nature of traffic states and the physical proximity of actions lead to the failure of traditional reward signals. Building on this analysis, the HDR framework innovatively integrates two complementary components: (1) a Temporal Difference Reward (TRD) based on a global potential function, which utilizes the evolutionary trend of potential energy to ensure optimal policy invariance and consistency with long-term objectives; and (2) an Action Gradient Reward (ARG), which directly measures the marginal utility of actions to provide a local guidance signal with a high SNR. Furthermore, we formulate the cooperative driving problem as a Multi-Agent Partially Observable Markov Game (POMDPG) with a time-varying agent set and provide a complete instantiation scheme for HDR within this framework. Extensive experiments conducted using both online planning (MCTS) and Multi-Agent Reinforcement Learning (QMIX, MAPPO, MADDPG) algorithms demonstrate that the HDR mechanism significantly improves convergence speed and policy stability. The results confirm that HDR guides agents to learn high-quality cooperative policies that effectively balance traffic efficiency and safety.
中文摘要 在涉及高频连续控制的多车协同驾驶任务中，传统的基于状态的奖励函数存在奖励差异消失的问题。这一现象导致策略梯度的信噪比（SNR）较低，显著阻碍算法的收敛和性能提升。为应对这一挑战，本文提出了一种新的混合差分奖励（HDR）机制。我们首先理论阐明交通状态的时间准稳态特性以及动作的物理接近性如何导致传统奖励信号失效。基于该分析，HDR框架创新地整合了两个互补组成部分：（1）基于全局势能函数的时间差分奖励（TRD），利用势能的进化趋势确保最佳政策的不变性和与长期目标的一致性;以及（2）动作梯度奖励（ARG），直接衡量提供高信噪比局部指导信号的边际效用。此外，我们将合作驱动问题表述为多智能体部分可观测马尔可夫博弈（POMDPG），并以时间变化的智能体集，并在该框架内提供了完整的HDR实例化方案。通过大量使用在线规划（MCTS）和多智能体强化学习（QMIX、MAPPO、MADDPG）算法进行的实验，表明HDR机制显著提升了收敛速度和策略稳定性。结果证实，HDR指导客服人员学习高质量的合作政策，有效平衡交通效率与安全。

CroTad: A Contrastive Reinforcement Learning Framework for Online Trajectory Anomaly Detection

CroTad：一种用于在线轨迹异常检测的对比强化学习框架

Authors: Rui Xue, Dan He, Fengmei Jin, Chen Zhang, Xiaofang Zhou
Subjects: Subjects: Machine Learning (cs.LG); Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2511.16929
Pdf link: https://arxiv.org/pdf/2511.16929
Abstract Detecting trajectory anomalies is a vital task in modern Intelligent Transportation Systems (ITS), enabling the identification of unsafe, inefficient, or irregular travel behaviours. While deep learning has emerged as the dominant approach, several key challenges remain unresolved. First, sub-trajectory anomaly detection, capable of pinpointing the precise segments where anomalies occur, remains underexplored compared to whole-trajectory analysis. Second, many existing methods depend on carefully tuned thresholds, limiting their adaptability in real-world applications. Moreover, the irregular sampling of trajectory data and the presence of noise in training sets further degrade model performance, making it difficult to learn reliable representations of normal routes. To address these challenges, we propose a contrastive reinforcement learning framework for online trajectory anomaly detection, CroTad. Our method is threshold-free and robust to noisy, irregularly sampled data. By incorporating contrastive learning, CroTad learns to extract diverse normal travel patterns for different itineraries and effectively distinguish anomalous behaviours at both sub-trajectory and point levels. The detection module leverages deep reinforcement learning to perform online, real-time anomaly scoring, enabling timely and fine-grained identification of abnormal segments. Extensive experiments on two real-world datasets demonstrate the effectiveness and robustness of our framework across various evaluation scenarios.
中文摘要 检测轨迹异常是现代智能交通系统（ITS）中的重要任务，能够识别不安全、低效或不规范的出行行为。尽管深度学习已成为主流方法，但仍有若干关键挑战未被解决。首先，能够精确定位异常发生区段的亚轨迹异常检测，仍远不及整体轨迹分析被充分探索。其次，许多现有方法依赖于精心调优的阈值，限制了其在实际应用中的适应性。此外，轨迹数据的不规则采样和训练集中的噪声存在进一步降低了模型性能，使得学习可靠表示正常路径变得困难。为应对这些挑战，我们提出了一个用于在线轨迹异常检测的对比强化学习框架CroTad。我们的方法是无阈值且对噪声、不规则采样数据具有鲁棒性。通过引入对比学习，CroTad 能够为不同行程提取多样化的正常旅行模式，并有效区分子轨迹和点级的异常行为。该检测模块利用深度强化学习进行在线、实时异常评分，实现对异常片段的及时且细致度的识别。在两个真实世界数据集上的大量实验展示了我们框架在各种评估场景下的有效性和稳健性。

RL-AD-Net: Reinforcement Learning Guided Adaptive Displacement in Latent Space for Refined Point Cloud Completion

RL-AD-Net：强化学习引导的潜空间自适应位移以实现精细点云补全

Authors: Bhanu Pratap Paregi, Vaibhav Kumar
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.17054
Pdf link: https://arxiv.org/pdf/2511.17054
Abstract Recent point cloud completion models, including transformer-based, denoising-based, and other state-of-the-art approaches, generate globally plausible shapes from partial inputs but often leave local geometric inconsistencies. We propose RL-AD-Net, a reinforcement learning (RL) refinement framework that operates in the latent space of a pretrained point autoencoder. The autoencoder encodes completions into compact global feature vectors (GFVs), which are selectively adjusted by an RL agent to improve geometric fidelity. To ensure robustness, a lightweight non-parametric PointNN selector evaluates the geometric consistency of both the original completion and the RL-refined output, retaining the better reconstruction. When ground truth is available, both Chamfer Distance and geometric consistency metrics guide refinement. Training is performed separately per category, since the unsupervised and dynamic nature of RL makes convergence across highly diverse categories challenging. Nevertheless, the framework can be extended to multi-category refinement in future work. Experiments on ShapeNetCore-2048 demonstrate that while baseline completion networks perform reasonable under their training-style cropping, they struggle in random cropping scenarios. In contrast, RL-AD-Net consistently delivers improvements across both settings, highlighting the effectiveness of RL-guided ensemble refinement. The approach is lightweight, modular, and model-agnostic, making it applicable to a wide range of completion networks without requiring retraining.
中文摘要 最新的点云补全模型，包括基于变压器、基于去噪及其他先进方法，能够从部分输入生成全局合理的形状，但常常会留下局部几何上的不一致。我们提出了RL-AD-Net，一种在预训练点自编码器的潜在空间中运行的强化学习（RL）精炼框架。自编码器将完备化编码为紧凑的全局特征向量（GFV），由强化学习代理选择性调整以提升几何真实度。为确保鲁棒性，轻量级非参数PointNN选择器评估原始完备化和RL精炼输出的几何一致性，保留更优的重建。当有地面真实数据时，倒角距离和几何一致性度量共同指导细化。训练按类别单独进行，因为强化学习的无监督和动态特性使得跨高度多样化类别的融合具有挑战性。尽管如此，该框架未来仍可扩展至多类别细化。ShapeNetCore-2048的实验表明，基线完成网络在其训练式裁剪下表现尚可，但在随机裁剪场景中表现较差。相比之下，RL-AD-Net在两种环境下持续带来改进，凸显了强化学习引导的集合优化的有效性。该方法轻量化、模块化且模型无关，适用于广泛的完备网络，无需重新训练。

MIR: Efficient Exploration in Episodic Multi-Agent Reinforcement Learning via Mutual Intrinsic Reward

MIR：通过相互内在奖励实现的情节多智能体强化学习的高效探索

Authors: Kesheng Chen, Wenjian Luo, Bang Zhang, Zeping Yin, Zipeng Ye
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.17165
Pdf link: https://arxiv.org/pdf/2511.17165
Abstract Episodic rewards present a significant challenge in reinforcement learning. While intrinsic reward methods have demonstrated effectiveness in single-agent rein-forcement learning scenarios, their application to multi-agent reinforcement learn-ing (MARL) remains problematic. The primary difficulties stem from two fac-tors: (1) the exponential sparsity of joint action trajectories that lead to rewards as the exploration space expands, and (2) existing methods often fail to account for joint actions that can influence team states. To address these challenges, this paper introduces Mutual Intrinsic Reward (MIR), a simple yet effective enhancement strategy for MARL with extremely sparse rewards like episodic rewards. MIR incentivizes individual agents to explore actions that affect their teammates, and when combined with original strategies, effectively stimulates team exploration and improves algorithm performance. For comprehensive experimental valida-tion, we extend the representative single-agent MiniGrid environment to create MiniGrid-MA, a series of MARL environments with sparse rewards. Our evalu-ation compares the proposed method against state-of-the-art approaches in the MiniGrid-MA setting, with experimental results demonstrating superior perfor-mance.
中文摘要 情节奖励在强化学习中构成了重大挑战。虽然内在奖励方法在单智能体强化学习场景中已证明有效，但其在多智能体强化学习（MARL）中的应用仍存在问题。主要困难源于两个因素：（1）随着探索空间扩展而带来奖励的联合行动轨迹呈指数级稀疏;（2）现有方法常常忽视可能影响团队状态的联合行动。为应对这些挑战，本文介绍了互惠内在奖励（MIR），这是一种简单但有效的MARL增强策略，奖励极为稀疏，如情节奖励。MIR激励个别代理探索影响队友的行动，结合原创策略，有效激发团队探索并提升算法性能。为了全面的实验验证，我们扩展了代表性的单代理MiniGrid环境，创建了MiniGrid-MA，这是一系列奖励稀疏的MARL环境。我们的评估比较了所提方法与MiniGrid-MA环境下最先进的方法，实验结果显示其性能优越。

FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle

FireScope：带链式思维预言的野火风险预测

Authors: Mario Markov (1), Stefan Maria Ailuro (1), Luc Van Gool (1), Konrad Schindler (2), Danda Pani Paudel (1 and 2) ((1) INSAIT, Sofia University, (2) ETH Zurich)
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.17171
Pdf link: https://arxiv.org/pdf/2511.17171
Abstract Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce $\textbf{FireScope-Bench}$, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose $\textbf{FireScope}$, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, $\textbf{FireScope}$ achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that $\textbf{FireScope-Bench}$ has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.
中文摘要 预测野火风险是一个需要推理的空间问题，需要整合视觉、气候和地理因素来推断连续的风险图。现有方法缺乏可靠的泛化所需的因果推理和多模态理解。我们介绍了$\textbf{FireScope-Bench}$，这是一个大规模数据集和基准测试，将Sentinel-2的影像和气候数据与美国专家定义的风险栅栏结合，并结合欧洲的真实野火事件进行跨洲评估。基于该数据集，我们提出了$\textbf{FireScope}$，这是一个基于VLM的推理生成框架，结合强化学习和视觉监督，预测具有互补推理痕迹的风险栅格。在美国训练并在欧洲测试时，$\textbf{FireScope}$ 实现了显著的性能提升，专家反馈和自动分析也证实其推理痕迹忠实且语义上有意义。我们的发现表明，推理能够为栅格预测模型提供基础，提升泛化性和可解释性。据我们所知，这是首个框架：（1）证明基于语言的推理可以提升视觉生成的泛化，（2）提出可跨大陆应用的高分辨率野火风险模型，以及（3）实现多模态火灾风险模型对稳健跨大陆推广的系统研究。我们相信 $\textbf{FireScope-Bench}$ 有潜力成为推动推理驱动、可解释和可推广空间建模的基础。数据和源代码将公开。

Cross-cultural value alignment frameworks for responsible AI governance: Evidence from China-West comparative analysis

负责任人工智能治理中的跨文化价值对齐框架：来自中西方比较分析的证据

Authors: Haijiang Liu, Jinguang Gu, Xun Wu, Daniel Hershcovich, Qiaoling Xiao
Subjects: Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.17256
Pdf link: https://arxiv.org/pdf/2511.17256
Abstract As Large Language Models (LLMs) increasingly influence high-stakes decision-making across global contexts, ensuring their alignment with diverse cultural values has become a critical governance challenge. This study presents a Multi-Layered Auditing Platform for Responsible AI that systematically evaluates cross-cultural value alignment in China-origin and Western-origin LLMs through four integrated methodologies: Ethical Dilemma Corpus for assessing temporal stability, Diversity-Enhanced Framework (DEF) for quantifying cultural fidelity, First-Token Probability Alignment for distributional accuracy, and Multi-stAge Reasoning frameworK (MARK) for interpretable decision-making. Our comparative analysis of 20+ leading models, such as Qwen, GPT-4o, Claude, LLaMA, and DeepSeek, reveals universal challenges-fundamental instability in value systems, systematic under-representation of younger demographics, and non-linear relationships between model scale and alignment quality-alongside divergent regional development trajectories. While China-origin models increasingly emphasize multilingual data integration for context-specific optimization, Western models demonstrate greater architectural experimentation but persistent U.S.-centric biases. Neither paradigm achieves robust cross-cultural generalization. We establish that Mistral-series architectures significantly outperform LLaMA3-series in cross-cultural alignment, and that Full-Parameter Fine-Tuning on diverse datasets surpasses Reinforcement Learning from Human Feedback in preserving cultural variation...
中文摘要 随着大型语言模型（LLMs）在全球范围内日益影响高风险决策，确保其与多元文化价值观保持一致已成为一项关键的治理挑战。本研究提出了一个多层次负责任人工智能审计平台，通过四种综合方法系统评估中国和西方大型语言模型中的跨文化价值对齐：伦理困境语料库用于时间稳定性评估，多样性增强框架（DEF）用于量化文化忠实度，第一代币概率对齐用于分布准确性，以及多阶段推理框架工作（MARK）用于可解释决策。我们对20+个领先模型（如Qwen、GPT-4o、Claude、LLaMA和DeepSeek）进行了比较分析，揭示了普遍存在的挑战——价值体系的根本不稳定性、年轻人口统计学的系统性不足、模型规模与对齐质量之间的非线性关系——以及区域发展轨迹的不同。虽然中国起源的模型越来越强调多语言数据集成以实现上下文特定优化，而西方模型则展现出更多架构实验，但仍存在以美国为中心的偏见。这两种范式都无法实现强有力的跨文化推广。我们发现，Mistral系列架构在跨文化对齐方面显著优于LLaMA3系列，且在多样化数据集上的全参数微调在保持文化多样性方面优于人类反馈强化学习......

MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning

MolSight：利用SMILES预训练、多粒度学习和强化学习实现光学化学结构识别

Authors: Wenrui Zhang, Xinggang Wang, Bin Feng, Wenyu Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.17300
Pdf link: https://arxiv.org/pdf/2511.17300
Abstract Optical Chemical Structure Recognition (OCSR) plays a pivotal role in modern chemical informatics, enabling the automated conversion of chemical structure images from scientific literature, patents, and educational materials into machine-readable molecular representations. This capability is essential for large-scale chemical data mining, drug discovery pipelines, and Large Language Model (LLM) applications in related domains. However, existing OCSR systems face significant challenges in accurately recognizing stereochemical information due to the subtle visual cues that distinguish stereoisomers, such as wedge and dash bonds, ring conformations, and spatial arrangements. To address these challenges, we propose MolSight, a comprehensive learning framework for OCSR that employs a three-stage training paradigm. In the first stage, we conduct pre-training on large-scale but noisy datasets to endow the model with fundamental perception capabilities for chemical structure images. In the second stage, we perform multi-granularity fine-tuning using datasets with richer supervisory signals, systematically exploring how auxiliary tasks-specifically chemical bond classification and atom localization-contribute to molecular formula recognition. Finally, we employ reinforcement learning for post-training optimization and introduce a novel stereochemical structure dataset. Remarkably, we find that even with MolSight's relatively compact parameter size, the Group Relative Policy Optimization (GRPO) algorithm can further enhance the model's performance on stereomolecular. Through extensive experiments across diverse datasets, our results demonstrate that MolSight achieves state-of-the-art performance in (stereo)chemical optical structure recognition.
中文摘要 光学化学结构识别（OCSR）在现代化学信息学中发挥着关键作用，能够将科学文献、专利和教学材料中的化学结构图像自动转换为机器可读的分子表示。该能力对于大规模化学数据挖掘、药物发现流程以及相关领域的大型语言模型（LLM）应用至关重要。然而，现有的OCSR系统在准确识别立体化学信息方面面临重大挑战，原因是区分立体异构体的细微视觉线索，如楔形键和破折键、环构象以及空间排列。为应对这些挑战，我们提出了MolSight，这是一个采用三阶段培训范式的综合OCSR学习框架。第一阶段，我们在大规模但噪声较大的数据集上进行预训练，赋予模型化学结构图像的基本感知能力。第二阶段，我们利用管理信号更丰富的数据集进行多粒度微调，系统性地探讨辅助任务——特别是化学键分类和原子定位——如何促进分子式识别。最后，我们采用强化学习进行训练后优化，并引入了新的立体化学结构数据集。值得注意的是，即使MolSight参数规模相对较小，群相对策略优化（GRPO）算法仍能进一步提升模型在立体分子上的表现。通过跨越多种数据集的广泛实验，我们的结果表明MolSight在（立体）化学光学结构识别方面达到了最先进的性能。

Convergence and stability of Q-learning in Hierarchical Reinforcement Learning

Q-学习在层级强化学习中的收敛与稳定性

Authors: Massimiliano Manenti, Andrea Iannelli
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2511.17351
Pdf link: https://arxiv.org/pdf/2511.17351
Abstract Hierarchical Reinforcement Learning promises, among other benefits, to efficiently capture and utilize the temporal structure of a decision-making problem and to enhance continual learning capabilities, but theoretical guarantees lag behind practice. In this paper, we propose a Feudal Q-learning scheme and investigate under which conditions its coupled updates converge and are stable. By leveraging the theory of Stochastic Approximation and the ODE method, we present a theorem stating the convergence and stability properties of Feudal Q-learning. This provides a principled convergence and stability analysis tailored to Feudal RL. Moreover, we show that the updates converge to a point that can be interpreted as an equilibrium of a suitably defined game, opening the door to game-theoretic approaches to Hierarchical RL. Lastly, experiments based on the Feudal Q-learning algorithm support the outcomes anticipated by theory.
中文摘要 分层强化学习承诺能高效捕捉和利用决策问题的时间结构，并增强持续学习能力，但理论保证仍落后于实践。本文提出了封建Q学习方案，并研究其耦合更新在何种条件下收敛且稳定。通过利用随机近似理论和常微分方程方法，我们提出了一个定理，陈述了封建Q学习的收敛性和稳定性性质。这提供了针对封建强化学习的原则性收敛与稳定性分析。此外，我们证明这些更新收敛到一个可以解释为适当定义博弈的均衡点，为博弈论方法的层级强化学习打开了大门。最后，基于封建Q学习算法的实验支持理论预期的结果。

R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability

R2PS：部分可观测性下最坏情况下稳健实时追踪策略

Authors: Runyu Lu, Ruochuan Shi, Yuanheng Zhu, Dongbin Zhao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.17367
Pdf link: https://arxiv.org/pdf/2511.17367
Abstract Computing worst-case robust strategies in pursuit-evasion games (PEGs) is time-consuming, especially when real-world factors like partial observability are considered. While important for general security purposes, real-time applicable pursuit strategies for graph-based PEGs are currently missing when the pursuers only have imperfect information about the evader's position. Although state-of-the-art reinforcement learning (RL) methods like Equilibrium Policy Generalization (EPG) and Grasper provide guidelines for learning graph neural network (GNN) policies robust to different game dynamics, they are restricted to the scenario of perfect information and do not take into account the possible case where the evader can predict the pursuers' actions. This paper introduces the first approach to worst-case robust real-time pursuit strategies (R2PS) under partial observability. We first prove that a traditional dynamic programming (DP) algorithm for solving Markov PEGs maintains optimality under the asynchronous moves by the evader. Then, we propose a belief preservation mechanism about the evader's possible positions, extending the DP pursuit strategies to a partially observable setting. Finally, we embed the belief preservation into the state-of-the-art EPG framework to finish our R2PS learning scheme, which leads to a real-time pursuer policy through cross-graph reinforcement learning against the asynchronous-move DP evasion strategies. After reinforcement learning, our policy achieves robust zero-shot generalization to unseen real-world graph structures and consistently outperforms the policy directly trained on the test graphs by the existing game RL approach.
中文摘要 在追逐-规避博弈（PEGs）中计算最坏情况稳健策略非常耗时，尤其是在考虑部分可观测性等现实因素时。虽然对一般安全目的很重要，但当追踪方对规避者位置仅有不完美信息时，基于图的PEG的实时适用追捕策略目前仍然缺失。尽管最先进的强化学习（RL）方法如均衡策略推广（EPG）和抓取者为学习针对不同博弈动态的图神经网络（GNN）策略提供了指导，但它们仅限于完美信息情景，未考虑规避者预测追诉者行为的可能情况。本文介绍了在部分可观测性下最坏情况稳健实时追踪策略（R2PS）的首个方法。我们首先证明了传统的动态规划（DP）算法在规避者的异步移动下保持最优性。然后，我们提出一种关于逃避者可能位置的信念保持机制，将DP追踪策略扩展到部分可观察的环境。最后，我们将信念保持嵌入最先进的EPG框架，完成我们的R2PS学习方案，通过跨图强化学习实现针对异步移动DP规避策略的实时追随者策略。经过强化学习后，我们的策略实现了对现实世界中未见图结构的稳健零样本推广，并且持续优于现有游戏强化学习方法直接训练在测试图上的策略。

Human Imitated Bipedal Locomotion with Frequency Based Gait Generator Network

基于频率的步态发生器网络的人类模拟双足行走

Authors: Yusuf Baran Ates, Omer Morgul
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.17387
Pdf link: https://arxiv.org/pdf/2511.17387
Abstract Learning human-like, robust bipedal walking remains difficult due to hybrid dynamics and terrain variability. We propose a lightweight framework that combines a gait generator network learned from human motion with Proximal Policy Optimization (PPO) controller for torque control. Despite being trained only on flat or mildly sloped ground, the learned policies generalize to steeper ramps and rough surfaces. Results suggest that pairing spectral motion priors with Deep Reinforcement Learning (DRL) offers a practical path toward natural and robust bipedal locomotion with modest training cost.
中文摘要 由于混合动力学和地形变化，学习类似人类、健壮的双足行走仍然困难。我们提出了一个轻量级框架，结合了从人体运动中学习的步态发生器网络与用于扭矩控制的近端策略优化（PPO）控制器。尽管训练时仅在平坦或略有坡度的地面上，但这些学到的策略通常适用于更陡的坡道和崎岖的路面。结果表明，将谱运动先验与深度强化学习（DRL）结合，提供了一条实用的路径，实现自然且稳健的双足行走，且训练成本适中。

MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration

MorphSeek：可变形图像注册的细粒度潜在表示层策略优化

Authors: Runxun Zhang, Yizhou Liu, Li Dongrui, Bo XU, Jingwei Wei
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.17392
Pdf link: https://arxiv.org/pdf/2511.17392
Abstract Deformable image registration (DIR) remains a fundamental yet challenging problem in medical image analysis, largely due to the prohibitively high-dimensional deformation space of dense displacement fields and the scarcity of voxel-level supervision. Existing reinforcement learning frameworks often project this space into coarse, low-dimensional representations, limiting their ability to capture spatially variant deformations. We propose MorphSeek, a fine-grained representation-level policy optimization paradigm that reformulates DIR as a spatially continuous optimization process in the latent feature space. MorphSeek introduces a stochastic Gaussian policy head atop the encoder to model a distribution over latent features, facilitating efficient exploration and coarse-to-fine refinement. The framework integrates unsupervised warm-up with weakly supervised fine-tuning through Group Relative Policy Optimization, where multi-trajectory sampling stabilizes training and improves label efficiency. Across three 3D registration benchmarks (OASIS brain MRI, LiTS liver CT, and Abdomen MR-CT), MorphSeek achieves consistent Dice improvements over competitive baselines while maintaining high label efficiency with minimal parameter cost and low step-level latency overhead. Beyond optimizer specifics, MorphSeek advances a representation-level policy learning paradigm that achieves spatially coherent and data-efficient deformation optimization, offering a principled, backbone-agnostic, and optimizer-agnostic solution for scalable visual alignment in high-dimensional settings.
中文摘要 可变形图像配准（DIR）仍然是医学图像分析中一个基础性且具有挑战性的问题，主要原因是高维的致密位移场变形空间极为有限，且体素级监督的稀缺。现有的强化学习框架通常将该空间投影为粗糙的低维表示，限制了它们捕捉空间变形的能力。我们提出了MorphSeek，一种细粒度的表示级策略优化范式，将DIR重新表述为潜在特征空间中空间连续的优化过程。MorphSeek在编码器顶部引入了随机高斯策略头，用于建模潜在特征分布，便于高效的探索和从粗细到细的细化。该框架通过群体相对策略优化将无监督热身与弱监督微调相结合，多轨迹抽样稳定训练并提升标签效率。在三项三维注册基准测试（OASIS脑MRI、LiTS肝CT和腹部MRI-CT）中，MorphSeek在保持高标签效率、低参数成本和低步进潜伏的同时，实现了相较竞争基线的持续Dice改进。除了优化器具体细节外，MorphSeek还推进了一种表示层级的策略学习范式，实现空间相干且数据高效的变形优化，提供了一种原则性、骨干无关且优化器无关的解决方案，用于高维环境中可扩展的视觉对齐。

Multi-Agent Pointer Transformer: Seq-to-Seq Reinforcement Learning for Multi-Vehicle Dynamic Pickup-Delivery Problems

多智能体指针变换器：针对多车辆动态取货-交付问题的序列对序列强化学习

Authors: Zengyu Zou, Jingyuan Wang, Yixuan Huang, Junjie Wu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.17435
Pdf link: https://arxiv.org/pdf/2511.17435
Abstract This paper addresses the cooperative Multi-Vehicle Dynamic Pickup and Delivery Problem with Stochastic Requests (MVDPDPSR) and proposes an end-to-end centralized decision-making framework based on sequence-to-sequence, named Multi-Agent Pointer Transformer (MAPT). MVDPDPSR is an extension of the vehicle routing problem and a spatio-temporal system optimization problem, widely applied in scenarios such as on-demand delivery. Classical operations research methods face bottlenecks in computational complexity and time efficiency when handling large-scale dynamic problems. Although existing reinforcement learning methods have achieved some progress, they still encounter several challenges: 1) Independent decoding across multiple vehicles fails to model joint action distributions; 2) The feature extraction network struggles to capture inter-entity relationships; 3) The joint action space is exponentially large. To address these issues, we designed the MAPT framework, which employs a Transformer Encoder to extract entity representations, combines a Transformer Decoder with a Pointer Network to generate joint action sequences in an AutoRegressive manner, and introduces a Relation-Aware Attention module to capture inter-entity relationships. Additionally, we guide the model's decision-making using informative priors to facilitate effective exploration. Experiments on 8 datasets demonstrate that MAPT significantly outperforms existing baseline methods in terms of performance and exhibits substantial computational time advantages compared to classical operations research methods.
中文摘要 本文探讨了合作式多车辆动态拾取与交付随机请求问题（MVDPDPSR），并提出了基于序列对序列的端到端集中决策框架，称为多代理指针变换器（MAPT）。MVDPDPSR是车辆路由问题和时空系统优化问题的扩展，广泛应用于按需交付等场景。经典运筹学方法在处理大规模动态问题时，在计算复杂性和时间效率方面面临瓶颈。尽管现有强化学习方法取得了一定进展，但仍面临若干挑战：1）跨多个载体的独立解码无法模拟联合行动分布;2）特征提取网络难以捕捉实体间的关系;3）联合作用空间呈指数级增长。为解决这些问题，我们设计了MAPT框架，该框架采用变换器编码器提取实体表示，结合变换器解码器与指针网络以自回归方式生成联合动作序列，并引入关系感知注意力模块以捕捉实体间关系。此外，我们通过信息先验指导模型决策，促进有效探索。对8个数据集的实验表明，MAPT在性能方面显著优于现有基线方法，并且相比传统运筹学方法在计算时间上有显著优势。

Harnessing Data from Clustered LQR Systems: Personalized and Collaborative Policy Optimization

利用集群LQR系统的数据：个性化与协作策略优化

Authors: Vinay Kanakeri, Shivam Bajaj, Ashwin Verma, Vijay Gupta, Aritra Mitra
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2511.17489
Pdf link: https://arxiv.org/pdf/2511.17489
Abstract It is known that reinforcement learning (RL) is data-hungry. To improve sample-efficiency of RL, it has been proposed that the learning algorithm utilize data from 'approximately similar' processes. However, since the process models are unknown, identifying which other processes are similar poses a challenge. In this work, we study this problem in the context of the benchmark Linear Quadratic Regulator (LQR) setting. Specifically, we consider a setting with multiple agents, each corresponding to a copy of a linear process to be controlled. The agents' local processes can be partitioned into clusters based on similarities in dynamics and tasks. Combining ideas from sequential elimination and zeroth-order policy optimization, we propose a new algorithm that performs simultaneous clustering and learning to output a personalized policy (controller) for each cluster. Under a suitable notion of cluster separation that captures differences in closed-loop performance across systems, we prove that our approach guarantees correct clustering with high probability. Furthermore, we show that the sub-optimality gap of the policy learned for each cluster scales inversely with the size of the cluster, with no additional bias, unlike in prior works on collaborative learning-based control. Our work is the first to reveal how clustering can be used in data-driven control to learn personalized policies that enjoy statistical gains from collaboration but do not suffer sub-optimality due to inclusion of data from dissimilar processes. From a distributed implementation perspective, our method is attractive as it incurs only a mild logarithmic communication overhead.
中文摘要 众所周知，强化学习（RL）对数据需求极大。为了提高强化学习的样本效率，有人提出学习算法利用来自“大致相似”过程的数据。然而，由于工艺模型未知，识别哪些其他工艺相似存在挑战。在本研究中，我们将在基准线性二次调节器（LQR）设定的背景下研究该问题。具体来说，我们考虑一个有多个代理的环境，每个代理对应一个线性过程的复制品。代理的本地进程可以根据动态和任务的相似性被划分为集群。结合顺序消元和零阶策略优化的理念，我们提出了一种新算法，该算法同时进行聚类并学习，为每个聚类输出个性化策略（控制器）。在捕捉不同系统闭环性能差异的适当聚类分离概念下，我们证明了我们的方法高概率保证正确的聚类。此外，我们表明，每个集群所学政策的次优差距与集群规模呈反比增长，且没有额外偏差，这与以往基于协作学习的控制研究不同。我们的研究首次揭示了聚类如何在数据驱动控制中学习个性化策略，这些策略在协作中获得统计收益，但由于包含不同流程的数据而不存在次优性。从分布式实现的角度来看，我们的方法具有吸引力，因为它仅产生轻微的对数通信开销。

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

视频-R4：通过视觉反复思考强化富含文本的视频推理

Authors: Yolo Yunlong Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, Chenliang Xu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.17490
Pdf link: https://arxiv.org/pdf/2511.17490
Abstract Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.
中文摘要 理解文字丰富的视频需要阅读那些细微、短暂的文字提示，这些线索常常需要反复检查。然而，大多数视频质量保证模型依赖于固定帧的单次感知，导致幻觉和细粒度证据失效。受人类暂停、缩放和重读关键区域的启发，我们介绍了Video-R4（用视觉反复强化文本丰富视频推理），这是一种视频推理LMM，执行视觉反复思考：反复选择帧、放大信息区域、重新编码检索到的像素，并更新其推理状态。我们构建了两个具有可执行反刍轨迹的数据集：用于监督练习的视频-R4-CoT-17k和用于强化学习的视频-R4-RL-30k。我们提出了一个多阶段的反刍学习框架，逐步优化7B LMM，通过基于SFT和GRPO的强化学习，学习原子和混合视觉作。Video-R4-7B 在 M4-ViteVQA 上取得了最先进的结果，并进一步推广到多页文档质量保证、幻灯片质量保证和通用视频质量保证，证明迭代反刍是像素基多模态推理的有效范式。

Keyword: diffusion policy

There is no result