生成时间: 2026-02-13 16:48:19 (UTC+8); Arxiv 发布时间: 2026-02-13 20:00 EST (2026-02-14 09:00 UTC+8)

今天共有 54 篇相关文章

Keyword: reinforcement learning

Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions

大型语言模型对齐的机制性可解释性:进展、挑战与未来方向

TDPNavigator-Placer: Thermal- and Wirelength-Aware Chiplet Placement in 2.5D Systems Through Multi-Agent Reinforcement Learning

TDPNavigator-Placer:通过多智能体强化学习实现2.5D系统中的热感知和线长感知芯片组布置

When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification

何时以及该问什么:AskBench 和评分标准引导的 RLVR 用于 LLM 澄清

SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

SWE-MiniSandbox:构建软件工程代理的无容器强化学习

Patch the Distribution Mismatch: RL Rewriting Agent for Stable Off-Policy SFT

修补分布不匹配:强化重写代理以实现稳定的非策略SFT

Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization

通过行为代理优化推动主动代理的帕累托前沿

Can We Really Learn One Representation to Optimize All Rewards?

我们真的能学会一种表征来优化所有奖励吗?

Distributionally Robust Cooperative Multi-Agent Reinforcement Learning via Robust Value Factorization

通过稳健值因数分解实现分布式鲁棒合作多智能体强化学习

Credit Where It is Due: Cross-Modality Connectivity Drives Precise Reinforcement Learning for MLLM Reasoning

应得的功劳:跨模态连接推动MLLM推理的精准强化学习

RL over Commodity Networks: Overcoming the Bandwidth Barrier with Lossless Sparse Deltas

基于商品网络的强化学习:通过无损稀疏三角洲克服带宽障碍

Future Mining: Learning for Safety and Security

未来采矿:安全与保障的学习

Unifying Stable Optimization and Reference Regularization in RLHF

RLHF 中稳定优化与参考正则化的统一

Adaptive Milestone Reward for GUI Agents

图形界面代理的自适应里程碑奖励

Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

原生推理模型:训练语言模型以基于不可验证数据进行推理

SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent

视觉:具自证与信息获取的强化学习——搜索代理的多样化分支

PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

PRIME:数学与工程中可验证推理的过程-结果对齐基准

Learning to Configure Agentic AI Systems

学习配置代理型人工智能系统

The Five Ws of Multi-Agent Communication: Who Talks to Whom, When, What, and Why -- A Survey from MARL to Emergent Language and LLMs

多智能体沟通的五个W:谁与谁、何时、什么以及为什么交流——从MARL到新兴语言与大型语言模型的调查

Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm

夸克医疗对齐:一种整体多维对齐与协作优化范式

TabSieve: Explicit In-Table Evidence Selection for Tabular Prediction

TabSieve:表中证据的显式选择用于表式预测

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

DICE:扩散大型语言模型在生成CUDA核方面表现出色

STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

STVG-R1:通过强化学习激励实例层面的视频推理和扎根

AC-MASAC: An Attentive Curriculum Learning Framework for Heterogeneous UAV Swarm Coordination

AC-MASAC:异构无人机群群协调的专注课程学习框架

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

TSR:多回合强化学习的大型语言模型代理轨迹搜索部署

Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning

温度作为元策略:LLM强化学习中的自适应温度

RELATE: A Reinforcement Learning-Enhanced LLM Framework for Advertising Text Generation

相关:一个强化学习增强的广告文本生成大型语言模型框架

Detecting RLVR Training Data via Structural Convergence of Reasoning

通过推理结构收敛检测RLVR训练数据

Temporal Difference Learning with Constrained Initial Representations

带有约束初始表示的时间差分学习

From Path Signatures to Sequential Modeling: Incremental Signature Contributions for Offline RL

从路径签名到顺序建模:离线强化学习的增量签名贡献

Predicting LLM Output Length via Entropy-Guided Representations

通过熵引导表示预测LLM输出长度

In-Context Function Learning in Large Language Models

大型语言模型中的上下文功能学习

Efficient Crawling for Scalable Web Data Acquisition (Extended Version)

高效爬取可扩展网络数据采集(扩展版)

Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning

回声:通过音频交错推理迈向高级音频理解

Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration

扩展专家混合推理模型的谜题,并应用于GPT-OSS加速

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Gaia2:动态与异步环境中的大型语言模型代理基准测试

Accelerating Robotic Reinforcement Learning with Agent Guidance

加速机器人强化学习与智能体指导

FedGRPO: Privately Optimizing Foundation Models with Group-Relative Rewards from Domain Client

FedGRPO:私密优化基于群组相对奖励的基础模型,从域名客户端获得

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Composition-RL:为大型语言模型强化学习编写可验证的提示

Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards

通过在线强化学习与实机基准奖励提升大型语言模型的高性能计算代码生成能力

Geometry of Uncertainty: Learning Metric Spaces for Multimodal State Estimation in RL

不确定性几何:学习强化学习多模态状态估计的度量空间

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

GigaBrain-0.5M*:一种基于世界模型的强化学习VLA。

On the Complexity of Offline Reinforcement Learning with $Q^\star$-Approximation and Partial Coverage

关于带有$Q^\star$-近似和部分覆盖的离线强化学习的复杂性

Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty

停止不必要的反思:训练长距离模型(LRMs)进行自适应反思和长度协调惩罚的高效推理

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

P-GenRM:个性化生成奖励模型,具备测试时用户基础缩放

Meta-Sel: Efficient Demonstration Selection for In-Context Learning via Supervised Meta-Learning

Meta-Sel:通过监督元学习实现情境内学习的高效演示选择

Capability-Oriented Training Induced Alignment Risk

能力导向训练诱导的对齐风险

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

超越教师的学习:带有奖励外推的广义政策提炼

Seq2Seq2Seq: Lossless Data Compression via Discrete Latent Transformers and Reinforcement Learning

Seq2Seq2Seq:通过离散潜在变换器和强化学习实现无损数据压缩

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

DeepGen 1.0:一个轻量级统一多模态模型,用于推进图像生成和编辑技术

Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training

迈向政策范式理论:分布判别理论及其在大型语言医学培训中的应用

Any House Any Task: Scalable Long-Horizon Planning for Abstract Human Tasks

任何房子任何任务:抽象人类任务的可扩展长期规划

Intrinsic-Energy Joint Embedding Predictive Architectures Induce Quasimetric Spaces

内在能量关节嵌入预测结构诱导准度量空间

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

CM2:多回合和多步骤智能工具使用的清单奖励强化学习

Keyword: diffusion policy

Learning to Manipulate Anything: Revealing Data Scaling Laws in Bounding-Box Guided Policies

学习作任何事物:揭示边界框引导策略中的数据扩展规律