生成时间: 2026-02-05 16:48:14 (UTC+8); Arxiv 发布时间: 2026-02-05 20:00 EST (2026-02-06 09:00 UTC+8)

今天共有 47 篇相关文章

Keyword: reinforcement learning

GOPO: Policy Optimization using Ranked Rewards

GOPO:利用排名奖励进行策略优化

Autonomous AI Agents for Real-Time Affordable Housing Site Selection: Multi-Objective Reinforcement Learning Under Regulatory Constraints

自主人工智能代理用于实时经济适用房选址:监管约束下的多目标强化学习

Safety-Critical Reinforcement Learning with Viability-Based Action Shielding for Hypersonic Longitudinal Flight

基于可行性动作屏蔽的安全关键强化学习,用于高超音速纵向飞行

Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning

可监控性作为免费礼物:RLVR如何自发地对齐推理

Likelihood-Based Reward Designs for General LLM Reasoning

基于似然的奖励设计用于一般大型语言模型推理

After Talking with 1,000 Personas: Learning Preference-Aligned Proactive Assistants From Large-Scale Persona Interactions

与1000个角色对话后:从大规模角色互动中学习偏好一致的主动助手

Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL

通过跨章节元强化学习提升LLM的上下文在线学习能力

DELTA: Deliberative Multi-Agent Reasoning with Reinforcement Learning for Multimodal Psychological Counseling

DELTA:多模态心理咨询中的深思熟虑多代理推理与强化学习

Learning to Reason in 13 Parameters

学习13个参数的推理

Decoupling Time and Risk: Risk-Sensitive Reinforcement Learning with General Discounting

时间与风险解耦:风险敏感强化学习与一般贴现

Lyapunov Constrained Soft Actor-Critic (LC-SAC) using Koopman Operator Theory for Quadrotor Trajectory Tracking

利用库普曼算子理论进行四旋翼轨迹跟踪的里雅普诺夫约束软演员-批判者(LC-SAC)

Topology-Aware Revival for Efficient Sparse Training

拓扑感知复兴以实现高效稀疏训练

Piece of CAKE: Adaptive Execution Engines via Microsecond-Scale Learning

小菜一碟:通过微秒级学习实现自适应执行引擎

The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

缺失的一半:揭示部署后培训时间隐含的安全风险

Steering LLMs via Scalable Interactive Oversight

通过可扩展交互监督引导大型语言模型

ALORE: Autonomous Large-Object Rearrangement with a Legged Manipulator

ALORE:带腿机械臂的自主大型物体重排

CoLT: Reasoning with Chain of Latent Tool Calls

CoLT:用一连串潜在工具调用进行推理

Scaling Agentic Verifier for Competitive Coding

用于竞争编码的可扩展代理验证器

From Ambiguity to Action: A POMDP Perspective on Partial Multi-Label Ambiguity and Its Horizon-One Resolution

从歧义到行动:POMDP视角下的部分多标签歧义及其Horizon-One分辨率

Thickening-to-Thinning: Reward Shaping via Human-Inspired Learning Dynamics for LLM Reasoning

从增厚到变薄:通过人类启发的学习动态实现的奖励塑造,用于大型语言模型推理

MiniRec: Data-Efficient Reinforcement Learning for LLM-based Recommendation

MiniRec:基于LLM的推荐数据高效强化学习

ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation

心电图-R1:基于协议指导且无模式的MLLM,实现可靠的心电图解读

Agent-Omit: Training Efficient LLM Agents for Adaptive Thought and Observation Omission via Agentic Reinforcement Learning

代理遗漏:通过智能强化学习训练高效的大型语言模型代理进行自适应思维和观察省略

Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision

引导验证器:通过动态过程监督实现协作多模态推理

HoRD: Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation

HoRD:通过历史条件强化学习和在线提炼实现强健的人形控制

EMA Policy Gradient: Taming Reinforcement Learning for LLMs with EMA Anchor and Top-k KL

EMA策略梯度:驯服EMA锚点和Top-k KL的LLM强化学习

Mixture of Masters: Sparse Chess Language Models with Player Routing

大师混合:稀疏国际象棋语言模型与玩家路由

Learning the Value Systems of Agents with Preference-based and Inverse Reinforcement Learning

通过偏好学习和逆向强化学习学习代理的价值系统

Understanding Degradation with Vision Language Model

用视觉语言模型理解退化

Dual Mind World Model Inspired Network Digital Twin for Access Scheduling

受双心世界模型启发的网络数字孪生,用于访问调度

Reinforcement Learning-based Home Energy Management with Heterogeneous Batteries and Stochastic EV Behaviour

基于强化学习的家庭能源管理,采用异构电池和随机电动车行为

Stochastic Decision Horizons for Constrained Reinforcement Learning

约束强化学习的随机决策视野

QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning

QUATRO:查询自适应信任区域策略优化,用于LLM微调

WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning

WideSeek-R1:通过多智能体强化学习探索宽度尺度以实现广泛信息寻求

Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

重新思考扩散模型强化学习的设计空间:关于超越损失设计的似然估计的重要性

Multi-Source Retrieval and Reasoning for Legal Sentencing Prediction

多源检索与法律量刑预测推理

ERNIE 5.0 Technical Report

ERNIE 5.0 技术报告

Rationality Measurement and Theory for Reinforcement Learning Agents

强化学习代理的理性测量与理论

When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?

当沉默是金:大型语言模型能否学会在时间质量保证及更广泛领域保持禁欲?

Skin Tokens: A Learned Compact Representation for Unified Autoregressive Rigging

皮肤标记:统一自回归绑定的学习式紧凑表示

Evolving Afferent Architectures: Biologically-inspired Models for Damage-Avoidance Learning

进化的传入结构:基于生物的损伤避免学习模型

Joint Sleep Mode Activation and Load Balancing with Dynamic Cell Load: A Combinatorial Bandit Approach

联合睡眠模式激活与动态单元负载负载均衡:一种组合强盗方法

Beyond Rewards in Reinforcement Learning for Cyber Defence

网络防御强化学习中的超越奖励

CRoSS: A Continual Robotic Simulation Suite for Scalable Reinforcement Learning with High Task Diversity and Realistic Physics Simulation

CRoSS:一套持续机器人模拟套件,支持高任务多样性和真实物理模拟的可扩展强化学习

Rethinking the Trust Region in LLM Reinforcement Learning

重新思考LLM强化学习中的信任区域

Reinforced Attention Learning

强化注意力学习

Keyword: diffusion policy

DADP: Domain Adaptive Diffusion Policy

DADP:域自适应扩散政策