生成时间: 2026-01-27 16:38:35 (UTC+8); Arxiv 发布时间: 2026-01-27 20:00 EST (2026-01-28 09:00 UTC+8)

今天共有 59 篇相关文章

Keyword: reinforcement learning

Breaking Task Impasses Quickly: Adaptive Neuro-Symbolic Learning for Open-World Robotics

快速突破任务进展:开放世界机器人的自适应神经符号学习

Multi-Agent Deep Reinforcement Learning Under Constrained Communications

多智能体深度强化学习在受限通信下

Beyond Instrumental and Substitutive Paradigms: Introducing Machine Culture as an Emergent Phenomenon in Large Language Models

超越工具与替代范式:将机器文化作为大型语言模型中新兴现象引入

Scaling medical imaging report generation with multimodal reinforcement learning

利用多模态强化学习扩展医学影像报告生成

Beyond Outcome Verification: Verifiable Process Reward Models for Structured Reasoning

超越结果验证:结构化推理的可验证过程奖励模型

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Generation

重述、奖励、重复:基于叙事理论的故事生成的强化学习

Latent-Space Contrastive Reinforcement Learning for Stable and Efficient LLM Reasoning

潜在空间对比强化学习,实现稳定高效的大型语言模型推理

Structure-Aware NL-to-SQL for SFC Provisioning via AST-Masking Empowered Language Models

结构感知NL转SQL的SFC配置,通过AST掩蔽赋能的语言模型

Conformal Feedback Alignment: Quantifying Answer-Level Reliability for Robust LLM Alignment

共形反馈对齐:量化稳健大型语言模型对齐的答案级可靠性

Scaling Rough Terrain Locomotion with Automatic Curriculum Reinforcement Learning

利用自动课程强化学习提升崎岖地形的移动

PILOT: A Perceptive Integrated Low-level Controller for Loco-manipulation over Unstructured Scenes

PILOT:一种用于非结构化场景机车作的感知集成低级控制器

Embodiment-Induced Coordination Regimes in Tabular Multi-Agent Q-Learning

表式多智能体Q学习中的具现诱导协调机制

MetaWorld: Skill Transfer and Composition in a Hierarchical World Model for Grounding High-Level Instructions

MetaWorld:层级世界模型中的技能转移与构成,用于扎根高阶指令

Cognitive Platform Engineering for Autonomous Cloud Operations

自主云运营的认知平台工程

Quantum-Inspired Episode Selection for Monte Carlo Reinforcement Learning via QUBO Optimization

通过QUBO优化实现蒙特卡洛强化学习的量子启发剧集选择

Learning to Ideate for Machine Learning Engineering Agents

机器学习工程代理的创意学习

Deep Intrinsic Surprise-Regularized Control (DISRC): A Biologically Inspired Mechanism for Efficient Deep Q-Learning in Sparse Environments

深度内在惊奇正则化控制(DISRC):一种受生物启发的机制,用于稀疏环境中高效的深度Q学习

Athena: Synergizing Data Prefetching and Off-Chip Prediction via Online Reinforcement Learning

Athena:通过在线强化学习协同数据预取与芯片外预测

DIML: Differentiable Inverse Mechanism Learning from Behaviors of Multi-Agent Learning Trajectories

DIML:多智能体学习轨迹行为中的可微逆机制学习

Agentic reinforcement learning empowers next-generation chemical language models for molecular design and synthesis

代理强化学习赋能下一代化学语言模型,用于分子设计和合成

SQL-Trail: Multi-Turn Reinforcement Learning with Interleaved Feedback for Text-to-SQL

SQL跟踪:多回合强化学习,采用交错反馈,适用于文本转SQL的应用

ProGraph-R1: Progress-aware Reinforcement Learning for Graph Retrieval Augmented Generation

ProGraph-R1:图检索增强生成的进展感知强化学习

Agentic AI for Self-Driving Laboratories in Soft Matter: Taxonomy, Benchmarks,and Open Challenges

软物质自动驾驶实验室的代理人工智能:分类法、基准与开放挑战

SD-E$^2$: Semantic Exploration for Reasoning Under Token Budgets

SD-E$^2$:代币预算推理的语义探索

Beyond Static Datasets: Robust Offline Policy Optimization via Vetted Synthetic Transitions

超越静态数据集:通过经过审核的合成转移实现稳健的离线策略优化

Enhance the Safety in Reinforcement Learning by ADRC Lagrangian Methods

通过ADRC拉格朗日方法提升强化学习的安全性

FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

FP8-RL:用于大型语言模型强化学习的实用且稳定的低精度堆栈

QualiRAG: Retrieval-Augmented Generation for Visual Quality Understanding

QualiRAG:可视化质量理解的检索增强生成

PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR

PaperSearchQA:学习在科学论文中寻找和推理,使用 RLVR

Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents

支付更少的泛化税:对大型语言模型代理进行强化学习训练的跨域推广研究

ShopSimulator: Evaluating and Exploring RL-Driven LLM Agent for Shopping Assistants

ShopSimulator:评估与探索基于强化学习的购物助理大型语言模型代理

Reflecting Twice before Speaking with Empathy: Self-Reflective Alternating Inference for Empathy-Aware End-to-End Spoken Dialogue

在以同理心说话前先反思两次:自省的交替推断,用于同理心意识的端到端口语对话

VissimRL: A Multi-Agent Reinforcement Learning Framework for Traffic Signal Control Based on Vissim

VissimRL:基于Vissim的多智能体强化学习框架,用于交通信号控制

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

TriPlay-RL:三角色自我游戏强化学习,用于LLM安全对齐

Reinforcement Learning with Distributed MPC for Fuel-Efficient Platoon Control with Discrete Gear Transitions

采用分布式MPC的强化学习,实现节能排级控制及离散起落架转换

Temp-R1: A Unified Autonomous Agent for Complex Temporal KGQA via Reverse Curriculum Reinforcement Learning

Temp-R1:通过逆向课程强化学习实现复杂时空KGQA的统一自主代理

AI Agent for Reverse-Engineering Legacy Finite-Difference Code and Translating to Devito

AI代理用于逆向工程遗留有限差分代码并转换为Devito

daVinci-Dev: Agent-native Mid-training for Software Engineering

daVinci-Dev:软件工程的代理原生中期培训

OffSeeker: Online Reinforcement Learning Is Not All You Need for Deep Research Agents

OffSeeker:在线强化学习并不是你深度研究代理所需的全部

Enhancing Control Policy Smoothness by Aligning Actions with Predictions from Preceding States

通过将动作与前一状态的预测对齐,提升控制策略的平滑性

Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates

即时强化学习:无梯度更新的LLM代理持续学习

From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation

从可验证点到奖励链:利用可验证的基于引用的奖励进行开放式生成的强化学习

GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning

GenAgent:通过智能多模态推理实现文本到图像生成的缩放

K-Myriad: Jump-starting reinforcement learning with unsupervised parallel agents

K-Myriad:无监督并行代理的跳板强化学习

From Classification to Ranking: Enhancing LLM Reasoning Capabilities for MBTI Personality Detection

从分类到排名:增强 LLM 推理能力以实现 MBTI 人格检测

Learning long term climate-resilient transport adaptation pathways under direct and indirect flood impacts using reinforcement learning

利用强化学习学习在直接和间接洪水影响下学习长期气候韧通适应路径

Rank-1 Approximation of Inverse Fisher for Natural Policy Gradients in Deep Reinforcement Learning

深度强化学习中自然策略梯度的逆费舍尔近似

AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

AdaReasoner:用于迭代视觉推理的动态工具编排

ART for Diffusion Sampling: A Reinforcement Learning Approach to Timestep Schedule

扩散采样的ART方法:一种强化学习方法来实现时间步进度

Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs

健康评分:迈向可扩展的健康水平评估标准——大型语言模型

Reflect: Transparent Principle-Guided Reasoning for Constitutional Alignment at Scale

反思:透明的原则导向推理,实现大规模宪法对齐

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

自我提炼推理器:大型语言模型的策略性自我提炼

Trust, Don't Trust, or Flip: Robust Preference-Based Reinforcement Learning with Multi-Expert Feedback

信任、不信任或翻转:基于偏好的强化学习与多专家反馈

Dep-Search: Learning Dependency-Aware Reasoning Traces with Persistent Memory

Dep-Search:学习依赖意识推理痕迹与持久记忆

Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability

自学教学模式:可学习性边缘的推理

POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration

POPE:通过特权政策探索学习理性解决难题

Multi-Objective Reinforcement Learning for Efficient Tactical Decision Making for Trucks in Highway Traffic

多目标强化学习用于高速公路交通中卡车高效战术决策

Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes

重复使用你的FLOP:通过条件条件非常偏离政策的前缀来提升难题的强化学习规模

Keyword: diffusion policy

3DGesPolicy: Phoneme-Aware Holistic Co-Speech Gesture Generation Based on Action Control

3DGesPolicy:基于动作控制的音素感知整体共言手势生成