生成时间: 2026-01-13 16:37:20 (UTC+8); Arxiv 发布时间: 2026-01-13 20:00 EST (2026-01-14 09:00 UTC+8)

今天共有 74 篇相关文章

Keyword: reinforcement learning

Reinforcement Learning for Chain of Thought Compression with One-Domain-to-All Generalization

基于一域对全域推广的思维链压缩强化学习

Deep Q-Network Based Resilient Drone Communication:Neutralizing First-Order Markov Jammers

基于深度Q网络的弹性无人机通信:中和一阶马尔可夫干扰器

The Impact of Post-training on Data Contamination

后培训对数据污染的影响

From RLHF to Direct Alignment: A Theoretical Unification of Preference Learning for Large Language Models

从RLHF到直接对齐:大型语言模型偏好学习的理论统一

COVR:Collaborative Optimization of VLMs and RL Agent for Visual-Based Control

COVR:基于视觉控制的VLM与强化学习代理的协同优化

A Review of Online Diffusion Policy RL Algorithms for Scalable Robotic Control

在线扩散策略强化学习算法在可扩展机器人控制中的综述

HiMeS: Hippocampus-inspired Memory System for Personalized AI Assistants

HiMeS:受海马体启发的个性化AI助手记忆系统

TIR-Flow: Active Video Search and Reasoning with Frozen VLMs

TIR-Flow:冷冻VLM的主动视频搜索与推理

TimeGNN-Augmented Hybrid-Action MARL for Fine-Grained Task Partitioning and Energy-Aware Offloading in MEC

TimeGNN增强混合作用MARL用于MEC中的细粒度任务划分和能量感知卸载

Toward Safe and Responsible AI Agents: A Three-Pillar Model for Transparency, Accountability, and Trustworthiness

迈向安全且负责任的人工智能代理:透明、问责与可信的三大支柱模型

Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization

扎根你所见:通过字幕反馈、多样性感知抽样和冲突规范化实现抗幻觉MLLMs

Walk the PLANC: Physics-Guided RL for Agile Humanoid Locomotion on Constrained Footholds

走PLANC:物理引导强化学习,在受限足点上实现敏捷人形移动

How well can off-the-shelf LLMs elucidate molecular structures from mass spectra using chain-of-thought reasoning?

现成的大型语言模型(LLM)利用思维链推理,能多好地从质谱中阐明分子结构?

Future-as-Label: Scalable Supervision from Real-World Outcomes

未来即标签:基于现实世界成果的可扩展监督

Dynamic Incentivized Cooperation under Changing Rewards

动态激励合作,奖励变化

Lightweight Yet Secure: Secure Scripting Language Generation via Lightweight LLMs

轻量级且安全:通过轻量级大型语言模型生成安全脚本语言

Deep Reinforcement Learning based Control Design for Aircraft Recovery from Loss-of-Control Scenario

基于深度强化学习的控制设计,用于飞机从失控场景中恢复

Coupling Smoothed Particle Hydrodynamics with Multi-Agent Deep Reinforcement Learning for Cooperative Control of Point Absorbers

结合平滑粒子流体力学与多智能体深度强化学习,实现点吸收器的协作控制

ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

ArenaRL:通过基于锦标赛的相对排名,为开放式代理人调整强化学习

Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Vetting via Automated Spectral Inspection

Spec-o3:一种工具增强视觉语言代理,通过自动光谱检测筛选稀有天体候选

Self-Organizing Dual-Buffer Adaptive Clustering Experience Replay (SODASER) for Safe Reinforcement Learning in Optimal Control

自组织双缓冲自适应聚类经验重放(SODASER),用于最优控制下的安全强化学习

ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

ArrowGEV:通过学习时间之箭实现视频事件的扎根

Object-Centric World Models Meet Monte Carlo Tree Search

以对象为中心的世界模型遇见蒙特卡洛树搜索

KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks

KASER:面向开放式编程任务的知识对齐学生错误模拟器

Reinforcement Learning-Guided Dynamic Multi-Graph Fusion for Evacuation Traffic Prediction

强化学习引导动态多图融合用于疏散交通预测

Plasticity vs. Rigidity: The Impact of Low-Rank Adapters on Reasoning on a Micro-Budget

可塑性与刚性:低级适配器对微预算推理的影响

Characterising Toxicity in Generative Large Language Models

生成大型语言模型中的毒性特征

On-the-Fly VLA Adaptation via Test-Time Reinforcement Learning

通过测试时间强化学习实现的实时VLA适配

GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO

GanitLLM:通过课程进行难度感知孟加拉数学推理-GRPO

No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning

不再有陈旧反馈:开放世界代理学习的共同进化批评者

GDEPO: Group Dual-dynamic and Equal-right-advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning

GDEPO:群组双重动态与均衡右优势策略优化,增强训练数据利用以实现样本约束强化学习

Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy

与Delta思维:通过差异视觉推理策略激励强化学习

Code Evolution for Control: Synthesizing Policies via LLM-Driven Evolutionary Search

代码演化以实现控制:通过大型语言模型驱动的进化搜索综合策略

A Brain-like Synergistic Core in LLMs Drives Behaviour and Learning

大型语言模型中类似大脑的协同核心驱动行为和学习

Personality-Aware Reinforcement Learning for Persuasive Dialogue with LLM-Driven Simulation

基于LLM驱动的模拟的说服性对话中的人格感知强化学习

Distributional Clarity: The Hidden Driver of RL-Friendliness in Large Language Models

分布清晰度:大型语言模型中强化学习友好性的隐藏驱动因素

TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG

TreePS-RAG:基于树的过程监督,用于智能RAG中的强化学习

X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests

X-Coder:通过全合成任务、解决方案和测试推动竞技编程的发展

MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning

MEDVISTAGYM:通过工具集成强化学习实现医学图像思维的可扩展培训环境

Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework

通过强大的大型语言模型赋能多代理强化学习框架提升云网络韧性

ENTRA: Entropy-Based Redundancy Avoidance in Large Language Model Reasoning

ENTRA:基于熵的冗余避免在大型语言模型推理中

ReinPool: Reinforcement Learning Pooling Multi-Vector Embeddings for Retrieval System

ReinPool:强化学习池,用于检索系统的多向量嵌入

Generating readily synthesizable small molecule fluorophore scaffolds with reinforcement learning

通过强化学习生成易于合成的小分子荧光团支架

Rewarding Creativity: A Human-Aligned Generative Reward Model for Reinforcement Learning in Storytelling

奖励创造力:一种以人为本的生成奖励模型,用于讲故事中的强化学习

Agents of Diffusion: Enhancing Diffusion Language Models with Multi-Agent Reinforcement Learning for Structured Data Generation (Extended Version)

扩散代理:通过多智能体强化学习增强扩散语言模型以实现结构化数据生成(扩展版)

AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units

AscendKernelGen:基于LLM的神经处理单元内核生成的系统研究

Offline Meta-Reinforcement Learning with Flow-Based Task Inference and Adaptive Correction of Feature Overgeneralization

基于流程的任务推断和特征过度泛化的自适应纠正的离线元强化学习

Structured Reasoning for Large Language Models

大型语言模型的结构化推理

Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration

整合还是适应?棱镜:通过梯度集中解开SFT和RL数据

Group Pattern Selection Optimization: Let LRMs Pick the Right Pattern for Reasoning

群模式选择优化:让LRMs选择正确的模式进行推理

The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents

置信二分法:工具使用剂中误校的分析与缓解

ReasonTabQA: A Comprehensive Benchmark for Table Question Answering from Real World Industrial Scenarios

ReasonTabQA:来自真实工业场景的桌面问答综合基准

LRAS: Advanced Legal Reasoning with Agentic Search

LRAS:高级法律推理与代理检索

Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding

模仿人类认知,掌握多图像推理:一种提升视觉理解的元行动框架

Heterogeneous Multi-Expert Reinforcement Learning for Long-Horizon Multi-Goal Tasks in Autonomous Forklifts

自主叉车中远程多目标任务的异构多专家强化学习

Segmental Advantage Estimation: Enhancing PPO for Long-Context LLM Training

分段优势估计:增强PPO用于长上下文LLM训练

Reward Modeling from Natural Language Human Feedback

自然语言人类反馈中的奖励建模

OpenTinker: Separating Concerns in Agentic Reinforcement Learning

OpenTinker:在智能强化学习中分离关注点

On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

关于训练后监督微调与强化学习的非脱钩

Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

基于结果的优势重塑,用于数学推理中细粒度信用作业

Puzzle it Out: Local-to-Global World Model for Offline Multi-Agent Reinforcement Learning

解谜:基于离线多智能体强化学习的本地到全球世界模型

Graph Inference Towards ICD Coding

图推断对ICD编码的应用

Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions

控制具有覆盖增强的潜在动作的多模态会话代理

Stagewise Reinforcement Learning and the Geometry of the Regret Landscape

分阶段强化学习与遗憾景观的几何

GRPO with State Mutations: Improving LLM-Based Hardware Test Plan Generation

带有状态变异的GRPO:改进基于LLM的硬件测试计划生成

Clipped Affine Policy: Low-Complexity Near-Optimal Online Power Control for Energy Harvesting Communications over Fading Channels

截剪仿射策略:低复杂度近优在线功率控制,用于衰落信道上的能量收割通信

Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model

平滑算子:平滑可验证奖励激活视觉-语言模型的空间推理能力

Hiking in the Wild: A Scalable Perceptive Parkour Framework for Humanoids

野外徒步:为类人生物打造的可扩展感知跑酷框架

Video Evidence to Reasoning Efficient Video Understanding via Explicit Evidence Grounding

视频证据到推理,通过显性证据进行高效视频理解,基础

Beyond Single-Shot: Multi-step Tool Retrieval via Query Planning

超越单次获取:通过查询规划实现多步骤工具检索

Data-driven control of hydraulic impact hammers under strict operational and control constraints

在严格作和控制约束下,基于数据驱动的液压冲击锤控制

Failure-Aware RL: Reliable Offline-to-Online Reinforcement Learning with Self-Recovery for Real-World Manipulation

失败感知强化学习:可靠的离线到在线强化学习,具备自我恢复功能,用于现实作

Video Generation Models in Robotics - Applications, Research Challenges, Future Directions

机器人中的视频生成模型——应用、研究挑战与未来方向

Keyword: diffusion policy

A Review of Online Diffusion Policy RL Algorithms for Scalable Robotic Control

在线扩散策略强化学习算法在可扩展机器人控制中的综述