生成时间: 2026-03-20 16:44:25 (UTC+8); Arxiv 发布时间: 2026-03-20 20:00 EDT (2026-03-21 08:00 UTC+8)

今天共有 46 篇相关文章

Keyword: reinforcement learning

Lightweight Adaptation for LLM-based Technical Service Agent: Latent Logic Augmentation and Robust Noise Reduction

基于LLM的技术服务代理轻量适配:潜在逻辑增强与稳健降噪

SLEA-RL: Step-Level Experience Augmented Reinforcement Learning for Multi-Turn Agentic Training

SLEA-RL:多回合代理训练的阶级经验增强强化学习

Uncovering Latent Phase Structures and Branching Logic in Locomotion Policies: A Case Study on HalfCheetah

揭示移动政策中的潜在相位结构与分支逻辑:半猎豹案例研究

Enhancing Reinforcement Learning Fine-Tuning with an Online Refiner

通过在线精炼器增强强化学习的微调

BoundAD: Boundary-Aware Negative Generation for Time Series Anomaly Detection

BoundAD:用于时间序列异常检测的边界感知负片生成

Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

Insight-V++:迈向多模态大型语言模型的高级长链视觉推理

R2-Dreamer: Redundancy-Reduced World Models without Decoders or Augmentation

R2-Dreamer:无解码器或增强的冗余减少世界模型

How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence

心理学习范式如何塑造并限制了人工智能

MolRGen: A Training and Evaluation Setting for De Novo Molecular Generation with Reasonning Models

MolRGen:基于推理模型的新分子生成训练与评估环境

Approximate Subgraph Matching with Neural Graph Representations and Reinforcement Learning

神经图表示与强化学习的近似子图匹配

DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

DriveVLM-RL:神经科学启发的强化学习,结合视觉语言模型,实现安全且可部署的自动驾驶

Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

通过课程学习推理 I:自学课程的可证实益处

Escaping Offline Pessimism: Vector-Field Reward Shaping for Safe Frontier Exploration

逃离离线悲观:向量场奖励塑造以实现安全边疆探索

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

PowerFlow:通过原则分布匹配解锁大型语言模型的双重性质

Mathematical Foundations of Deep Learning

深度学习的数学基础

RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach

RE-SAC:理清公交车队控制中的偶然性与认识风险:一种稳定且稳健的集合日程学习方法

Efficient and Versatile Quadrupedal Skating: Optimal Co-design via Reinforcement Learning and Bayesian Optimization

高效且多功能的四足滑行:通过强化学习和贝叶斯优化实现的最优协同设计

Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation

通过测试时间策略学习实现自适应解码,实现自我改进生成

Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

样本高效强化学习的折现贝塔-伯努利奖励估计,且有可验证奖励

AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models

AcceRL:一个分布式异步强化学习与世界模型框架,用于视觉-语言-行动模型

Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds

利用生成三维世界的模拟到实物强化学习扩展机器人VLA

Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

推理负载的平衡:难度差异化策略优化与长度重分布,实现高效且稳健的强化学习

iSatCR: Graph-Empowered Joint Onboard Computing and Routing for LEO Data Delivery

iSatCR:基于图形的联合机载计算与路由用于LEO数据传输

Learning to Self-Evolve

学习自我进化

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

平衡思维:提升视觉语言模型中的思维链训练

Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning

构造思维:视觉文本交错几何推理的基准与策略优化

HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning

HISR:多回合代理强化学习的事后信息调制分段过程奖励

CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

因果RM:基于观察用户反馈的因果理论RLHF奖励建模

Memento-Skills: Let Agents Design Agents

纪念技能:让代理设计代理

Automatic Configuration of LLM Post-Training Pipelines

LLM后培训流程的自动配置

Mi:dm K 2.5 Pro

Mi:dm K 2.5 Pro

ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents

ProRL 代理:多回合大型语言模型代理强化学习训练的即服务推广

Learn for Variation: Variationally Guided AAV Trajectory Learning in Differentiable Environments

为变异学习:在可微环境中的变分引导AAV轨迹学习

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

RewardFlow:在大型语言模型下的agentic RL状态图上的拓扑感知奖励传播

Bridging Network Fragmentation: A Semantic-Augmented DRL Framework for UAV-aided VANETs

网络碎片化桥接:无人机辅助VANET的语义增强DRL框架

MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model

多跳空间:视觉语言模型的多跳合成空间推理基准

Context Bootstrapped Reinforcement Learning

上下文引导强化学习

Maximum-Entropy Exploration with Future State-Action Visitation Measures

最大熵探索及未来状态动作访问测量

CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think

CRAFT:通过微调对齐扩散模型比你想象的要容易

MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

MoRI:在大型语言模型中学习基于动机的科学构想推理

Articulated-Body Dynamics Network: Dynamics-Grounded Prior for Robot Learning

关节体动力学网络:机器人学习的动力学基础先验

Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control

利用自编码门控对节点变换器和强化学习控制的自适应状态感知股价预测

VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

VEPO:低资源语言基础模型的变量熵策略优化

Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

Box Maze:一种用于可靠大型语言模型推理的过程控制架构

Markov Potential Game and Multi-Agent Reinforcement Learning for Autonomous Driving

自动驾驶的马尔可夫潜在博弈与多智能体强化学习

OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

OS-Themis:一个面向通用GUI奖励的可扩展批评框架

Keyword: diffusion policy

There is no result