生成时间: 2026-02-24 16:54:58 (UTC+8); Arxiv 发布时间: 2026-02-24 20:00 EST (2026-02-25 09:00 UTC+8)

今天共有 58 篇相关文章

Keyword: reinforcement learning

FineRef: Fine-Grained Error Reflection and Correction for Long-Form Generation with Citations

FineRef:长格式生成的细粒度错误反思与纠正,含引用

Learning to Remember: End-to-End Training of Memory Agents for Long-Context Reasoning

学习记忆:记忆代理的端到端训练以实现长上下文推理

Deep Reinforcement Learning for Optimizing Energy Consumption in Smart Grid Systems

深度强化学习优化智能电网系统中的能耗

1D-Bench: A Benchmark for Iterative UI Code Generation with Visual Feedback in Real-World

1D-Bench:现实世界中带有视觉反馈的迭代UI代码生成基准

Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications

语言中的层级奖励设计:增强智能体行为与人类规范的对齐

DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning

DP-RFT:通过差分私有强化微调学习生成合成文本

Adaptive Time Series Reasoning via Segment Selection

通过段选择实现自适应时间序列推理

Toward AI Autonomous Navigation for Mechanical Thrombectomy using Hierarchical Modular Multi-agent Reinforcement Learning (HM-MARL)

迈向基于分层模块化多智能体强化学习(HM-MARL)的机械血栓切除术AI自主导航

In-Context Planning with Latent Temporal Abstractions

含潜在时间抽象的上下文规划

LMFPPO-UBP: Local Mean Field Proximal Policy Optimization with Unbalanced Punishment for Spatial Public Goods Games

LMFPPO-UBP:空间公共物品博弈的局部均值场近端策略优化,带有不平衡惩罚

Task-Aware Exploration via a Predictive Bisimulation Metric

通过预测双模拟指标实现任务感知探索

HONEST-CAV: Hierarchical Optimization of Network Signals and Trajectories for Connected and Automated Vehicles with Multi-Agent Reinforcement Learning

HONEST-CAV:利用多智能体强化学习,实现联网和自动化车辆网络信号和轨迹的分层优化

TAG: Thinking with Action Unit Grounding for Facial Expression Recognition

标签:用动作单元思考 面部表情识别的接地

Carbon-aware decentralized dynamic task offloading in MIMO-MEC networks via multi-agent reinforcement learning

通过多智能体强化学习实现MIMO-MEC网络中的碳感知去中心化动态任务卸载

Issues with Measuring Task Complexity via Random Policies in Robotic Tasks

机器人任务中通过随机策略测量任务复杂性的问题

VariBASed: Variational Bayes-Adaptive Sequential Monte-Carlo Planning for Deep Reinforcement Learning

VariBASed:变分贝叶斯自适应序列蒙特卡洛规划中的深度强化学习

Gait Asymmetry from Unilateral Weakness and Improvement With Ankle Assistance: a Reinforcement Learning based Simulation Study

单侧无力与踝关节辅助改善带来的步态不对称:基于强化学习的模拟研究

TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models

TPRU:推进大型多模态模型中的时间和过程理解

DeepInterestGR: Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation

DeepInterestGR:利用多模态大型语言模型挖掘深度多兴趣生成推荐

IDSelect: A RL-Based Cost-Aware Selection Agent for Video-based Multi-Modal Person Recognition

IDSelect:基于强化学习的成本感知选择代理,用于基于视频的多模态人物识别

MagicAgent: Towards Generalized Agent Planning

MagicAgent:迈向通用代理规划

Learning to Detect Language Model Training Data via Active Reconstruction

通过主动重建学习检测语言模型训练数据

Human-to-Robot Interaction: Learning from Video Demonstration for Robot Imitation

人与机器人交互:通过视频演示学习机器人模仿

Adaptive Problem Generation via Symbolic Representations

通过符号表示实现自适应问题生成

How to Allocate, How to Learn? Dynamic Rollout Allocation and Advantage Modulation for Policy Optimization

如何分配,如何学习?动态推广分配与优势调制以优化策略

Characterizing MARL for Energy Control: A Multi-KPI Benchmark on the CityLearn Environment

MARL用于能源控制的特性描述:CityLearn环境的多关键绩效指标基准

Robust Exploration in Directed Controller Synthesis via Reinforcement Learning with Soft Mixture-of-Experts

通过软专家混合强化学习对定向控制器合成的深入探索

DGPO: RL-Steered Graph Diffusion for Neural Architecture Generation

DGPO:用于神经结构生成的强化学习引导图扩散

ALPACA: A Reinforcement Learning Environment for Medication Repurposing and Treatment Optimization in Alzheimer's Disease

羊驼:阿尔茨海默病药物再利用与治疗优化的强化学习环境

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

TOPReward:代币概率作为机器人隐藏的零射击奖励

Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

学习在个性化问答中多步检索个人语境的推理

Soft Sequence Policy Optimization: Bridging GMPO and SAPO

软序列策略优化:连接GMPO与SAPO

LLMs Can Learn to Reason Via Off-Policy RL

LLMs可以通过非策略强化学习推理

Stable Deep Reinforcement Learning via Isotropic Gaussian Representations

通过各向同性高斯表示的稳定深度强化学习

IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking

IR$^3$:对比逆强化学习用于可解释的奖励黑客检测与缓解

RAmmStein: Regime Adaptation in Mean-reverting Markets with Stein Thresholds -- Optimal Impulse Control in Concentrated AMMs

RAmmStein:均值回归市场中的体制适应——集中AMMs中的最优冲动控制

A Reinforcement Learning-based Transmission Expansion Framework Considering Strategic Bidding in Electricity Markets

基于强化学习的输电扩展框架,考虑电力市场中的战略竞标

Sizing of Battery Considering Renewable Energy Bidding Strategy with Reinforcement Learning

电池规模评估:结合强化学习考虑可再生能源招标策略

UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

UrbanAlign:VLM-人类偏好对齐的事后语义校准

SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning

SenTSR-Bench:用注入知识思考时间序列推理

How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

如何培训你的深度研究特工?Search-R1中的提示、奖励与策略优化

Cost-Aware Diffusion Active Search

成本感知扩散主动搜索

Advantage-based Temporal Attack in Reinforcement Learning

基于优势的时序攻击在强化学习中

CACTO-BIC: Scalable Actor-Critic Learning via Biased Sampling and GPU-Accelerated Trajectory Optimization

CACTO-BIC:通过偏向采样和GPU加速轨迹优化实现可扩展的演员-批评者学习

TextShield-R1: Reinforced Reasoning for Tampered Text Detection

TextShield-R1:被篡改文本检测的强化推理

Meta-Learning and Meta-Reinforcement Learning - Tracing the Path towards DeepMind's Adaptive Agent

元学习与元强化学习——追踪通往DeepMind自适应代理的路径

DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

DSDR:用于探索大型语言模型推理的双尺度多样性正则化

Uncertainty-Aware Rank-One MIMO Q Network Framework for Accelerated Offline Reinforcement Learning

不确定性感知的一级MIMO Q网络框架,用于加速离线强化学习

Janus-Q: End-to-End Event-Driven Trading via Hierarchical-Gated Reward Modeling

Janus-Q:通过层级门槛奖励建模实现端到端事件驱动交易

Sparse Masked Attention Policies for Reliable Generalization

稀疏的掩饰注意力政策以实现可靠概括

RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection

RL-RIG:通过内在反射实现生成空间推理器

A Secure and Private Distributed Bayesian Federated Learning Design

一种安全且私密的分布式贝叶斯联合学习设计

noDice: Inference for Discrete Probabilistic Programs with Nondeterminism and Conditioning

noDice:具有非确定性和条件处理的离散概率程序的推断

Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

可扩展合作多代理学习的下降引导策略梯度

Adaptive Underwater Acoustic Communications with Limited Feedback: An AoI-Aware Hierarchical Bandit Approach

有限反馈的自适应水下声学通信:一种AoI感知的分层盗贼方法

ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models

ReSyn:为推理模型自主扩展合成环境

LAD: Learning Advantage Distribution for Reasoning

LAD:推理中的学习优势分布

Keyword: diffusion policy

AdaWorldPolicy: World-Model-Driven Diffusion Policy with Online Adaptive Learning for Robotic Manipulation

AdaWorldPolicy:基于在线自适应学习的世界模型驱动扩散政策,用于机器人作