生成时间: 2025-10-08 16:29:04 (UTC+8); Arxiv 发布时间: 2025-10-08 20:00 EDT (2025-10-09 08:00 UTC+8)

今天共有 41 篇相关文章

Keyword: reinforcement learning

CARE: Cognitive-reasoning Augmented Reinforcement for Emotional Support Conversation

CARE:情感支持对话的认知推理增强强化

Adaptive Reinforcement Learning for Dynamic Configuration Allocation in Pre-Production Testing

用于生产前测试中动态配置分配的自适应强化学习

Percepta: High Performance Stream Processing at the Edge

Percepta:边缘的高性能流处理

Adversarial Reinforcement Learning for Offensive and Defensive Agents in a Simulated Zero-Sum Network Environment

模拟零和网络环境下进攻和防御智能体的对抗强化学习

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

从中毒到意识到:培养法学硕士的后门自我意识

Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning

让它平静下来:用于可验证强化学习的探索性退火解码

Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment

超越单片奖励:MLLM 对齐的混合和多方面奖励优化

Adjusting the Output of Decision Transformer with Action Gradient

使用动作梯度调整决策转换器的输出

Adaptive Dynamics Planning for Robot Navigation

机器人导航的自适应动力学规划

Teacher-Student Guided Inverse Modeling for Steel Final Hardness Estimation

师生引导的钢材最终硬度估算逆建模

Adversarial Reinforcement Learning for Large Language Model Agent Safety

面向大型语言模型代理安全的对抗性强化学习

Prior-Aligned Meta-RL: Thompson Sampling with Learned Priors and Guarantees in Finite-Horizon MDPs

先验对齐的元RL:在有限视野MDP中具有学习先验和保证的Thompson采样

Vul-R2: A Reasoning LLM for Automated Vulnerability Repair

Vul-R2:用于自动漏洞修复的推理法学硕士

TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation

TensorBLEU:基于矢量化 GPU 的 BLEU 分数实现,用于每句话的训练中评估

Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment

可证明在离线和在线 RLHF/DPO 对齐中同时缓解损坏、过度优化和冗长

Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations

发表论文是一门艺术:学术演讲的自我提升美学代理

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

流程中代理系统优化,实现有效规划和工具使用

Improving Chain-of-Thought Efficiency for Autoregressive Image Generation

提高自回归图像生成的思维链效率

A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks

没有计划的目标只是一个愿望:高效且有效的全球规划师培训,以应对长期代理任务

HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection

HOI-R1:探索多模态大语言模型在人物交互检测中的潜力

DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision

DecEx-RAG:通过流程监督通过决策和执行优化来促进代理检索增强生成

Oracle-Guided Masked Contrastive Reinforcement Learning for Visuomotor Policies

用于视觉运动策略的 Oracle 引导掩码对比强化学习

Joint Communication Scheduling and Velocity Control for Multi-UAV-Assisted Post-Disaster Monitoring: An Attention-Based In-Context Learning Approach

多无人机辅助灾后监测的联合通信调度和速度控制:一种基于注意力的情境学习方法

EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

EMORL-TTS:基于LLM的TTS中细粒度情绪控制的强化学习

Risk level dependent Minimax Quantile lower bounds for Interactive Statistical Decision Making

用于交互式统计决策的风险级别相关最小最大分位数下限

EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

EEPO:通过采样然后忘记进行探索增强的策略优化

Prompt reinforcing for long-term planning of large language models

大语言模型长期规划的提示强化

EARL: Efficient Agentic Reinforcement Learning Systems for Large Language Models

EARL:用于大型语言模型的高效智能体强化学习系统

Learning to Crawl: Latent Model-Based Reinforcement Learning for Soft Robotic Adaptive Locomotion

学习爬行:基于潜在模型的软机器人自适应运动强化学习

Information-Theoretic Policy Pre-Training with Empowerment

信息论政策预训练与赋权

Optimal Batched Scheduling of Stochastic Processing Networks Using Atomic Action Decomposition

基于原子作用分解的随机处理网络的最优批量调度

From Learning to Mastery: Achieving Safe and Efficient Real-World Autonomous Driving with Human-In-The-Loop Reinforcement Learning

从学习到精通:通过人机交互强化学习实现安全高效的真实世界自动驾驶

VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization

VideoMiner:通过基于树的组相对策略优化,迭代接地一小时视频的关键帧

ASPO: Asymmetric Importance Sampling Policy Optimization

ASPO:非对称重要性抽样策略优化

When Thinking Drifts: Evidential Grounding for Robust Video Reasoning

当思维漂移时:稳健视频推理的证据基础

Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL

从失败中学习:通过故障感知逆向 RL 了解 LLM 对齐

The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives

对齐审计器:用于验证和完善 LLM 目标的贝叶斯框架

Multi-Task Reinforcement Learning with Language-Encoded Gated Policy Networks

使用语言编码的门控策略网络的多任务强化学习

Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction

窥视黑匣子内部:用于可解释和准确的关系提取的强化学习

Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents

分层GRPO:处理LLM搜索代理强化学习中的结构异质性

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

TaTToo:基于工具的思维 PRM,用于表格推理中测试时间缩放

Keyword: diffusion policy

There is no result