生成时间: 2025-10-28 16:31:57 (UTC+8); Arxiv 发布时间: 2025-10-28 20:00 EDT (2025-10-29 08:00 UTC+8)

今天共有 70 篇相关文章

Keyword: reinforcement learning

Taxonomy and Trends in Reinforcement Learning for Robotics and Control Systems: A Structured Review

机器人和控制系统强化学习的分类学和趋势:结构化综述

Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs

通过 VLM 中的掩蔽预测激活视觉上下文和常识推理

Embodied Navigation with Auxiliary Task of Action Description Prediction

具身导航与动作描述预测的辅助任务

GAPO: Group Adaptive Policy Optimization for Real-World Code Edit

GAPO:用于真实世界代码编辑的组自适应策略优化

SynCast: Synergizing Contradictions in Precipitation Nowcasting via Diffusion Sequential Preference Optimization

SynCast:通过扩散顺序偏好优化协同降水临近预报中的矛盾

SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models

SCoPE VLM:在视觉语言模型中实现高效文档导航的选择性上下文处理

Computational Hardness of Reinforcement Learning with Partial $q^π$-Realizability

具有部分$q^π$可实现性的强化学习的计算难度

Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models

超越推理收益:减轻大型推理模型中的一般能力遗忘

Is Temporal Difference Learning the Gold Standard for Stitching in RL?

学习时间差异是 RL 拼接的金标准吗?

Do You Trust the Process?: Modeling Institutional Trust for Community Adoption of Reinforcement Learning Policies

你信任这个过程吗?:为社区采用强化学习政策建模机构信任

Online Optimization for Offline Safe Reinforcement Learning

离线安全强化学习的在线优化

Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics

惩罚长度:揭示质量估计指标中的系统偏差

Predictive Coding Enhances Meta-RL To Achieve Interpretable Bayes-Optimal Belief Representation Under Partial Observability

预测编码增强了元强化研究,以实现部分可观察性下可解释的贝叶斯最优信念表示

Agentic Reinforcement Learning for Real-World Code Repair

用于真实世界代码修复的代理强化学习

STAR-RIS-assisted Collaborative Beamforming for Low-altitude Wireless Networks

STAR-RIS辅助的低空无线网络协同波束成形

EasyUUV: An LLM-Enhanced Universal and Lightweight Sim-to-Real Reinforcement Learning Framework for UUV Attitude Control

EasyUUV:用于 UUV 姿态控制的 LLM 增强型通用轻量级模拟实强化学习框架

OlaMind: Towards Human-Like and Hallucination-Safe Customer Service for Retrieval-Augmented Dialogue

OlaMind:迈向类似人类和幻觉安全的客户服务,用于检索增强对话

Solving Continuous Mean Field Games: Deep Reinforcement Learning for Non-Stationary Dynamics

求解连续均值场博弈:非平稳动力学的深度强化学习

Dopamine-driven synaptic credit assignment in neural networks

神经网络中多巴胺驱动的突触信用分配

PACR: Progressively Ascending Confidence Reward for LLM Reasoning

PACR:LLM 推理的信心逐步提升奖励

CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning

CityRiSE:通过强化学习推理视觉语言模型中的城市社会经济地位

GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

GRPO-Guard:通过调节削波缓解流量匹配中的隐式过度优化

BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles

BLIP-FusePPO:用于自动驾驶汽车车道保持的视觉语言深度强化学习框架

Teaching Machine Learning Through Cricket: A Practical Engineering Education Approach

通过板球教授机器学习:实用的工程教育方法

A Novel Multi-Timescale Stability-Preserving Hierarchical Reinforcement Learning Controller Framework for Adaptive Control in High-Dimensional Dynamical Systems

一种用于高维动力系统自适应控制的新型多时间尺度稳定性保持分层强化学习控制器框架

Agent-GSPO: Communication-Efficient Multi-Agent Systems via Group Sequence Policy Optimization

Agent-GSPO:通过组序列策略优化实现通信高效的多智能体系统

Resource Allocation for XR with Edge Offloading: A Reinforcement Learning Approach

XR的资源分配与边缘卸载:一种强化学习方法

Transitive RL: Value Learning via Divide and Conquer

传递 RL:通过分而治之的价值学习

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

FAPO:用于高效可靠推理的缺陷感知策略优化

SPIRAL: Self-Play Incremental Racing Algorithm for Learning in Multi-Drone Competitions

SPIRAL:用于多无人机比赛学习的自玩增量赛车算法

Curriculum-Based Iterative Self-Play for Scalable Multi-Drone Racing

基于课程的迭代自我游戏,实现可扩展的多无人机竞速

UCB-type Algorithm for Budget-Constrained Expert Learning

用于预算约束专家学习的UCB型算法

FlowCritic: Bridging Value Estimation with Flow Matching in Reinforcement Learning

FlowCritic:在强化学习中将价值估计与流量匹配联系起来

RL-AVIST: Reinforcement Learning for Autonomous Visual Inspection of Space Targets

力强-华成:用于空间目标自主目视检查的强化学习

Policies over Poses: Reinforcement Learning based Distributed Pose-Graph Optimization for Multi-Robot SLAM

策略高于姿态:基于强化学习的分布式姿态图优化,用于多机器人SLAM

Scalable Supervising Software Agents with Patch Reasoner

具有补丁推理器的可扩展监督软件代理

VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions

VEHME:用于评估手写数学表达式的视觉语言模型

HRM-Agent: Training a recurrent reasoning model in dynamic environments using reinforcement learning

HRM-Agent:使用强化学习在动态环境中训练循环推理模型

Toward Agents That Reason About Their Computation

面向对其计算进行推理的代理

Guardian: Decoupling Exploration from Safety in Reinforcement Learning

卫报:强化学习中的探索与安全性的解耦

Offline Preference Optimization via Maximum Marginal Likelihood Estimation

通过最大边际似然估计进行离线偏好优化

Never Too Rigid to Reach: Adaptive Virtual Model Control with LLM- and Lyapunov-Based Reinforcement Learning

永远不会太僵化而无法实现:基于 LLM 和 Lyapunov 的强化学习的自适应虚拟模型控制

Hazard-Responsive Digital Twin for Climate-Driven Urban Resilience and Equity

气候驱动的城市复原力和公平的灾害响应数字孪生

Multi-Agent Conditional Diffusion Model with Mean Field Communication as Wireless Resource Allocation Planner

以均值现场通信为无线资源分配规划器的多智能体条件扩散模型

Softmax is $1/2$-Lipschitz: A tight bound across all $\ell_p$ norms

Softmax 是 $1/2$-Lipschitz:对所有 $\ell_p$ 规范的严格约束

Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts

为混合专家提供稳定有效的强化学习

Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

通过工具集成强化学习激励 LLM 法官的代理推理

Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

作为替代奖励最大化的优势塑造:统一Pass@K政策梯度

AirFed: Federated Graph-Enhanced Multi-Agent Reinforcement Learning for Multi-UAV Cooperative Mobile Edge Computing

AirFed:面向多无人机协作移动边缘计算的联邦图增强多智能体强化学习

Think before Recommendation: Autonomous Reasoning-enhanced Recommender

推荐前三思:自主推理增强推荐器

Adapting Interleaved Encoders with PPO for Language-Guided Reinforcement Learning in BabyAI

在 BabyAI 中采用 PPO 调整交错编码器以进行语言引导强化学习

Guiding Skill Discovery with Foundation Models

使用基础模型指导技能发现

TARC: Time-Adaptive Robotic Control

TARC:时间自适应机器人控制

Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach

真实足球模拟中的类人守门员:一种样本高效的强化学习方法

Code Aesthetics with Agentic Reward Feedback

带有代理奖励反馈的代码美学

CNOT Minimal Circuit Synthesis: A Reinforcement Learning Approach

CNOT 最小电路合成:一种强化学习方法

Transferable Deep Reinforcement Learning for Cross-Domain Navigation: from Farmland to the Moon

跨域导航的可转移深度强化学习:从农田到月球

The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

N 个世界之最:通过max@k优化使强化学习与 N 个最佳采样保持一致

VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations

VideoTG-R1:通过反射边界注释的课程强化学习增强视频时间基础

Causal Deep Q Network

因果深Q网络

An Information-Theoretic Analysis of Out-of-Distribution Generalization in Meta-Learning with Applications to Meta-RL

元学习中分布外泛化的信息论分析及其在元强化学习中的应用

MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding

MergeMix:用于视觉和多模态理解的统一增强范式

Learning to Reason Efficiently with Discounted Reinforcement Learning

通过折扣强化学习学习有效推理

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

VOLD:通过策略蒸馏从法学硕士到视觉语言模型的推理转移

Sequential Multi-Agent Dynamic Algorithm Configuration

顺序多智能体动态算法配置

Multi-Agent Evolve: LLM Self-Improve through Co-evolution

多智能体进化:法学硕士通过共同进化自我完善

Think Twice: Branch-and-Rethink Reasoning Reward Model

三思而后行:分支和重新思考推理奖励模型

Keyword: diffusion policy

Two-Steps Diffusion Policy for Robotic Manipulation via Genetic Denoising

通过遗传去噪进行机器人作的两步扩散策略

ManiDP: Manipulability-Aware Diffusion Policy for Posture-Dependent Bimanual Manipulation

ManiDP:用于姿势依赖性双手作的可纵性感知扩散策略

Deep Active Inference with Diffusion Policy and Multiple Timescale World Model for Real-World Exploration and Navigation

基于扩散策略和多时间尺度世界模型进行深度主动推理,实现现实世界探索和导航