生成时间: 2025-10-31 16:28:59 (UTC+8); Arxiv 发布时间: 2025-10-31 20:00 EDT (2025-11-01 08:00 UTC+8)

今天共有 38 篇相关文章

Keyword: reinforcement learning

Non-myopic Matching and Rebalancing in Large-Scale On-Demand Ride-Pooling Systems Using Simulation-Informed Reinforcement Learning

基于仿真信息强化学习的大规模按需拼车系统中的非近视匹配和再平衡

Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start

Metis-SPECS:通过基于自蒸馏偏好的冷启动解耦多模态学习

Adversarial Pre-Padding: Generating Evasive Network Traffic Against Transformer-Based Classifiers

对抗性预填充:针对基于 Transformer 的分类器生成规避网络流量

MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs

MedVLSynther:使用生成器验证器 LMM 从医疗文档中合成高质量的视觉问答

Approximating Human Preferences Using a Multi-Judge Learned System

使用多法官学习系统近似人类偏好

$π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

$π_\texttt{RL}$:基于流的视觉-语言-行动模型的在线RL微调

Multi-Agent Reinforcement Learning for Market Making: Competition without Collusion

多智能体强化学习做市:无串通竞争

Estimating cognitive biases with attention-aware inverse planning

使用注意力感知逆向计划估计认知偏差

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

监督强化学习:从专家轨迹到逐步推理

PORTool: Tool-Use LLM Training with Rewarded Tree

PORTool:使用奖励树进行工具使用法学硕士训练

Morphology-Aware Graph Reinforcement Learning for Tensegrity Robot Locomotion

面向张力机器人运动的形态感知图强化学习

Network-Constrained Policy Optimization for Adaptive Multi-agent Vehicle Routing

自适应多智能体车辆路线的网络约束策略优化

GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks

GUI 知识台:揭示 GUI 任务中 VLM 故障背后的知识差距

Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error

不要两次踏入同一条河流:从反复试验中学习推理

EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

EgoExo-Con:探索视图不变视频时间理解

Reasoning Curriculum: Bootstrapping Broad LLM Reasoning from Math

推理课程:从数学中引导广泛的 LLM 推理

One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning

一个模型来批评所有这些:通过有效推理奖励代理工具的使用

A Game-Theoretic Spatio-Temporal Reinforcement Learning Framework for Collaborative Public Resource Allocation

一种面向公共资源协同配置的博弈论时空强化学习框架

Graph-Enhanced Policy Optimization in LLM Agent Training

LLM 代理训练中的图增强策略优化

Thor: Towards Human-Level Whole-Body Reactions for Intense Contact-Rich Environments

雷神:在激烈的接触丰富环境中实现人类水平的全身反应

Empowering RepoQA-Agent based on Reinforcement Learning Driven by Monte-carlo Tree Search

基于蒙特卡洛树搜索驱动的强化学习赋能RepoQA-Agent

Offline Clustering of Preference Learning with Active-data Augmentation

使用主动数据增强的偏好学习离线聚类

Reinforcement Learning for Pollution Detection in a Randomized, Sparse and Nonstationary Environment with an Autonomous Underwater Vehicle

使用自主水下航行器在随机、稀疏和非静止环境中进行污染检测的强化学习

Towards Reinforcement Learning Based Log Loading Automation

迈向基于强化学习的日志加载自动化

Adaptive Context Length Optimization with Low-Frequency Truncation for Multi-Agent Reinforcement Learning

低频截断的自适应上下文长度优化,用于多智能体强化学习

Human-in-the-loop Online Rejection Sampling for Robotic Manipulation

用于机器人作的人机交互在线剔除采样

PolarZero: A Reinforcement Learning Approach for Low-Complexity Polarization Kernel Design

PolarZero:一种用于低复杂度极化核设计的强化学习方法

ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems

ReSpec:优化强化学习系统中的推测解码

Data-Efficient RLVR via Off-Policy Influence Guidance

通过政策外影响指导实现数据高效的 RLVR

Think Outside the Policy: In-Context Steered Policy Optimization

跳出政策思考:情境引导策略优化

InfoFlow: Reinforcing Search Agent Via Reward Density Optimization

InfoFlow:通过奖励密度优化强化搜索代理

Emu3.5: Native Multimodal Models are World Learners

Emu3.5:原生多模态模型是世界学习者

A DRL-Empowered Multi-Level Jamming Approach for Secure Semantic Communication

DRL 赋能的多级干扰方法,实现安全语义通信

Low-Altitude UAV-Carried Movable Antenna for Joint Wireless Power Transfer and Covert Communications

用于联合无线电力传输和秘密通信的低空无人机携带的可移动天线

The Era of Agentic Organization: Learning to Organize with Language Models

代理组织时代:学习使用语言模型进行组织

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Linear:一种富有表现力、高效的注意力架构

A General Incentives-Based Framework for Fairness in Multi-agent Resource Allocation

基于通用激励的多智能体资源分配公平框架

Defeating the Training-Inference Mismatch via FP16

通过 FP16 击败训练-推理不匹配

Keyword: diffusion policy

There is no result