生成时间: 2026-01-08 16:34:55 (UTC+8); Arxiv 发布时间: 2026-01-08 20:00 EST (2026-01-09 09:00 UTC+8)

今天共有 44 篇相关文章

Keyword: reinforcement learning

PC2P: Multi-Agent Path Finding via Personalized-Enhanced Communication and Crowd Perception

PC2P:通过个性化增强的通信和人群感知实现多智能体路径寻找

Autonomous Threat Detection and Response in Cloud Security: A Comprehensive Survey of AI-Driven Strategies

云安全中的自主威胁检测与响应:人工智能驱动策略的全面综述

Mastering the Game of Go with Self-play Experience Replay

掌握围棋自玩体验回放

Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning

比率方差正则化策略优化以实现高效LLM微调

Aligning Findings with Diagnosis: A Self-Consistent Reinforcement Learning Framework for Trustworthy Radiology Reporting

将发现与诊断对齐:可信放射报告的自洽强化学习框架

Exploration Through Introspection: A Self-Aware Reward Model

通过内省探索:一种自我觉察的奖励模型

Sensor to Pixels: Decentralized Swarm Gathering via Image-Based Reinforcement Learning

传感器到像素:通过基于图像的强化学习实现去中心化群体聚集

FIRE-VLM: A Vision-Language-Driven Reinforcement Learning Framework for UAV Wildfire Tracking in a Physics-Grounded Fire Digital Twin

FIRE-VLM:基于物理的火灾数字孪生中的无人机野火跟踪视觉语言驱动强化学习框架

ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

ThinkRL-Edit:基于推理的强化学习思维图像编辑

Understanding Reward Hacking in Text-to-Image Reinforcement Learning

理解文本转图像强化学习中的奖励黑客

Adaptive Model-Based Reinforcement Learning for Orbit Feedback Control in NSLS-II Storage Ring

NSLS-II存储环轨道反馈控制的自适应模型强化学习

Semantic Belief-State World Model for 3D Human Motion Prediction

三维人体运动预测的语义信念-状态世界模型

VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation

VeRPO:代码生成的可验证稠密奖励策略优化

SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

SCRIBE:工具使用语言模型的结构化中级监督

From Score to Sound: An End-to-End MIDI-to-Motion Pipeline for Robotic Cello Performance

从乐谱到声音:机器人大提琴演奏的端到端MIDI到运动流程

Interleaved Tool-Call Reasoning for Protein Function Understanding

交错工具调用推理以理解蛋白质功能

Locomotion Beyond Feet

脚之外的运动

Shielded RecRL: Explanation Generation for Recommender Systems without Ranking Degradation

屏蔽RecRL:推荐系统解释生成,避免排名降级

AMIR-GRPO: Inducing Implicit Preference Signals into GRPO

AMIR-GRPO:将隐性偏好信号引入GRPO

Sandwich Reasoning: An Answer-Reasoning-Answer Approach for Low-Latency Query Correction

Dual-Attention Heterogeneous GNN for Multi-robot Collaborative Area Search via Deep Reinforcement Learning

通过深度强化学习实现多机器人协作区域搜索的双注意力异构GNN

TreeAdv: Tree-Structured Advantage Redistribution for Group-Based RL

TreeAdv:基于群体的强化学习中的树状结构优势再分配

R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

R$^3$L:反思然后重试强化学习,结合语言引导探索、关键学分和积极放大

ETR: Outcome-Guided Elastic Trust Regions for Policy Optimization

ETR:策略优化的成果导向弹性信任区域

EDCO: Dynamic Curriculum Orchestration for Domain-specific Large Language Model Fine-tuning

EDCO:用于领域特定大型语言模型微调的动态课程编排

O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL

O-Researcher:通过多智能体蒸馏和智能强化学习的开放式深度研究模型

MVP: Enhancing Video Large Language Models via Self-supervised Masked Video Prediction

MVP:通过自我监督蒙面视频预测增强视频大型语言模型

NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

NeoAMT:新词感知能动机器翻译与强化学习

From Brute Force to Semantic Insight: Performance-Guided Data Transformation Design with LLMs

从暴力破解到语义洞察:基于性能的大型语言模型数据转换设计

ROI-Reasoning: Rational Optimization for Inference via Pre-Computation Meta-Cognition

ROI-推理:通过预计算元认知进行理性优化推理

Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning

阶梯电位优势估计:利用中间置信度和正确性实现高效的数学推理

Staged Voxel-Level Deep Reinforcement Learning for 3D Medical Image Segmentation with Noisy Annotations

用于带噪声注释的3D医学图像分割的分级体素级深度强化学习

IndexTTS 2.5 Technical Report

IndexTTS 2.5 技术报告

Adaptive-Boundary-Clipping GRPO: Ensuring Bounded Ratios for Stable and Generalizable Training

自适应边界裁剪GRPO:确保稳定且可推广训练的有界比值

Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification

交易-R1:通过过程层推理验证将可验证奖励桥接到随机环境

CoINS: Counterfactual Interactive Navigation via Skill-Aware VLM

CoINS:通过技能感知VLM实现的反事实交互式导航

Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models

反长度偏移:用于高效推理模型训练的动态离群值截断

On-Device Deep Reinforcement Learning for Decentralized Task Offloading Performance trade-offs in the training process

设备内深度强化学习用于分散式任务卸载训练过程中的性能权衡

Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model

用帧思考:通过帧奖励模型进行生成视频失真评估

Cells on Autopilot: Adaptive Cell (Re)Selection via Reinforcement Learning

细胞自动驾驶:通过强化学习实现的自适应细胞(再)选择

GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning

GeoReason:通过逻辑一致性强化学习,使遥感视觉语言模型中的思维与回答对齐

InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

InfiniteWeb:可扩展的网页环境综合用于图形界面代理培训

Agentic Rubrics as Contextual Verifiers for SWE Agents

作为SWE代理的情境验证工具的代理评分标准

Hierarchical GNN-Based Multi-Agent Learning for Dynamic Queue-Jump Lane and Emergency Vehicle Corridor Formation

基于GNN的多智能体学习,用于动态排队-跳跃车道和紧急车辆走廊形成

Keyword: diffusion policy

There is no result