生成时间: 2026-03-06 16:42:08 (UTC+8); Arxiv 发布时间: 2026-03-06 20:00 EST (2026-03-07 09:00 UTC+8)

今天共有 37 篇相关文章

Keyword: reinforcement learning

CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models

CTRL-RAG:基于对比似然奖励的强化学习,适用于情境忠实RAG模型

Auction-Based RIS Allocation With DRL: Controlling the Cost-Performance Trade-Off

基于拍卖的RIS配额与日行车(DRL):控制成本效益权衡

Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey

高效大型语言模型推断的动态模型路由与级联:一项综述

Transformer-Based Multipath Congestion Control: A Decoupled Approach for Wireless Uplinks

基于变压器的多径拥塞控制:无线上行链路的解耦方法

Risk-Aware Reinforcement Learning for Mobile Manipulation

移动作的风险感知强化学习

Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

强化学习中的自助探索与群体级自然语言反馈

When Sensors Fail: Temporal Sequence Models for Robust PPO under Sensor Drift

传感器失效时:传感器漂移下稳健PPO的时间序列模型

Optimizing Language Models for Crosslingual Knowledge Consistency

优化跨语言知识一致性的语言模型

LLM-Guided Decentralized Exploration with Self-Organizing Robot Teams

LLM引导的去中心化探索与自组织机器人团队

Distributional Reinforcement Learning with Information Bottleneck for Uncertainty-Aware DRAM Equalization

信息瓶颈分布式强化学习用于不确定性感知DRAM均衡

Selfish Cooperation Towards Low-Altitude Economy: Integrated Multi-Service Deployment with Resilient Federated Reinforcement Learning

自私合作迈向低空经济:集成多军种部署与韧性联邦强化学习

Adaptive Personalized Federated Reinforcement Learning for RIS-Assisted Aerial Relays in SAGINs with Fluid Antennas

适用于带流体天线的SAGINs中RIS辅助天线中继的自适应个性化联合强化学习

Diffusion Policy through Conditional Proximal Policy Optimization

通过条件近端策略优化实现扩散策略

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

VISA:通过屏蔽适应注入价值,实现个性化LLM对齐

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning

SCoUT:通过多智能体强化学习中的实用工具引导时间分组实现的可扩展通信

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

BandPO:通过概率感知界限桥接信任区域与比率剪裁,用于LLM强化学习

Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition

联合异构语言模型优化,用于混合自动语音识别

$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

$\nabla$-推理者:通过测试时间梯度下降在潜空间进行大型语言模型推理

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

3D-RFT:基于视频的3D场景理解强化微调

Competitive Multi-Operator Reinforcement Learning for Joint Pricing and Fleet Rebalancing in AMoD Systems

竞争性多运营商强化学习,用于AMoD系统中的联合定价和车队再平衡

BioLLMAgent: A Hybrid Framework with Enhanced Structural Interpretability for Simulating Human Decision-Making in Computational Psychiatry

BioLLMAgent:一种具有增强结构可解释性的混合框架,用于计算精神病学中模拟人类决策

Formal Entropy-Regularized Control of Stochastic Systems

随机系统的形式熵正则化控制

WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents

WebFactory:基础语言智能的自动压缩到有根基的网络代理中

Reward-Conditioned Reinforcement Learning

奖励条件强化学习

Decoupling Task and Behavior: A Two-Stage Reward Curriculum in Reinforcement Learning for Robotics

任务与行为的分离:机器人强化学习中的两阶段奖励课程

LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting

LBM:通过推理和行动实现的层级大型自动竞价模型

KARL: Knowledge Agents via Reinforcement Learning

卡尔:通过强化学习实现知识代理

Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards

通过测试时间强化学习与音频文本语义奖励提升ASR的稳健性

Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

Wiki-R1:通过数据和抽样课程激励知识型VQA的多模态推理

SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning

SarcasmMiner:一个双轨后培训框架,用于强健的视听讽刺推理

Whispering to a Blackbox: Bootstrapping Frozen OCR with Visual Prompts

对黑匣低语:用视觉提示自助冻结OCR

Knowledge Divergence and the Value of Debate for Scalable Oversight

知识分歧与辩论对可扩展监督的价值

Latent Policy Steering through One-Step Flow Policies

通过一步流策略进行潜在策略引导

DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning

DiSCTT:共识引导的自学课程,促进推理中高效考试时间适应

Keyword: diffusion policy

Diffusion Policy through Conditional Proximal Policy Optimization

通过条件近端策略优化实现扩散策略

Task-Relevant and Irrelevant Region-Aware Augmentation for Generalizable Vision-Based Imitation Learning in Agricultural Manipulation

用于农业作中基于视觉的通用模仿学习的任务相关与无关区域感知增强

SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

种子政策:通过自我演化扩散策略实现机器人作视野尺度