生成时间: 2025-12-19 16:33:00 (UTC+8); Arxiv 发布时间: 2025-12-19 20:00 EST (2025-12-20 09:00 UTC+8)

今天共有 29 篇相关文章

Keyword: reinforcement learning

Bilevel Optimization for Covert Memory Tampering in Heterogeneous Multi-Agent Architectures (XAMT)

异构多智能体架构(XAMT)中隐蔽内存篡改的双级优化

DSO: Direct Steering Optimization for Bias Mitigation

DSO:直接转向优化以缓解偏置

Small Language Models for Efficient Agentic Tool Calling: Outperforming Large Models with Targeted Fine-tuning

高效智能体工具调用的小型语言模型:通过精准微调优胜于大型模型

Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models

大型语言模型中自适应低秩多头自我注意力的动态秩强化学习

Techno-economic optimization of a heat-pipe microreactor, part I: theory and cost optimization

热管微型反应堆的技术经济优化,第一部分:理论与成本优化

INTELLECT-3: Technical Report

INTELLECT-3:技术报告

MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation

MRG-R1:临床对齐医学报告生成的强化学习

Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

医学视觉语言模型的视觉对齐,用于基础放射科报告生成

Hypernetworks That Evolve Themselves

自我进化的超网络

NDRL: Cotton Irrigation and Nitrogen Application with Nested Dual-Agent Reinforcement Learning

NDRL:含嵌套双代理强化学习的棉花灌溉与氮应用

StarCraft+: Benchmarking Multi-agent Algorithms in Adversary Paradigm

星际争霸+:对手范式中多智能体算法的基准测试

Guiding Perception-Reasoning Closer to Human in Blind Image Quality Assessment

引导盲图质量评估中的感知推理更接近人类

ParamExplorer: A framework for exploring parameters in generative art

ParamExplorer:用于探索生成艺术参数的框架

Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game

Stackelberg 从人类反馈中学习:偏好优化作为顺序游戏

Implementing a Sharia Chatbot as a Consultation Medium for Questions About Islam

实施伊斯兰教法聊天机器人作为伊斯兰问题咨询平台

JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

JustRL:用简单的强化学习配方扩展15亿大型语言模型

Olaf: Bringing an Animated Character to Life in the Physical World

奥拉夫:让一个动画角色在现实世界中栩栩如生

Coordinated Anti-Jamming Resilience in Swarm Networks via Multi-Agent Reinforcement Learning

通过多智能体强化学习实现群聚网络中的协调抗干扰韧性

Meta-RL Induces Exploration in Language Agents

Meta-RL 诱导语言代理的探索

ReinforceGen: Hybrid Skill Policies with Automated Data Generation and Reinforcement Learning

ReinforceGen:结合自动数据生成和强化学习的混合技能策略

RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

RePlan:基于逻辑的区域规划用于复杂指令型图像编辑

AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning

AdaSearch:通过强化学习平衡大型语言模型中的参数知识与搜索

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

MomaGraph:具备具身任务规划的视觉语言模型的状态感知统一场景图

Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning

后期行为克隆:预训练BC策略以实现高效强化学习精调

Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

探索与开发:通过裁剪、熵和虚假奖励重新思考RLVR

Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

生成式对抗性推理器:利用对抗强化学习增强LLM推理

AdaTooler-V: Adaptive Tool-Use for Images and Videos

AdaTooler-V:图像和视频自适应工具使用

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

重要差异:能力缺口发现与纠正的审计模型

Keyword: diffusion policy

TS-DP: Reinforcement Speculative Decoding For Temporal Adaptive Diffusion Policy Acceleration

TS-DP:时间自适应扩散政策加速的强化推测解码