生成时间: 2026-03-23 17:01:46 (UTC+8); Arxiv 发布时间: 2026-03-23 20:00 EDT (2026-03-24 08:00 UTC+8)

今天共有 35 篇相关文章

Keyword: reinforcement learning

LARFT: Closing the Cognition-Action Gap for Length Instruction Following in Large Language Models

LARFT:在大型语言模型中,缩短长度指令跟随的认知与行动差距

Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion

探究到精炼:通过解释反演对大型语言模型的强化提炼

Full-Stack Domain Enhancement for Combustion LLMs: Construction and Optimization

燃烧大型语言模型的全栈域增强:构建与优化

MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

MemReward:基于图形的经验记忆,用于有限标签的LLM奖励预测

PrefPO: Pairwise Preference Prompt Optimization

PrefPO:成对偏好提示优化

Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification

Goedel-Code-Prover:开放最先进代码验证的分层证明搜索

Optimizing Resource-Constrained Non-Pharmaceutical Interventions for Multi-Cluster Outbreak Control Using Hierarchical Reinforcement Learning

利用层级强化学习优化资源受限的非药物干预措施以控制多群组疫情

Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas

LLM政策综合中的合作与利用,针对连续社会困境

Deep Hilbert--Galerkin Methods for Infinite-Dimensional PDEs and Optimal Control

深希尔伯特-加勒金方法:无限维偏微分方程与最优控制

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

ProactiveBench:多模态大型语言模型中的主动性基准测试

Teaching an Agent to Sketch One Part at a Time

教代理一次绘制一个部分

Stochastic Sequential Decision Making over Expanding Networks with Graph Filtering

基于图滤波扩展网络的随机顺序决策

EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models

EvidenceRL:强化可信语言模型的证据一致性

PA2D-MORL: Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning

PA2D-MORL:基于帕累托上升方向分解的多目标强化学习

SaFRO: Satisfaction-Aware Fusion via Dual-Relative Policy Optimization for Short-Video Search

SaFRO:通过双相对策略优化实现满足感感知融合,用于短视频搜索

DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management

DeepStock:库存管理中的政策规范化强化学习

ContractionPPO: Certified Reinforcement Learning via Differentiable Contraction Layers

ContractionPPO:通过可区分收缩层进行认证强化学习

Heavy-Tailed and Long-Range Dependent Noise in Stochastic Approximation: A Finite-Time Analysis

随机近似中的重尾噪声和长程相关噪声:有限时间分析

A Subgoal-driven Framework for Improving Long-Horizon LLM Agents

一个以子目标为驱动的框架,用于改进长远视野的LLM代理

LoopRPT: Reinforcement Pre-Training for Looped Language Models

LoopRPT:循环语言模型的强化预训练

FedPDPO: Federated Personalized Direct Preference Optimization for Large Language Model Alignment

FedPDPO:大型语言模型对齐的联合个性化直接偏好优化

Generalized Task-Driven Design of Soft Robots via Reduced-Order FEM-based Surrogate Modeling

通过基于低阶有限元素法的替代建模实现软机器人的通用任务驱动设计

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

FIPO:通过未来吉隆坡影响的政策优化激发深度推理

NASimJax: GPU-Accelerated Policy Learning Framework for Penetration Testing

NASimJax:GPU加速策略学习框架用于渗透测试

What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

如果共识撒谎怎么办?测试时的选择性-补充强化学习

Learning Adaptive Parameter Policies for Nonlinear Bayesian Filtering

学习非线性贝叶斯滤波的自适应参数策略

Robust Beam Codebooks for mmWave/THz Systems: Toward a Stochastic RL Approach

毫米波/太太赫兹系统的稳健束流代码手册:迈向随机强化学习方法

SAGE: Sustainable Agent-Guided Expert-tuning for Culturally Attuned Translation in Low-Resource Southeast Asia

SAGE:可持续的代理引导专家调校,促进资源匮乏东南亚的文化同步翻译

GustPilot: A Hierarchical DRL-INDI Framework for Wind-Resilient Quadrotor Navigation

GustPilot:用于抗风四旋翼导航的分层DRL-INDI框架

Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States

通过重新引入马尔可夫状态,突破LLM后培训能力上限

ReViSQL: Achieving Human-Level Text-to-SQL

ReViSQL:实现人类级文本转SQL的实现

Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs

《经验是最好的老师:激励强化学习中的有效探索》为LLMs提供帮助

Fine-tuning Timeseries Predictors Using Reinforcement Learning

利用强化学习微调时间序列预测器

Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning

适应链:外科视觉语言适应与强化学习

AGILE: A Comprehensive Workflow for Humanoid Loco-Manipulation Learning

敏捷:人形机动操作学习的全面工作流程

Keyword: diffusion policy

There is no result