生成时间: 2026-01-16 16:34:08 (UTC+8); Arxiv 发布时间: 2026-01-16 20:00 EST (2026-01-17 09:00 UTC+8)

今天共有 26 篇相关文章

Keyword: reinforcement learning

StatLLaMA: A multi-stage training framework for building a domain-optimized statistical language model

StatLLaMA:一个多阶段训练框架,用于构建领域优化的统计语言模型

GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents

GUI-Eyes:工具增强感知,用于图形界面代理的视觉基础

Eluder dimension: localise it!

Eluder维度:定位它!

OUTLINEFORGE: Hierarchical Reinforcement Learning with Explicit States for Scientific Writing

OUTLINEFORGE:科学写作中的层级强化学习与显性状态

PaperScout: An Autonomous Agent for Academic Paper Search with Process-Aware Sequence-Level Policy Optimization

PaperScout:一个具备流程感知序列级策略优化的学术论文搜索自主代理

Event-Driven Deep RL Dispatcher for Post-Storm Distribution System Restoration

事件驱动深度强化调度器,用于风暴后分配系统恢复

Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts

稀疏强化学习:通过稳定稀疏展开打破大型语言模型强化学习中的内存壁垒

History Is Not Enough: An Adaptive Dataflow System for Financial Time-Series Synthesis

历史不够:用于金融时间序列综合的自适应数据流系统

DecisionLLM: Large Language Models for Long Sequence Decision Exploration

DecisionLLM:用于长序列决策探索的大型语言模型

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

ToolSafe:通过主动的步级防护和反馈,增强基于LLM代理的工具调用安全性

Reinforcement Learning to Discover a NorthEast Monsoon Index for Monthly Rainfall Prediction in Thailand

强化学习:发现泰国东北季风指数以预测月度降雨

HOMURA: Taming the Sand-Glass for Time-Constrained LLM Translation via Reinforcement Learning

HOMURA:通过强化学习驯服沙漏,实现时间限制的大型语言模型翻译

PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary

PRL:过程奖励学习提升LLMs的推理能力并拓宽推理边界

The impact of tactile sensor configurations on grasp learning efficiency -- a comparative evaluation in simulation

触觉传感器配置对掌握学习效率的影响——模拟中的比较评估

Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning

证据增强策略优化与奖励共进化,用于长上下文推理

Boundary-Aware NL2SQL: Integrating Reliability through Hybrid Reward and Data Synthesis

边界感知NL2SQL:通过混合奖励与数据综合整合可靠性

SuS: Strategy-aware Surprise for Intrinsic Exploration

SuS:内在探索的战略感知惊喜

FastStair: Learning to Run Up Stairs with Humanoid Robots

快梯:用类人机器人学习爬楼梯

CS-GBA: A Critical Sample-based Gradient-guided Backdoor Attack for Offline Reinforcement Learning

CS-GBA:基于临界样本的梯度引导后门攻击,用于离线强化学习

Reinforcement Learning with Multi-Step Lookahead Information Via Adaptive Batching

通过自适应批处理实现多步前瞻信息的强化学习

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

城市社会语义细分与视觉语言推理

PERM: Psychology-grounded Empathetic Reward Modeling for Large Language Models

PERM:基于心理学的大型语言模型的同理心奖励建模

Combinatorial Optimization Augmented Machine Learning

组合优化增强机器学习

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

成为你自己的红队成员:通过自我游戏和反思性体验回放实现安全调整

Institutional AI: A Governance Framework for Distributional AGI Safety

机构人工智能:分布式AGI安全的治理框架

MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching

MatchTIR:通过二分匹配实现工具整合推理的细粒度监督

Keyword: diffusion policy

There is no result