生成时间: 2026-01-29 16:43:27 (UTC+8); Arxiv 发布时间: 2026-01-29 20:00 EST (2026-01-30 09:00 UTC+8)

今天共有 33 篇相关文章

Keyword: reinforcement learning

Towards a Mechanistic Understanding of Large Reasoning Models: A Survey of Training, Inference, and Failures

迈向对大型推理模型的机械理解:训练、推理与失败的综述

E2HiL: Entropy-Guided Sample Selection for Efficient Real-World Human-in-the-Loop Reinforcement Learning

E2HiL:熵引导样本选择,实现高效的真实人机循环强化学习

Distributional value gradients for stochastic environments

随机环境下的分布值梯度

Techno-economic optimization of a heat-pipe microreactor, part II: multi-objective optimization analysis

热管微型反应堆的技术经济优化,第二部分:多目标优化分析

Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery

量化感知蒸馏用于NVFP4推断准确性恢复

Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis

通过对比分析对代码环境中的奖励黑客检测基准测试

In-Context Reinforcement Learning From Suboptimal Historical Data

从次优历史数据进行上下文强化学习

A Reinforcement Learning Based Universal Sequence Design for Polar Codes

基于强化学习的极性码通用序列设计

Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models

奖励智力谦逊:学习在大型语言模型中何时不该回答

Meta-Cognitive Reinforcement Learning with Self-Doubt and Recovery

元认知强化学习与自我怀疑与康复

Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning

Spark:通过动态分支进行战略性政策感知探索,实现长期代理学习

Scaling Medical Reasoning Verification via Tool-Integrated Reinforcement Learning

通过工具集成强化学习扩展医学推理验证

Proactive SFC Provisioning with Forecast-Driven DRL in Data Centers

数据中心中采用预测驱动的 DRL 进行主动 SFC 配置

Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models

内源性再提示:统一多模态模型的自我演化认知对齐

CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

CE-RM:通过两阶段推广和统一标准优化的点状生成奖励模型

PsychePass: Calibrating LLM Therapeutic Competence via Trajectory-Anchored Tournaments

PsychePass:通过轨迹锚定锦标赛校准LLM治疗能力

MARE: Multimodal Alignment and Reinforcement for Explainable Deepfake Detection via Vision-Language Models

MARE:通过视觉语言模型实现可解释的深度伪造检测的多模态对齐与强化

PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use

PEARL:多跳工具使用的计划探索与自适应强化学习

Fair Recourse for All: Ensuring Individual and Group Fairness in Counterfactual Explanations

公平救济:确保反事实解释中的个人和群体公平

Inequality in Congestion Games with Learning Agents

拥塞博弈中的学习代理不等式

Ranking-aware Reinforcement Learning for Ordinal Ranking

序数排序的排名感知强化学习

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

《越难越好:通过难度感知的GRPO和多方面题目重构提升数学推理能力》

P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering

P2S:概率过程监督用于广域推理问题解答

GPO: Growing Policy Optimization for Legged Robot Locomotion and Whole-Body Control

GPO:腿式机器人移动和全身控制政策优化的增长

Positive-Unlabeled Reinforcement Learning Distillation for On-Premise Small Models

正向-无标签强化学习蒸馏,适用于本地小型模型

One Step Is Enough: Dispersive MeanFlow Policy Optimization

一步就够了:色散平均流策略优化

Adapting the Behavior of Reinforcement Learning Agents to Changing Action Spaces and Reward Functions

适应强化学习主体行为以适应变化的行动空间和奖励函数

GraphAllocBench: A Flexible Benchmark for Preference-Conditioned Multi-Objective Policy Learning

GraphAllocBench:偏好条件多目标政策学习的灵活基准

Less is More: Clustered Cross-Covariance Control for Offline RL

少即是多:离线强化学习的集叉协方差控制

SERA: Soft-Verified Efficient Repository Agents

SERA:软验证高效仓库代理

Reinforcement Learning via Self-Distillation

通过自我蒸馏进行强化学习

Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

通过失败前缀条件训练饱和问题的推理模型

End-to-end example-based sim-to-real RL policy transfer based on neural stylisation with application to robotic cutting

基于神经风格化的端到端模拟到现实的强化学习策略转移,并应用于机器人切割

Keyword: diffusion policy

There is no result