Skip to content

Knowledge Graphs

Knowledge Graphs

Publish Date Title Authors Homepage Code
2026-05-15 FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast Igor Bogdanov et.al. 2605.16233v1 null
2026-05-15 Argus: Evidence Assembly for Scalable Deep Research Agents Zhen Zhang et.al. 2605.16217v1 null
2026-05-15 Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most Tahreem Yasir et.al. 2605.16207v1 null
2026-05-15 Look Before You Leap: Autonomous Exploration for LLM Agents Ziang Ye et.al. 2605.16143v1 null
2026-05-15 SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation Xin Zhang et.al. 2605.16117v1 null
2026-05-15 DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation Rui Chu et.al. 2605.16113v1 null
2026-05-15 Federated Imputation under Heterogeneous Feature Spaces Imane Hocine et.al. 2605.16099v1 null
2026-05-15 Multi-level Self-supervised Pretraining on Compositional Hierarchical Graph for Molecular Property Prediction Xiayu Liu et.al. 2605.16088v1 null
2026-05-15 Towards Foundation Models for Relational Databases with Language Models and Graph Neural Networks Jingcheng Wu et.al. 2605.16085v1 null
2026-05-15 Who Owns This Agent? Tracing AI Agents Back to Their Owners Ruben Chocron et.al. 2605.16035v1 null
2026-05-15 Judge Circuits Nils Feldhus et.al. 2605.16023v1 null
2026-05-15 Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory Isar Nejadgholi et.al. 2605.15990v1 null
2026-05-15 Ontology for Policing: Conceptual Knowledge Learning for Semantic Understanding and Reasoning in Law Enforcement Reports Anita Srbinovska et.al. 2605.15978v1 null
2026-05-15 Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning Fabio Rovai et.al. 2605.15967v1 null
2026-05-15 CHoE: Cross-Domain Heterogeneous Graph Prompt Learning via Structure-Conditioned Experts Peiyuan Li et.al. 2605.15888v1 null
2026-05-15 Shapley Neuron Values for Continual Learning: Which Neurons Matter Most? Mohammad Ali Vahedifar et.al. 2605.15877v1 null
2026-05-15 BootstrapAgent: Distilling Repository Setup into Reusable Agent Knowledge Sihan Fu et.al. 2605.15815v1 null
2026-05-15 Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets Kai Hidajat et.al. 2605.15787v1 null
2026-05-15 SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows? Kean Shi et.al. 2605.15777v1 null
2026-05-15 Structure Abstraction and Generalization in a Hippocampal-Entorhinal Inspired World Model Tianqiu Zhang et.al. 2605.15733v1 null
2026-05-15 H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure Jiawei Yu et.al. 2605.15701v1 null
2026-05-15 ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models Jiahui Guang et.al. 2605.15687v1 null
2026-05-15 TFZ-Tree: An Ultra-Lightweight Waveform Classification Framework for Resource-Constrained Devices Hao Wang et.al. 2605.15656v1 null
2026-05-15 A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM Shaoke Xi et.al. 2605.15617v1 null
2026-05-15 MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models Weixin Liu et.al. 2605.15589v1 null
2026-05-15 Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems Nurbek Tastan et.al. 2605.15573v1 null
2026-05-15 GiLT: Augmenting Transformer Language Models with Dependency Graphs Tianyu Huang et.al. 2605.15562v1 null
2026-05-15 Neural Point-Forms Bruno Trentini et.al. 2605.15524v1 null
2026-05-15 X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Human Attention Guruprasad Raghavan et.al. 2605.15505v1 null
2026-05-14 FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models Dmitry Stanishevskii et.al. 2605.15482v1 null
2026-05-14 Beyond Partner Diversity: An Influence-Based Team Steering Framework for Zero-Shot Human-Machine Teaming Wei Sheng et.al. 2605.15400v1 null
2026-05-14 PACER: Acyclic Causal Discovery from Large-Scale Interventional Data Ramon Viñas Torné et.al. 2605.15353v1 null
2026-05-14 Zero-Shot Goal Recognition with Large Language Models Kin Max Piamolini Gusmão et.al. 2605.15333v1 null
2026-05-14 Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution Han Li et.al. 2605.15301v1 null
2026-05-14 GQA-μP: The maximal parameterization update for grouped query attention Kyle R. Chickering et.al. 2605.15290v1 null
2026-05-14 Autonomous Intelligent Agents for Natural-Language-Driven Web Execution with Integrated Security Assurance Vinil Pasupuleti et.al. 2605.15281v1 null
2026-05-14 FutureSim: Replaying World Events to Evaluate Adaptive Agents Shashwat Goel et.al. 2605.15188v1 null
2026-05-14 Evidential Reasoning Advances Interpretable Real-World Disease Screening Chenyu Lian et.al. 2605.15171v1 null
2026-05-14 Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment Sayantan Kumar et.al. 2605.15168v1 null
2026-05-14 MeMo: Memory as a Model Ryan Wei Heng Quek et.al. 2605.15156v1 null
2026-05-14 Why Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG Riccardo Terrenzi et.al. 2605.15109v1 null
2026-05-14 TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale Anurup Ganguli et.al. 2605.15053v2 null
2026-05-14 Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use Renning Pang et.al. 2605.15041v1 null
2026-05-14 Generalized Priority-Aware Shapley Value Kiljae Lee et.al. 2605.15018v1 null
2026-05-14 COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion Zihan Deng et.al. 2605.15016v1 null
2026-05-14 The Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale Peter A. Jansen et.al. 2605.15011v1 null
2026-05-14 KGPFN: Unlocking the Potential of Knowledge Graph Foundation Model via In-Context Learning Yisen Gao et.al. 2605.14907v1 null
2026-05-14 COREKG: Coreset-Guided Personalized Summarization of Knowledge Graphs Sohel Aman Khan et.al. 2605.14900v1 null
2026-05-14 BiFedKD: Bidirectional Federated Knowledge Distillation Framework for Non-IID and Long-Tailed ECG Monitoring Zixuan Shu et.al. 2605.14886v1 null
2026-05-14 REALM: Retrospective Encoder Alignment for LFP Modeling Peicheng Wu et.al. 2605.14867v1 null
2026-05-14 Towards In-Depth Root Cause Localization for Microservices with Multi-Agent Recursion-of-Thought Lingzhe Zhang et.al. 2605.14866v1 null
2026-05-14 A Deterministic Agentic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions Yu Zhang et.al. 2605.14857v1 null
2026-05-14 Exploitation of Hidden Context in Dynamic Movement Forecasting: A Neural Network Journey from Recurrent to Graph Neural Networks and General Purpose Transformers Lukas Schelenz et.al. 2605.14855v1 null
2026-05-14 Emotion-Attended Stateful Memory (EASM):The Architecture for Hyper-Personalization at Scale Vineet Kotecha et.al. 2605.14833v1 null
2026-05-14 A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency Zhao Yang et.al. 2605.14802v1 null
2026-05-14 Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation Songyang Gao et.al. 2605.14790v1 null
2026-05-14 XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition Gong Zhiren et.al. 2605.14754v1 null
2026-05-14 Cognitive-Uncertainty Guided Knowledge Distillation for Accurate Classification of Student Misconceptions Qirui Liu et.al. 2605.14752v1 null
2026-05-14 EVA: Editing for Versatile Alignment against Jailbreaks Yi Wang et.al. 2605.14750v1 null
2026-05-14 Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model Minghao Wu et.al. 2605.14723v1 null
2026-05-14 On Strong Equivalence Notions in Logic Programming and Abstract Argumentation Giovanni Buraglio et.al. 2605.14721v1 null
2026-05-14 SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization Posheng Chen et.al. 2605.14704v1 null
2026-05-14 Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI Joy Bose et.al. 2605.14665v2 null
2026-05-14 Teaching Large Language Models When Not to Know: Learning Temporal Critique for Ex-Ante Reasoning Chenlu Ding et.al. 2605.14636v1 null
2026-05-14 Action-Inspired Generative Models Eshwar R. A. et.al. 2605.14631v1 null
2026-05-14 SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning Kang Chen et.al. 2605.14619v1 null
2026-05-14 VerbalValue: A Socially Intelligent Virtual Host for Sales-Driven Live Commerce Yuyan Chen et.al. 2605.14542v1 null
2026-05-14 Fully Dynamic Rebalancing in Dockless Bike-Sharing Systems via Deep Reinforcement Learning Edoardo Scarpel et.al. 2605.14501v1 null
2026-05-14 GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations Jingbo Yang et.al. 2605.14498v1 null
2026-05-14 Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification Truong Thanh Hung Nguyen et.al. 2605.14495v1 null
2026-05-14 Learning Scenario Reduction for Two-Stage Robust Optimization with Discrete Uncertainty Tianjue Lin et.al. 2605.14494v1 null
2026-05-14 Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict Yihang Chen et.al. 2605.14473v2 null
2026-05-14 A Unified Knowledge Embedded Reinforcement Learning-based Framework for Generalized Capacitated Vehicle Routing Problems Wen Wang et.al. 2605.14416v1 null
2026-05-14 Metis AI: The Overlooked Middle Zone Between AI-Native and World-Movers Xiang Li et.al. 2605.14407v1 null
2026-05-14 Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation Kyomin Hwang et.al. 2605.14404v1 null
2026-05-14 Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis Yucheng Shi et.al. 2605.14392v1 null
2026-05-14 Nexus : An Agentic Framework for Time Series Forecasting Sarkar Snigdha Sarathi Das et.al. 2605.14389v1 null
2026-05-14 Optimal Pattern Detection Tree for Symbolic Rule-Based Classification Young-Chae Hong et.al. 2605.14374v1 null
2026-05-14 Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax Zeli Su et.al. 2605.14366v1 null
2026-05-14 CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation Yuyang Wu et.al. 2605.14344v2 null
2026-05-14 Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows Zixin Chen et.al. 2605.14322v1 null
2026-05-14 ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition Shen Lin et.al. 2605.14309v2 null
2026-05-14 Web Agents Should Adopt the Plan-Then-Execute Paradigm Julien Piet et.al. 2605.14290v1 null
2026-05-14 Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems Ling Wang et.al. 2605.14259v1 null
2026-05-14 Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology Jesseba Fernando et.al. 2605.14258v1 null
2026-05-14 What Makes Words Hard? Sakura at BEA 2026 Shared Task on Vocabulary Difficulty Prediction Adam Nohejl et.al. 2605.14257v1 null
2026-05-14 Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks Kai Sun et.al. 2605.14252v1 null
2026-05-13 Why Retrieval-Augmented Generation Fails: A Graph Perspective Kai Guo et.al. 2605.14192v1 null
2026-05-13 Thinking Ahead: Prospection-Guided Retrieval of Memory with Language Models Harshita Chopra et.al. 2605.14177v1 null
2026-05-13 Unsteady Metrics and Benchmarking Cultures of AI Model Builders Stefan Baack et.al. 2605.14164v1 null
2026-05-13 Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR) Marius S. Knorr et.al. 2605.14126v1 null
2026-05-13 MathAtlas: A Benchmark for Autoformalization in the Wild Nilay Patel et.al. 2605.14061v1 null
2026-05-13 Derivation Prompting: A Logic-Based Method for Improving Retrieval-Augmented Generation Ignacio Sastre et.al. 2605.14053v1 null
2026-05-13 VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use Juan S. Santillana et.al. 2605.13989v1 null
2026-05-13 Towards Resource-Efficient LLMs: End-to-End Energy Accounting of Distillation Pipelines Katherine Lambert et.al. 2605.13981v1 null
2026-05-13 Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction Darius A. Faroughy et.al. 2605.13950v1 null
2026-05-13 WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data Ziheng Zhang et.al. 2605.13846v1 null
2026-05-13 An LLM-Based System for Argument Reconstruction Paulo Pirozelli et.al. 2605.13793v1 null
2026-05-13 EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents Jiaqi Liu et.al. 2605.13941v1 null
2026-05-13 RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning Andrea Morandi et.al. 2605.13695v1 null

Abstracts

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

2605.16233v1 by Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.

摘要:LLM代理能否通過自生成記憶而不進行梯度更新來改善決策?我們提出了FORGE(失敗優化反思畢業與演化),這是一種分階段的基於人群的協議,旨在為層次化ReAct代理演化提示注入的自然語言記憶。FORGE包裹了一個反思風格的內循環,其中一個專門的反思代理(使用相同的底層LLM,沒有從更強模型中提煉)將失敗的軌跡轉換為可重用的知識工件:文本啟發式(規則)、少量示範(示例)或兩者結合(混合),並設有一個外循環,在階段之間將表現最佳的實例的記憶傳播到整個人群,並通過畢業標準凍結收斂的實例。我們在CybORG CAGE-2上進行評估,這是一個在30步視野下針對B線攻擊者的隨機網絡防禦POMDP,其中四個測試的LLM家族(Gemini-2.5-Flash-Lite、Grok-4-Fast、Llama-4-Maverick、Qwen3-235B)都顯示出強烈的負面、重尾的零-shot獎勵。與零-shot基線和反思基線(孤立的單流學習)相比,FORGE在所有12個模型表示條件下將平均評估回報提高了1.7-7.7$\times$,並在反思上提高了29-72%,將主要失敗率(低於$-100$)降低到約1%。我們發現(1)人群廣播是一個關鍵機制,無畢業的消融實驗確認廣播帶來了性能增益,而畢業主要節省計算資源;(2)示例在四個模型中的三個模型上實現了最強的回報,規則則提供了最佳的成本可靠性配置,令標記數減少約40%;(3)較弱的基線模型受益不成比例,這表明FORGE可能減輕能力差距,而不是放大強模型。所有證據均限於CAGE-2 B線;跨家族的發現僅為方向性證據。

Argus: Evidence Assembly for Scalable Deep Research Agents

2605.16217v1 by Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu, Simon Shaolei Du, Kaiyu Yang, Bo An, Lidong Bing, Xinyu Wang

Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.

摘要:深度研究代理在複雜的信息搜尋任務上取得了顯著的進展。即使是長 ReAct 風格的展開也僅探索單一的軌跡,而最近的最先進系統則通過平行搜尋和聚合來擴展推理時間計算。然而,深度研究的答案由互補的證據組成,這些平行展開往往重複而不是補全,導致收益遞減,同時將聚合上下文推向模型的極限。我們提出 Argus,一個代理系統,其中搜尋者和導航者合作,將深度研究視為從互補的證據片段中組裝拼圖,而不是在平行中強行獲得整個答案。搜尋者通過 ReAct 風格的互動收集給定子查詢的證據痕跡。導航者維護一個共享的證據圖,驗證哪些片段仍然缺失,派遣搜尋者去收集它們,並對完成的圖進行推理以產出源追蹤的最終答案。我們使用強化學習訓練導航者進行驗證、派遣和綜合,同時獨立訓練搜尋者以保持標準的 ReAct 代理。最終的導航者支持單一搜尋者或多個平行搜尋者的展開,而無需重新訓練。在 35B-A3B MoE 主幹上構建的搜尋者和導航者,Argus 在單一搜尋者上獲得 5.5 分,在 8 個平行搜尋者上獲得 12.7 分,平均在八個基準測試中表現。使用 64 個搜尋者時,它在 BrowseComp 上達到 86.2,超越我們基準測試的每個專有代理,而導航者的推理上下文保持在 21.5K 令牌以下。

Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

2605.16207v1 by Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Dey Tithi, Xiaoyi Tian, Tiffany Barnes

Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution--feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably produce pedagogically actionable feedback, revealing a gap between diagnostic judgment and instructional effectiveness. Our findings suggest that LLMs are better suited for hybrid architectures where KG-grounded models handle diagnosis while LLMs support open-ended scaffolding and dialogue.

摘要:有效的輔導需要區分最佳、有效但次優以及不正確的學生解答,這一區分對於智能輔導系統(ITS)至關重要,但對於基於LLM的輔導尚未進行測試。隨著LLM越來越多地被探索作為ITS的對話補充,評估它們的診斷精確性變得至關重要。我們提出了一個基準,評估七個LLM反饋代理在命題邏輯中的表現,使用知識圖譜衍生的真實數據,涵蓋10,836對解答-反饋配對和三種反饋條件。模型在最佳步驟上達到了接近上限的表現,但系統性地過度拒絕有效但次優的推理,並過度驗證不正確的解答,這恰恰是自適應輔導最為重要的地方。這些失誤在不同模型中持續存在,無論解答的上下文如何,這表明是架構而非信息的限制。此外,準確的診斷並未可靠地產生可教學的可行反饋,顯示出診斷判斷與教學有效性之間的差距。我們的研究結果表明,LLM更適合用於混合架構,其中基於KG的模型處理診斷,而LLM則支持開放式的支架和對話。

Look Before You Leap: Autonomous Exploration for LLM Agents

2605.16143v1 by Ziang Ye, Wentao Shi, Yuxin Liu, Yu Wang, Zhengzhou Cai, Yaorui Shi, Qi Gu, Xunliang Cai, Fuli Feng

Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.

摘要:大型語言模型基礎的代理在不熟悉的環境中常常因為過早的利用而失敗:這是一種在獲得足夠的環境特定信息之前,根據先前知識行動的傾向。我們認為自主探索是一種關鍵但尚未充分探討的能力,用於構建自適應代理。為了形式化和量化這一能力,我們引入了探索檢查點覆蓋率,這是一個可驗證的指標,用於衡量代理發現關鍵狀態、物體和可用性的廣度。我們的系統評估顯示,使用標準任務導向強化學習訓練的代理始終表現出狹隘和重複的行為,這妨礙了下游性能。為了解決這一限制,我們開發了一種訓練策略,交錯任務執行的回合和探索的回合,每種類型的回合都通過其相應的可驗證獎勵進行優化。在這一訓練策略的基礎上,我們提出了探索-再行動的範式,該範式將信息收集與任務執行解耦:代理首先利用互動預算獲取基於環境的知識,然後利用這些知識解決任務。我們的結果表明,系統性地學習探索對於構建可泛化和適合現實世界的代理是至關重要的。

SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation

2605.16117v1 by Xin Zhang, Yang Cao, Baoxing Wu, Kai Song, Siying Li

Large Language Models (LLMs) have demonstrated strong capabilities across diverse NLP applications, such as translation, text generation, and question answering. Nevertheless, they remain limited in complex settings that demand deep reasoning and logical inference. Since these models are trained on large-scale text corpora, their generation process may still introduce irrelevant, noisy, or factually inconsistent content. To mitigate this problem, we introduce SGR, a stepwise framework that enhances LLM reasoning through external subgraph generation. SGR builds query-specific subgraphs from external knowledge bases and uses their semantic structure to support multi-step inference. By grounding intermediate reasoning steps in structured external knowledge, the framework helps the model concentrate on relevant entities, relations, and supporting evidence. In particular, SGR first constructs a subgraph tailored to the input question. It then guides the model to reason progressively over the generated structure and combines multiple reasoning trajectories to obtain the final prediction. Experimental results across several benchmark datasets show that SGR achieves consistent improvements over competitive baselines, highlighting its value for improving both reasoning accuracy and factual reliability.

摘要:大型語言模型(LLMs)在翻譯、文本生成和問答等多樣化的自然語言處理應用中展示了強大的能力。然而,它們在需要深度推理和邏輯推斷的複雜情境中仍然存在限制。由於這些模型是基於大規模文本語料庫進行訓練的,它們的生成過程可能仍會引入無關、噪音或事實不一致的內容。為了減輕這個問題,我們引入了SGR,一個通過外部子圖生成增強LLM推理的逐步框架。SGR從外部知識庫構建查詢特定的子圖,並利用其語義結構來支持多步推理。通過將中間推理步驟基於結構化的外部知識,該框架幫助模型專注於相關的實體、關係和支持證據。特別是,SGR首先構建一個針對輸入問題量身定制的子圖。然後,它引導模型在生成的結構上逐步推理,並結合多條推理路徑來獲得最終預測。在幾個基準數據集上的實驗結果顯示,SGR在競爭基準上實現了一致的改進,突顯了其在提高推理準確性和事實可靠性方面的價值。

DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation

2605.16113v1 by Rui Chu, Bingyin Zhao, Thanh Quoc Hung Le, Duy Cao Hoang, Huawei Lin, Ping Li, Weijie Zhao, Khoa D Doan, Yingjie Lao

Large language models (LLMs) have achieved unprecedented success due to their exceptional generative capabilities. However, because they depend on knowledge encapsulated from training corpora, they may produce hallucinations, stereotypes, and socially biased content. In particular, LLMs are prone to prejudiced responses involving race, gender, and age, which are collectively referred to as social biases. Prior studies have used fine-tuning and prompt engineering to mitigate such biases in LLMs, but these methods require additional training resources or domain knowledge to design the framework. Moreover, they may degrade the original capabilities of LLMs and often overlook the need for dynamic debiasing contexts for fairer inference. In this paper, we propose DebiasRAG, a novel tuning-free and dynamic query-specific debiasing framework based on retrieval-augmented generation (RAG). DebiasRAG improves fairness while preserving the intrinsic properties of LLMs, such as representation ability. DebiasRAG consists of three stages: (1) query-specific debiasing candidate generation; (2) context candidate pool construction; and (3) gradient-updated debiasing-guided context piece reranking. First, DebiasRAG leverages self-diagnosed bias contexts relevant to the query through regular retrieval, where the bias contexts are prepared offline by the DebiasRAG provider. Given the query-specific bias contexts, DebiasRAG reversely produces debiasing contexts, which are provided as additional fairness constraints for LLM outputs. Second, a regular RAG retrieval process produces query-related contexts from the regular RAG document database, such as a chunked Wikipedia dataset.

摘要:大型語言模型(LLMs)因其卓越的生成能力而取得了前所未有的成功。然而,由於它們依賴於從訓練語料庫中封裝的知識,因此可能會產生幻覺、刻板印象和社會偏見內容。特別是,LLMs 容易對涉及種族、性別和年齡的問題產生偏見回應,這些問題統稱為社會偏見。先前的研究已經使用微調和提示工程來減輕 LLMs 中的這些偏見,但這些方法需要額外的訓練資源或領域知識來設計框架。此外,它們可能會降低 LLMs 的原始能力,並且往往忽視了動態去偏見上下文的需求,以實現更公平的推理。在本文中,我們提出了 DebiasRAG,一種基於檢索增強生成(RAG)的新型無調整和動態查詢特定去偏見框架。DebiasRAG 在保持 LLMs 內在特性的同時提高了公平性,例如表徵能力。DebiasRAG 包含三個階段: (1) 查詢特定去偏見候選生成; (2) 上下文候選池構建;以及 (3) 梯度更新的去偏見引導上下文片段重新排序。首先,DebiasRAG 通過常規檢索利用與查詢相關的自我診斷偏見上下文,其中偏見上下文由 DebiasRAG 提供者離線準備。考慮到查詢特定的偏見上下文,DebiasRAG 反向生成去偏見上下文,這些上下文作為 LLM 輸出的額外公平性約束提供。其次,常規 RAG 檢索過程從常規 RAG 文檔數據庫中生成與查詢相關的上下文,例如分塊的維基百科數據集。

Federated Imputation under Heterogeneous Feature Spaces

2605.16099v1 by Imane Hocine, Chaimaa Medjadji, Sylvain Kubler, Gregoire Danoy, Yves Le Traon

Federated Learning (FL) enables collaborative training across decentralized clients, but most methods assume aligned feature schemas, an assumption that rarely holds in tabular settings where clients observe only partially overlapping feature subsets. In these heterogeneous feature spaces, parameter-averaging methods (e.g., FedAvg) transfer little information across weakly overlapping or disjoint feature groups, limiting their effectiveness for federated imputation. To overcome this, we propose \textbf{FedHF-Impute}, a federated imputation framework that separates structural feature unavailability from conventional missingness and uses a shared global feature graph to propagate information across statistically related features through message passing. This enables indirect cross-client knowledge transfer, even when features are never jointly observed locally, while preserving standard federated communication. Under simulated partial schema overlap on the SECOM and AirQuality datasets, FedHF-Impute improves imputation accuracy (RMSE) over FL baselines by 26.9\%, and 8.4\% respectively, while achieving comparable performance on PhysioNET, with only a 0.3\% difference relative to the best baseline.

摘要:聯邦學習(FL)使得去中心化客戶之間能夠進行協作訓練,但大多數方法假設特徵架構是對齊的,這一假設在客戶僅觀察到部分重疊的特徵子集的表格設置中很少成立。在這些異質特徵空間中,參數平均方法(例如,FedAvg)在弱重疊或不相交的特徵組之間傳遞的信息很少,限制了它們在聯邦插補中的有效性。為了克服這一問題,我們提出了\textbf{FedHF-Impute},這是一個聯邦插補框架,將結構性特徵的不可用性與傳統的缺失性分開,並使用共享的全局特徵圖通過消息傳遞在統計相關的特徵之間傳遞信息。這使得間接的跨客戶知識轉移成為可能,即使在本地從未共同觀察到特徵的情況下,同時保持標準的聯邦通信。在對SECOM和AirQuality數據集進行模擬的部分架構重疊下,FedHF-Impute在插補準確性(RMSE)上分別比FL基線提高了26.9\%和8.4\%,同時在PhysioNET上達到了可比的性能,與最佳基線之間僅有0.3\%的差異。

Multi-level Self-supervised Pretraining on Compositional Hierarchical Graph for Molecular Property Prediction

2605.16088v1 by Xiayu Liu, Zhengyi Lu, Hou-biao Li

Self-supervised pretraining on molecular graphs has emerged as a promising approach for molecular property prediction, yet most existing methods operate at a single structural granularity and treat bond information as auxiliary edge attributes rather than as an independent semantic layer. In this work, we propose MolCHG, a multi-level self-supervised pretraining framework built upon a novel Compositional Hierarchical Graph that organizes molecular structure into four types of nodes across three semantic levels. By introducing a bond graph that operates in parallel with the atom graph, our architecture elevates bond-level information to independently evolving node representations, enabling fragment nodes to aggregate atom-level and bond-level semantics on an equal footing. We design three level-specific pretraining objectives: an atom-bond cross-view contrastive task that aligns the atom-view and bond-view representations within each fragment, a fragment-level functional group prediction task to inject domain-relevant chemical knowledge, and graph-level structure prediction tasks to encode global molecular topology. Experiments on nine MoleculeNet benchmarks demonstrate that MolCHG achieves the best performance on seven datasets across both classification and regression tasks, remaining competitive with the strongest baselines on the rest. Ablation studies further confirm that the multi-level supervision signals are complementary and that each component contributes to the overall performance.

摘要:自我監督的分子圖預訓練已成為分子性質預測的一種有前景的方法,然而大多數現有方法僅在單一結構粒度下運作,並將鍵結信息視為輔助邊屬性,而非獨立的語義層。在這項工作中,我們提出了MolCHG,一個多層次自我監督的預訓練框架,基於一種新穎的組合層次圖,將分子結構組織為三個語義層次中的四種類型的節點。通過引入一個與原子圖並行運作的鍵結圖,我們的架構將鍵結層級的信息提升為獨立演化的節點表示,使得片段節點能夠在平等的基礎上聚合原子層級和鍵結層級的語義。我們設計了三個層級特定的預訓練目標:一個原子-鍵結交叉視圖對比任務,對齊每個片段內的原子視圖和鍵結視圖表示,一個片段級的官能團預測任務以注入與領域相關的化學知識,以及圖級結構預測任務以編碼全局分子拓撲。在九個MoleculeNet基準上的實驗表明,MolCHG在七個數據集的分類和回歸任務中達到了最佳性能,並在其餘數據集上與最強基線保持競爭力。消融研究進一步確認了多層次監督信號是互補的,且每個組件對整體性能都有貢獻。

Towards Foundation Models for Relational Databases with Language Models and Graph Neural Networks

2605.16085v1 by Jingcheng Wu, Ratan Bahadur Thapa, Mojtaba Nayyeri, Lucas Etteldorf, Max Finkenbeiner, Fabian Leeske, Steffen Staab

Relational databases store much of the world's structured information, and they are essential for driving complex predictive applications. However, deep learning progress on relational data remains limited, as conventional approaches flatten databases into single tables via manual feature engineering, discarding relational context. Relational deep learning (RDL) addresses this by modeling databases as relational entity graphs (REGs) for graph neural networks (GNNs), but remains task- and database-specific. To combine the strengths of both paradigms, we propose a hybrid architecture combining a fine-tuned BART encoder to capture intra-row semantics with a GraphSAGE-based GNN over REGs to inject relational context. Experiments on RelBench show that the GNN substantially enriches BART's row embeddings, achieving a ROC-AUC of 67.40 on the driver-dnf task from the rel-f1 dataset. This performance is competitive with supervised baselines such as LightGBM (68.86) and narrows the gap to RDL (72.62) to within 5.22 points, though a substantial gap remains to state-of-the-art foundation models such as KumoRFM (82.63). These results suggest that lightweight hybrid LM-GNN architectures offer a promising and resource-efficient path towards foundation models for relational databases.

摘要:關聯資料庫儲存了世界上大量的結構化資訊,並且對於驅動複雜的預測應用至關重要。然而,對於關聯數據的深度學習進展仍然有限,因為傳統方法通過手動特徵工程將資料庫展平為單一表格,捨棄了關聯上下文。關聯深度學習(RDL)通過將資料庫建模為關聯實體圖(REGs)以供圖神經網絡(GNNs)使用來解決這個問題,但仍然是任務和資料庫特定的。為了結合這兩種範式的優勢,我們提出了一種混合架構,結合了微調的BART編碼器以捕捉行內語義,並在REGs上使用基於GraphSAGE的GNN來注入關聯上下文。在RelBench上的實驗顯示,GNN顯著豐富了BART的行嵌入,並在rel-f1數據集的driver-dnf任務上達到了67.40的ROC-AUC。這一表現與監督基準如LightGBM(68.86)具有競爭力,並將與RDL(72.62)之間的差距縮小至5.22點,儘管與最先進的基礎模型如KumoRFM(82.63)之間仍然存在顯著差距。這些結果表明,輕量級混合LM-GNN架構為關聯資料庫的基礎模型提供了一條有前景且資源高效的道路。

Who Owns This Agent? Tracing AI Agents Back to Their Owners

2605.16035v1 by Ruben Chocron, Doron Jonathan Ben Chayim, Eyal Lenga, Gilad Gressel, Alina Oprea, Yisroel Mirsky

AI agents are increasingly deployed to act autonomously in the world, yet there is still no reliable way to trace a harmful agent back to the account that deployed it. This creates the same accountability gap across both ends of the intent spectrum: benign operators may deploy misconfigured or overbroad agents that cause harm unintentionally, while malicious operators may deliberately weaponize agents for scams, harassment, or cyber attacks. In many cases, these agents are powered by vendor-hosted models, a dependency that holds even for sophisticated adversaries such as state actors conducting cyber operations. In either case, affected parties can observe the behavior but cannot notify the responsible operator, stop the session, or identify the account for investigation. We formalize this gap as the problem of agent attribution: linking an observed agent interaction to the responsible account at the hosting vendor. To our knowledge, this is the first work to define the problem and present a practical solution. Our protocol is canary-based: an authorized party injects a canary into the agent's interaction stream, and the vendor searches a narrow window of session logs to recover the originating session and account. Simple canaries suffice in non-adversarial settings. For adversarial operators who filter or paraphrase incoming content, we develop robust canary constructions that cannot be suppressed without degrading the agent's own task performance, yielding a formal asymmetry in the defender's favor. We evaluate a variety of scenarios including real-world agents and show that our attribution method is reliable, robust, and scalable for vendor-side deployment.

摘要:AI代理越來越多地被部署以在世界上自主行動,但仍然沒有可靠的方法可以追溯有害代理到部署它的帳戶。這在意圖光譜的兩端創造了相同的問責差距:良性操作員可能會部署配置錯誤或過於廣泛的代理,無意中造成傷害,而惡意操作員則可能故意將代理武器化,用於詐騙、騷擾或網絡攻擊。在許多情況下,這些代理由供應商托管的模型驅動,即使是進行網絡行動的國家行為者等複雜對手也不例外。在任何情況下,受影響方可以觀察行為,但無法通知負責的操作員、停止會話或識別帳戶以進行調查。我們將這一差距正式化為代理歸屬問題:將觀察到的代理互動鏈接到托管供應商的負責帳戶。據我們所知,這是第一項定義該問題並提出實用解決方案的工作。我們的協議是基於金絲雀的:授權方將金絲雀注入代理的互動流中,供應商搜索狹窄的會話日誌窗口以恢復原始會話和帳戶。在非對抗性環境中,簡單的金絲雀就足夠了。對於過濾或改寫進入內容的對抗性操作員,我們開發了穩健的金絲雀構造,這些構造在不降低代理自身任務性能的情況下無法被壓制,從而在防禦者的利益中產生正式的不對稱性。我們評估了各種場景,包括現實世界的代理,並展示了我們的歸屬方法在供應商端部署中是可靠的、穩健的和可擴展的。

Judge Circuits

2605.16023v1 by Nils Feldhus, Tanja Baeumel, Elena Golimblevskaia, Qianli Wang, Van Bach Nguyen, Aaron Louis Eidt, Christopher Ebert, Wojciech Samek, Jing Yang, Vera Schmitt, Sebastian Möller, Simon Ostermann

LLM-as-a-judge has become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes (e.g., a 1-5 rating vs. a True/False label). Existing diagnoses of these format-induced inconsistencies stop at the input-output level. Using Position-aware Edge Attribution Patching (PEAP), we causally investigate the internal mechanism in Gemma-3, Qwen2.5, and Llama-3. We find that judgments across structured understanding and open-ended preference tasks share a sparse, generalized Latent Evaluator sub-graph in the mid-to-late multi-layer perceptrons (MLPs); zero-ablating it collapses judgment while preserving world knowledge in architecturally modular models. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format-induced inconsistency on the open-weight models we study: a continuous judgment signal computed in the shared trunk is mapped through fragile, format-specific terminal branches, enabling format-independent preference to be isolated downstream of the requested output format. Our findings imply that benchmark-level reliability comparisons across formats are partially measuring formatter geometry rather than evaluation quality.

摘要:LLM-as-a-judge 已成為大規模評分模型輸出的主流範式,然而當其輸出格式改變時(例如,1-5 評分與真/假標籤),同一模型卻會系統性地分配不同的分數。現有對這些格式引起的不一致性的診斷僅停留在輸入-輸出層面。使用位置感知邊緣歸因修補(PEAP),我們對 Gemma-3、Qwen2.5 和 Llama-3 的內部機制進行了因果調查。我們發現,在結構化理解和開放式偏好任務中的判斷,分享了一個稀疏的、通用的潛在評估子圖,位於中到後期的多層感知器(MLPs)中;將其完全去除會導致判斷崩潰,同時在架構模塊化模型中保留世界知識。通過在結構上將抽象判斷與輸出格式解耦,我們提供了一個關於我們研究的開放權重模型中格式引起不一致性的機制解釋:在共享主幹中計算的連續判斷信號,通過脆弱的、格式特定的終端分支進行映射,使得格式獨立的偏好能夠在請求的輸出格式的下游被孤立。我們的發現暗示,跨格式的基準級可靠性比較部分上是在測量格式化幾何而非評估質量。

Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory

2605.15990v1 by Isar Nejadgholi, Masoud Kianpour, Krishnapriya Vishnubhotla, Maryam Molamohamadi

Tremendous efforts have been put into evaluating the inclusivity and effectiveness of AI systems across cultures. However, the cultural capabilities considered in much of the literature remain vaguely defined, are referred to using interchangeable terminology, and are typically limited to recalling accurate information about various demographics, regions, and nationalities. To address this construct ambiguity, we draw from Intercultural Communication scholarship and propose a three-level taxonomy of AI-relevant cultural capabilities: Cultural Awareness answers "Does the model know?", Cultural Sensitivity answers "How does it frame its knowledge?", and Cultural Competence answers "Can it adapt as the interaction evolves?". Beyond conceptual clarification, we position this taxonomy as a practical tool for improving the validity and interpretability of AI evaluation in real-world, multicultural settings. Without such construct clarity, evaluation results risk overstating model capabilities and may lead to inappropriate deployment decisions in culturally sensitive contexts.

摘要:在評估人工智慧系統在不同文化中的包容性和有效性方面,付出了巨大的努力。然而,許多文獻中考慮的文化能力仍然定義模糊,使用可互換的術語來描述,通常僅限於回憶有關各種人口統計、地區和國籍的準確信息。為了解決這一構念模糊性,我們借鑒跨文化交流的學術研究,提出了一個與人工智慧相關的文化能力的三級分類法:文化意識回答「模型知道嗎?」文化敏感性回答「它如何框架其知識?」文化能力回答「隨著互動的演變,它能夠適應嗎?」除了概念澄清,我們將這一分類法定位為改善人工智慧在現實世界多文化環境中評估的有效性和可解釋性的實用工具。如果沒有這樣的構念清晰性,評估結果將有過度誇大模型能力的風險,並可能導致在文化敏感的情境中做出不當的部署決策。

Ontology for Policing: Conceptual Knowledge Learning for Semantic Understanding and Reasoning in Law Enforcement Reports

2605.15978v1 by Anita Srbinovska, Jansen Orfan, Adrian Martin, Ernest Fokoué

Law enforcement reports contain structured fields and written narratives. However, many incident facts that are needed for review, police training, and investigations are in natural language and require manual reading. We propose a framework using symbolic methods for converting narratives into evidence-linked facts. Our objective is to measure the value of narratives to recover incident details only from the unstructured text and build temporal graphs with time cues and domain axioms. We achieve this by redacting personal identifiers, semantic parsing, predicate mapping to ontology, and reasoning. We evaluate the symbolic approach on 450 property crime reports and a short human review. Of the extracted events from the system, 54.1% had a confidence score of at least 0.80 and 93.7% were mapped through the PropBank--VerbNet--WordNet semantic path. 100% agreement was reached on incident initiation, stolen items, and temporal cues and lower agreement for forced entry interpretation.

摘要:執法報告包含結構化欄位和書面敘述。然而,許多事件事實需要進行審查、警方訓練和調查的資訊都是自然語言,並需要人工閱讀。我們提出了一個框架,使用符號方法將敘述轉換為與證據相關的事實。我們的目標是衡量敘述的價值,以僅從非結構化文本中恢復事件細節,並建立帶有時間提示和領域公理的時間圖。我們通過編輯個人識別信息、語義解析、謂詞映射到本體和推理來實現這一目標。我們在450份財產犯罪報告和一次簡短的人類審查上評估了這種符號方法。從系統中提取的事件中,有54.1%的信心分數至少為0.80,93.7%通過PropBank--VerbNet--WordNet語義路徑進行映射。在事件啟動、被盜物品和時間提示方面達成了100%的一致,而對於強行入侵的解釋則一致性較低。

Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning

2605.15967v1 by Fabio Rovai

We study event-graph substrates: a class of world models that represent agent state as an append-only log of typed RDF triples and answer counterfactual queries by forking the log under a structured intervention vocabulary. Substrates are inspectable at the triple level, support exact counterfactuals, and transfer across domains without learned components. We formalize the class, prove a duality between explanatory and counterfactual queries that reduces both to the same causal-ancestor traversal, and evaluate a 1,400-line CLEVRER-DSL interpreter atop a domain-agnostic substrate runtime at full CLEVRER validation scale (n=75,618). The substrate exceeds the NS-DR symbolic oracle on all four per-question categories (by 9.89, 20.26, 17.65, and 0.80 percentage points), and exceeds the parametric ALOE baseline on descriptive and explanatory while lagging on predictive and counterfactual. We also introduce twin-EventLog, a 500-specification Park-canonical Smallville counterfactual benchmark on which the substrate exceeds Llama-3.1-8B with full context by 18.80 points joint accuracy.

摘要:我們研究事件圖基底:這是一類世界模型,將代理狀態表示為一個僅附加的類型化 RDF 三元組日誌,並通過在結構化干預詞彙下分叉日誌來回答反事實查詢。基底可在三元組層級進行檢查,支持精確的反事實,並在沒有學習組件的情況下跨領域轉移。我們對這一類進行形式化,證明了解釋性查詢和反事實查詢之間的對偶性,將兩者都簡化為相同的因果祖先遍歷,並在全 CLEVRER 驗證規模(n=75,618)上評估一個 1,400 行的 CLEVRER-DSL 解釋器,該解釋器基於一個領域無關的基底運行時。該基底在所有四個每問題類別上超過了 NS-DR 符號神諭(分別為 9.89、20.26、17.65 和 0.80 個百分點),並在描述性和解釋性上超過了參數化 ALOE 基準,但在預測性和反事實上落後。我們還介紹了 twin-EventLog,這是一個 500 個規範的 Park-canonical Smallville 反事實基準,在這個基準上,基底在完整上下文中超過了 Llama-3.1-8B,達到 18.80 分的聯合準確率。

CHoE: Cross-Domain Heterogeneous Graph Prompt Learning via Structure-Conditioned Experts

2605.15888v1 by Peiyuan Li, Yongqi Huang, Jitao Zhao, Dongxiao He, Di Jin, Weixiong Zhang

Heterogeneous Graph Prompt Learning (HGPL)has emerged as a promising paradigm for bridging the gap between the objectives of pre-training foundation models and their downstream applications in heterogeneous graph settings. However, existing HGPL methods are primarily designed for in-domain scenarios, whereas real-world deployments often span multiple domains, and the data used for pre-training and downstream tasks may originate from different distributions. Consequently, the applicability of current HGPL approaches is limited to in-domain settings, and their performance typically degrades when application domains shift. To address this serious limitation, we develop CHoE, a cross-domain HGPL method built upon an expert network. During pre-training, we introduce and train structure-conditioned experts, and during prompt tuning, we adopt a structure-aware expert routing and load balancing mechanism to select structurally compatible experts for each meta-path view. In addition, we design a prompt-based semantic fusion module to integrate representations across multiple views for downstream prediction. Extensive experiments show that CHoE consistently improves performance in few-shot cross-domain applications, outperforming all baseline approaches.

摘要:異質圖提示學習(HGPL)已成為彌合預訓練基礎模型目標與其在異質圖設定中下游應用之間差距的有前景的範式。然而,現有的HGPL方法主要針對域內場景設計,而現實世界的部署往往跨越多個領域,且用於預訓練和下游任務的數據可能來自不同的分佈。因此,目前HGPL方法的適用性僅限於域內設置,當應用領域發生變化時,其性能通常會下降。為了解決這一嚴重限制,我們開發了CHoE,一種基於專家網絡的跨域HGPL方法。在預訓練過程中,我們引入並訓練結構條件專家,而在提示調整過程中,我們採用結構感知的專家路由和負載平衡機制,以選擇與每個元路徑視圖結構相容的專家。此外,我們設計了一個基於提示的語義融合模塊,以整合多個視圖的表示以進行下游預測。大量實驗表明,CHoE在少量跨域應用中持續提高性能,超越了所有基準方法。

Shapley Neuron Values for Continual Learning: Which Neurons Matter Most?

2605.15877v1 by Mohammad Ali Vahedifar, Abhisek Ray, Qi Zhang

Continual learning enables neural networks to learn tasks sequentially without forgetting previously acquired knowledge. However, neural networks suffer from catastrophic forgetting, where learning new tasks degrades performance on earlier ones. We address this problem with Shapley Neuron Valuation (SNV), a principled framework that quantifies Neuron importance in continual learning, grounded in cooperative game theory. SNV selectively freezes important Neurons while keeping others plastic, enabling buffer-free continual learning without expanding architecture. Experiments on ImageNet-1k show that SNV consistently outperforms existing buffer-free methods. In particular, SNV improves accuracy by +2.88% in the class incremental learning and +6.46% in the task incremental learning scenarios compared to the second baseline.

摘要:持續學習使神經網絡能夠按順序學習任務,而不會忘記先前獲得的知識。然而,神經網絡面臨著災難性遺忘的問題,即學習新任務會降低早期任務的性能。我們通過 Shapley Neuron Valuation (SNV) 來解決這個問題,這是一個基於合作博弈論的原則性框架,用於量化持續學習中神經元的重要性。SNV 有選擇性地凍結重要的神經元,同時保持其他神經元的可塑性,實現無緩衝區的持續學習,而不擴展架構。在 ImageNet-1k 上的實驗顯示,SNV 始終優於現有的無緩衝區方法。特別是,與第二基線相比,SNV 在類增量學習中提高了 +2.88% 的準確率,在任務增量學習場景中提高了 +6.46% 的準確率。

BootstrapAgent: Distilling Repository Setup into Reusable Agent Knowledge

2605.15815v1 by Sihan Fu, Oucheng Liu, Shiyuan Wang, Jin Shi, Chengkun Wei

Code agents increasingly help developers work with unfamiliar repositories, but every such task depends on a costly prerequisite: bootstrapping the repository into a usable development state. This process requires substantial trial-and-error exploration, yet the resulting knowledge--resolved dependencies, repair strategies--stays trapped in a single conversation, unavailable to future agents. We therefore formulate repository bootstrapping as a reusable startup knowledge problem and introduce BootstrapAgent, a multi-agent framework that distills the heuristics discovered during bootstrap exploration into a persistent, verifiable, agent-consumable .bootstrap contract. Through evidence extraction, structured planning, deterministic Docker-based verification, and trace-driven repair, BootstrapAgent generates a contract covering environment setup, diagnostic checks, minimal verification, and accumulated repair knowledge. We further propose warm repair with clean replay to accelerate iterative debugging without sacrificing cold-start reproducibility, and a delta repair with sanity check to prevent reward hacking. Experiments on three benchmarks show that BootstrapAgent achieves a 92.9% success rate, outperforming the baseline by over 10% while reducing downstream agent token usage by 25.9% and build time by 22.3%. Our code is available at https://github.com/Vossera/BootstrapAgent.

摘要:代碼代理越來越多地幫助開發人員處理不熟悉的代碼庫,但每一項任務都依賴於一個昂貴的前提條件:將代碼庫引導到可用的開發狀態。這個過程需要大量的反覆試驗探索,但所產生的知識——解決的依賴關係、修復策略——卻被困在單一的對話中,無法供未來的代理使用。因此,我們將代碼庫引導公式化為一個可重用的啟動知識問題,並引入BootstrapAgent,一個多代理框架,將在引導探索過程中發現的啟發式方法提煉為一個持久的、可驗證的、可供代理使用的.bootstrap合約。通過證據提取、結構化規劃、確定性基於Docker的驗證和基於追蹤的修復,BootstrapAgent生成一個涵蓋環境設置、診斷檢查、最小驗證和累積修復知識的合約。我們進一步提出了乾淨重放的熱修復,以加速迭代調試而不犧牲冷啟動的可重現性,並提出了帶有合理性檢查的增量修復,以防止獎勵黑客行為。在三個基準上的實驗顯示,BootstrapAgent達到了92.9%的成功率,超過基準線10%以上,同時將下游代理的令牌使用量減少了25.9%,構建時間減少了22.3%。我們的代碼可在 https://github.com/Vossera/BootstrapAgent 獲得。

Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets

2605.15787v1 by Kai Hidajat, Solden Stoll, Joseph An

Why does a Transformer that has memorized its training set wait thousands of steps before it generalizes? Existing accounts locate this delay in norm minimization, feature emergence, or the late discovery of sparse subnetworks. These explanations capture important parts of the transition, but ignore a constraint unique to attention-based models: if attention discards an informative token, no bounded downstream computation can recover it. We formalize attention as an implicit Bayesian posterior over the task dependency graph and prove that generalization requires two separable conditions: a familiar Goldilocks bound on MLP capacity, coinciding with norm-based theories of grokking, and a novel Bayesian structural condition requiring attention to place sufficient mass on every informative token. This decoupling explains delayed generalization as delayed structural inference. Early in training, the MLP memorizes through unaligned features, drives the cross-entropy loss near zero, and thereby starves attention of structural gradient. Weight decay must then erode memorization before the missing graph becomes learnable, yielding the known inverse-weight-decay delay, which we derive as a structural waiting time. We then prove that this explaining-away delay can be bypassed by a KL-based structural intervention, yielding an inverse-intervention-strength scaling law for the grokking time. Experiments on algorithmic sequence tasks isolate structure from capacity and show that this Bayesian ticket matches or outperforms lottery-ticket transfer.

摘要:為什麼一個已經記住其訓練集的Transformer會在進行泛化之前等待數千步?現有的解釋將這一延遲歸因於範數最小化、特徵出現或稀疏子網絡的晚期發現。這些解釋捕捉了轉變的重要部分,但忽略了一個對基於注意力的模型獨特的約束:如果注意力丟棄了一個有信息的標記,則沒有有界的下游計算可以恢復它。我們將注意力形式化為任務依賴圖上的隱式貝葉斯後驗,並證明泛化需要兩個可分的條件:對MLP容量的熟悉的金髮女孩界限,與基於範數的grokking理論相符,以及一個新的貝葉斯結構條件,要求注意力在每個有信息的標記上放置足夠的質量。這種解耦解釋了延遲泛化的原因是延遲的結構推斷。在訓練的早期,MLP通過不對齊的特徵進行記憶,將交叉熵損失推近於零,從而使注意力缺乏結構梯度。權重衰減必須在缺失的圖變得可學習之前侵蝕記憶,產生已知的逆權重衰減延遲,我們將其推導為結構等待時間。我們然後證明,這種解釋延遲可以通過基於KL的結構干預來繞過,從而產生grokking時間的逆干預強度縮放法則。在算法序列任務上的實驗將結構與容量隔離,並顯示這種貝葉斯票證匹配或超越了彩票票證轉移。

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

2605.15777v1 by Kean Shi, Zihang Li, Tianyi Ma, Zengji Tu, Jialong Wu, Xinbo Xu, Qingyao Yang, Ruoyu Wu, Weichu Xie, Ming Wu, Jason Zeng, Michael Heinrich, Elvis Zhang, Liang Chen, Kuan Li, Baobao Chang

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software-as-a-Service (SaaS) environments are a natural choice for CUA evaluation, as they host a large share of modern digital work and naturally involve dynamic system states, cross-application coordination, domain-specific knowledge, and long-horizon dependencies. To this end, we introduce SaaS-Bench, a benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks require long-horizon execution, cover both text-only and multimodal settings, and are evaluated with weighted verification checkpoints that measure strict task completion and partial progress. Experiments show that representative LLM-based agents struggle on SaaS-Bench, with even the strongest model completing fewer than 4% of tasks end-to-end, exposing limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code are available at https://github.com/UniPat-AI/SaaS-Bench for reproduction.

摘要:電腦使用代理(CUAs)正在迅速將大型語言模型(LLMs)從基於文本的推理擴展到在更複雜的環境中執行動作,例如網頁瀏覽器和圖形用戶界面(GUIs)。然而,現有的網頁和GUI代理基準通常依賴於簡化的設置、孤立的任務或短期交互,這使得在現實專業工作流程中評估代理的能力變得困難。軟體即服務(SaaS)環境是CUA評估的自然選擇,因為它們承載了現代數字工作的很大一部分,並自然涉及動態系統狀態、跨應用協調、領域特定知識和長期依賴性。為此,我們介紹了SaaS-Bench,這是一個基於六個專業領域中23個可部署SaaS系統構建的基準,包含106個基於現實工作場景的任務。這些任務需要長期執行,涵蓋文本和多模態設置,並通過加權驗證檢查點進行評估,這些檢查點測量嚴格的任務完成和部分進展。實驗顯示,代表性的基於LLM的代理在SaaS-Bench上表現不佳,即使是最強的模型也僅完成不到4%的任務,暴露了在規劃、狀態跟踪、跨應用上下文維護和錯誤恢復方面的局限性。代碼可在 https://github.com/UniPat-AI/SaaS-Bench 獲得以便重現。

Structure Abstraction and Generalization in a Hippocampal-Entorhinal Inspired World Model

2605.15733v1 by Tianqiu Zhang, Muyang Lyu, Xiao Liu, Si Wu

Humans abstract experiences into structured representations to facilitate pattern inference and knowledge transfer. While the hippocampal-entorhinal (HPC-MEC) circuit is known to represent both spatial and conceptual spaces, the mechanisms for concurrently extracting abstract structures from continuous, high-dimensional dynamics remain poorly understood. We propose a brain-inspired hierarchical model that simultaneously infers latent transitions and constructs a predictive visual world model. Our architecture employs an inverse model for structural extraction alongside an HPC-MEC coupling model that dissociates relational structures (MEC) from integrated episodic scenes (HPC). Using primitive transformation dynamics as a benchmark, we demonstrate the model's capacity for structural abstraction. By leveraging velocity-driven path integration, the framework enables robust prediction and structural reuse across diverse contexts, thereby achieving structural generalization. This work provides a novel computational framework for understanding how brain-inspired, self-supervised learning of world models facilitates the acquisition of reusable abstract knowledge.

摘要:人類將經驗抽象為結構化的表徵,以促進模式推斷和知識轉移。雖然海馬-內嗅皮層(HPC-MEC)電路已知能夠表徵空間和概念空間,但同時從連續的高維動態中提取抽象結構的機制仍然不甚了解。我們提出了一種受大腦啟發的階層模型,該模型同時推斷潛在轉變並構建預測視覺世界模型。我們的架構使用逆模型進行結構提取,並結合一個HPC-MEC耦合模型,將關聯結構(MEC)與整合的情節場景(HPC)分離。使用原始轉換動態作為基準,我們展示了該模型的結構抽象能力。通過利用速度驅動的路徑整合,該框架實現了在多樣化上下文中的穩健預測和結構重用,從而達成結構泛化。這項工作提供了一個新穎的計算框架,以理解受大腦啟發的自我監督學習如何促進可重用抽象知識的獲取。

H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure

2605.15701v1 by Jiawei Yu, Yixiang Fang, Xilin Liu, Yuchi Ma

Memory data are ubiquitous in Large Language Model (LLM)-based agents (e.g., OpenClaw and Manus). A few recent works have attempted to exploit agents'memory for improving their performance on the question-answering (QA) task, but they lack a principled mechanism for effectively modeling how memory data evolves over time and retrieving memory data effectively, leading to poor performance in memory utilization. To fill this gap, we present H-Mem, a novel memory mechanism via a hybrid structure that can not only effectively model the evolution of agent memory over a long period of time, but also provide an efficient memory retrieval approach. Particularly, H-Mem builds a temporal and semantic tree structure that allows the short-term memory data to evolve progressively into long-term memory data, where the latter provides summarized information about the former, while simultaneously constructing a knowledge graph to capture the relationships between entities in memory. Moreover, it offers an effective memory retrieval approach by exploiting the hybrid structure of the tree and graph structures. Extensive experiments on three agent memory benchmarks show that H-Mem achieves state-of-the-art performance on the QA task.

摘要:記憶數據在基於大型語言模型(LLM)的代理中無處不在(例如,OpenClaw 和 Manus)。一些最近的研究嘗試利用代理的記憶來提高其在問答(QA)任務上的表現,但它們缺乏一種原則性機制來有效建模記憶數據隨時間的演變以及有效檢索記憶數據,導致記憶利用率低下。為了填補這一空白,我們提出了 H-Mem,一種新穎的記憶機制,通過混合結構不僅能有效建模代理記憶在長時間內的演變,還能提供高效的記憶檢索方法。特別是,H-Mem 建立了一個時間和語義樹結構,使短期記憶數據能逐步演變為長期記憶數據,後者提供了有關前者的摘要信息,同時構建了一個知識圖譜以捕捉記憶中實體之間的關係。此外,它通過利用樹和圖結構的混合結構提供了一種有效的記憶檢索方法。在三個代理記憶基準上的廣泛實驗顯示,H-Mem 在 QA 任務上達到了最先進的性能。

ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models

2605.15687v1 by Jiahui Guang, Yingjie Zhu, Cuiyun Gao, Haiyan Wang, Jing Li, Di Shao, Zhaoquan Gu

Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evaluate unlearning effectiveness based on output deviations, while overlooking the generation quality after unlearning. This can easily lead to hallucinated or rigid responses, thereby affecting the usability and safety of the unlearned model. To address this issue, we propose ASRU, a controllable multimodal unlearning framework that incorporates generation quality as a core evaluation objective. ASRU first induces initial refusal behavior through activation redirection, and then optimizes fine-grained refusal boundaries using a customized reward function, thereby achieving a better trade-off between target knowledge unlearning and model utility. Experiments on Qwen3-VL show that ASRU significantly improves unlearning effectiveness (+24.6%) on average and generation quality (5.8x) on average while effectively preserving model utility, using only a small amount of retained supervision data.

摘要:多模態大型語言模型(MLLMs)在預訓練期間可能會記住敏感的跨模態信息,使得機器遺忘(MU)變得至關重要。現有的方法通常基於輸出偏差來評估遺忘的有效性,而忽略了遺忘後的生成質量。這很容易導致產生幻覺或僵化的回應,從而影響未學習模型的可用性和安全性。為了解決這個問題,我們提出了ASRU,一個可控的多模態遺忘框架,將生成質量作為核心評估目標。ASRU首先通過激活重定向誘導初始拒絕行為,然後使用自定義獎勵函數優化細粒度的拒絕邊界,從而在目標知識遺忘和模型效用之間實現更好的權衡。在Qwen3-VL上的實驗顯示,ASRU顯著提高了遺忘的有效性(平均+24.6%)和生成質量(平均5.8倍),同時有效保留模型效用,僅使用少量保留的監督數據。

TFZ-Tree: An Ultra-Lightweight Waveform Classification Framework for Resource-Constrained Devices

2605.15656v1 by Hao Wang, Kuang Zhang, Yonggang Chi, Tianqi Zhao, Yanbo Fu, Jiaxing Guo

Under the trend of multi-waveform coexistence in 6G IoT, intelligent receivers must first identify physical-layer waveform types before performing correct demodulation and resource scheduling. However, existing signal identification research largely focuses on symbol-level modulation classification. Research directly targeting physical-layer waveform types (e.g., OFDM, OTFS, LoRa) is not only extremely scarce but also heavily reliant on deep neural networks and complex time-frequency transforms, making deployment on resource-constrained terminals difficult. Symbol modulation classification methods themselves cannot circumvent the prerequisite of ``waveform identification first.'' To address this dual gap, we propose an ultra-lightweight waveform classification framework based on time-frequency multidimensional features with a cooperative Z-test tree (ZTree). The framework employs low-complexity time-domain feature extraction, and the classification backend adopts a ZTree optimized by Z-statistical testing, which uses hypothesis testing confidence to automatically control decision tree splitting and size, ensuring efficient execution on resource-limited processors. Tested on ten 6G candidate waveforms including OFDM, OTFS, DSSS, LoRa, and NB-IoT, the method achieves 99.5\% average accuracy under AWGN and 87.4\% under TDL-C multipath channels, with main confusion between OTFS and LoRa. Implemented in C on an x86 platform, single inference latency is under 4~ms. To the best of our knowledge, this is the first work achieving real-time recognition of ten IoT waveform types. Future work will target deployment acceleration on embedded MCUs. Code and dataset are open-sourced at: https://github.com/Einstein-sworder/IoT-wave.

摘要:在6G物聯網多波形共存的趨勢下,智能接收器必須首先識別物理層波形類型,才能進行正確的解調和資源調度。然而,現有的信號識別研究主要集中在符號級調製分類上。針對物理層波形類型(例如,OFDM、OTFS、LoRa)的研究不僅極其稀缺,而且高度依賴深度神經網絡和複雜的時頻變換,這使得在資源受限的終端上部署變得困難。符號調製分類方法本身無法繞過“首先進行波形識別”的前提。為了解決這一雙重空白,我們提出了一種基於時頻多維特徵的超輕量級波形分類框架,並採用了協作Z檢驗樹(ZTree)。該框架採用低複雜度的時域特徵提取,分類後端則採用經Z統計檢驗優化的ZTree,利用假設檢驗的置信度自動控制決策樹的分裂和大小,確保在資源有限的處理器上高效執行。在包括OFDM、OTFS、DSSS、LoRa和NB-IoT在內的十種6G候選波形上進行測試,該方法在AWGN下達到99.5%的平均準確率,在TDL-C多徑通道下達到87.4%,主要的混淆發生在OTFS和LoRa之間。在x86平台上用C語言實現,單次推理延遲低於4毫秒。據我們所知,這是第一個實現十種物聯網波形類型實時識別的工作。未來的工作將針對嵌入式MCU的部署加速。代碼和數據集已開源,網址為:https://github.com/Einstein-sworder/IoT-wave。

A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

2605.15617v1 by Shaoke Xi, ChonLam Lao, Boyi Jia, Jiaqi Gao, Zhipeng Zhang, Jiamin Cao, Brian Sutioso, Erci Xu, Minlan Yu, Kui Ren, Yong Li, Zhengping Qian, Ennan Zhai, Jingren Zhou

Large language model (LLM) training today runs on clusters spanning thousands of GPUs. While this scale enables rapid model advances, developing, debugging, and performance-tuning the training framework inevitably becomes complex and costly. This is because engineers often need to reproduce production behaviors to diagnose failures or evaluate optimizations, thereby demanding frequent and even exclusive access to production-scale clusters -- which becomes increasingly hard given that the majority of GPUs are already committed to production workloads. Simulation relies on complex performance models that are difficult to maintain, and downscaled experiments often fail to capture scale-dependent behaviors. We present PrismLLM to decouple large-scale execution from the need to access large clusters, enabling engineers to run and observe ranks of interest under faithful large-scale behavior using only a few GPUs. PrismLLM constructs a high-fidelity execution graph via a slicing-based approach that captures computation, communication, and dependencies of the target scale. Then, PrismLLM performs hybrid emulation where selected ranks execute the original program while the remaining ranks are replayed as virtual participants. Experiments on large-scale LLM training workloads show that PrismLLM accurately reproduces performance and memory behavior, achieving only 0.58\% average error in iteration time and less than 0.01\% error in peak GPU memory usage. PrismLLM can emulate clusters of up to 8192 GPUs using fewer than 1\% of the physical GPUs required by the original deployment.

摘要:大型語言模型(LLM)訓練如今在跨越數千個GPU的集群上運行。雖然這種規模使模型的快速進步成為可能,但開發、調試和性能調整訓練框架不可避免地變得複雜且昂貴。這是因為工程師通常需要重現生產行為來診斷故障或評估優化,因此需要頻繁甚至獨占地訪問生產規模的集群——這在大多數GPU已經承擔生產工作負載的情況下變得越來越困難。模擬依賴於難以維護的複雜性能模型,而縮小規模的實驗往往無法捕捉到與規模相關的行為。
我們提出PrismLLM,以將大規模執行與訪問大型集群的需求解耦,讓工程師能夠僅使用少量GPU運行和觀察感興趣的排名,並在忠實的大規模行為下進行觀察。PrismLLM通過基於切片的方法構建高保真度的執行圖,捕捉目標規模的計算、通信和依賴關係。然後,PrismLLM執行混合模擬,選定的排名執行原始程序,而其餘排名則作為虛擬參與者進行重播。
在大規模LLM訓練工作負載上的實驗顯示,PrismLLM準確重現了性能和內存行為,在迭代時間上僅達到0.58\%的平均誤差,並且在峰值GPU內存使用上低於0.01\%的誤差。PrismLLM可以使用少於1\%的原始部署所需的物理GPU來模擬高達8192個GPU的集群。

MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

2605.15589v1 by Weixin Liu, Congning Ni, Shelagh A. Mulvaney, Susannah L. Rose, Murat Kantarcioglu, Bradley A. Malin, Zhijun Yin

Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.

摘要:大型語言模型(LLMs)在心理健康領域的應用日益增多,但尚不清楚它們在多大程度上捕捉相關的生物醫學知識,以及它們在臨床重要的結構性判斷中應用這些知識的可靠性。在此,我們提出了一個基於知識圖譜(KG)的基準,用於評估LLMs在心理健康實體識別、關係判斷和兩步推理方面的表現。這個基準源自PrimeKG,包含九個任務類別,並提供KG支持的答案及控制的負面選項。對15個封閉源和開放源LLMs的實驗顯示出持續的識別與判斷之間的差距:領先模型在實體類型識別和小型關係類型子集上達到了接近上限的表現,但在關係預測和兩步推理方面仍然存在困難。此外,短小的KG衍生片段對某些模型有益,但對其他模型則降低了性能。此外,在受限的多選設置下,輸出格式的可靠性可以顯著影響測量的性能,突顯了回應有效性在基準評估中的關鍵作用。因此,MHGraphBench應被解讀為評估與經過策劃的PrimeKG心理健康片段的一致性,而不是對現實世界臨床安全的直接評估。

Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems

2605.15573v1 by Nurbek Tastan, Alex Iacob, Lorenzo Sani, Meghdad Kurmanji, Nicholas D. Lane, Samuel Horvath, Karthik Nandakumar

Multi-agent systems can solve complex tasks through collaboration between multiple Large Language Model agents. Existing collaboration frameworks typically operate in either a parallel or a sequential mode. In the parallel mode, agents respond independently to queries followed by aggregation of responses. In contrast, sequential systems allow agents to communicate via a directed topology and refine one another step by step. However, both modes are inadequate for achieving the desired objectives of minimizing communication and latency while simultaneously maximizing the accuracy of the final response. In this work, we introduce a hybrid paradigm called Nexa, a trainable response-conditioned policy that bridges the gap between the two modes. Nexa begins with a parallel execution stage, embeds the resulting responses into a shared semantic space, and then predicts a sparse directed acyclic communication graph. If the graph is empty, the system remains purely parallel; if it is non-empty, the system performs one sequential message propagation. The policy is a lightweight transformer model, and the method avoids the need for external LLM judges or reward models, as well as hand-crafted test-time topology search. We formalize this hybrid execution problem, show that the resulting graph is acyclic by construction, and that the framework strictly subsumes pure parallel execution, and present a training procedure based on policy-gradient optimization. Results demonstrate that the response-conditioned policy learned by Nexa under one setting can be reused when the number of agents, the task, or the underlying agent changes, thus emphasizing the generalizability of the learned communication policy.

摘要:多智能體系統可以透過多個大型語言模型代理之間的合作來解決複雜任務。現有的合作框架通常以平行或串行模式運作。在平行模式中,代理獨立地對查詢作出回應,然後將回應進行聚合。相對地,串行系統允許代理透過有向拓撲進行通信,並逐步互相完善。然而,這兩種模式都不足以實現最小化通信和延遲,同時最大化最終回應準確性的預期目標。在本研究中,我們介紹了一種稱為Nexa的混合範式,這是一種可訓練的回應條件政策,彌合了這兩種模式之間的差距。Nexa從平行執行階段開始,將結果回應嵌入共享語義空間,然後預測一個稀疏的有向無環通信圖。如果圖是空的,系統將保持純平行;如果圖非空,系統將執行一次串行消息傳播。該政策是一個輕量級的Transformer模型,該方法避免了對外部LLM評審或獎勵模型的需求,以及手工製作的測試時拓撲搜索。我們形式化了這一混合執行問題,顯示所得到的圖是無環的,並且該框架嚴格地包含了純平行執行,並提出了一種基於政策梯度優化的訓練程序。結果顯示,Nexa在一種設置下學到的回應條件政策可以在代理數量、任務或底層代理變更時重複使用,從而強調了學習到的通信政策的通用性。

GiLT: Augmenting Transformer Language Models with Dependency Graphs

2605.15562v1 by Tianyu Huang, Yida Zhao, Chuyan Zhou, Kewei Tu

Augmenting Transformers with linguistic structures effectively enhances the syntactic generalization performance of language models. Previous work in this direction focuses on syntactic tree structures of languages, in particular constituency tree structures. We propose Graph-Infused Layers Transformer Language Model (GiLT) which leverages dependency graphs for augmenting Transformer language models. Unlike most previous work, GiLT does not insert extra structural tokens in language modeling; instead, it injects structural information into language modeling by modulating attention weights in the Transformer with features extracted from the dependency graph that is incrementally constructed along with token prediction. In our experiments, GiLT with semantic dependency graphs achieves better syntactic generalization while maintaining competitive perplexity in comparison with Transformer language model baselines. In addition, GiLT can be finetuned from a pretrained language model to achieve improved downstream task performance. Our code is released at https://github.com/cookie-pie-oops/GiLT-LM.

摘要:增強Transformer模型與語言結構的結合有效提升了語言模型的句法泛化性能。之前在這方面的研究集中於語言的句法樹結構,特別是成分樹結構。我們提出了圖形注入層Transformer語言模型(GiLT),它利用依賴圖來增強Transformer語言模型。與大多數之前的工作不同,GiLT並不在語言建模中插入額外的結構標記;相反,它通過調節Transformer中的注意力權重,將從隨著標記預測逐步構建的依賴圖中提取的特徵注入語言建模。在我們的實驗中,使用語義依賴圖的GiLT在保持與Transformer語言模型基準相當的困惑度的同時,實現了更好的句法泛化。此外,GiLT可以從預訓練的語言模型進行微調,以提高下游任務的性能。我們的代碼已發布於 https://github.com/cookie-pie-oops/GiLT-LM。

Neural Point-Forms

2605.15524v1 by Bruno Trentini, Jacob Hume, Vincenzo Antonio Isoldi, Philipp Misof, Ekaterina S. Ivshina, Kelly Maggs

Point cloud learning often rests on the premise that observed samples are noisy traces of an underlying geometric object, such as a manifold embedded in a high-dimensional feature space. Yet much of this geometry is not captured directly by coordinates, pairwise distances, or learned graph neighborhoods alone. In the smooth setting, differential forms are devices to encode higher order tangency information. In this work, we introduce a new family of principled learnable geometric features for point clouds called neural point-forms (NPFs). In the absence of a natural tangency structure, we instead use Laplacian-based techniques from Diffusion Geometry to build a discrete model for comparing differential forms on point clouds via inner products. In the continuum, submanifolds of a shared ambient feature space are represented as comparison matrices, whose entries describe how pairs of feature forms interact with extrinsic tangency information. We make this intuition precise by proving the long-run consistency of comparison matrices under standard sampling, bandwidth, density, and manifold-hypothesis assumptions. This yields a compact, efficient and permutation-invariant neural layer whose output is a learned form-comparison matrix. Across synthetic and biologically relevant experiments, we show that NPFs provide a competitive, and interpretable representation, with the strongest benefits appearing when labels depend on sampling density, manifold-like structure, or response-relevant population geometry.

摘要:點雲學習通常基於這樣的前提:觀察到的樣本是潛在幾何物體的噪聲痕跡,例如嵌入在高維特徵空間中的流形。然而,這種幾何結構並不是僅通過坐標、成對距離或學習的圖鄰域直接捕捉到的。在平滑的情況下,微分形式是編碼高階切觸信息的工具。在這項工作中,我們引入了一種新的原則性可學習幾何特徵的家族,稱為神經點形式(NPFs)。在缺乏自然切觸結構的情況下,我們使用基於拉普拉斯的擴散幾何技術來構建一個離散模型,以通過內積比較點雲上的微分形式。在連續情況下,共享環境特徵空間的子流形被表示為比較矩陣,其條目描述了特徵形式對如何與外部切觸信息互動。我們通過證明在標準抽樣、帶寬、密度和流形假設下比較矩陣的長期一致性,使這一直覺變得精確。這產生了一個緊湊、高效且對置換不變的神經層,其輸出是一個學習的形式比較矩陣。在合成和生物相關的實驗中,我們顯示NPFs提供了一種具有競爭力和可解釋性的表示,當標籤依賴於抽樣密度、類流形結構或響應相關的人口幾何時,最強的好處出現。

X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Human Attention

2605.15505v1 by Guruprasad Raghavan, George Nychis, Rohan Narayana Murthy

In enterprise operations, the context required for an AI agent task is scattered across systems of record, static information stores, and communication channels. What is stored is system state, a lossy representation of the work that actually happened [2, 52]. The prevailing approach [17, 31, 34, 36] retrieves by matching request content to what is stored; for narrow requests this works well. But synthesis quality depends on knowing what to surface and how to interpret it: knowledge specific to each organization, team, and individual [5, 57, 61], present in behavioral patterns, absent from any retrieval index. For complex agentic tasks it breaks down: True Lead Rate is low, False Lead Rate is high, and the model has no mechanism to improve. We present X-SYNTH, a framework for enterprise context synthesis grounded in human attention, the digitally observable interaction signatures of each worker, encoding not just what they did but the sequence in which they did it, along with implicit reward signals. Behavioral traces preceding positive outcomes are distinguishable from those that did not, without external labeling. X-SYNTH models each individual's behavioral baseline as a Digital Twin Signature (DTS) and selects among seven qualitatively distinct attention filters: Proportional, Inverse, Differential, Recurrent, Comparative, Sequential, and Collective, per individual and per query, to identify causally relevant activity signatures. A four-stage pipeline assembles ranked context grounded in behavioral patterns rather than query embeddings. On a sales lead identification task, a frontier model unaided achieves 9.5% True Lead Rate (TLR) with 90.5% False Lead Rate (FLR). Augmented with X-SYNTH, TLR rises to 61.9% (6.5x) while FLR falls to 18.8%. Enterprise context synthesis is not a retrieval problem. It is a relevance problem, and human attention is its most reliable ground truth.

摘要:在企業運營中,AI代理任務所需的上下文散布在記錄系統、靜態信息存儲和通信渠道中。所儲存的是系統狀態,這是一種損失性表示,反映了實際發生的工作[2, 52]。目前的做法[17, 31, 34, 36]是通過將請求內容與所儲存的內容進行匹配來檢索;對於狹窄的請求,這種方法運行良好。但合成質量取決於了解應該呈現什麼以及如何解釋它:這些知識是特定於每個組織、團隊和個體的[5, 57, 61],存在於行為模式中,而在任何檢索索引中都不存在。對於複雜的代理任務,這種方法失效:真線索率低,假線索率高,模型沒有改進的機制。我們提出了X-SYNTH,這是一個基於人類注意力的企業上下文合成框架,能夠數位觀察每位工作者的互動特徵,不僅編碼他們所做的事情,還編碼他們所做事情的順序,以及隱含的獎勵信號。正面結果之前的行為痕跡與未達成的痕跡是可區分的,無需外部標籤。X-SYNTH將每個個體的行為基線建模為數位雙胞胎簽名(DTS),並在七種質量上不同的注意力過濾器中進行選擇:按比例、反向、差異、重複、比較、序列和集體,針對每個個體和每個查詢,以識別因果相關的活動特徵。一個四階段的管道組裝基於行為模式而非查詢嵌入的排名上下文。在銷售線索識別任務中,一個未經輔助的前沿模型實現了9.5%的真線索率(TLR)和90.5%的假線索率(FLR)。經過X-SYNTH增強後,TLR上升至61.9%(6.5倍),而FLR下降至18.8%。企業上下文合成不是一個檢索問題。這是一個相關性問題,而人類注意力是其最可靠的真實基準。

FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models

2605.15482v1 by Dmitry Stanishevskii, Nini Kamkia, Alexey Khoroshilov, Dmitry Zmitrovich, Denis Kokosinskii, Zhirayr Hayrapetyan, Andrei Kalmykov

Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs. FINESSE-Bench combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark. This design enables evaluation of domain breadth, performance degradation as difficulty increases, the ability to solve computational tasks, and model behavior in specialized financial domains. We also describe a unified evaluation protocol covering multiple-choice questions, numerical answers, and short open-ended responses, together with an automated scoring scheme for freeform answers based on the LLM-as-judge paradigm. FINESSE-Bench is intended both as a complement to existing open financial benchmarks and as a tool for more substantive evaluation of professionally relevant financial competencies in large language models.

摘要:大型語言模型(LLMs)越來越多地應用於財務分析、報告、投資決策支持、風險管理、合規性和專業培訓。然而,對其在財務領域能力的穩健評估仍然不完整。廣泛使用的開放基準,如FinQA、ConvFinQA和TAT-QA,在推進財務問答和數值推理方面發揮了重要作用,但它們主要集中於財務報告的問答,並未提供明確的專業難度層級。更廣泛的資源,包括FinanceBench、PIXIU、FinBen和FLaME,擴展了財務任務的覆蓋範圍,但評估從基礎知識到專家級財務推理的過渡問題仍然未解決。在這項工作中,我們提出了FINESSE-Bench,一套由8個專門基準組成的工具,包括3,993個問題,用於對LLMs的財務能力進行層級評估。FINESSE-Bench結合了受專業認證啟發的考試導向數據集(類似CFA的1-3級、類似CMT的2級和類似CFTe的1級)、應用交易任務集合以及俄語奧林匹克基準。這一設計使得能夠評估領域廣度、隨著難度增加而導致的性能下降、解決計算任務的能力,以及模型在專業財務領域的行為。我們還描述了一個統一的評估協議,涵蓋選擇題、數值答案和簡短的開放式回答,並基於LLM作為評判者的範式提供自動評分方案。FINESSE-Bench旨在作為現有開放財務基準的補充,以及用於更實質性地評估大型語言模型中與專業相關的財務能力的工具。

Beyond Partner Diversity: An Influence-Based Team Steering Framework for Zero-Shot Human-Machine Teaming

2605.15400v1 by Wei Sheng, Rohan Paleja

While AI agents are rapidly advancing from isolated tools to interactive collaborators, data-driven human-machine teaming (HMT) methods remain costly in their reliance on human interaction data across domains, teammates, and team sizes. Zero-shot coordination (ZSC) addresses this bottleneck by simulating diverse partner populations to approximate how unseen partners might behave. However, partner coverage alone is insufficient as team settings scale and communication becomes degraded. To remedy this deficiency, we propose Influence-Based Team Steering (IBTS), a framework that uses influence shaping to incentivize agents to discover diverse, high-performing team interaction patterns and further steers ongoing trajectories toward stronger learned coordination modes. We assess IBTS on Overcooked-AI in both two-agent and three-agent settings, allowing us to test whether learned coordination structure transfers beyond dyadic interaction. Our evaluation includes simulated partners, synthetic partner-style variation, and, to our knowledge, the first 30-subject Overcooked-AI HMT study involving two real human teammates and one machine teammate. Across these evaluations, IBTS improves team performance against competing baselines, highlighting the need for scaled ZSC to combine sparse-reward coordination mechanisms with partner-variation coverage rather than relying on diversity alone.

摘要:雖然 AI 代理正在迅速從孤立的工具轉變為互動的合作夥伴,但基於數據的人機協作 (HMT) 方法仍然因依賴於跨領域、隊友和團隊規模的人類互動數據而成本高昂。零樣本協調 (ZSC) 通過模擬多樣的夥伴群體來解決這一瓶頸,以近似未見夥伴的行為。然而,僅僅依賴夥伴覆蓋是不夠的,隨著團隊設置的擴大和溝通的惡化,這一問題變得更加明顯。為了解決這一不足,我們提出了基於影響的團隊引導 (IBTS),這是一個利用影響塑造來激勵代理發現多樣化、高效能團隊互動模式的框架,並進一步引導持續的軌跡朝向更強的學習協調模式。我們在 Overcooked-AI 上評估 IBTS,涵蓋了兩代理和三代理的設置,讓我們能夠測試學習的協調結構是否能超越二人互動。我們的評估包括模擬夥伴、合成夥伴風格變化,以及據我們所知,首次進行的 30 名參與者的 Overcooked-AI HMT 研究,涉及兩名真實人類隊友和一名機器隊友。在這些評估中,IBTS 提升了團隊表現,超越了競爭基準,突顯了擴展 ZSC 的必要性,以結合稀疏獎勵的協調機制與夥伴變化覆蓋,而不是僅僅依賴多樣性。

PACER: Acyclic Causal Discovery from Large-Scale Interventional Data

2605.15353v1 by Ramon Viñas Torné, Sílvia Fàbregas Salazar, Soyon Park, Ivo Alexander Ban, Artyom Gadetsky, Nikita Doikov, Maria Brbić

Inferring the structure of directed acyclic graphs (DAGs) from data is a central challenge in causal discovery, particularly in modern high-dimensional settings where large-scale interventional data are increasingly available. While interventional data can improve identifiability, existing methods remain limited by soft acyclicity constraints, leading to optimization over invalid cyclic graphs, numerical instability, and reduced scalability. We introduce PACER (Perturbation-driven Acyclic Causal Edge Recovery), a scalable framework for causal discovery that guarantees acyclicity by construction. PACER parameterizes a distribution over DAGs through a joint model of variable permutations and edge probabilities, enabling direct optimization over valid causal structures without surrogate penalties. The framework supports a unified likelihood-based treatment of observational and interventional data, flexible conditional density models, and the incorporation of structural prior knowledge. For linear-Gaussian mechanisms, we derive closed-form expressions for the expected interventional log-likelihood and its gradients, yielding substantial computational gains. Empirically, PACER matches or exceeds state-of-the-art methods on protein signaling and large-scale genetic perturbation benchmarks, while scaling efficiently to networks with thousands of variables and achieving up to two orders of magnitude speedups over penalty-based differentiable approaches. These results demonstrate that exact and scalable causal discovery from high-dimensional perturbation data is achievable through principled search space design.

摘要:從數據中推斷有向無環圖(DAG)的結構是因果發現中的一個核心挑戰,特別是在現代高維環境中,大規模的干預數據越來越可用。雖然干預數據可以提高可識別性,但現有方法仍然受到軟性無環約束的限制,導致在無效的循環圖上進行優化、數值不穩定以及可擴展性降低。我們介紹了PACER(擾動驅動的無環因果邊緣恢復),這是一個可擴展的因果發現框架,通過構造保證無環性。PACER通過變量排列和邊緣概率的聯合模型對DAG進行分佈參數化,使得可以在有效的因果結構上直接進行優化,而無需替代懲罰。該框架支持對觀察數據和干預數據的統一基於似然的處理、靈活的條件密度模型以及結構先驗知識的納入。對於線性高斯機制,我們推導了期望干預對數似然及其梯度的封閉形式表達式,從而帶來了可觀的計算增益。在實證上,PACER在蛋白質信號傳導和大規模基因擾動基準測試中與最先進的方法相匹配或超越,同時有效擴展到擁有數千個變量的網絡,並在懲罰基的可微分方法上實現了高達兩個數量級的加速。這些結果表明,通過原則性的搜索空間設計,從高維擾動數據中實現精確且可擴展的因果發現是可行的。

Zero-Shot Goal Recognition with Large Language Models

2605.15333v1 by Kin Max Piamolini Gusmão, Nathan Gavenski, Nir Oren, Felipe Meneguzzi

Large language models have recently reached near-parity with classical planners on well-known planning domains, yet this competence relies on world-knowledge exploitation rather than genuine symbolic reasoning. Goal recognition is a complementary abductive task structurally better suited to LLM strengths: it consists of evaluating consistency with world knowledge rather than generating novel action sequences. This paper provides the first systematic zero-shot evaluation of frontier LLMs as goal recognisers on key classical PDDL benchmarks. Our results show that LLM competence on goal recognition is uneven: some models scale with evidence and approach landmark-based accuracy at full observations, while others remain anchored to world-knowledge priors regardless of how much evidence accumulates. Qualitative analysis of model reasoning traces reveals that this divergence reflects a fundamental difference in evidence integration rather than domain familiarity. These findings position goal recognition as a principled benchmark for the foundational planning knowledge of LLMs.

摘要:大型語言模型最近在知名的規劃領域達到了與傳統規劃器的近似平衡,然而這種能力依賴於世界知識的利用,而非真正的符號推理。目標識別是一個互補的推理任務,結構上更適合大型語言模型的優勢:它的核心在於評估與世界知識的一致性,而不是生成新的行動序列。本文提供了對前沿大型語言模型作為目標識別器在關鍵經典PDDL基準上的首次系統性零樣本評估。我們的結果顯示,大型語言模型在目標識別上的能力是不均衡的:一些模型隨著證據的增加而提升,並在完整觀察下接近基於地標的準確性,而另一些模型則無論證據累積多少,仍然固守於世界知識的先驗。對模型推理過程的定性分析顯示,這種分歧反映了證據整合的根本差異,而非領域熟悉度。這些發現將目標識別定位為大型語言模型基礎規劃知識的一個原則性基準。

Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

2605.15301v1 by Han Li, Jinyu Tian, Rili Feng, Yuqiao Du, Chong Zheng, Chenyu Wang, Chenchen Liu, Shihao Li, Xinping Lei, Yifan Yao, Weihao Xie, Letian Zhu, Jiaheng Liu

Large language models (LLMs) still struggle with the rigorous reasoning demands of hard competitive programming. While recent multi-agent frameworks attempt to bridge this reliability gap, they remain fundamentally stateless: they rely on static retrieval and discard the valuable problem-solving and debugging experience gained from previous tasks. To address this, we present Solvita, an agentic evolution framework that enables continuous learning without requiring weight updates to the underlying LLM. Solvita reorganizes problem-solving into a closed-loop system of strategy selection, program synthesis, certified supervision, and targeted hacking, executed by four specialized agents: Planner, Solver, Oracle, and Hacker. Crucially, each agent is paired with a trainable, graph-structured knowledge network. As the system operates, outcome signals, such as pass/fail verdicts, test certification quality, and adversarial vulnerabilities discovered by the Hacker, are recast as reinforcement learning updates to these network weights. This allows the agents to dynamically route future queries based on past successes and failures, effectively accumulating transferable reasoning experience over time. Evaluated across CodeContests, APPS, AetherCode, and live Codeforces rounds, Solvita establishes a new state-of-the-art among code-generation agents, outperforming existing multi-agent pipelines and nearly doubling the accuracy of single-pass baselines.

摘要:大型語言模型(LLMs)在艱難的競賽編程中仍然面臨嚴格的推理需求挑戰。雖然最近的多代理框架試圖彌補這一可靠性差距,但它們仍然在根本上是無狀態的:它們依賴靜態檢索,並捨棄了從先前任務中獲得的寶貴問題解決和除錯經驗。為了解決這個問題,我們提出了Solvita,一個代理進化框架,能夠在不需要對底層LLM進行權重更新的情況下實現持續學習。Solvita將問題解決重新組織為一個封閉循環系統,包括策略選擇、程序合成、認證監督和針對性破解,這些由四個專門的代理執行:計劃者、解決者、神諭者和駭客。至關重要的是,每個代理都與一個可訓練的圖結構知識網絡配對。系統運行時,結果信號,如通過/失敗的判決、測試認證質量以及駭客發現的對抗性漏洞,被重新轉化為對這些網絡權重的強化學習更新。這使得代理能夠根據過去的成功和失敗動態路由未來的查詢,有效地隨著時間的推移積累可轉移的推理經驗。在CodeContests、APPS、AetherCode和現場Codeforces回合的評估中,Solvita在代碼生成代理中建立了新的最先進水平,超越了現有的多代理管道,並幾乎將單次通過基準的準確性翻倍。

GQA-μP: The maximal parameterization update for grouped query attention

2605.15290v1 by Kyle R. Chickering, Huijuan Wang, Mengxi Wu, Alexander Moreno, Muhao Chen, Xuezhe Ma, Daria Soboleva, Joel Hestness, Zhengzhong Liu, Eric Xing

Hyperparameter transfer across model architectures dramatically reduces the amount of compute necessary for tuning large language models (LLMs). The maximal update parameterization (μP) ensures transfer through principled mathematical analysis but can be challenging to derive for new model architectures. Building on the spectral feature-learning view of Yang et al. (2023a), we make two advances. First, we promote spectral norm conditions on the weights from a heuristic to the definition of feature learning, and as a consequence arrive at the Complete-P depth and weight-decay scalings without recourse to lazy-learning. Second, we consider a modified spectral norm that preserves the valid scaling law of network weights when weight matrices are not full rank. This enables (to our knowledge, the first) derivation of μP scalings for grouped-query attention (GQA). We demonstrate the efficacy of our theoretical derivations by showing learning rate transfer across the GQA repetition hyperparameter as well as experiments regarding transfer over weight decay.

摘要:超參數在模型架構之間的轉移顯著減少了調整大型語言模型(LLMs)所需的計算量。最大更新參數化(μP)通過原則性的數學分析確保轉移,但對於新的模型架構來說,推導可能具有挑戰性。在Yang等人(2023a)的光譜特徵學習觀點的基礎上,我們做出了兩項進展。首先,我們將權重的光譜範數條件從啟發式提升到特徵學習的定義,因此在不依賴於懶學習的情況下達到了完整的深度和權重衰減縮放。其次,我們考慮了一種修改過的光譜範數,當權重矩陣不是滿秩時,該範數保持網絡權重的有效縮放法則。這使得(據我們所知,首次)推導了分組查詢注意力(GQA)的μP縮放。我們通過展示在GQA重複超參數之間的學習率轉移以及關於權重衰減的轉移實驗,證明了我們理論推導的有效性。

Autonomous Intelligent Agents for Natural-Language-Driven Web Execution with Integrated Security Assurance

2605.15281v1 by Vinil Pasupuleti, Siva Rama Krishna Varma Bayyavarapu, Shrey Tyagi

Modern web test suites rot. A UI refactor breaks locators, a timing change causes race conditions, and within weeks developers abandon the suite entirely. This paper presents an AI-driven autonomous testing framework that addresses these failure modes through five integrated strategies - navigation reliability, context-aware selector generation, post-generation validation, smart wait injection, and failure learning - implemented over a containerised worker architecture that decouples orchestration from long-running browser execution. Evaluated across four production applications and 176 scenarios, the framework improves script generation success from 55% to 93%, achieves an 8x reduction in navigation failures, eliminates 80% of timing-related race conditions, and reduces test creation time by 75% compared to manual Selenium authoring. The framework extends naturally to security validation: testers describe attack scenarios in plain English - "try accessing another user's invoice" - which the agent converts to OWASP Top 10-aligned browser probes, detecting 85% of authentication bypass vulnerabilities and 95% of input validation flaws with false positive rates below 12%. Natural-language-driven security testing of this kind represents, to our knowledge, a novel contribution to the field.

摘要:現代的網頁測試套件會腐爛。UI 重構會破壞定位器,時間變更會導致競爭條件,幾週內開發人員完全放棄該套件。本文提出了一個以 AI 驅動的自主測試框架,通過五種集成策略解決這些失敗模式——導航可靠性、上下文感知的選擇器生成、生成後驗證、智能等待注入和失敗學習——在一個容器化的工作架構上實施,將編排與長時間運行的瀏覽器執行解耦。該框架在四個生產應用和 176 個場景中進行評估,將腳本生成成功率從 55% 提高到 93%,實現了導航失敗的 8 倍減少,消除了 80% 的時間相關競爭條件,並將測試創建時間相比於手動 Selenium 編寫減少了 75%。該框架自然擴展到安全驗證:測試人員用簡單的英語描述攻擊場景——“嘗試訪問另一個用戶的發票”——該代理將其轉換為與 OWASP Top 10 對齊的瀏覽器探針,檢測到 85% 的身份驗證繞過漏洞和 95% 的輸入驗證缺陷,假陽性率低於 12%。這種自然語言驅動的安全測試在我們的知識中,代表了對該領域的一項新貢獻。

FutureSim: Replaying World Events to Evaluate Adaptive Agents

2605.15188v1 by Shashwat Goel, Nikhil Chandak, Arvindh Arun, Ameya Prabhu, Steffen Staab, Moritz Hardt, Maksym Andriushchenko, Jonas Geiping

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.

摘要:AI 代理人越來越多地被部署在需要隨著新信息到來而適應的動態、開放式環境中。為了有效地測量這種能力以應對現實案例,我們提出建立基於實際事件的模擬,按照事件發生的順序重播。我們建立了 FutureSim,讓代理人預測超出其知識截止日期的世界事件,同時與世界的時間順序重播互動:在模擬期間內,真實新聞文章不斷到達,問題逐漸解決。我們在其本地環境中評估前沿代理人,測試他們在 2026 年 1 月到 3 月的三個月期間預測世界事件的能力。FutureSim 顯示出它們能力的明顯差異,最佳代理人的準確率為 25%,而許多代理人的 Brier 技能分數甚至比不做預測還要差。通過仔細的消融實驗,我們展示了 FutureSim 如何提供一個現實的環境來研究新興的研究方向,如長期測試時間適應、搜索、記憶和對不確定性的推理。總體而言,我們希望我們的基準設計為測量 AI 在現實世界中跨越長時間範圍的開放式適應進展鋪平道路。

Evidential Reasoning Advances Interpretable Real-World Disease Screening

2605.15171v1 by Chenyu Lian, Hong-Yu Zhou, Jing Qin

Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region-level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence-aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post-hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real-world disease screening, yielding notably higher specificity at clinical-level recall. Code is publicly available at https://github.com/DopamineLcy/EviScreen.

摘要:疾病篩檢對於臨床實踐中的早期檢測和及時干預至關重要。然而,目前大多數醫學影像的篩檢模型在可解釋性和性能上都存在限制。它們通常缺乏有效的機制來參考歷史案例或提供透明的推理途徑。為了解決這些挑戰,我們提出了EviScreen,一個利用歷史案例區域級證據的疾病篩檢證據推理框架。所提出的EviScreen通過從雙重知識庫檢索的區域證據提供了回顧性可解釋性。利用這一證據機制,隨後的證據感知推理模塊使用當前案例和來自歷史案例的證據進行預測,從而提高疾病篩檢的性能。此外,EviScreen通過利用從對比檢索中獲得的異常圖來增強定位可解釋性,而不是依賴事後的顯著性圖。我們的方法在我們精心建立的現實世界疾病篩檢基準上實現了優越的性能,在臨床級召回率下產生了顯著更高的特異性。代碼可在https://github.com/DopamineLcy/EviScreen公開獲得。

Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

2605.15168v1 by Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim, Jeremy C. Weiss

Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient's course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.

摘要:重建精確的臨床時間線對於建模病人軌跡和預測像敗血症這樣複雜且異質的病症風險至關重要。雖然非結構化的臨床敘述提供了語義豐富且上下文完整的病程描述,但它們往往缺乏時間精確性,並且包含模糊的事件時間。相反,結構化的電子健康紀錄(EHR)數據提供了精確的時間錨點,但卻錯過了大量臨床上有意義的事件。我們提出了一種檢索增強的多模態對齊框架,旨在彌補這一差距,以提高從文本中提取的絕對臨床時間線的時間精確性。我們的方法將時間線重建公式化為基於圖的多步驟過程:首先從敘述中提取中心錨事件以建立初始時間框架,然後相對於這一骨幹放置非中心事件,最後使用檢索到的結構化EHR行作為外部時間證據來校準時間線。通過在涵蓋MIMIC-III和MIMIC-IV的i2m4基準上使用經過指導調整的大型語言模型進行評估,我們的多模態管道在絕對時間戳準確性(AULTC)上始終有所改善,並且在幾乎所有評估模型中提高了時間一致性,相較於單模態僅文本重建,且不妥協事件匹配率。此外,我們的實證差距分析顯示,34.8%的文本衍生事件在表格記錄中完全缺失,這表明對齊這些模態可以比單一來源產生更具時間忠實性和臨床信息性的病人軌跡重建。

MeMo: Memory as a Model

2605.15156v1 by Ryan Wei Heng Quek, Sanghyuk Lee, Alfred Wei Lun Leong, Arun Verma, Alok Prakash, Nancy F. Chen, Bryan Kian Hsiang Low, Daniela Rus, Armando Solar-Lezama

Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, (c) it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM's weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.

摘要:大型語言模型(LLMs)在各種任務中表現出色,但在預訓練後保持不變,直到隨後的更新。許多現實世界的應用需要及時的、特定領域的信息,這促使了高效機制的需求,以納入新知識。在本文中,我們介紹了 MeMo(記憶作為模型),這是一個模塊化框架,能夠將新知識編碼到專用的記憶模型中,同時保持 LLM 參數不變。與現有方法相比,MeMo 提供了幾個優勢:(a)它捕捉複雜的跨文檔關係,(b)它對檢索噪聲具有魯棒性,(c)它避免了 LLM 的災難性遺忘,(d)它不需要訪問 LLM 的權重或輸出 logits,使其能夠與開放和專有的封閉源 LLM 進行即插即用的集成,以及(e)其檢索成本在推理時與語料庫大小無關。我們在三個基準測試(BrowseComp-Plus、NarrativeQA 和 MuSiQue)上的實驗結果顯示,MeMo 在多種設置中相較於現有方法達到了強勁的性能。

Why Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG

2605.15109v1 by Riccardo Terrenzi, Maximilian von Zastrow, Serkan Ayvaz

Retrieval-Augmented Generation can improve factuality by grounding answers in external evidence, but Agentic GraphRAG complicates what it means for citations to be faithful. In these systems, an agent explores a knowledge graph before producing an answer and a small set of citations. We frame citation faithfulness as a trajectory-level problem: final citations should not only support the answer, but also account for the graph traversal, structure, and visited-but-uncited entities that may influence it. Through controlled ablation experiments, we compare the effects of isolating, removing, and masking cited and uncited graph entities. Our results show that cited evidence is often necessary, as removing it substantially changes answers and reduces accuracy. However, citations are not sufficient, because accurate answers can also depend on uncited traversal context and surrounding graph structure. These findings suggest that citation evaluation in Agentic GraphRAG should move beyond source support toward provenance over the broader retrieval trajectory.

摘要:檢索增強生成可以通過將答案基於外部證據來提高事實性,但 Agentic GraphRAG 使得引用的忠實性變得複雜。在這些系統中,代理在產生答案和一小組引用之前會探索知識圖譜。我們將引用忠實性框架定義為一個軌跡級別的問題:最終的引用不僅應該支持答案,還應考慮圖遍歷、結構以及可能影響答案的已訪問但未引用的實體。通過控制的消融實驗,我們比較了隔離、移除和掩蔽已引用和未引用圖實體的效果。我們的結果顯示,引用的證據通常是必要的,因為移除它會顯著改變答案並降低準確性。然而,引用並不足夠,因為準確的答案也可能依賴於未引用的遍歷上下文和周圍的圖結構。這些發現表明,在 Agentic GraphRAG 中的引用評估應該超越來源支持,轉向對更廣泛檢索軌跡的來源追溯。

TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

2605.15053v2 by Anurup Ganguli

Continually pre-training a large language model on heterogeneous text domains, without replay or task labels, has remained an unsolved architectural problem at LLM scale. Existing methods rely on replay buffers, task identifiers, regularization penalties that scale poorly, or sentence-classification-scale evaluation. We introduce TFGN, an architectural overlay for transformer language models that produces input-conditioned, parameter-efficient updates while leaving the rest of the transformer unchanged. On six heterogeneous text domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) at 1B tokens per phase across three model scales (~398M, ~739M, ~9B) and two regimes (From-Scratch and Retrofit), TFGN achieves backward transfer of -0.007 at LLaMA 3.1 8B Retrofit, HellaSwag retention 0.506/0.504/0.510, and >=99.59% L2-orthogonal gradient separation between domain pairs - with no replay, no task IDs, no Fisher penalty. The same matrices show positive cross-domain forward transfer: held-out JavaScript PPL drops 26.8% at LLaMA-8B Retrofit and 62.0% at GPT-2 Medium From-Scratch purely from Python training. Two extensions on the same substrate close further open problems. A closed-loop meta-control layer (Extension A) reduces forgetting by an additional 81% at ~398M, mapping onto the System A and System M roles of Dupoux et al. (arXiv:2603.15381). An operator-level plan vector (Extension B) reshapes forward-pass behavior at 99.96% cosine fidelity over 30 source->target pairs. The architectural insight is a Read/Write decomposition: the forward pass is fully dense, while cross-domain parameter updates are structured so prior-domain subspaces are not written to. To our knowledge, TFGN is the first architecture that simultaneously closes catastrophic forgetting at LLM scale, realizes a closed-loop autonomous-learning meta-controller, and carries an operator-level latent planner.

摘要:持續在異質文本領域上對大型語言模型進行預訓練,且不使用重播或任務標籤,仍然是一個在LLM規模下未解決的架構問題。現有的方法依賴於重播緩衝區、任務識別符、規範化懲罰(這些在擴展時表現不佳)或句子分類規模的評估。我們介紹了TFGN,這是一種為Transformer語言模型提供的架構覆蓋,能夠在不改變Transformer其餘部分的情況下,生成輸入條件的、參數高效的更新。在六個異質文本領域(散文、Python、數學、生物醫學、中文、JavaScript)中,每個階段有10億個標記,涵蓋三種模型規模(約398M、約739M、約9B)和兩種模式(從頭開始和改造),TFGN在LLaMA 3.1 8B改造中實現了-0.007的向後轉移,HellaSwag保留率為0.506/0.504/0.510,以及在域對之間的>=99.59% L2正交梯度分離——無重播、無任務ID、無Fisher懲罰。同樣的矩陣顯示出正向的跨域轉移:在LLaMA-8B改造中,保留的JavaScript PPL下降了26.8%,在GPT-2 Medium從頭開始中純粹來自Python訓練下降了62.0%。基於相同基材的兩個擴展進一步解決了開放問題。一個閉環元控制層(擴展A)在約398M時減少了81%的遺忘,映射到Dupoux等人(arXiv:2603.15381)的系統A和系統M角色。一個操作級計劃向量(擴展B)在30個源->目標對上以99.96%的餘弦保真度重塑了前向傳遞行為。架構見解是讀/寫分解:前向傳遞是完全密集的,而跨域參數更新的結構使得先前域的子空間不會被寫入。據我們所知,TFGN是第一個在LLM規模下同時解決災難性遺忘、實現閉環自主學習元控制器並攜帶操作級潛在規劃者的架構。

Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use

2605.15041v1 by Renning Pang, Tian Lan, Leyuan Liu, Piao Tong, Sheng Cao, Xiaosong Zhang

Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats historical execution trajectories as structured cases. Instead of reusing raw exemplar outputs, CAST extracts case-derived signals to identify complexity profiles for estimating optimal reasoning strategies, alongside failure profiles to map likely structural breakdowns. The framework translates this knowledge into a fine-grained reward design and adaptive reasoning, enabling the model to autonomously internalize case-based strategies during reinforcement learning. Experiments on BFCLv2 and ToolBench demonstrate that CAST improves both schema-faithful execution and task-level tool-use success while reducing unnecessary deliberation. The approach achieves up to 5.85 percentage points gain in overall execution accuracy and reduces average reasoning length by 26%, significantly mitigating high-impact structural errors. Ultimately, this demonstrates how historical execution cases can provide reusable adaptation knowledge for calibrated tool use.

摘要:工具使用將大型語言模型擴展到超越參數知識,但可靠的執行需要在適當的推理深度與嚴格的結構有效性之間取得平衡。我們從案例導向的角度來解決這個問題,提出了CAST,一個以案例為驅動的框架,將歷史執行軌跡視為結構化案例。CAST不是重用原始範例輸出,而是提取案例衍生的信號,以識別複雜性輪廓,從而估計最佳推理策略,並通過失敗輪廓來映射可能的結構故障。該框架將這些知識轉化為細緻的獎勵設計和自適應推理,使模型能夠在強化學習過程中自主內化基於案例的策略。在BFCLv2和ToolBench上的實驗表明,CAST改善了結構忠實執行和任務層級工具使用的成功率,同時減少了不必要的深思。該方法在整體執行準確性上可達到高達5.85個百分點的增益,並將平均推理長度減少26%,顯著減輕了高影響的結構錯誤。最終,這表明歷史執行案例如何提供可重用的適應知識,以便進行經過校準的工具使用。

Generalized Priority-Aware Shapley Value

2605.15018v1 by Kiljae Lee, Ziqi Liu, Weijing Tang, Yuan Zhang

Shapley value and its priority-aware extensions are widely used for valuation in machine learning, but existing methods require pairwise priority to be binary and acyclic, a restriction spectacularly violated in real-data examples such as aggregated human preferences and multi-criterion comparisons. We introduce the generalized priority-aware Shapley value (GPASV), a random order value defined on arbitrary directed weighted priority graphs, in which pairwise edges penalize rather than forbid order violations. GPASV covers a range of classical models as boundary cases. We establish GPASV through an axiomatic characterization, develop the associated computational methods, and introduce a priority sweeping diagnostic extending PASV's. We apply GPASV to LLM ensemble valuation on the cyclic Chatbot Arena preference graph, illustrating that priority-aware valuation is not a one-button operation: different balances of pairwise graph priority versus individual soft priority produce substantively different valuations of the same data.

摘要:Shapley 值及其優先級感知擴展在機器學習中的估值中被廣泛使用,但現有方法要求成對優先級為二元且無環,這在實際數據示例中,如聚合的人類偏好和多標準比較,顯著違反了這一限制。我們引入了廣義優先級感知 Shapley 值 (GPASV),這是一種定義在任意有向加權優先級圖上的隨機順序值,其中成對邊緣懲罰而不是禁止順序違規。GPASV 覆蓋了一系列作為邊界情況的經典模型。我們通過公理化特徵建立 GPASV,開發相關的計算方法,並引入擴展 PASV 的優先級掃描診斷。我們將 GPASV 應用於循環 Chatbot Arena 偏好圖上的 LLM 集成估值,說明優先級感知估值並不是一鍵操作:成對圖優先級與個別軟優先級的不同平衡會對相同數據產生實質上不同的估值。

COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion

2605.15016v1 by Zihan Deng, Xiaozhen Zhong, Chuanzhi Xu

As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long-range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain-of-Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal-Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi-modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at https://github.com/FrankDengAI/COTCAgent/.

摘要:隨著大型語言模型在醫療保健領域的應用,智能臨床決策支持迅速發展。長期電子健康紀錄(EHR)提供準確臨床診斷和分析所需的關鍵時間證據。然而,當前的大型語言模型在長期EHR推理方面存在重大缺陷。首先,由於缺乏細緻的統計推理,當定量證據以文本形式隱含時,它們經常幻想出臨床趨勢和指標,從而偏見診斷推斷。其次,長期EHR中的非均勻時間序列和稀缺標籤阻礙了模型捕捉長期時間依賴性,限制了可靠的臨床推理。為了解決上述限制,本研究提出了概率性思維鏈完成代理(COTCAgent),這是一個針對長期電子健康紀錄的分層推理框架。它由三個核心模塊組成。時間統計適配器(TSA)將分析計劃轉換為可執行代碼,以標準化趨勢輸出。思維鏈完成(COTC)層利用帶權重評分的症狀-趨勢-疾病知識庫來評估疾病風險,而有界完成模塊通過標準化詢問和迭代評分約束獲取結構化證據,以確保嚴謹的推理。通過解耦統計計算、特徵匹配和語言生成,該框架消除了對複雜多模態輸入的依賴,並能以較低的計算開銷實現高效的長期紀錄分析。實驗結果顯示,基於Baichuan-M2的COTCAgent在自建數據集上達到90.47%的Top-1準確率,在HealthBench上達到70.41%,超越了現有的醫療代理和主流大型語言模型。代碼可在https://github.com/FrankDengAI/COTCAgent/獲得。

The Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale

2605.15011v1 by Peter A. Jansen

Scientific contributions rarely develop in isolation, but instead build upon prior discoveries. We formulate the task of automated technological roadmapping as extracting scientific contributions from scholarly articles and linking them to their prerequisites. We present the Scientific Contribution Graph, a large-scale AI/NLP-domain resource containing 2 million detailed scientific contributions extracted from 230k open-access papers and connected by 12.5 million prerequisite edges. We further introduce scientific prerequisite prediction, a scientific discovery task in which models predict which existing technologies can enable future discoveries, and show that contemporary models are rapidly improving on this task, reaching 0.48 MAP when evaluated using temporally filtered backtesting. We anticipate technological roadmapping resources such as this will support scientific impact assessment and automated scientific discovery.

摘要:科學貢獻很少在孤立中發展,而是建立在先前的發現之上。我們將自動化技術路線圖的任務定義為從學術文章中提取科學貢獻並將其與其前提相連結。我們提出了科學貢獻圖,這是一個大型的 AI/NLP 領域資源,包含從 230,000 篇開放獲取論文中提取的 200 萬個詳細科學貢獻,並由 1,250 萬個前提邊相連。我們進一步介紹了科學前提預測,這是一項科學發現任務,其中模型預測哪些現有技術可以促進未來的發現,並顯示當代模型在這項任務上的表現正在迅速改善,在使用時間過濾的回測評估時達到 0.48 MAP。我們預期像這樣的技術路線圖資源將支持科學影響評估和自動化科學發現。

KGPFN: Unlocking the Potential of Knowledge Graph Foundation Model via In-Context Learning

2605.14907v1 by Yisen Gao, Jiaxin Bai, Haoyu Huang, Zhongwei Xie, Yufei Li, Hong Ting Tsang, Sirui Han, Yangqiu Song

Knowledge graph (KG) foundation models aim to generalize across graphs with unseen entities and relations by learning transferable relational structure. However, most existing methods primarily emphasize relation-level universality, while in-context learning, the other pillar of foundation models remains under-explored for KG reasoning. In KGs, context is inherently structured and heterogeneous: effective prediction requires conditioning on the local context around the query entities as well as the global context that summarizes how a relation behaves across many instances. We propose KGPFN, a KG foundation model using Prior-data Fitted Network that unifies transferable relational regularities with inference-time in-context learning from structured context. KGPFN first learns relation representations via message passing on relation graphs to capture cross-graph relational invariances. For query-specific reasoning, it encodes local neighborhoods using a multi-layer NBFNet as local context. To enable ICL at global scale, it constructs relation-specific global context by retrieving a large set of instances of the query relation together with their local neighborhoods, and aggregates them within a Prior-Data Fitted Network framework that combines feature-level and sample-level attention. Through multi-graph pretraining on diverse KGs, KGPFN learns when to instantiate reusable patterns and when to override them using contextual evidence. Experiments on 57 KG benchmarks demonstrate that KGPFN achieves strong adaptation to previously unseen graphs through in-context learning alone, consistently outperforming competitive fine-tuned KG foundation models. Our code is available at https://github.com/HKUST-KnowComp/KGPFN.

摘要:知識圖譜(KG)基礎模型旨在通過學習可轉移的關係結構,實現對未見實體和關係的圖形的泛化。然而,大多數現有方法主要強調關係層面的普遍性,而在上下文學習這一基礎模型的另一支柱方面,對於KG推理的探索仍然不足。在KG中,上下文本質上是結構化且異質的:有效的預測需要根據查詢實體周圍的局部上下文以及總結關係在多個實例中如何表現的全局上下文進行條件化。我們提出了KGPFN,一種使用Prior-data Fitted Network的KG基礎模型,將可轉移的關係規律與來自結構化上下文的推理時上下文學習統一。KGPFN首先通過在關係圖上進行消息傳遞來學習關係表示,以捕捉跨圖的關係不變性。對於查詢特定的推理,它使用多層NBFNet編碼局部鄰域作為局部上下文。為了在全局範圍內啟用ICL,它通過檢索查詢關係的大量實例及其局部鄰域來構建關係特定的全局上下文,並在Prior-Data Fitted Network框架內聚合這些實例,該框架結合了特徵層級和樣本層級的注意力。通過在多樣化的KG上進行多圖預訓練,KGPFN學會了何時實例化可重用模式以及何時使用上下文證據來覆蓋它們。在57個KG基準上的實驗表明,KGPFN僅通過上下文學習就能強力適應先前未見的圖形,並且始終超越競爭性的微調KG基礎模型。我們的代碼可在 https://github.com/HKUST-KnowComp/KGPFN 獲得。

COREKG: Coreset-Guided Personalized Summarization of Knowledge Graphs

2605.14900v1 by Sohel Aman Khan, Raghava Mutharaju, Supratim Shit

Knowledge Graphs (KGs) are extensively used across different domains and in several applications. Often, these KGs are very large in size. Such KGs become unwieldy for tasks such as question answering and visualization. Summarization of KGs offers a viable alternative in such cases. Furthermore, personalized KG summarization is crucial in the current data-driven world as it captures the specific requirements of users based on their query patterns. Since it only maintains relevant information, the personalized summaries of KG are small, resulting in significantly smaller storage requirements and query runtime. In this work, we adapt the coreset theory to create personalized KG summaries. For a given dataset and a user-specific query workload, we present an approach that samples a relevant subset of triples using sensitivity-based importance sampling. We ensure that the subset approximates the characteristics of the full dataset with bounded approximation error. We define sensitivity scores that measure the importance of a triple with respect to a user's query workload, which are then used by our coreset construction algorithm. We explicitly focus on personalized knowledge graph summarization by constructing summaries independently for each user based on their query behaviour. Our evaluation on Freebase, WikiData, and DBpedia shows that COREKG delivers higher query-answering accuracy and structural coverage than the state-of-the-art methods, such as GLIMPSE, PPR, iSummary, PEGASUS and APEX$^2$ while requiring only a tiny fraction of the original graph.

摘要:知識圖譜(KGs)在不同領域和多種應用中被廣泛使用。這些KGs的大小通常非常龐大。這樣的KGs在進行問題回答和可視化等任務時變得笨重。KGs的摘要在這種情況下提供了一種可行的替代方案。此外,個性化的KG摘要在當前以數據為驅動的世界中至關重要,因為它根據用戶的查詢模式捕捉了特定需求。由於它僅保留相關信息,KG的個性化摘要較小,從而顯著減少存儲需求和查詢運行時間。在這項工作中,我們調整了核心集理論以創建個性化的KG摘要。對於給定的數據集和用戶特定的查詢工作負載,我們提出了一種方法,使用基於敏感度的重要性抽樣來抽取相關的三元組子集。我們確保該子集在有界近似誤差的情況下近似完整數據集的特徵。我們定義了敏感度分數,這些分數衡量三元組相對於用戶查詢工作負載的重要性,然後由我們的核心集構建算法使用。我們明確專注於個性化知識圖譜摘要,根據每個用戶的查詢行為獨立構建摘要。我們在Freebase、WikiData和DBpedia上的評估顯示,COREKG在查詢回答準確性和結構覆蓋率方面優於最先進的方法,如GLIMPSE、PPR、iSummary、PEGASUS和APEX$^2$,而只需原始圖的一小部分。

BiFedKD: Bidirectional Federated Knowledge Distillation Framework for Non-IID and Long-Tailed ECG Monitoring

2605.14886v1 by Zixuan Shu, Tiancheng Cao, Hen-Wei Huang

Electrocardiogram (ECG) monitoring in Internet of Medical Things (IoMT) networks is constrained by strict data-sharing regulations and privacy concerns. Federated learning (FL) enables collaborative learning by keeping raw ECG data on devices, but frequent transmissions of high-dimensional model updates incur heavy per-round traffic over bandwidth-limited links. To alleviate this bottleneck, federated distillation (FD) replaces parameter exchange with logit-based knowledge transfer. However, the performance of FD often degrades under the non-independent and identically distributed (non-IID) and long-tailed label distributions in ECG deployments. To address these challenges, we propose a bidirectional federated knowledge distillation (BiFedKD) framework that employs an aggregation-by-distillation pipeline with temperature scaling to produce a stable global distillation signal for cross-client alignment. Experiments on the MIT-BIH Arrhythmia dataset show that BiFedKD improves accuracy and Macro-F1 over the baseline by $3.52\%$ and $9.93\%$, respectively. Moreover, to reach the same Macro-F1, BiFedKD reduces communication overhead by $40\%$ and computation cost by $71.7\%$ compared with the baseline.

摘要:心電圖 (ECG) 監測在醫療物聯網 (IoMT) 網絡中受到嚴格的數據共享規範和隱私問題的限制。聯邦學習 (FL) 通過將原始 ECG 數據保留在設備上來實現協作學習,但高維模型更新的頻繁傳輸會在帶寬有限的鏈路上產生巨大的每輪流量。為了緩解這一瓶頸,聯邦蒸餾 (FD) 用基於邏輯的知識轉移取代了參數交換。然而,在 ECG 部署中,FD 的性能在非獨立同分佈 (non-IID) 和長尾標籤分佈下往往會下降。為了解決這些挑戰,我們提出了一種雙向聯邦知識蒸餾 (BiFedKD) 框架,該框架採用帶有溫度縮放的蒸餾聚合管道,以產生穩定的全局蒸餾信號以進行跨客戶對齊。在 MIT-BIH 心律不齊數據集上的實驗顯示,BiFedKD 分別將準確率和 Macro-F1 提高了 $3.52\%$ 和 $9.93\%$。此外,為了達到相同的 Macro-F1,與基線相比,BiFedKD 將通信開銷減少了 $40\%$,計算成本減少了 $71.7\%$。

REALM: Retrospective Encoder Alignment for LFP Modeling

2605.14867v1 by Peicheng Wu, Zhenyu Bu, Runze Ma, Lin Du

Spike activity has been the dominant neural signal for behavior decoding due to its high spatial and temporal resolution. However, as brain-computer interfaces (BCIs) move toward high channel counts and wireless operation, the high sampling frequency of spike signals becomes a bottleneck due to high power and bandwidth requirements. Local field potentials (LFPs) represent a different spatial-temporal scale of brain activity compared to spikes, offering key advantages including improved long-term stability, reduced energy consumption, and lower bandwidth requirement. Despite these benefits, LFP-based decoding models typically show reduced accuracy and often rely on non-causal architectures that are unsuitable for real-time deployment. To address these challenges, we propose REALM: a retrospective distillation framework that enables causal LFP decoding. Inspired by offline-to-online distillation strategies in speech recognition, REALM transfers representational knowledge from a pretrained multi-session bidirectional LFP model to a causal version for real-time deployment. We first pretrain a bidirectional Mamba-2 teacher model using a masked autoencoding objective. We then distill this teacher model into a compact student model via a combined objective of representation alignment and task supervision. REALM consistently outperforms both causal and non-causal LFP-based SOTA methods for behavior decoding. Notably, our REALM improves decoding performance while achieving a $2\times$ reduction in parameter count and a $10\times$ reduction in training time. These results demonstrate that retrospective distillation effectively bridges the gap between offline and real-time neural decoding. REALM shows that LFP-only models can achieve competitive decoding performance without reliance on spike signals, offering a practical and scalable alternative for next-generation wireless implantable BCIs.

摘要:尖峰活動因其高空間和時間解析度而成為行為解碼的主要神經信號。然而,隨著腦-電腦介面(BCIs)朝著高通道數和無線操作發展,尖峰信號的高取樣頻率成為瓶頸,因為它需要高功率和帶寬。與尖峰相比,本地場電位(LFP)代表了不同的空間-時間尺度的腦活動,提供了包括改善長期穩定性、降低能耗和較低帶寬需求等關鍵優勢。儘管有這些好處,基於LFP的解碼模型通常顯示出準確性降低,並且往往依賴於不適合實時部署的非因果架構。為了解決這些挑戰,我們提出了REALM:一個使因果LFP解碼成為可能的回顧性蒸餾框架。受到語音識別中離線到在線蒸餾策略的啟發,REALM將從預訓練的多會話雙向LFP模型中轉移表徵知識到因果版本,以便實時部署。我們首先使用遮罩自編碼目標預訓練一個雙向Mamba-2教師模型。然後,我們通過表徵對齊和任務監督的結合目標,將這個教師模型蒸餾成一個緊湊的學生模型。REALM在行為解碼方面始終超越了基於LFP的因果和非因果SOTA方法。值得注意的是,我們的REALM在提高解碼性能的同時,實現了參數數量減少$2\times$和訓練時間減少$10\times$。這些結果表明,回顧性蒸餾有效地彌合了離線和實時神經解碼之間的差距。REALM顯示僅基於LFP的模型可以在不依賴尖峰信號的情況下實現競爭性的解碼性能,為下一代無線植入式BCIs提供了一種實用且可擴展的替代方案。

Towards In-Depth Root Cause Localization for Microservices with Multi-Agent Recursion-of-Thought

2605.14866v1 by Lingzhe Zhang, Tong Jia, Kangjin Wang, Chiming Duan, Minghua He, Rongqian Wang, Xi Peng, Meiling Wang, Gong Zhang, Renhai Chen, Ying Li

As modern microservice systems grow increasingly complex due to dynamic interactions and evolving runtime environments, they experience failures with rising frequency. Ensuring system reliability therefore critically depends on accurate root cause localization (RCL). While numerous traditional machine learning and deep learning approaches have been explored for this task, they often suffer from limited interpretability and poor transferability across deployments. More recently, large language model (LLM)-based methods have been proposed to address these issues. However, existing LLM-based approaches still face two fundamental limitations: context explosion, which dilutes critical evidence and degrades localization accuracy, and serial reasoning structures, which hinder deep causal exploration and impair inference efficiency. In this paper, we conduct a comprehensive study of both how human SREs perform root cause localization in practice and why existing LLM-based methods fall short. Motivated by these findings, we introduce RCLAgent, an in-depth root cause localization framework for microservice systems that realizes multi-agent recursion-of-thought with parallel reasoning. RCLAgent decomposes the diagnostic process along the trace graph by assigning each span to a Dedicated Agent and organizing agents recursively and in parallel according to the graph topology, with the final diagnosis obtained by synthesizing the Root-Level Diagnosis Report and the Global Evidence Graph. Extensive experiments on multiple public benchmarks demonstrate that RCLAgent consistently outperforms state-of-the-art methods in both localization accuracy and inference efficiency.

摘要:隨著現代微服務系統因動態互動和不斷演變的運行環境而變得越來越複雜,它們經歷故障的頻率也在上升。因此,確保系統可靠性在很大程度上依賴於準確的根本原因定位(RCL)。雖然已經探索了許多傳統的機器學習和深度學習方法來解決這一任務,但它們往往存在有限的可解釋性和跨部署的轉移能力差的問題。最近,基於大型語言模型(LLM)的方法被提出來解決這些問題。然而,現有的基於LLM的方法仍然面臨兩個基本限制:上下文膨脹,這稀釋了關鍵證據並降低了定位準確性,以及串行推理結構,這妨礙了深入的因果探索並損害了推理效率。在本文中,我們對人類SRE在實踐中如何執行根本原因定位以及為什麼現有的基於LLM的方法不夠充分進行了全面研究。基於這些發現,我們介紹了RCLAgent,一個針對微服務系統的深入根本原因定位框架,實現了多代理的思維遞歸和並行推理。RCLAgent通過將每個跨度分配給專用代理,並根據圖拓撲遞歸和並行組織代理,來沿著跟蹤圖分解診斷過程,最終的診斷通過綜合根級診斷報告和全球證據圖獲得。在多個公共基準上的大量實驗表明,RCLAgent在定位準確性和推理效率方面始終優於最先進的方法。

A Deterministic Agentic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions

2605.14857v1 by Yu Zhang, Dongjiang Zhuang, Qu Zhou, Zheng Huang, Junhe Wu, Jing Cao, Kai Chen

Harmonized System (HS) tariff classification is a high-stakes, expert-level task in which a free-form product description must be mapped to a specific six- or eight-digit code under the General Interpretive Rules (GIR), section notes, chapter notes, and Explanatory Notes. The difficulty lies not in knowledge volume but in multi-dimensional rule reasoning: a correct classification must satisfy competing priority rules along several axes simultaneously, including material, form, function, essential character, the part-versus-whole boundary, and specific listing versus residual headings. End-to-end prompting of large language models fails characteristically by resolving one axis while ignoring the priority constraints on the others. We present a deterministic agentic workflow in contrast to self-planning agents: the control flow is fixed, language model calls are confined to narrow stages, and reflection and verification are retained as local mechanisms. This design yields interpretability by construction--each decision is decomposed into stage-wise structured outputs with verbatim citation of the chapter or section notes that bear on it. The architecture combines offline knowledge-engineering of the Chinese HS tariff with an online six-stage pipeline. Evaluated on HSCodeComp at the six-digit level, the workflow reaches 75.0% top-1 and 91.5% top-3 at four digits, and 64.2% top-1 and 78.3% top-3 at six digits with Qwen3.6-plus; an open-weight Qwen3.6-27B-FP8 backbone in non-thinking mode achieves 84.2% four-digit and 77.4% six-digit top-1 agreement with the frontier model. A two-stage manual audit of 226 six-digit disagreements suggests that a non-trivial fraction of HSCodeComp ground-truth labels may deviate from HS general rules; full adjudication records are released in the appendix as preliminary findings for community review.

摘要:harmonized system (HS) 關稅分類是一項高風險、專家級的任務,其中必須將自由形式的產品描述映射到一般解釋規則 (GIR)、章節註釋、部分註釋和解釋性註釋下的特定六位或八位代碼。 難點不在於知識的量,而在於 多維規則推理:正確的分類必須同時滿足多個軸上的競爭優先規則,包括材料、形式、功能、本質特徵、部分與整體的邊界,以及特定清單與剩餘標題之間的區別。 大型語言模型的端到端提示特徵性地失敗,因為它解決了一個軸,同時忽略了其他軸上的優先約束。 我們提出了一種 確定性代理工作流程,與自我規劃代理形成對比:控制流是固定的,語言模型調用限制在狹窄的階段內,反思和驗證作為局部機制保留。 這種設計通過結構化輸出逐步分解每個決策,並逐字引用相關的章節或部分註釋,從而實現可解釋性。 該架構結合了中國 HS 關稅的離線知識工程與在線六階段管道。 在六位數級別的 HSCodeComp 上進行評估,該工作流程在四位數達到 75.0% 的 top-1 和 91.5% 的 top-3,在六位數達到 64.2% 的 top-1 和 78.3% 的 top-3,使用 Qwen3.6-plus;在非思考模式下,開放權重的 Qwen3.6-27B-FP8 主幹與前沿模型達成 84.2% 四位數和 77.4% 六位數的 top-1 一致性。 對 226 個六位數分歧進行的兩階段手動審核表明,HSCodeComp 的真實標籤中可能有一個非微不足道的部分偏離 HS 一般規則;完整的裁決記錄作為社區審查的初步發現在附錄中發布。

Exploitation of Hidden Context in Dynamic Movement Forecasting: A Neural Network Journey from Recurrent to Graph Neural Networks and General Purpose Transformers

2605.14855v1 by Lukas Schelenz, Shobha Rajanna, Denis Gosalci, Lucas Heublein, Jonas Pirkl, Jonathan Ott, Felix Ott, Christopher Mutschler, Tobias Feigl

Forecasting within signal processing pipelines is crucial for mitigating delays, particularly in predicting the dynamic movements of objects such as NBA players. This task poses significant challenges due to the inherently interactive and unpredictable nature of sports, where abrupt changes in velocity and direction are prevalent. Traditional approaches, including (S)ARIMA(X), Kalman filters (KF), and Particle filters (PF), often struggle to model the non-linear dynamics present in such scenarios. Machine learning (ML) methods, such as long short-term memory (LSTM) networks, graph neural networks (GNNs), and Transformers, offer greater flexibility and accuracy but frequently fail to explicitly capture the interplay between temporal dependencies and contextual interactions, which are critical in chaotic sports environments. In this paper, we evaluate these models and assess their strengths and weaknesses. Experimental results reveal key performance trade-offs across input history length, generalizability, and the ability to incorporate contextual information. ML-based methods demonstrated substantial improvements over linear models across forecast horizons of up to 2s. Among the tested architectures, our hybrid LSTM augmented with contextual information achieved the lowest final displacement error (FDE) of 1.51m, outperforming temporal convolutional neural network (TCNN), graph attention network (GAT), and Transformers, while also requiring less data and training time compared to GAT and Transformers. Our findings indicate that no single architecture excels across all metrics, emphasizing the need for task-specific considerations in trajectory prediction for fast-paced, dynamic environments such as NBA gameplay.

摘要:預測在信號處理管道中對於減少延遲至關重要,特別是在預測物體(如NBA球員)的動態運動時。這項任務面臨重大挑戰,因為體育運動本質上是互動且不可預測的,速度和方向的突然變化十分普遍。傳統方法,包括(S)ARIMA(X)、卡爾曼濾波器(KF)和粒子濾波器(PF),在建模這些場景中存在的非線性動力學時,往往表現不佳。機器學習(ML)方法,如長短期記憶(LSTM)網絡、圖神經網絡(GNNs)和Transformer,提供了更大的靈活性和準確性,但經常無法明確捕捉時間依賴性和上下文互動之間的相互作用,而這在混亂的體育環境中至關重要。在本文中,我們評估這些模型並評估它們的優缺點。實驗結果揭示了在輸入歷史長度、可泛化性和納入上下文信息的能力方面的關鍵性能權衡。基於ML的方法在預測範圍達到2秒的情況下,顯示出相較於線性模型的顯著改進。在測試的架構中,我們的混合LSTM增強了上下文信息,達到了最低的最終位移誤差(FDE)為1.51米,超越了時間卷積神經網絡(TCNN)、圖注意力網絡(GAT)和Transformer,同時相比於GAT和Transformer需要更少的數據和訓練時間。我們的研究結果表明,沒有單一架構在所有指標上都表現優秀,強調了在快速變化、動態環境(如NBA比賽)的軌跡預測中需要考慮特定任務的需求。

Emotion-Attended Stateful Memory (EASM):The Architecture for Hyper-Personalization at Scale

2605.14833v1 by Vineet Kotecha, Vansh Gupta

Current language model systems remain fundamentally stateless across sessions, limiting their ability to personalize interactions over time. While retrieval-augmented generation and fine-tuning improve knowledge access and domain capability, they do not enable persistent understanding of individual users. We propose an emotion-attended stateful memory architecture that dynamically constructs user-specific conversational context using long-term history, emotional signals, and inferred intent at inference time. To evaluate its impact, we conducted a controlled A/B study across thirty non-scripted conversations spanning six emotionally distinct categories using the same underlying language model in both conditions. The memory-enriched condition consistently outperformed the stateless baseline across all evaluated scenarios. The largest gains were observed in memory grounding (95% improvement), plan clarity (57%), and emotional validation (34%). Results remained consistent even in emotionally adversarial conversations involving grief, distress, and uncertainty. These findings suggest that stateful emotional memory may represent a foundational infrastructure layer for hyper-personalized AI systems, though broader validation across larger and more diverse evaluations remains necessary

摘要:目前的語言模型系統在會話中基本上仍然是無狀態的,限制了它們隨著時間推移個性化互動的能力。雖然檢索增強生成和微調改善了知識訪問和領域能力,但它們並未使個別用戶的持久理解成為可能。我們提出了一種情感關注的有狀態記憶架構,該架構在推理時動態構建用戶特定的對話上下文,利用長期歷史、情感信號和推斷的意圖。為了評估其影響,我們在三十個非腳本化的對話中進行了受控的A/B研究,這些對話涵蓋了六個情感上明顯不同的類別,並在兩種條件下使用相同的基礎語言模型。增強記憶的條件在所有評估場景中始終優於無狀態基線。在記憶基礎(改善95%)、計劃清晰度(57%)和情感驗證(34%)方面觀察到了最大的增益。即使在涉及悲傷、痛苦和不確定性的情感對抗性對話中,結果也保持一致。這些發現表明,有狀態的情感記憶可能代表超個性化AI系統的基礎基礎設施層,儘管在更大和更多樣化的評估中仍需進行更廣泛的驗證。

A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency

2605.14802v1 by Zhao Yang, Wang Huan, Li Yingshuo, Tu Haomiao, Lin Hujite

Large language models often suffer from fact loss, timeline confusion, persona drift, and reduced stability during long-range interaction, especially under high-noise knowledge bases, context clearing, and cross-model transfer. To address these issues, we introduce ARPM, an external temporal memory governance framework for long-term dialogue. ARPM separates static knowledge memory from dynamic dialogue experience memory and combines vector retrieval, BM25, RRF fusion, dual-temporal reranking, chronological evidence reading, and a controlled analysis protocol for evidence verification and answer binding. Unlike approaches that encode persona consistency into model weights or rely only on long context, ARPM treats continuity as a traceable, auditable, and transferable governance problem. Using engineering logs, we conduct three experiments. First, in a 50-round question-answering setting, we compare signal-to-noise ratios of 1:5 and 1:200+, and distinguish CSV auto-judgment from manual review. Under 1:5, CSV recall accuracy is 54.0%, while manual review raises it to 100.0%. Under 1:200+, the values are 44.0% and 80.0%. These results show that automatic rules can underestimate recall after supporting evidence enters the prompt. Second, ablation results show that dialogue history retrieval is necessary for recent continuity: disabling it reduces strict accuracy from 100% to 66.7%, and disabling BM25 reduces it to 80.0%, indicating that pure semantic retrieval is insufficient for correction and tracing. Third, under a 5.1-million-character noise substrate, periodic context clearing, and multi-model handoff, ARPM maintains semantic continuity, boundary continuity, and persona consistency, while exposing limits caused by weak protocol compliance. These findings show that long-term persona consistency can be decomposed into governable components and evaluated in a white-box manner.

摘要:大型語言模型經常遭遇事實遺失、時間線混亂、角色漂移以及在長期互動中穩定性降低的問題,尤其是在高噪聲知識庫、上下文清除和跨模型轉移的情況下。為了解決這些問題,我們引入了ARPM,一個用於長期對話的外部時間記憶治理框架。ARPM將靜態知識記憶與動態對話經驗記憶分開,並結合了向量檢索、BM25、RRF融合、雙時間重排序、時間證據閱讀以及證據驗證和答案綁定的控制分析協議。與將角色一致性編碼到模型權重中或僅依賴長上下文的方法不同,ARPM將連續性視為一個可追溯、可審計和可轉移的治理問題。利用工程日誌,我們進行了三個實驗。首先,在一個50輪的問答設置中,我們比較了信號對噪聲比為1:5和1:200+的情況,並區分了CSV自動判斷與手動審查。在1:5的情況下,CSV召回準確率為54.0%,而手動審查將其提高到100.0%。在1:200+的情況下,這些值分別為44.0%和80.0%。這些結果顯示,自動規則在支持證據進入提示後可能低估召回率。其次,消融結果顯示對話歷史檢索對於近期連續性是必要的:禁用它會將嚴格準確率從100%降低到66.7%,禁用BM25則將其降低到80.0%,這表明純語義檢索不足以進行修正和追踪。第三,在一個510萬字符的噪聲基底下,定期清除上下文和多模型交接的情況下,ARPM保持了語義連續性、邊界連續性和角色一致性,同時暴露出由於協議遵從性不足而造成的限制。這些發現顯示,長期角色一致性可以分解為可治理的組件,並以白盒方式進行評估。

Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation

2605.14790v1 by Songyang Gao, Yinghui Xia, Siyi Liu, Hui Xiong

Research idea generation is the innovation-driving step of automated scientific research. Recently, large language models (LLMs) have shown potential for automating idea generation at scale. However, existing methods mainly condition LLMs on eliciting idea generation through static retrieval of relevant literature or complex prompt engineering, without discarding the structural relations among references. We propose Graphs of Research (GoR), a supervised fine-tuning method that extracts a 2-hop reference neighborhood for each seed paper, derives the relations among those references from citation position, frequency, predecessor links, and publication time, and organizes them into a paper-evolution directed acyclic graph (DAG). We construct an automated extraction pipeline that draws data from five major ML/NLP venues, comprising 498/50/50 train/validation/test seed papers and approximately 7,600 cited references. Qwen2.5-7B-Instruct-1M is fine-tuned on a structured-text prompt that includes the citation graph, edge signals, reference information, and task definition to predict the idea for the seed paper. Across head-to-head LLM-judge tournaments against gpt-4o-driven baselines, GoR-SFT achieves SOTA, demonstrating the effectiveness of citation-evolution graphs as supervision signal for LLM-based idea generation. We hope that this reduces the barrier for citation evolution graphs as a supervision, accelerating automated scientific innovation.

摘要:研究想法生成是自動化科學研究的創新驅動步驟。最近,大型語言模型(LLMs)顯示出在大規模自動化想法生成方面的潛力。然而,現有的方法主要依賴於通過靜態檢索相關文獻或複雜的提示工程來引導LLMs生成想法,而未能捨棄參考文獻之間的結構關係。我們提出了研究圖(Graphs of Research, GoR),這是一種監督式微調方法,為每篇種子論文提取2跳參考鄰域,從引用位置、頻率、前驅鏈接和出版時間推導這些參考文獻之間的關係,並將它們組織成一個論文演變有向無環圖(DAG)。我們構建了一個自動提取管道,從五個主要的機器學習/自然語言處理(ML/NLP)會議中提取數據,包括498/50/50的訓練/驗證/測試種子論文和約7,600個被引用的參考文獻。Qwen2.5-7B-Instruct-1M在一個包含引用圖、邊緣信號、參考信息和任務定義的結構化文本提示上進行微調,以預測種子論文的想法。在與gpt-4o驅動的基準進行的正面對正面LLM評審比賽中,GoR-SFT達到了最先進技術(SOTA),展示了引用演變圖作為LLM基礎想法生成的監督信號的有效性。我們希望這能降低引用演變圖作為監督的障礙,加速自動化科學創新。

XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition

2605.14754v1 by Gong Zhiren, Tiantong Wu, Jiaming Zhang, Fuyao Zhang, Che Wang, Yurong Hao, Yikun Hou, Foo Ping, Yilei Zhao, Fei Huang, Chau Yuen, Wei Yang Bryan Lim

Large Language Models (LLMs) are increasingly deployed for knowledge synthesis, yet their capacity for compositional generalization in scientific knowledge remains under-characterized. Existing benchmarks primarily focus on single-turn restricted scenarios, failing to capture the capability boundaries exposed by real-world interactive scientific workflows. To address this, we introduce XDomainBench, a diagnostic benchmark for interactive interdisciplinary scientific reasoning. We formalize the composition order and mixture structure to enable systematic stress-testing from single-discipline to inter-disciplinary, comprising 8,598 interactive sessions across 20 domains and 4 task categories, with 8 realistic trajectory patterns covering difficulty and domain-mixture dynamics, simulating real AI4S scenarios. Large-scale evaluation of LLMs reveals a systematic reasoning collapse as composition order increases, stemming from two root causes: (i) direct difficulty increases induced by domain composition, and (ii) indirect interaction-amplified failures where trajectory patterns trigger error accumulation, reasoning breaks, and domain confusion, ultimately leading to session collapse.

摘要:大型語言模型(LLMs)越來越多地被用於知識綜合,但它們在科學知識中的組合泛化能力仍然未被充分描述。現有的基準主要集中在單回合的限制場景,未能捕捉到真實世界互動科學工作流程所暴露的能力邊界。為了解決這個問題,我們引入了XDomainBench,一個用於互動跨學科科學推理的診斷基準。我們形式化了組合順序和混合結構,以便從單一學科到跨學科進行系統性的壓力測試,包括在20個領域和4個任務類別中進行的8,598個互動會話,並涵蓋8種現實的軌跡模式,模擬難度和領域混合動態,真實地模擬AI4S場景。對LLMs的大規模評估顯示,隨著組合順序的增加,推理能力系統性崩潰,這源於兩個根本原因:(i)由領域組合引起的直接難度增加,以及(ii)間接互動放大失敗,其中軌跡模式觸發錯誤累積、推理中斷和領域混淆,最終導致會話崩潰。

Cognitive-Uncertainty Guided Knowledge Distillation for Accurate Classification of Student Misconceptions

2605.14752v1 by Qirui Liu, Hao Chen, Weijie Shi, Jiajie Xu, Jia Zhu

Accurately identifying student misconceptions is crucial for personalized education but faces three challenges: (1) data scarcity with long-tail distribution, where authentic student reasoning is difficult to synthesize; (2) fuzzy boundaries between error categories with high annotation noise; (3) deployment parado-large models overlook unconventional approaches due to pretraining bias and cannot be deployed on edge, while small models overfit to noise. Unlike traditional methods that increase diversity through large-scale data synthesis, we propose a two-stage knowledge distillation framework that mines high-value samples from existing data. The first stage performs standard distillation to transfer task capabilities. The second stage introduces a dual-layer marginal selection mechanism based on cognitive uncertainty, identifying four types of critical samples based on teacher model uncertainty and confidence differences. For different data subsets, we design difficulty-adaptive mechanism to balance hard/soft label contributions, enabling student models to inherit inter-class relationships from teacher soft labels while distinguishing ambiguous error types. Experiments show that with augmented training on only 10.30% of filtered samples, we achieve MAP@3 of 0.9585 (+17.8%) on the MAP-Charting dataset, and using only a 4B parameter model, we attain 84.38% accuracy on cross-topic tests of middle school algebra misconception benchmarks, significantly outperforming sota LLM (67.73%) and standard fine-tuned 72B models (81.25%). Our code is available at https://github.com/RoschildRui/acl2026_map.

摘要:準確識別學生的誤解對於個性化教育至關重要,但面臨三個挑戰: (1) 數據稀缺且呈長尾分佈,真實的學生推理難以綜合; (2) 錯誤類別之間的邊界模糊,標註噪聲高; (3) 部署悖論——大型模型因預訓練偏見而忽視非常規方法,無法在邊緣設備上部署,而小型模型則過度擬合噪聲。與傳統方法通過大規模數據合成來增加多樣性不同,我們提出了一個兩階段知識蒸餾框架,從現有數據中挖掘高價值樣本。第一階段執行標準蒸餾以轉移任務能力。第二階段引入基於認知不確定性的雙層邊際選擇機制,根據教師模型的不確定性和信心差異識別四種類型的關鍵樣本。對於不同數據子集,我們設計了難度自適應機制,以平衡難/易標籤的貢獻,使學生模型能夠從教師的軟標籤中繼承類間關係,同時區分模糊的錯誤類型。實驗顯示,僅在10.30%的過濾樣本上進行增強訓練,我們在MAP-Charting數據集上達到了0.9585的MAP@3 (+17.8%),並且僅使用一個4B參數模型,在中學代數誤解基準的跨主題測試中達到了84.38%的準確率,顯著超越了sota LLM(67.73%)和標準微調的72B模型(81.25%)。我們的代碼可在https://github.com/RoschildRui/acl2026_map獲得。

EVA: Editing for Versatile Alignment against Jailbreaks

2605.14750v1 by Yi Wang, Hongye Qiu, Yue Xu, Sibei Yang, Zhan Qin, Minlie Huang, Wenjie Wang

Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated impressive capabilities but remain vulnerable to jailbreaking attacks, where adversaries exploit textual or visual triggers to bypass safety guardrails. Recent defenses typically rely on safety fine-tuning or external filters to reduce the model's likelihood of producing harmful content. While effective to some extent, these methods often incur significant computational overheads and suffer from the safety utility trade-off, degrading the model's performance on benign tasks. To address these challenges, we propose EVA (Editing for Versatile Alignment against Jailbreaks), a novel framework that pioneers the application of direct model editing for safety alignment. EVA reframes safety alignment as a precise knowledge correction task. Instead of retraining massive parameters, EVA identifies and surgically edits specific neurons responsible for the model's susceptibility to harmful instructions, while leaving the vast majority of the model unchanged. By localizing the updates, EVA effectively neutralizes harmful behaviors without compromising the model's general reasoning capabilities. Extensive experiments demonstrate that EVA outperforms baselines in mitigating jailbreaks across both LLMs and VLMs, offering a precise and efficient solution for post-deployment safety alignment.

摘要:大型語言模型(LLMs)和視覺語言模型(VLMs)展示了令人印象深刻的能力,但仍然容易受到越獄攻擊,對手利用文本或視覺觸發器來繞過安全防護。最近的防禦通常依賴於安全微調或外部過濾器,以降低模型產生有害內容的可能性。雖然在某種程度上有效,但這些方法往往會產生顯著的計算開銷,並且面臨安全效用的權衡,從而降低模型在良性任務上的表現。為了解決這些挑戰,我們提出了EVA(針對越獄的多功能對齊編輯),這是一個開創性框架,開創了直接模型編輯在安全對齊中的應用。EVA將安全對齊重新定義為一個精確的知識修正任務。EVA不需要重新訓練大量參數,而是識別並精確編輯負責模型對有害指令敏感的特定神經元,同時保持模型絕大多數部分不變。通過局部更新,EVA有效中和了有害行為,而不妨礙模型的一般推理能力。大量實驗證明,EVA在減輕LLMs和VLMs的越獄問題上優於基準,為部署後的安全對齊提供了一個精確且高效的解決方案。

Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model

2605.14723v1 by Minghao Wu, Yuting Yan, Zhenyang Cai, Ke Ji, Chuangsen Fang, Ziying Sheng, Xidong Wang, Rongsheng Wang, Hejia Zhang, Shuang Li, Benyou Wang, Hongyuan Zha

Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action-conditioned patient dynamics. We introduce SepsisAgent, a world model-augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid--vasopressor interventions, and follows a propose--simulate--refine workflow before committing to a prescription. We first show that world-model access alone yields inconsistent LLM decision performance, motivating agent-specific training. We then train SepsisAgent through a three-stage curriculum: patient-dynamics supervised fine-tuning, propose--simulate--refine behavior cloning, and world-model-based agentic reinforcement learning. On MIMIC-IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM-based baselines in off-policy value while achieving the best safety profile under guideline adherence and unsafe-action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.

摘要:重症監護病房中的敗血症管理需要在快速變化的病人體徵下做出連續的治療決策。雖然大型語言模型(LLMs)編碼了廣泛的臨床知識並能夠推理指導方針,但它們並不固有地基於行動條件下的病人動態。我們介紹了SepsisAgent,一個增強世界模型的LLM代理,用於敗血症治療建議。SepsisAgent使用學習到的臨床世界模型來模擬病人在候選液體-血管收縮劑干預下的反應,並遵循提議-模擬-精煉的工作流程,然後再進行處方。我們首先顯示僅依賴世界模型的訪問會導致LLM決策性能不一致,這促使了特定代理的訓練。然後,我們通過三個階段的課程訓練SepsisAgent:病人動態監督微調、提議-模擬-精煉行為克隆,以及基於世界模型的代理強化學習。在MIMIC-IV敗血症軌跡上,SepsisAgent在離線政策價值方面超越了所有傳統的RL和基於LLM的基準,同時在遵循指導方針和不安全行為指標下達到了最佳的安全性配置。進一步分析顯示,與臨床世界模型的重複互動使代理能夠學習病人演變中的規律,即使在移除模擬器訪問的情況下,這些規律仍然有用。

On Strong Equivalence Notions in Logic Programming and Abstract Argumentation

2605.14721v1 by Giovanni Buraglio, Wolfgang Dvorak, Stefan Woltran

Strong equivalence between knowledge bases ensures the possibility of replacing one with the other without affecting reasoning outcomes, in any given context. This makes it a crucial property in nonmonotonic formalisms. In particular, the fields of logic programming and abstract argumentation provide primary examples in which this property has been subject to vast investigations. However, while (classes of) logic programs and abstract argumentation frameworks are known to be semantically equivalent in static settings, this alignment breaks in dynamic contexts due to differing notions of update. As a result, strong equivalence does not always carry over from one formalism to the other. In this paper, we carefully investigate this discrepancy and introduce a new notion of strong equivalence for logic programs. Our approach preserves strong equivalence under translation between certain classes of logic programs and both Dung-style and claim-augmented argumentation frameworks, thus restoring compatibility across these formalisms.

摘要:強等價性在知識庫之間確保了在任何給定的上下文中,可以互相替換而不影響推理結果。這使其成為非單調形式主義中的一個關鍵特性。特別是,邏輯編程和抽象論證的領域提供了這一特性受到廣泛研究的主要例子。然而,雖然(類別的)邏輯程序和抽象論證框架在靜態環境中被認為是語義上等價的,但這種一致性在動態環境中因為更新的不同概念而破裂。因此,強等價性並不總是能夠從一種形式主義轉移到另一種形式主義。在本文中,我們仔細調查了這一差異,並為邏輯程序引入了一個新的強等價性概念。我們的方法在某些類別的邏輯程序和Dung風格及增強主張的論證框架之間的轉換下保持強等價性,從而恢復這些形式主義之間的兼容性。

SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

2605.14704v1 by Posheng Chen, Powen Cheng, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu

In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.

摘要:在現實世界場景中,目標物體可能位於不可見的區域。雖然人類通常可以根據上下文和常識推斷被遮擋物體的位置,但這一能力對於視覺-語言模型(VLMs)來說仍然是一個重大挑戰。為了解決這一差距,我們推出了SceneFunRI,一個用於推理隱形物體的基準測試。基於SceneFun3D數據集,SceneFunRI將任務定義為通過半自動流程進行的2D空間推理問題,並包含855個實例。它要求模型根據任務指令和常識推理推斷隱形功能物體的位置。最強基線模型(Gemini 3 Flash)僅達到CAcc@75的15.20,mIoU為0.74,Dist為28.65。我們將提示分析分為三個類別:強指令提示、基於推理的提示和空間排除過程(SPoE)。這些發現表明,隱形區域推理在當前的VLMs中仍然是一種不穩定的能力,這促使未來的工作更加緊密地整合任務意圖、常識先驗、空間基礎和不確定性感知搜索。

2605.14665v2 by Joy Bose

Legal reasoning is not semantic similarity search. A court judgment encodes constrained symbolic reasoning: precedent propagation, procedural state transitions, and statute-bound inference. These are properties that vector-based retrieval-augmented generation (RAG) cannot faithfully represent. Hallucinated precedents, outdated statute citations, and unsupported reasoning chains remain persistent failure modes in LLM-based legal AI, with real consequences for access to justice in high-caseload jurisdictions such as India. This paper presents Falkor-IRAC, a graph-constrained generation framework for Indian legal AI that grounds generation in structured reasoning over an IRAC (Issue, Rule, Analysis, Conclusion) knowledge graph. Judgments from the Supreme Court and High Courts of India are ingested as IRAC node structures enriched with procedural state transitions, precedent relationships, and statutory references, stored in FalkorDB for low-latency agentic traversal. At inference time, LLM-generated answers are accepted only if a valid supporting path can be traced through the graph, a check performed by a falsifiability oracle called the Verifier Agent. The system also detects doctrinal conflicts as a first-class output rather than silently resolving them. Falkor-IRAC is evaluated using graph-native metrics: citation grounding accuracy, path validity rate, hallucinated precedent rate, and conflict detection rate. These metrics are argued to be more appropriate for legal reasoning evaluation than BLEU and ROUGE. On a proof-of-concept corpus of 51 Supreme Court judgments, the Verifier Agent correctly validated citations on completed queries and correctly rejected fabricated citations. Evaluation against vector-only RAG baselines is left for future work. The companion InIRAC dataset, 500+ structured Indian court judgments with IRAC annotations, is released alongside this paper.

摘要:法律推理並不是語義相似性搜索。法院判決編碼了受限的符號推理:先例傳播、程序狀態轉換和受法規約束的推理。這些是基於向量的檢索增強生成(RAG)無法忠實表現的特性。虛構的先例、過時的法規引用和不支持的推理鏈在基於大型語言模型(LLM)的法律人工智慧中仍然是持續的失敗模式,對於印度等高案件負擔的司法管轄區的司法公正造成了真實後果。本文提出了Falkor-IRAC,一個針對印度法律人工智慧的圖約束生成框架,將生成基於IRAC(問題、規則、分析、結論)知識圖中的結構化推理。來自印度最高法院和高等法院的判決被作為IRAC節點結構攝取,並增強了程序狀態轉換、先例關係和法定引用,存儲在FalkorDB中以實現低延遲的代理遍歷。在推理時,僅當可以通過圖追蹤到有效的支持路徑時,LLM生成的答案才會被接受,這一檢查由一個稱為驗證代理的可證偽性神諭執行。系統還將教義衝突檢測為一級輸出,而不是默默解決它們。Falkor-IRAC使用圖本地指標進行評估:引用基礎準確率、路徑有效率、虛構先例率和衝突檢測率。這些指標被認為比BLEU和ROUGE更適合法律推理評估。在一個包含51個最高法院判決的概念驗證語料庫中,驗證代理正確驗證了完成查詢的引用,並正確拒絕了虛構的引用。對僅基於向量的RAG基準的評估留待未來的工作。伴隨本文發布的InIRAC數據集包含500多個帶有IRAC註釋的結構化印度法院判決。

Teaching Large Language Models When Not to Know: Learning Temporal Critique for Ex-Ante Reasoning

2605.14636v1 by Chenlu Ding, Jiancan Wu, Yanchen Luo, Zheyuan Liu, Yancheng Yuan, Xiang Wang

Large language models (LLMs) often fail to reason under temporal cutoffs: when prompted to answer from the standpoint of an earlier time, they exploit knowledge that became available only later. We study this failure through the lens of ex-ante reasoning, where a model must rely exclusively on information knowable before a cutoff. Through a systematic analysis of prompt-level interventions, we find that temporal leakage is highly sensitive to cutoff formulation and instruction placement: explicit cutoff statements outperform implicit historical framings, and prefix constraints reduce leakage more effectively than suffix constraints. These findings indicate that prompting can steer models into a temporal frame, but does not endow them with the ability to verify whether a response is temporally admissible. We further argue that supervised fine-tuning is insufficient, since ex-ante correctness is not an intrinsic property of an answer, but a relation between the answer and the cutoff. To address this gap, we propose TCFT, a Temporal Critique Fine-Tuning framework that trains models to acquire cutoff-aware temporal verification. Given a query, a cutoff, and a candidate response, TCFT teaches the model to identify post-cutoff leakage, explain temporal boundary violations, and judge temporal admissibility. Experiments with Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct show that TCFT consistently outperforms prompting and SFT baselines, reducing average leakage by 41.89 and 37.79 percentage points, respectively.

摘要:大型語言模型(LLMs)在時間截止方面常常無法進行推理:當被提示從早期的角度回答時,它們會利用只有在稍後才可獲得的知識。我們通過事前推理的視角來研究這一失敗,在這種情況下,模型必須僅依賴於截止之前可知的信息。通過對提示級別干預的系統分析,我們發現時間泄漏對截止的表述和指令的位置非常敏感:明確的截止聲明優於隱含的歷史框架,而前綴約束比後綴約束更有效地減少泄漏。這些發現表明,提示可以引導模型進入一個時間框架,但並不賦予它們驗證回應是否在時間上可接受的能力。我們進一步認為,監督微調是不夠的,因為事前正確性不是答案的內在屬性,而是答案與截止之間的關係。為了解決這一差距,我們提出了TCFT,一個時間批評微調框架,旨在訓練模型獲得截止意識的時間驗證。給定一個查詢、一個截止和一個候選回應,TCFT教會模型識別截止後的泄漏,解釋時間邊界違規,並判斷時間的可接受性。與Qwen2.5-7B-Instruct和Qwen2.5-14B-Instruct的實驗顯示,TCFT始終優於提示和SFT基準,平均泄漏分別減少了41.89和37.79個百分點。

Action-Inspired Generative Models

2605.14631v1 by Eshwar R. A., Debnath Pal

We introduce Action-Inspired Generative Models (AGMs), a dual-network generative framework motivated by the observation that existing bridge-matching methods assign uniform regression weight to every stochastic transition in the transport landscape, regardless of whether a given bridge sample lies along a structurally coherent trajectory or a degenerate one. We address this by introducing a lightweight learned scalar potential $V_φ$ that scores bridge samples online and modulates the drift objective via importance weights derived through a stop-gradient barrier -- preventing adversarial feedback between the two networks whilst preserving $V_φ$'s guiding signal. Crucially, $V_φ$ comprises only $\sim$1.4% of the primary drift network's parameter count, adds no overhead to the inference graph, and requires no iterative half-bridge fitting or auxiliary stochastic differential equation (SDE) solvers: it is a plug-and-play enhancement to any bridge-matching training loop. At inference, $V_φ$ is discarded entirely, leaving standard Euler-Maruyama integration of the exponential moving average (EMA) drift. We demonstrate that selectively penalising uninformative transport paths through the learned potential yields consistent improvements in generation quality across fidelity and coverage metrics.

摘要:我們介紹了行動啟發生成模型(AGMs),這是一種雙網絡生成框架,其動機來自於觀察到現有的橋樑匹配方法對於運輸景觀中的每個隨機轉換分配均勻的回歸權重,無論給定的橋樑樣本是否位於結構一致的軌跡上或退化的軌跡上。我們通過引入一個輕量級的學習標量潛能 $V_φ$ 來解決這個問題,該潛能在線評分橋樑樣本並通過通過停止梯度障礙衍生的權重來調整漂移目標——防止兩個網絡之間的對抗性反饋,同時保留 $V_φ$ 的引導信號。至關重要的是,$V_φ$ 僅佔主要漂移網絡參數數量的約 1.4%,對推理圖不增加任何開銷,並且不需要迭代的半橋擬合或輔助隨機微分方程(SDE)求解器:這是一個即插即用的增強,適用於任何橋樑匹配訓練循環。在推理時,$V_φ$ 被完全丟棄,留下標準的歐拉-丸山積分的指數移動平均(EMA)漂移。我們證明了通過學習潛能有選擇性地懲罰無信息的運輸路徑會在保真度和覆蓋度指標上產生一致的生成質量改進。

SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning

2605.14619v1 by Kang Chen, Junjie Nian, Yixin Cao, Yugang Jiang

Multi-run chain-of-thought reasoning is usually collapsed to final-answer aggregates, which discard howsampled trajectories share, split, and rejoin through intermediate computation. We propose SliceGraph, a post-hoc problem-model-cell graph built by mutual-kNN over sparse activation-key Jaccard similarity between CoT slices, and treat it as a measurement object for process geometry rather than as a decoding program. Across sampled CoT ensembles from three primary 4B/8B models on math and science benchmarks, blinded annotation supports SliceGraph biconnected components as shared reasoning-state units and process families as within-family strategy-coherent route units. In 85.5% of 954 problem-model cells, correct CoTs sharing the same normalized answer split into multiple process families; among cells with at least two such runs, 76.6% of run pairs are cross-family on average. We call such same-answer, family-divergent correct trajectories process isomers. A label-seeded reward field provides a separate value-landscape layer: success-associated regions often split into disconnected high-value cores, and route families specialize over these core footprints rather than merely duplicating one another. A typed-state transition analysis further shows that process families navigate the same atlas with distinct transition kernels under matched null controls. Representation ablations, a cross-architecture replication, and two cross-scale replications support the robustness of the route-family scaffold, showing that final-answer aggregation overlooks this structured multi-route process geometry.

摘要:多次運行的思維鏈推理通常被簡化為最終答案的聚合,這忽略了抽樣的軌跡如何在中間計算中共享、分裂和重新結合。我們提出了 SliceGraph,一種基於 CoT 切片之間稀疏激活鍵 Jaccard 相似度的互動 kNN 構建的事後問題模型單元圖,並將其視為過程幾何的測量對象,而不是解碼程序。在來自三個主要的 4B/8B 模型的數學和科學基準的抽樣 CoT 集合中,盲目標註支持 SliceGraph 的雙連通組件作為共享推理狀態單元,而過程家族則作為內部家族策略一致的路徑單元。在 954 個問題模型單元中,85.5% 的正確 CoT 共享相同的標準化答案,分裂為多個過程家族;在至少有兩次此類運行的單元中,76.6% 的運行對在平均上是跨家族的。我們稱這種相同答案、家族分歧的正確軌跡為過程異構體。一個標籤引導的獎勵場提供了一個獨立的價值景觀層:成功相關區域通常分裂為不相連的高價值核心,而路徑家族則在這些核心足跡上專業化,而不僅僅是相互複製。類型狀態轉換分析進一步顯示,過程家族在匹配的虛無控制下以不同的轉換核導航同一圖集。表示消融、跨架構複製和兩個跨尺度複製支持路徑家族支架的穩健性,顯示最終答案聚合忽略了這種結構化的多路徑過程幾何。

VerbalValue: A Socially Intelligent Virtual Host for Sales-Driven Live Commerce

2605.14542v1 by Yuyan Chen

A skilled live-commerce host is not merely a narrator, but a sales agent who converts viewer curiosity into purchase intent through expert product knowledge, emotionally intelligent response tactics, and entertainment that serves as a vehicle for product exposure. Yet no existing AI system replicates this: conversational recommenders treat recommendation as a terminal act, while general-purpose LLMs hallucinate product claims and default to generic promotional templates that fail to engage or persuade. We present VerbalValue, a sales-conversion-oriented virtual host that turns exceptional verbal ability into real commercial value, built on three contributions. First, we construct a domain knowledge base of product specifications and a curated sales terminology lexicon that anchor product-related responses in verified expertise. Second, we collect and annotate 1,475 live-commerce interactions spanning diverse viewer intents. Third, we fine-tune a large language model on this data to deliver empathetic, commercially oriented responses, adapting to viewer intent through empathetic amplification, evidence-backed rebuttal, and humor-mediated deflection. Experiments against GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and other baselines demonstrate gains of 23% on informativeness and 18% on factual correctness, with consistent advantages in tactfulness and viewer engagement.

摘要:一位熟練的直播電商主持人不僅僅是敘述者,而是一位銷售代理,通過專業的產品知識、情感智力的反應策略以及作為產品曝光載體的娛樂,將觀眾的好奇心轉化為購買意圖。然而,目前沒有任何現有的人工智慧系統能夠複製這一點:對話推薦系統將推薦視為終端行為,而通用的大型語言模型則會幻覺產品聲明,並默認使用無法吸引或說服的通用促銷模板。我們提出了VerbalValue,一個以銷售轉換為導向的虛擬主持人,將卓越的口語能力轉化為實際的商業價值,基於三個貢獻。首先,我們構建了一個產品規格的領域知識庫和一個經過策劃的銷售術語詞彙表,將產品相關的回應固定在經過驗證的專業知識上。其次,我們收集並標註了1,475次涵蓋多樣觀眾意圖的直播電商互動。第三,我們在這些數據上微調了一個大型語言模型,以提供同理心的、以商業為導向的回應,通過同理心增強、基於證據的反駁和幽默的轉移來適應觀眾意圖。針對GPT-5.4、Claude Sonnet 4.6、Gemini 3.1 Pro及其他基準的實驗顯示,在信息量上提升了23%,在事實正確性上提升了18%,並在得體性和觀眾參與度上持續優勢。

Fully Dynamic Rebalancing in Dockless Bike-Sharing Systems via Deep Reinforcement Learning

2605.14501v1 by Edoardo Scarpel, Alberto Pettena, Matteo Cederle, Federico Chiariotti, Marco Fabris, Gian Antonio Susto

This paper proposes a fully dynamic Deep Reinforcement Learning (DRL) method for rebalancing dockless bike-sharing systems, overcoming the limitations of periodic, system-wide interventions. We model the service through a graph-based simulator and cast rebalancing as a Markov decision process. A DRL agent routes a single truck in real time, executing localized pick-up, drop-off, and charging actions guided by spatiotemporal criticality scores. Experiments on real-world data show significant reductions in availability failures with a minimal fleet size, while limiting spatial inequality and mobility deserts. Our approach demonstrates the value of learning-based rebalancing for efficient and reliable shared micromobility.

摘要:這篇論文提出了一種完全動態的深度強化學習(DRL)方法,用於重新平衡無停靠自行車共享系統,克服了定期、系統性干預的限制。我們通過基於圖形的模擬器來建模服務,並將重新平衡視為一個馬可夫決策過程。一個DRL代理實時路由單一卡車,執行由時空關鍵性分數指導的本地化取貨、放貨和充電行動。在真實數據上的實驗顯示,在最小車隊規模下,可用性故障顯著減少,同時限制了空間不平等和流動性沙漠。我們的方法展示了基於學習的重新平衡對於高效和可靠的共享微型出行的價值。

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

2605.14498v1 by Jingbo Yang, Kwei-Herng Lai, Xiaowen Wang, Shiyu Chang, Yaar Harari, Evgeniy Gabrilovich

Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long-running conversations. However, both existing memory systems and benchmarks are built around the dyadic, single-user setup, even though real deployments routinely span groups and channels with multiple users interacting with the agent and with each other. This mismatch leaves three properties of group memory unmeasured: (i) group dynamics that go beyond concatenated one-on-one chats, (ii) speaker-grounded belief tracking, where the per-user memory modeling is needed, and (iii) audience-adapted language, where Theory-of-Mind shifts produce role-specific vocabulary. We introduce GroupMemBench, a benchmark that exposes all three. A graph-grounded synthesis pipeline produces multi-party conversations with controllable reply structure and conditions each message on per-user personas and target audiences. An adversarial query pipeline then binds every question to a specific asker across six categories, spanning multi-hop reasoning, knowledge update, term ambiguity, user-implicit reasoning, temporal reasoning, and abstention, and iteratively searches challenging, realistic queries that reflect comprehensive memory capability. Benchmarking leading memory systems exposes a sharp collapse: the strongest one reaches only 46.0% average accuracy, with knowledge update at 27.1% and term ambiguity at 37.7%, while a simple BM25 baseline matches or exceeds most agent memory systems. This indicates current memory ingestion erases the structural and lexical features group memory depends on, leaving multi-user memory far from solved.

摘要:大型語言模型(LLM)代理人越來越多地作為個人助手和工作場所的協作夥伴,其效用依賴於能夠在長時間對話中提取、檢索和應用信息的記憶系統。然而,現有的記憶系統和基準都是圍繞二元的單用戶設置構建的,儘管實際部署通常涉及多個用戶與代理人及彼此之間的互動。這種不匹配使得群體記憶的三個特性未被測量:(i)超越連接的一對一聊天的群體動態,(ii)基於說話者的信念追蹤,其中需要每位用戶的記憶建模,以及(iii)適應觀眾的語言,其中心智理論的轉變產生角色特定的詞彙。我們介紹了GroupMemBench,一個揭示這三個特性的基準。一個基於圖形的合成管道生成具有可控回覆結構的多方對話,並根據每位用戶的角色和目標觀眾對每條消息進行條件設置。隨後,一個對抗性查詢管道將每個問題綁定到六個類別中的特定提問者,涵蓋多跳推理、知識更新、術語歧義、用戶隱含推理、時間推理和放棄,並迭代搜索反映全面記憶能力的挑戰性現實查詢。對領先記憶系統的基準測試揭示了一個明顯的崩潰:最強的系統僅達到46.0%的平均準確率,知識更新為27.1%,術語歧義為37.7%,而一個簡單的BM25基準則與大多數代理記憶系統相匹配或超過。這表明當前的記憶攝取抹去了群體記憶所依賴的結構和詞彙特徵,使得多用戶記憶仍然遠未解決。

Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification

2605.14495v1 by Truong Thanh Hung Nguyen, Vo Thanh Khang Nguyen, Hoang-Loc Cao, Phuc Ho, Van Pham, Hung Cao

Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and arena-based quantitative bipolar argumentation (A-QBAF) as a submission to the ICMR 2026 Grand Challenge on Multimedia Verification. Our method decomposes each case into claim-centered sections, retrieves targeted evidence, and converts evidence into structured support and attack arguments with provenance and strength scores. These arguments are resolved through small local argument graphs with selective clash resolution and uncertainty-aware escalation. The resulting system generates section-wise verification reports that are transparent, editable, and computationally practical for real-world multimedia verification. Our implementation is public at: https://github.com/Analytics-Everywhere-Lab/MV2026_the_liems.

摘要:多媒體驗證不僅需要準確的結論,還需要透明且可爭辯的推理。我們提出了一個可爭辯的多代理框架,該框架整合了多模態大型語言模型、外部驗證工具以及基於競技場的定量雙極論證(A-QBAF),作為對ICMR 2026多媒體驗證大挑戰的提交。我們的方法將每個案例分解為以主張為中心的部分,檢索目標證據,並將證據轉換為具有來源和強度分數的結構化支持和攻擊論證。這些論證通過小型局部論證圖進行解決,並具有選擇性衝突解決和不確定性感知升級。最終生成的系統產生逐節的驗證報告,這些報告透明、可編輯,並且在現實世界的多媒體驗證中具有計算實用性。我們的實現公開於:https://github.com/Analytics-Everywhere-Lab/MV2026_the_liems。

Learning Scenario Reduction for Two-Stage Robust Optimization with Discrete Uncertainty

2605.14494v1 by Tianjue Lin, Jianan Zhou, Jieyi Bi, Yaoxin Wu, Wen Song, Zhiguang Cao, Jie Zhang

Two-Stage Robust Optimization (2RO) with discrete uncertainty is challenging, often rendering exact solutions prohibitive. Scenario reduction alleviates this issue by selecting a small, representative subset of scenarios to enable tractable computation. However, existing methods are largely problem-agnostic, operating solely on the uncertainty set without consulting the feasible region or recourse structure. In this paper, we introduce PRISE, a problem-driven sequential lookahead heuristic that constructs reduced scenario sets by evaluating the marginal impact of each scenario. While PRISE yields high-quality scenario subsets, each selection step requires solving multiple subproblems, making it computationally expensive at scale. To address this, we propose NeurPRISE, a neural surrogate model built on a GNN-Transformer backbone that encodes the per-scenario structure via graph convolution and captures cross-scenario interactions through attention. NeurPRISE is trained via imitation learning with a gain-aware ranking objective, which distills marginal gain information from PRISE into a learned scoring function for scenario ranking and selection. Extensive results on three 2RO problems show that NeurPRISE consistently achieves competitive regret relative to comprehensive methods, maintains strong calability with varying numbers of scenarios, and delivers 7-200x speedup over PRISE. NeurPRISE also exhibits strong zero-shot generalization, effectively handling instances with larger problem scales (up to 5x), more scenarios (up to 4x), and distribution shifts.

摘要:兩階段穩健優化 (2RO) 在離散不確定性下是具有挑戰性的,通常會使得精確解變得難以獲得。情境減少通過選擇一小部分具有代表性的情境來緩解這一問題,以便進行可處理的計算。然而,現有的方法在很大程度上是與問題無關的,僅僅對不確定性集進行操作,而不考慮可行區域或補救結構。在本文中,我們介紹了PRISE,一種以問題為驅動的序列前瞻啟發式方法,通過評估每個情境的邊際影響來構建減少的情境集。雖然PRISE產生高質量的情境子集,但每個選擇步驟需要解決多個子問題,使其在大規模時計算成本高昂。為了解決這個問題,我們提出了NeurPRISE,一種基於GNN-Transformer骨幹的神經代理模型,通過圖卷積編碼每個情境的結構,並通過注意力捕捉跨情境的互動。NeurPRISE通過模仿學習進行訓練,並使用增益感知的排名目標,將PRISE的邊際增益信息提煉為學習的評分函數,用於情境排名和選擇。在三個2RO問題上的廣泛結果顯示,NeurPRISE在相對於綜合方法時始終實現具有競爭力的遺憾,並在不同數量的情境下保持強大的可擴展性,並在速度上比PRISE快7-200倍。NeurPRISE還展現出強大的零樣本泛化能力,有效處理具有更大問題規模(最多5倍)、更多情境(最多4倍)和分佈變化的實例。

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

2605.14473v2 by Yihang Chen, Pin Qian, Su Wang, Sipeng Zhang, Huan Xu, Shuhuai Lin, Xinpeng Wei

The Context-Compliance Regime in Retrieval-Augmented Generation (RAG) occurs when retrieved context dominates the final answer even when it conflicts with the model's parametric knowledge. Accuracy alone does not reveal how retrieved context causally shapes answers under such conflict. We introduce Context-Driven Decomposition (CDD), a belief-decomposition probe that operates at inference time and serves as an intervention mechanism for controlled retrieval conflict. Across Epi-Scale stress tests, TruthfulQA misconception injection, and cross-model reruns, CDD exposes three patterns. P1: context compliance is measurable in an upper-bound adversarial setting, where Standard RAG reaches 15.0% accuracy on TruthfulQA misconception injection (N=500). P2: adversarial accuracy gains transfer across model families -- CDD improves accuracy on Gemini-2.5-Flash and on Claude Haiku/Sonnet/Opus -- but rationale-answer causal coupling does not transfer. CDD reaches 64.1% mistake-injection causal sensitivity on Gemini-2.5-Flash, while sensitivities for all three Claude variants fall in the [-3%, +7%] range, suggesting that the Claude-side accuracy gains operate through a mechanism distinct from the explicit conflict-resolution trace. P3: explicit conflict decomposition improves robustness under temporal drift and noisy distractors, with CDD reaching 71.3% on temporal shifts and 69.9% on distractor evidence on the full Epi-Scale adversarial benchmark. These three patterns identify context-compliance as a structural axis along which standard RAG can be probed and intervened on, distinct from retrieval-quality or single-method robustness questions, and motivate releasing Epi-Scale for systematic study across model families and retrieval pipelines.

摘要:檢索增強生成(RAG)中的上下文合規性體系發生在檢索的上下文主導最終答案,即使它與模型的參數知識相衝突。僅僅依賴準確性並不能揭示在這種衝突下,檢索的上下文如何因果地塑造答案。我們引入了上下文驅動分解(CDD),這是一種在推理時運作的信念分解探針,並作為控制檢索衝突的干預機制。在Epi-Scale壓力測試、TruthfulQA誤解注入和跨模型重跑中,CDD揭示了三種模式。P1:在一個上限對抗設定中,上下文合規性是可測量的,標準RAG在TruthfulQA誤解注入中達到15.0%的準確率(N=500)。P2:對抗準確性增益在模型家族之間轉移——CDD提高了Gemini-2.5-Flash和Claude Haiku/Sonnet/Opus的準確性——但理由-答案的因果耦合並不轉移。CDD在Gemini-2.5-Flash上達到64.1%的錯誤注入因果敏感性,而所有三個Claude變體的敏感性均落在[-3%,+7%]範圍內,這表明Claude側的準確性增益是通過一種與明確衝突解決痕跡不同的機制運作的。P3:明確的衝突分解在時間漂移和噪聲干擾下提高了穩健性,CDD在全Epi-Scale對抗基準上在時間變化上達到71.3%,在干擾證據上達到69.9%。這三種模式將上下文合規性確定為一個結構性軸,標準RAG可以在此進行探測和干預,這與檢索質量或單一方法的穩健性問題不同,並促使釋放Epi-Scale以便在模型家族和檢索管道中進行系統研究。

A Unified Knowledge Embedded Reinforcement Learning-based Framework for Generalized Capacitated Vehicle Routing Problems

2605.14416v1 by Wen Wang, Xiangchen Wu, Liang Wang, Hao Hu, Xianping Tao

The Capacitated Vehicle Routing Problem (CVRP) is a fundamental NP-hard problem with broad applications in logistics and transportation. Real-world CVRPs often involve diverse objectives and complex constraints, such as time windows or backhaul requirements, motivating the development of a unified solution framework. Recent reinforcement learning (RL) approaches have shown promise in combinatorial optimization, yet they rely on end-to-end learning and lack explicit problem-solving knowledge, limiting solution quality. In this paper, we propose a knowledge-embedded framework inspired by the Route-First Cluster-Second heuristics. It incorporates knowledge at two levels: (1) decomposing CVRPs into the route-first and cluster-second subproblems, and (2) leveraging dynamic programming to solve the second subproblem, whose results guide the RL-based constructive solver to solve the first problem. To mitigate partial observability caused by problem decomposition, we introduce a unified history-enhanced context processing module. Extensive experiments show that this framework achieves superior solution quality compared with state-of-the-art learning-based methods, with a smaller gap to classical heuristics, demonstrating strong generalization across diverse CVRP variants.

摘要:容量限制的車輛路徑規劃問題 (CVRP) 是一個基本的 NP-hard 問題,在物流和運輸領域有廣泛的應用。現實世界中的 CVRP 通常涉及多樣的目標和複雜的約束條件,例如時間窗口或回程需求,這促使了統一解決框架的發展。最近的強化學習 (RL) 方法在組合優化中顯示出潛力,但它們依賴於端到端學習,缺乏明確的問題解決知識,限制了解決方案的質量。在本文中,我們提出了一個嵌入知識的框架,靈感來自於路徑優先、聚類次優的啟發式方法。它在兩個層面上整合知識:(1) 將 CVRP 分解為路徑優先和聚類次優的子問題,以及 (2) 利用動態規劃來解決第二個子問題,其結果指導基於 RL 的建構性求解器來解決第一個問題。為了減輕由於問題分解引起的部分可觀察性,我們引入了一個統一的歷史增強上下文處理模塊。大量實驗表明,這一框架在解決方案質量上優於最先進的基於學習的方法,與經典啟發式方法之間的差距更小,顯示出在多樣的 CVRP 變體中具有強大的泛化能力。

Metis AI: The Overlooked Middle Zone Between AI-Native and World-Movers

2605.14407v1 by Xiang Li

The dominant discourse on AI limitations frames the boundary of AI capability as a divide between digital tasks (where AI excels) and physical tasks (where embodiment is required). We argue this framing misses the most consequential boundary: the one within digital tasks. We identify a class of tasks we call Metis AI, named for the Greek concept of metis (practical, contextual knowledge), that are performed entirely on computers yet resist reliable AI automation. These tasks are not computationally intractable; they are institutionally, socially, and normatively entangled in ways that defeat algorithmic approaches. We distinguish constitutive metis (knowledge destroyed by the act of formalization) from operational metis (system-specific familiarity that automation can progressively absorb), and propose five structural characteristics that define the Metis AI zone: consequential irreversibility, relational irreducibility, normative open texture, adversarial co-evolution, and accountability anchoring. We ground each in established theory from across the social sciences, philosophy, and humanitarian practice, argue that these characteristics are properties of the tasks themselves rather than limitations of current models, and show that the appropriate design response is not better automation but centaur architectures in which humans lead and AI supports.

摘要:主流的關於人工智慧限制的論述將人工智慧能力的邊界框架定義為數位任務(人工智慧擅長的領域)與實體任務(需要具身的任務)之間的分界。我們認為這種框架忽略了最重要的邊界:數位任務內部的邊界。我們識別出一類我們稱之為 Metis AI 的任務,這個名稱源於希臘概念 metis(實踐的、情境的知識),這些任務完全在電腦上執行,但卻抵抗可靠的人工智慧自動化。這些任務在計算上並不是不可處理的;它們在制度上、社會上和規範上交織在一起,以至於使算法方法失效。我們區分了構成性 metis(由於形式化的行為而被摧毀的知識)與操作性 metis(自動化可以逐步吸收的系統特定熟悉度),並提出五個定義 Metis AI 區域的結構特徵:後果性不可逆性、關係性不可簡化性、規範性開放性、對抗性共同演化,以及問責制錨定。我們將每個特徵與社會科學、哲學和人道實踐中的既有理論相結合,並主張這些特徵是任務本身的屬性,而不是當前模型的限制,並顯示出適當的設計回應不是更好的自動化,而是人類主導、人工智慧支持的半人馬架構。

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

2605.14404v1 by Kyomin Hwang, Hyeonjin Kim, Sangyeon Cho, Nojun Kwak

While LLMs are increasingly used in commercial services, they pose privacy risks such as leakage of sensitive personally identifiable information (PII). For LLMs trained on multilingual corpora, Multilingual Machine Unlearning (MMU) aims to remove information across multiple languages. However, prior MMU evaluations fail to capture such cross-linguistic distribution of information, being largely limited to direct extensions of per-language evaluation protocols. To this end, we propose two metrics to evaluate the information spread across languages: the Knowledge Separability Score (KSS) and the Knowledge Persistence Score (KPS). KSS measures the overall unlearning quality across multiple languages, while KPS more specifically aims to assess consistent removal of information among different language pairs. We evaluated various unlearning methods in the multilingual setting with these metrics and conducted comprehensive analyses. Through our investigation, we provide insights into unique phenomena exclusive to MMU and offer a new perspective on MMU evaluation.

摘要:雖然大型語言模型(LLMs)在商業服務中的使用越來越普遍,但它們也帶來了隱私風險,例如敏感的個人識別資訊(PII)洩漏。對於在多語言語料庫上訓練的LLMs,多語言機器遺忘(MMU)旨在刪除跨多種語言的信息。然而,先前的MMU評估未能捕捉到這種跨語言的信息分佈,主要限於對每種語言評估協議的直接擴展。為此,我們提出了兩個指標來評估跨語言的信息傳播:知識可分離性得分(KSS)和知識持續性得分(KPS)。KSS衡量多種語言之間的整體遺忘質量,而KPS則更具體地旨在評估不同語言對之間信息的一致刪除。我們使用這些指標在多語言環境中評估了各種遺忘方法,並進行了全面的分析。通過我們的調查,我們提供了對MMU獨特現象的見解,並為MMU評估提供了一種新視角。

Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

2605.14392v1 by Yucheng Shi, Zhenwen Liang, Kishan Panaganti, Dian Yu, Wenhao Yu, Haitao Mi

We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve--verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others are intrinsically hard to solve but easy to verify, like planted subset-sum or constraint satisfaction. Both create a durable gap between proposing and solving that the policy cannot close by gaming the verifier, and it is this gap that keeps reward informative as the learner improves. We instantiate this view in EvoEnv, a single-policy generator, solver method that synthesizes Python environments from ten seeds and admits them only after staged validation, semantic self-review, solver-relative difficulty calibration, and novelty checks. The strongest evidence comes from the already-strong regime: on Qwen3-4B-Thinking, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce the average, while EvoEnv improves it from 72.4 to 74.8, a relative gain of 3.3%. Stable self-improvement, we suggest, depends not on producing more synthetic data, but on models learning to construct worlds whose difficulty stays structurally beyond their own reach.

摘要:我們追求一個自我改進語言模型的願景,在這個願景中,模型不僅僅生成問題或痕跡以供模仿,而是構建訓練它的環境。在零數據推理強化學習中,這將自我改進的框架從數據生成循環轉變為環境構建循環,其中每個工件都是一個可重用的可執行對象,能夠抽樣實例、計算參考和評分回應。這一願景是否能持續改進取決於一個單一的特性:環境必須表現出穩定的解決—驗證不對稱性,模型必須能夠寫出一個在新實例上無法可靠執行的神諭。這種不對稱性有兩種互補的形式。一些任務在算法上難以推理,但作為代碼卻是微不足道的:一個動態程序或圖遍歷,編譯一次,產生無限多的經過校準的實例。其他任務則本質上難以解決,但容易驗證,比如植入的子集和或約束滿足問題。這兩者在提議和解決之間創造了一個持久的差距,政策無法通過操縱驗證者來縮小這個差距,而正是這個差距使得獎勵在學習者改進時保持信息量。我們在EvoEnv中實現了這一觀點,這是一個單一政策生成器、求解方法,從十個種子合成Python環境,並僅在經過分階驗證、語義自評、求解器相對難度校準和新穎性檢查後接受它們。最強的證據來自於已經強大的範疇:在Qwen3-4B-Thinking上,固定公共數據的RLVR和固定手工製作環境的RLVR降低了平均值,而EvoEnv則將其從72.4提高到74.8,相對增益為3.3%。我們建議,穩定的自我改進不在於產生更多的合成數據,而在於模型學會構建那些難度結構上超出自身範疇的世界。

Nexus : An Agentic Framework for Time Series Forecasting

2605.14389v1 by Sarkar Snigdha Sarathi Das, Palash Goyal, Mihir Parmar, Nanyun Peng, Vishy Tirumalashetty, Chun-Liang Li, Rui Zhang, Jinsung Yoon, Tomas Pfister

Time series forecasting is not just numerical extrapolation, but often requires reasoning with unstructured contextual data such as news or events. While specialized Time Series Foundation Models (TSFMs) excel at forecasting based on numerical patterns, they remain unaware to real-world textual signals. Conversely, while LLMs are emerging as zero-shot forecasters, their performance remains uneven across domains and contextual grounding. To bridge this gap, we introduce Nexus, a multi-agent forecasting framework that decomposes prediction into specialized stages: isolating macro-level and micro-level temporal fluctuations, and integrating contextual information when available before synthesizing a final forecast. This decomposition enables Nexus to adapt from seasonal signals to volatile, event-driven information without relying on external statistical anchors or monolithic prompting. We show that current-generation LLMs possess substantially stronger intrinsic forecasting ability than previously recognized, depending critically on how numerical and contextual reasoning are organized. Evaluated on data strictly succeeding LLM knowledge cutoffs spanning Zillow real estate metrics and volatile stock market equities, Nexus consistently matches or outperforms state-of-the-art TSFMs and strong LLM baselines. Beyond numerical accuracy, Nexus produces high-quality reasoning traces that explicitly show the fundamental drivers behind each forecast. Our results establish that real-world forecasting is an agentic reasoning problem extending well beyond only sequence modeling.

摘要:時間序列預測不僅僅是數值外推,還常常需要利用非結構化的上下文數據進行推理,例如新聞或事件。雖然專門的時間序列基礎模型(TSFMs)在基於數值模式的預測方面表現出色,但它們對現實世界的文本信號卻無法察覺。相反,儘管大型語言模型(LLMs)正在成為零樣本預測者,但它們在不同領域和上下文基礎上的表現仍然不均衡。為了彌補這一差距,我們引入了Nexus,一個多代理預測框架,將預測分解為專門的階段:隔離宏觀和微觀的時間波動,並在可用時整合上下文信息,然後合成最終預測。這種分解使Nexus能夠從季節性信號適應到波動的事件驅動信息,而無需依賴外部統計錨點或單一提示。我們顯示當前一代LLMs擁有比之前認識到的更強大的內在預測能力,這在很大程度上取決於數值和上下文推理的組織方式。在對僅在LLM知識截止日期之後的數據進行評估時,這些數據涵蓋了Zillow房地產指標和波動的股票市場,Nexus始終與最先進的TSFMs和強大的LLM基準相匹配或超越。在數值準確性之外,Nexus還生成高質量的推理痕跡,明確顯示每個預測背後的基本驅動因素。我們的結果確立了現實世界的預測是一個代理推理問題,遠超過僅僅是序列建模。

Optimal Pattern Detection Tree for Symbolic Rule-Based Classification

2605.14374v1 by Young-Chae Hong, Yangho Chen

Pattern discovery in data plays a crucial role across diverse domains, including healthcare, risk assessment, and machinery maintenance. In contrast to black-box deep learning models, symbolic rule discovery emerges as a key data mining task, generating human-interpretable rules that offer both transparency and intuitive explainability. This paper introduces the Optimal Pattern Detection Tree (OPDT), a rule-based machine learning model based on novel mixed-integer programming to discover a single optimal pattern in data through binary classification. To incorporate prior knowledge and compliance requirements, we further introduce the Branching Structure Constraints (BSC) framework, which enables decision makers to encode domain knowledge and constraints directly into the model. This optimization-based approach discovers a hidden underlying pattern in datasets, when it exists, by identifying an optimal rule that maximizes coverage while minimizing the false positive rate due to misclassification. Our computational experiments show that OPDT discovers a pattern with optimality guarantees on moderately sized datasets within reasonable runtime.

摘要:資料中的模式發現對於各種領域至關重要,包括醫療保健、風險評估和機械維護。與黑箱深度學習模型相對,符號規則發現成為一項關鍵的數據挖掘任務,生成可供人類解釋的規則,提供透明性和直觀的可解釋性。本文介紹了最佳模式檢測樹(OPDT),這是一種基於新型混合整數規劃的基於規則的機器學習模型,通過二元分類在數據中發現單一最佳模式。為了納入先前知識和合規要求,我們進一步引入了分支結構約束(BSC)框架,使決策者能夠將領域知識和約束直接編碼到模型中。這種基於優化的方法在數據集中發現潛在的隱藏模式(如果存在的話),通過識別一條最佳規則來最大化覆蓋率,同時最小化由於錯誤分類造成的假陽性率。我們的計算實驗顯示,OPDT能在合理的運行時間內,在中等大小的數據集上發現具有最佳性保證的模式。

Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

2605.14366v1 by Zeli Su, Ziyin Zhang, Zhou Liu, Xuexian Song, Zhankai Xu, Longfei Zheng, Xiaolu Zhang, Rong Fu, Guixian Xu, Wentao Zhang

Extending large language models (LLMs) to low-resource languages often incurs an "alignment tax": improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimization (GRPO), where the model is optimized using embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flexible realizations, enabling controlled updates that reduce destructive interference with pretrained knowledge. We evaluate our approach on Tibetan-Chinese machine translation and Tibetan headline generation. Experiments show that our method acquires low-resource capabilities while markedly mitigating alignment tax, preserving general competence more effectively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher semantic quality and preference in open-ended generation, and few-shot transfer results indicate that it learns more transferable and robust representations under limited supervision. Overall, our study demonstrates that reinforcement learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.

摘要:擴展大型語言模型(LLMs)至低資源語言通常會產生“對齊稅”:目標語言的改進以一般能力的災難性遺忘為代價。我們主張這種權衡源於監督微調(SFT)的剛性,該方法在狹窄且偏見的數據分佈上強制執行標記級別的表面模仿。為了解決這一限制,我們提出了一種由群體相對策略優化(GRPO)驅動的語義空間對齊範式,其中模型使用嵌入級別的語義獎勵進行優化,而不是最大化似然。這一目標通過靈活的實現鼓勵意義的保留,使得控制更新得以減少對預訓練知識的破壞性干擾。我們在藏語-中文機器翻譯和藏語標題生成上評估了我們的方法。實驗表明,我們的方法在獲得低資源能力的同時顯著減輕了對齊稅,更有效地保留了一般能力,超過了SFT。儘管產生的表面重疊較少,但語義強化學習在開放式生成中產生了更高的語義質量和偏好,而少量樣本轉移結果表明,在有限的監督下,它學習到了更具可轉移性和穩健性的表示。總的來說,我們的研究表明,使用語義獎勵的強化學習為包容性低資源語言擴展提供了一條更安全和可靠的途徑。

CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation

2605.14344v2 by Yuyang Wu, Stefano Falletta, Delia McGrath, Sherry Yang

Generative modeling has emerged as a promising approach for crystal structure discovery. However, existing LLM-based generative models struggle with low-level atomic precision, while diffusion-based methods fall short in integrating high-level scientific knowledge. As a result, generated structures are often invalid, unstable, or do not possess desirable properties. To address this gap, we propose CrystalReasoner (CrysReas), an end-to-end LLM framework that generates crystal structures from natural language instructions through reasoning and alignment. CrysReas introduces physical priors as thinking tokens, which include crystallographic symmetry, local coordination environments and predicted physical properties before generating atomic coordinates. This bridges the gap between natural language and 3D structures. CrysReas then employs reinforcement learning (RL) with a multi-objective, dense reward function to align generation with physical validity, chemical consistency, and thermodynamic stability. For property-conditioned tasks, we design task-specific reward functions and train specialized models for discrete constraints (e.g., space group) and continuous properties (e.g., elasticity, thermal expansion). Empirical results demonstrate that compared to prior works and baselines without thinking traces or RL, CrysReas obtains better performance on diverse metrics, triples S.U.N. ratio, and achieves better performance for property conditioned generation. CrysReas also exhibits adaptive reasoning, increasing reasoning lengths as the number of atoms increases. Our work demonstrates the potential of leveraging thinking traces and RL for generating valid, stable, and property-conditioned crystal structures.

摘要:生成模型已成為晶體結構發現的一種有前景的方法。然而,現有的基於LLM的生成模型在低級原子精度方面表現不佳,而基於擴散的方法在整合高級科學知識方面也有所不足。因此,生成的結構往往是無效的、不穩定的,或者不具備理想的特性。為了解決這一問題,我們提出了CrystalReasoner (CrysReas),這是一個端到端的LLM框架,通過推理和對齊從自然語言指令生成晶體結構。CrysReas引入了物理先驗作為思維標記,包括晶體對稱性、局部協調環境和預測的物理性質,然後生成原子坐標。這彌補了自然語言與三維結構之間的差距。接著,CrysReas採用強化學習(RL)與多目標、密集的獎勵函數,將生成過程與物理有效性、化學一致性和熱力學穩定性對齊。對於屬性條件任務,我們設計了特定任務的獎勵函數,並為離散約束(例如,空間群)和連續性質(例如,彈性、熱膨脹)訓練專門的模型。實證結果表明,與之前的工作和沒有思維痕跡或RL的基準相比,CrysReas在多種指標上獲得了更好的性能,三倍的S.U.N.比率,並在屬性條件生成方面表現更佳。CrysReas還展現了自適應推理,隨著原子數量的增加而增加推理長度。我們的工作展示了利用思維痕跡和RL生成有效、穩定和屬性條件晶體結構的潛力。

Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

2605.14322v1 by Zixin Chen, Peng Liu, Rui Sheng, Haobo Li, Jianhong Tu, Xiaodong Deng, Kashun Shum, Dayiheng Liu, Huamin Qu

Language agents are increasingly deployed in complex professional workflows, with tutoring emerging as a particularly high-stakes capability that remains largely unmeasured in existing benchmarks. Effective tutor agents require more than producing correct answers or executing accurate tool calls: a robust tutor must diagnose learner state, adapt support over time, make pedagogically justified decisions grounded in educational evidence, and execute interventions within realistic learning-management systems. We introduce EduAgentBench, a source-grounded benchmark for holistically evaluating tutor agents across the full scope of teaching work. It contains 150 quality-controlled tasks across three capability surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion. Tasks are constructed through a pedagogical-insight-driven pipeline and evaluated with complementary verification signals and human review. Across a comprehensive evaluation of frontier models, our findings reveal that current models are generally capable of bounded pedagogical judgment, but still fall short of professional teaching standards in situated tutoring and autonomous teaching-workflow execution. To our knowledge, EduAgentBench is the first theory-grounded and realistic benchmark for evaluating the holistic teaching capability of tutor agents, providing a measurement foundation for developing future tutor agents that can support realistic teaching work.

摘要:語言代理人越來越多地被應用於複雜的專業工作流程中,其中輔導作為一種特別高風險的能力,仍然在現有基準中大部分未被測量。有效的輔導代理人需要的不僅僅是產生正確的答案或執行準確的工具調用:一個強大的輔導者必須診斷學習者的狀態,隨著時間的推移調整支持,做出基於教育證據的教學決策,並在現實的學習管理系統中執行干預。我們介紹了EduAgentBench,這是一個基於來源的基準,用於全面評估輔導代理人在教學工作全範圍內的表現。它包含150個經質量控制的任務,涵蓋三個能力面向:專業的教學判斷、情境多輪輔導和Canvas風格的教學工作流程完成。任務是通過以教學洞察為驅動的流程構建的,並通過互補的驗證信號和人工審查進行評估。在對前沿模型的全面評估中,我們的研究結果顯示,當前模型通常能夠進行有限的教學判斷,但在情境輔導和自主教學工作流程執行方面仍然未達到專業教學標準。據我們所知,EduAgentBench是第一個以理論為基礎且現實的基準,用於評估輔導代理人的整體教學能力,為開發能夠支持現實教學工作的未來輔導代理人提供了測量基礎。

ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition

2605.14309v2 by Shen Lin, Jing Lin, Junhao Dong, Piotr Koniusz, Li Xu

Machine unlearning in Vision-Language Models (VLMs) is typically performed at the image or instance level, making it difficult to precisely remove target knowledge without affecting unrelated semantics. This issue is especially pronounced since a single image often contains multiple entangled concepts, including both target concepts to be forgotten and contextual information that should be preserved. In this paper, we propose an interpretable concept-level unlearning framework for VLMs, which constructs a compact task-specific concept vocabulary from the forgetting set using a multimodal large language model. In addition to modality alignment, visual representations are decomposed into sparse, nonnegative combinations of semantic concepts, providing an explicit interface for fine-grained knowledge manipulation. Based on this decomposition, our method formulates unlearning as concept-level optimization, where target concepts are selectively suppressed while intra-instance non-target semantics and global cross-modal knowledge are preserved. Extensive experiments across both in-domain and out-of-domain forgetting settings demonstrate that our method enables more comprehensive target forgetting, better preserves non-target knowledge within the same image, and maintains competitive model utility compared with existing VLM unlearning methods.

摘要:在視覺-語言模型(VLMs)中,機器遺忘通常在圖像或實例層面進行,這使得精確移除目標知識而不影響無關語義變得困難。這個問題特別明顯,因為單一圖像通常包含多個交織的概念,包括需要遺忘的目標概念和應該保留的上下文信息。在本文中,我們提出了一個可解釋的概念層級遺忘框架,該框架利用多模態大型語言模型從遺忘集構建一個緊湊的任務特定概念詞彙表。除了模態對齊之外,視覺表示被分解為稀疏的、非負的語義概念組合,提供了一個明確的界面以進行細緻的知識操作。基於這種分解,我們的方法將遺忘公式化為概念層級的優化,其中目標概念被選擇性地抑制,而實例內的非目標語義和全局跨模態知識則得以保留。在域內和域外的遺忘設置中進行的廣泛實驗表明,我們的方法能夠實現更全面的目標遺忘,更好地保留同一圖像中的非目標知識,並與現有的VLM遺忘方法相比,維持競爭的模型效用。

Web Agents Should Adopt the Plan-Then-Execute Paradigm

2605.14290v1 by Julien Piet, Annabella Chow, Yiwei Hou, Muxi Lyu, Sylvie Venuto, Jinhao Zhu, Raluca Ada Popa, David Wagner

ReAct has become the default architecture across LLM agents, and many existing web agents follow this paradigm. We argue that it is the wrong default for web agents. Instead, web agents should default to plan-then-execute: commit to a task-specific program before observing runtime web content, then execute it. The reason is that web content mixes inputs from many parties. An e-commerce product page may combine a seller's listing, customer reviews and sponsored advertisements. Under ReAct, all of this content flows into the model when deciding on the next action, creating a direct path for prompt injections to steer the agent's control flow. Plan-then-execute changes this boundary: untrusted data may influence values or branches inside a predefined execution graph, but it cannot redefine the user task or cause the model to synthesize new actions at runtime. We analyze WebArena, a popular web agent benchmark, and find that all tasks are compatible with plan-then-execute, while 80% can be completed with a purely programmatic plan, without any runtime LLM subroutine. We identify the main barrier to adopting plan-then-execute on the web: For it to work well, tools must map cleanly to semantic actions, with effects known before execution, so agents have enough information to plan. The web does not naturally expose that interface. Browser tools such as click, type, and scroll have page-dependent meanings. Planning at this layer is near-sighted: the agent can only see actions on the current page, and later actions appear only after it acts. Closing this gap requires typed interfaces that turn website interactions from clicks and keystrokes to task-level operations. This is an infrastructure problem, not a modeling problem. Web tasks do not need reactivity by default; they need typed, complete, auditable website APIs.

摘要:ReAct 已成為 LLM 代理的默認架構,許多現有的網路代理遵循這一範式。我們認為這對於網路代理來說是錯誤的默認選擇。相反,網路代理應該默認採用計劃然後執行:在觀察運行時的網路內容之前,先承諾一個特定任務的程序,然後再執行它。原因在於網路內容混合了來自多方的輸入。一個電子商務產品頁面可能結合了賣家的列表、客戶評論和贊助廣告。在 ReAct 下,所有這些內容在決定下一步行動時都流入模型中,為提示注入提供了直接的途徑來引導代理的控制流程。計劃然後執行改變了這一邊界:不受信任的數據可能影響預定執行圖中的值或分支,但它不能重新定義用戶任務或導致模型在運行時合成新行動。我們分析了 WebArena,一個流行的網路代理基準,發現所有任務都與計劃然後執行兼容,而 80% 的任務可以通過純粹的程序計劃完成,而不需要任何運行時的 LLM 子程序。我們確定了在網路上採用計劃然後執行的主要障礙:為了使其運作良好,工具必須清晰地映射到語義行動,並且在執行之前已知效果,以便代理擁有足夠的信息來計劃。網路並不自然地暴露這一介面。瀏覽器工具如點擊、輸入和滾動具有頁面依賴的含義。在這一層進行計劃是近視的:代理只能看到當前頁面的行動,而後續行動只有在其行動後才會出現。縮小這一差距需要類型化介面,將網站互動從點擊和鍵入轉變為任務級操作。這是一個基礎設施問題,而不是建模問題。網路任務不需要默認的反應性;它們需要類型化、完整且可審計的網站 API。

Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems

2605.14259v1 by Ling Wang, Songnan Liu, Jianan Wang, Cheng Cheng, Xin Liu, Yihan Zhu, Enyu Li, Yu Xiao, Jiangyong Xie, Duogong Yan, Jiangyi Chen

Applying Large Language Models (LLMs) to heterogeneous enterprise systems is hindered by hallucinations and failures in multi-hop, n-ary reasoning. Existing paradigms (e.g., GraphRAG, NL2SQL) lack the semantic grounding and auditable execution required for these complex environments. We introduce HEAR, an enterprise agentic reasoner built on a Stratified Hypergraph Ontology. Its base Graph Layer virtualizes provenance-aware data interfaces, while the Hyperedge Layer encodes n-ary business rules and procedural protocols. Operating an evidence-driven reasoning loop, HEAR dynamically orchestrates ontology tools for structured multi-hop analysis without requiring LLM retraining. Evaluations on supply-chain tasks, including order fulfillment blockage root cause analysis (RCA), show HEAR achieves up to 94.7% accuracy. Crucially, HEAR demonstrates adaptive efficiency: utilizing procedural hyperedges to minimize token costs, while leveraging topological exploration for rigorous correctness on complex queries. By matching proprietary model performance with open-weight backbones and automating manual diagnostics, HEAR establishes a scalable, auditable foundation for enterprise intelligence.

摘要:將大型語言模型(LLMs)應用於異質企業系統受到幻覺和多跳、n-元推理失敗的阻礙。現有的範式(例如,GraphRAG,NL2SQL)缺乏這些複雜環境所需的語義基礎和可審計的執行。我們介紹了HEAR,一個基於分層超圖本體的企業代理推理器。其基本圖層虛擬化了具有來源感知的數據接口,而超邊層編碼了n-元商業規則和程序協議。HEAR運行一個基於證據的推理循環,動態協調本體工具以進行結構化的多跳分析,而無需重新訓練LLM。在供應鏈任務的評估中,包括訂單履行阻塞根本原因分析(RCA),顯示HEAR達到了高達94.7%的準確率。關鍵是,HEAR展示了自適應效率:利用程序超邊來最小化令牌成本,同時利用拓撲探索來確保複雜查詢的嚴謹正確性。通過將專有模型性能與開放權重骨幹匹配並自動化手動診斷,HEAR建立了一個可擴展的、可審計的企業智能基礎。

Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology

2605.14258v1 by Jesseba Fernando, Grigori Guitchounts

Large language models are remarkably capable, yet how computation propagates through their layers remains poorly understood. A growing line of work treats depth as discrete time and the residual stream as a dynamical system, where each layer's nonlinear update has a local linear description. However, previous analyses have relied on scalar summaries or approximate linearizations, leaving the full spectral geometry of trained LLMs unknown. We perform full Jacobian eigendecomposition across three production--scale LLMs and show that training installs a monotonic spectral gradient through depth -- from non-normal, rotation-dominated early layers to near--symmetric late layers -- together with a cumulative low-rank bottleneck that funnels perturbations into a small fraction of the residual stream's effective dimensions. Our experiments reveal that this gradient and the dimensional collapse are learned rather than architectural, and is largely dissolved when structured non-normality is removed. We further show that the topological positioning of graph communities predicts whether the Jacobian amplifies or suppresses them, with the sign of the coupling determined by the local operator type, a relationship absent at initialization. These results map a learned spectral geometry in LLMs that links perturbation propagation and compression to the network's functional topology.

摘要:大型語言模型具有驚人的能力,但計算如何在其層中傳播仍然不甚了解。越來越多的研究將深度視為離散時間,並將殘差流視為動態系統,其中每一層的非線性更新都有一個局部線性描述。然而,以往的分析依賴於標量摘要或近似線性化,導致訓練後的LLM的完整光譜幾何仍然未知。我們對三個生產規模的LLM進行了完整的雅可比特徵分解,並顯示訓練在深度上安裝了一個單調的光譜梯度——從非正規、以旋轉為主的早期層到接近對稱的晚期層——以及一個累積的低秩瓶頸,將擾動引導到殘差流有效維度的一小部分。我們的實驗顯示,這個梯度和維度崩潰是學習到的,而不是架構性的,當結構性非正規性被移除時,這種現象大部分會消失。我們進一步顯示,圖社區的拓撲位置預測雅可比是放大還是抑制它們,耦合的符號由局部運算符類型決定,而這種關係在初始化時是不存在的。這些結果描繪了LLM中學習到的光譜幾何,將擾動傳播和壓縮與網絡的功能拓撲聯繫起來。

What Makes Words Hard? Sakura at BEA 2026 Shared Task on Vocabulary Difficulty Prediction

2605.14257v1 by Adam Nohejl, Xuanxin Wu, Yusuke Ide, Maria Angelica Riera Machin, Yi-Ning Chang, Hitomi Yanaka

We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online at https://github.com/adno/vocabulary-difficulty .

摘要:我們描述了兩種類型的詞彙難度預測模型:一種是高準確度的黑箱模型,該模型在公開賽道中獲得了最佳共享任務結果;另一種是可解釋模型,該模型的表現超過了微調的編碼器基準。作為黑箱模型,我們使用軟目標損失函數對LLM進行了微調,以有效應用於評分任務,達到r > 0.91。可解釋模型提供了對每個項目難度影響因素的見解,同時保持強相關性(r > 0.77)。我們進一步分析了結果,展示了英國文化協會的知識型詞彙列表(KVL)中的項目難度通常受到拼寫難度或測試項目結構的影響,除了單詞的真正產出難度之外。我們的代碼可在https://github.com/adno/vocabulary-difficulty 在線獲得。

Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks

2605.14252v1 by Kai Sun, Peibo Duan, Yongsheng Huang, Guowei Zhang, Benjamin Smith, Nanxu Gong, Levin Kuhlmann

Spiking neural networks (SNNs), which are brain-inspired and spike-driven, achieve high energy efficiency. However, a performance gap between SNNs and artificial neural networks (ANNs) still remains. Knowledge distillation (KD) is commonly adopted to improve SNN performance, but existing methods typically enforce uniform alignment across all timesteps, either from a teacher network or through inter-temporal self-distillation, implicitly assuming that per-timestep predictions should be treated equally. In practice, SNN predictions vary and evolve over time, and intermediate timesteps need not all be individually correct even when the final aggregated output is correct. Under such conditions, effective distillation should not force every timestep toward the same supervision target, but instead provide corrective guidance to erroneous timesteps while preserving useful temporal dynamics. To address this issue, we propose Selective Alignment Knowledge Distillation (SeAl-KD), which selectively aligns class-level and temporal knowledge by equalizing competing logits at erroneous timesteps and reweighting temporal alignment based on confidence and inter-timestep similarity. Extensive experiments on static image and neuromorphic event-based datasets demonstrate consistent improvements over existing distillation methods. The code is available at https://github.com/KaiSUN1/SeAl

摘要:尖峰神經網絡(SNNs)受到大腦啟發並以尖峰驅動,達到了高能效。然而,SNNs 與人工神經網絡(ANNs)之間仍然存在性能差距。知識蒸餾(KD)通常被採用來改善 SNN 的性能,但現有的方法通常在所有時間步上強制進行均勻對齊,無論是來自教師網絡還是通過時間間隔自我蒸餾,隱含地假設每個時間步的預測應該被平等對待。在實踐中,SNN 的預測隨時間變化和演變,即使最終的聚合輸出是正確的,中間時間步也不必全部正確。在這種情況下,有效的蒸餾不應該強迫每個時間步朝向相同的監督目標,而應該為錯誤的時間步提供修正指導,同時保留有用的時間動態。為了解決這個問題,我們提出了選擇性對齊知識蒸餾(SeAl-KD),它通過在錯誤的時間步上平衡競爭的邏輯來選擇性地對齊類別級和時間知識,並根據信心和時間步之間的相似性重新加權時間對齊。在靜態圖像和神經形態事件基礎數據集上進行的廣泛實驗顯示出對現有蒸餾方法的一致改進。代碼可在 https://github.com/KaiSUN1/SeAl 獲得。

Why Retrieval-Augmented Generation Fails: A Graph Perspective

2605.14192v1 by Kai Guo, Xinnan Dai, Zhibo Zhang, Nuohan Lin, Shenglai Zeng, Jie Ren, Haoyu Han, Jiliang Tang

Retrieval-Augmented Generation (RAG) has become a powerful and widely used approach for improving large language models by grounding generation in retrieved evidence. However, RAG systems still produce incorrect answers in many cases. Why RAG fails despite having access to external information remains poorly understood. We present a model-internal study of retrieval-augmented generation that examines how retrieved evidence influences answer generation. Using circuit tracing, we construct attribution graphs that model the flow of information through transformer layers during decoding. These graphs represent interactions among retrieved context, intermediate model activations, and generated tokens, providing a graph, circuit-level view of how external evidence is integrated into the model's reasoning process across multiple question answering benchmarks, we observe consistent structural differences: correct predictions exhibit deeper reasoning paths, more distributed evidence flow, and a more structured pattern of local connectivity, while failed predictions show shallower, fragmented, and overly concentrated evidence flow. Building on these findings, we develop a graph-based error detection framework that uses attribution-graph topology features. Furthermore, we show that attribution graphs enable targeted interventions. By reinforcing question-constrained evidence grounding, we reshape internal routing so that answer generation remains guided by the question, leading to more effective integration of retrieved information and fewer errors.

摘要:檢索增強生成(RAG)已成為一種強大且廣泛使用的方法,通過基於檢索到的證據來改善大型語言模型。然而,RAG 系統在許多情況下仍然會產生不正確的答案。儘管有外部信息可用,RAG 為何失敗仍然不甚了解。我們提出了一個模型內部研究,檢查檢索到的證據如何影響答案生成。通過電路追蹤,我們構建了歸因圖,模擬解碼過程中信息在Transformer層之間的流動。這些圖表示了檢索上下文、中間模型激活和生成標記之間的互動,提供了一個圖形、電路級別的視角,展示外部證據如何在多個問題回答基準中整合到模型的推理過程中,我們觀察到一致的結構差異:正確的預測展現出更深的推理路徑、更分散的證據流以及更有結構的局部連接模式,而失敗的預測則顯示出較淺、破碎和過於集中化的證據流。在這些發現的基礎上,我們開發了一個基於圖形的錯誤檢測框架,利用歸因圖的拓撲特徵。此外,我們展示了歸因圖能夠實現有針對性的干預。通過加強問題約束的證據基礎,我們重塑內部路由,使答案生成始終受到問題的指導,從而更有效地整合檢索到的信息並減少錯誤。

Thinking Ahead: Prospection-Guided Retrieval of Memory with Language Models

2605.14177v1 by Harshita Chopra, Krishna Kant Chintalapudi, Suman Nath, Ryen W. White, Chirag Shah

Long-horizon personalization requires dialogue assistants to retrieve user-specific facts from extended interaction histories. In practice, many relevant facts often have low semanticsimilarity to the query under dense retrieval. Standard Retrieval-Augmented Generation (RAG) and GraphRAG systems are still largely retrospective: they rely on embedding similarity to the query or on fixed graph traversals, so they often miss facts that matter for the user's needs but lie far from the query in embedding space. Inspired by prospection, the human ability to use imagined futures as cues for recall, we introduce Prospection-Guided Retrieval (PGR), which decouples retrieval from how memories are stored. Given a user query, PGR first expands the goal into a short Tree-of-Thought (ToT) or linear chain of plausible next steps, and uses these steps as retrieval probes rather than relying on the original query alone. The facts retrieved by these probes are then used to personalize the next round of prospection, enabling PGR to uncover additional memories that become relevant only after the simulation is grounded in the user's history. We also introduce MemoryQuest, a challenging multi-session benchmark in which each query is annotated with 3--5 dated reference facts subject to a low query-reference similarity constraint. Across 1,625 queries spanning 185 user profiles from 3 publicly available datasets, PGR-TOT substantially improves retrieval, including nearly 3x recall on MemoryQuest over the strongest baseline. In pairwise LLM-as-judge comparisons against baselines, PGR-generated responses are preferred on 89--98% of queries, with blinded human annotations on held-out subsets showing the same trend. Overall, the results demonstrate that explicit prospection yields large gains in long-horizon retrieval and response quality relative to similarity-only baselines.

摘要:長期個性化需要對話助手從擴展的互動歷史中檢索用戶特定的事實。實際上,許多相關事實在密集檢索下通常與查詢的語義相似度較低。標準的檢索增強生成(RAG)和GraphRAG系統仍然主要是回顧性的:它們依賴於與查詢的嵌入相似度或固定的圖遍歷,因此常常錯過對用戶需求重要但在嵌入空間中距離查詢較遠的事實。受到前瞻性啟發,人類利用想像的未來作為回憶線索的能力,我們引入了前瞻性引導檢索(PGR),它將檢索與記憶的存儲方式解耦。給定用戶查詢,PGR首先將目標擴展為一個短的思維樹(ToT)或合理的下一步線性鏈,並將這些步驟作為檢索探針,而不是僅依賴原始查詢。這些探針檢索到的事實然後用於個性化下一輪的前瞻性,使PGR能夠發現只有在模擬與用戶歷史相結合後才變得相關的額外記憶。我們還引入了MemoryQuest,這是一個具有挑戰性的多會話基準,其中每個查詢都附有3-5個有日期的參考事實,並受到低查詢-參考相似度約束。在涵蓋185個用戶檔案的3個公開可用數據集中的1,625個查詢中,PGR-TOT顯著提高了檢索效果,包括在MemoryQuest上相較於最強基線幾乎提高了3倍的召回率。在與基線的成對LLM作為評判的比較中,PGR生成的回應在89-98%的查詢中被偏好,對保留子集的盲人人工標註顯示出相同的趨勢。總體而言,結果顯示,相較於僅基於相似度的基線,明確的前瞻性在長期檢索和回應質量上帶來了巨大的提升。

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

2605.14164v1 by Stefan Baack, Christo Buschek, Maty Bohacek

The primary way to establish and compare competencies in foundation and generative AI models has shifted from peer-reviewed literature to press releases and company blog posts, where model builders highlight results on selected benchmarks. These artifacts now largely define the state of the art for researchers and the public. Despite their prominence, which benchmarks model builders choose to highlight, and what they communicate through this selection, is underexamined. To investigate, we introduce and open-source Benchmarking-Cultures-25, a dataset of 231 benchmarks highlighted across 139 model releases in 2025 from 11 major AI builders, alongside an interactive tool to explore the data. Our analysis reveals a fragmented evaluation landscape with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder, and 38.5% appear in just one release. Few achieve widespread use (e.g., GPQA Diamond, LiveCodeBench, AIME 2025). Moreover, benchmarks are attributed different competencies by different builders, depending on their narrative. To disentangle these conflicting presentations, we develop a unified taxonomy mapping diverging terminology to a shared framework of measured signals based on what benchmark authors claim to measure. "General knowledge application" is the second most popular, yet vaguely defined, category. Qualitative analysis shows many such benchmarks deemphasize construct validity, instead framing results as indicators of progress toward AGI. Their authors claim to measure knowledge or reasoning broadly, yet mostly evaluate STEM subjects (especially math). We argue that highlighted benchmarks function less as standardized measurement tools and more as flexible narrative devices prioritizing market positioning over scientific evaluation. Data: https://hf.co/datasets/matybohacek/benchmarking-cultures-25; tool: https://bench-cultures.net.

摘要:建立和比較基礎及生成 AI 模型的主要方式已從同行評審文獻轉變為新聞稿和公司部落格文章,在這些地方,模型建構者強調在選定基準上的結果。這些文獻現在在很大程度上定義了研究人員和公眾的最新技術狀態。儘管它們的顯著性,模型建構者選擇強調哪些基準,以及他們通過這一選擇傳達什麼,卻鮮有深入研究。為了調查這一點,我們引入並開源了 Benchmarking-Cultures-25,這是一個包含 231 個基準的數據集,涵蓋 2025 年 11 家主要 AI 建構者的 139 次模型發布,並提供了一個互動工具來探索數據。我們的分析揭示了一個破碎的評估格局,跨模型的可比性有限:63.2% 的突出基準僅由一位建構者使用,38.5% 僅出現在一次發布中。少數基準實現了廣泛使用(例如,GPQA Diamond、LiveCodeBench、AIME 2025)。此外,不同建構者根據其敘事對基準賦予不同的能力。為了理清這些矛盾的表述,我們開發了一個統一的分類法,將不同的術語映射到基於基準作者聲稱測量的共享信號框架。“一般知識應用”是第二受歡迎但定義模糊的類別。質性分析顯示,許多此類基準淡化了構念效度,而是將結果框架為通向 AGI 的進展指標。它們的作者聲稱廣泛測量知識或推理,但主要評估 STEM 科目(特別是數學)。我們認為,突出的基準更像是靈活的敘事工具,優先考慮市場定位而非科學評估,而非標準化的測量工具。數據:https://hf.co/datasets/matybohacek/benchmarking-cultures-25;工具:https://bench-cultures.net。

Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)

2605.14126v1 by Marius S. Knorr, Robert Müller, Jan P. Bremer, Nils Schweingruber

Fast Healthcare Interoperability Resources (FHIR) is the dominant standard for interoperable exchange of healthcare data. In FHIR, electronic health records form a directed graph of resources. Answering clinically meaningful questions over FHIR requires agents to perform multi-step reasoning, filtering, and aggregation across multiple resource types. Prior work shows that even tool-augmented LLM agents (retrieval, code execution, multi-turn planning) often select the wrong resources or violate traversal constraints. We study this problem in the context of FHIR-AgentBench, a benchmark for realistic question answering over real-world hospital data, and frame reasoning on FHIR as a sequential decision-making problem over a queryable structured graph. We implement a multi-turn CodeAct agent and post-train it with reinforcement learning using a custom harness and tools. A LLM Judge provides execution-grounded rewards. Compared to prompt-based, closed-model baselines, RL post-training improves performance while enforcing data-integrity constraints. Empirically, our approach improves answer correctness from 50% (o4-mini) to 77% on FHIR-AgentBench using a smaller and cheaper Qwen3-8B model. We present an end-to-end post-training pipeline (environment building, harness construction, model training and custom evaluation) that reliably improves multi-turn reasoning over structured clinical graphs.

摘要:快速醫療互操作性資源(FHIR)是互操作性醫療數據交換的主導標準。在FHIR中,電子健康記錄形成了一個有向資源圖。在FHIR上回答臨床有意義的問題需要代理執行多步推理、過濾和跨多種資源類型的聚合。先前的研究顯示,即使是工具增強的LLM代理(檢索、代碼執行、多輪規劃)也常常選擇錯誤的資源或違反遍歷約束。我們在FHIR-AgentBench的背景下研究這個問題,這是一個針對現實世界醫院數據的真實問題回答基準,並將FHIR上的推理框架設置為可查詢結構圖上的序列決策問題。我們實現了一個多輪CodeAct代理,並使用自定義環境和工具進行強化學習後訓練。一個LLM評判者提供基於執行的獎勵。與基於提示的封閉模型基準相比,強化學習後訓練在強化數據完整性約束的同時提高了性能。實證結果顯示,我們的方法將FHIR-AgentBench上的答案正確率從50%(o4-mini)提高到77%,使用的是一個更小且更便宜的Qwen3-8B模型。我們提出了一個端到端的後訓練流程(環境構建、環境構造、模型訓練和自定義評估),該流程可靠地提高了對結構化臨床圖的多輪推理。

MathAtlas: A Benchmark for Autoformalization in the Wild

2605.14061v1 by Nilay Patel, Noah Arias, Davit Babayan, Victoria Cochran, Timothy Libman, Hafsah Mahmood, Liam McCarty, Soli Munoz, Laurel Willey, Jeffrey Flanigan

Current autoformalization benchmarks are largely focused on olympiad or undergraduate mathematics, while graduate and research-level mathematics remains underexplored. In this paper, we introduce MathAtlas, the first large-scale autoformalization benchmark of in the wild graduate-level mathematics, containing ~52k theorems, definitions, exercises, examples, and proofs extracted from 103 graduate mathematics textbooks. MathAtlas is enriched with a mathematical dependency graph containing ~178k relations, and is the first autoformalization benchmark to include such relations, facilitating evaluation and development of dependency-aware autoformalization systems. Our extensive experiments show that MathAtlas is high quality but extremely challenging: strong baselines achieve at most 9.8% correctness on theorem statements and 16.7% on definitions. Furthermore, we find performance of state-of-the-art models degrades substantially with dependency depth: on MA-Hard, a subset of 700 entities with the deepest dependency trees, the best model achieves only 2.6% correctness for autoformalization on this challenging dataset. We release MathAtlas to the community as a benchmark set for large-scale autoformalization of graduate-level mathematics in the wild.

摘要:目前的自動形式化基準主要集中在奧林匹克或本科數學上,而研究生和研究級數學仍然未被充分探索。在本文中,我們介紹了 MathAtlas,這是第一個大規模的自動形式化基準,涵蓋了實際中的研究生級數學,包含約 52,000 個定理、定義、練習、例子和從 103 本研究生數學教科書中提取的證明。MathAtlas 充實了包含約 178,000 個關係的數學依賴圖,並且是第一個包含此類關係的自動形式化基準,促進了依賴感知自動形式化系統的評估和開發。我們的廣泛實驗表明,MathAtlas 具有高質量但極具挑戰性:強基線在定理陳述上最多達到 9.8% 的正確率,在定義上則為 16.7%。此外,我們發現最先進模型的性能隨著依賴深度的增加而顯著下降:在 MA-Hard 中,這是一個包含 700 個具有最深依賴樹的實體的子集,最佳模型在這個具有挑戰性的數據集上僅達到 2.6% 的自動形式化正確率。我們將 MathAtlas 發布給社區,作為大規模自動形式化研究生級數學的基準集。

Derivation Prompting: A Logic-Based Method for Improving Retrieval-Augmented Generation

2605.14053v1 by Ignacio Sastre, Guillermo Moncecchi, Aiala Rosá

The application of Large Language Models to Question Answering has shown great promise, but important challenges such as hallucinations and erroneous reasoning arise when using these models, particularly in knowledge-intensive, domain-specific tasks. To address these issues, we introduce Derivation Prompting, a novel prompting technique for the generation step of the Retrieval-Augmented Generation framework. Inspired by logic derivations, this method involves deriving conclusions from initial hypotheses through the systematic application of predefined rules. It constructs a derivation tree that is interpretable and adds control over the generation process. We applied this method in a specific case study, significantly reducing unacceptable answers compared to traditional RAG and long-context window methods.

摘要:大型語言模型在問答應用中的應用顯示出巨大的潛力,但在使用這些模型時,特別是在知識密集型和特定領域的任務中,會出現如幻覺和錯誤推理等重要挑戰。為了解決這些問題,我們引入了推導提示(Derivation Prompting),這是一種針對檢索增強生成框架生成步驟的新型提示技術。這種方法受到邏輯推導的啟發,涉及通過系統性地應用預定規則從初始假設中推導結論。它構建了一個可解釋的推導樹,並增加了對生成過程的控制。我們在一個特定的案例研究中應用了這種方法,顯著減少了與傳統的RAG和長上下文窗口方法相比的不可接受答案。

VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

2605.13989v1 by Juan S. Santillana

We present VectraYX-Nano, a 41.95M-parameter decoder-only language model trained from scratch in Spanish for cybersecurity, with a Latin-American focus and native tool invocation via the Model Context Protocol (MCP). Four contributions: (i) Corpus: VectraYX-Sec-ES, a 170M-token Spanish corpus from an eight-VM pipeline (~$25 USD) partitioned into conversational (42M tokens, OpenSubtitles-ES, OASST1), cybersecurity (118M tokens, NVD, Wikipedia-ES, CVE mirror, security blogs), and offensive-security tooling (10M tokens, ExploitDB, HackTricks, OWASP) phases. (ii) Architecture: 42M-parameter Transformer decoder with GQA, QK-Norm, RMSNorm, SwiGLU, RoPE, z-loss, and a 16,384-token byte-fallback BPE. (iii) Curriculum with replay: continual pre-training with a replay buffer yields monotonic loss descent (9.80->3.17->3.00->2.16); after SFT on OASST-ES, Alpaca-ES, CVE Q&A, and 6,327 tool-use traces, the model attains a conversational gate of 0.78+-0.05 (N=4 seeds). (iv) Two findings: a bootstrap-corpus ablation reveals a loss-vs-register inversion at nano scale; a LoRA study shows the B4 tool-selection floor of 0.000 is a corpus-density artifact, not a capacity gate -- a tool-dense corpus (2,801 examples) raises B4 to 0.145+-0.046 on Nano 42M and 0.445+-0.201 on a 260M mid-tier. The GGUF artifact is 81 MB (F16), runs at sub-second TTFT on commodity hardware under llama.cpp, and is to our knowledge the first Spanish-native cybersecurity LLM with end-to-end MCP integration. Corpus recipe, training scripts, GGUF weights, and B1-B5 benchmark are released.

摘要:我們推出了 VectraYX-Nano,這是一個擁有 41.95M 參數的僅解碼器語言模型,從零開始用西班牙語訓練,專注於網絡安全,並通過模型上下文協議(MCP)進行本地工具調用。四個貢獻:(i)語料庫:VectraYX-Sec-ES,一個由 170M 令牌組成的西班牙語語料庫,來自一個八虛擬機的管道(約 25 美元),分為對話(42M 令牌,OpenSubtitles-ES,OASST1)、網絡安全(118M 令牌,NVD,Wikipedia-ES,CVE 鏡像,安全博客)和攻擊性安全工具(10M 令牌,ExploitDB,HackTricks,OWASP)階段。(ii)架構:擁有 42M 參數的 Transformer 解碼器,配備 GQA、QK-Norm、RMSNorm、SwiGLU、RoPE、z-loss,以及 16,384 令牌的字節回退 BPE。(iii)帶回放的課程:持續的預訓練與回放緩衝區產生單調損失下降(9.80->3.17->3.00->2.16);在 OASST-ES、Alpaca-ES、CVE 問答和 6,327 次工具使用痕跡上進行 SFT 後,該模型達到 0.78+-0.05 的對話門檻(N=4 種子)。(iv)兩個發現:一個自引導語料庫消融顯示在納米尺度上存在損失與註冊的反轉;一項 LoRA 研究顯示 B4 工具選擇的下限 0.000 是語料庫密度的產物,而不是容量門檻——一個工具密集的語料庫(2,801 範例)將 B4 提升至 0.145+-0.046 在 Nano 42M 上以及 0.445+-0.201 在 260M 中階上。GGUF 工件大小為 81 MB(F16),在商品硬體上以亞秒 TTFT 運行,並且據我們所知,這是第一個具有端到端 MCP 整合的西班牙語本地網絡安全 LLM。語料庫配方、訓練腳本、GGUF 權重和 B1-B5 基準已發布。

Towards Resource-Efficient LLMs: End-to-End Energy Accounting of Distillation Pipelines

2605.13981v1 by Katherine Lambert, Sasha Luccioni

The rise in deployment of large language models has driven a surge in GPU demand and datacenter scaling, raising concerns about electricity use, grid stress, and the impacts of modern AI workloads. Distillation is often promoted as one of the most effective paths to obtain cheaper, more efficient models, yet these claims rarely account for the full end-to-end energy and resource costs, including crucial teacher-side workloads such as data generation, logit caching, and evaluation. We present a comprehensive energy accounting framework that measures the complete computational cost of distillation pipelines via detailed stage-wise tracking of GPU device power consumption. In our experiments, we separate and log empirical energy use across distinct phases and systematically measure the energy and emissions of two common distillation methods: the classic logit-based knowledge distillation and synthetic-data supervised fine-tuning, constructing energy-quality Pareto frontiers that expose the previously ignored costs. From these measurements and analyses, we derive practical design rules for selecting distillation methods and hyperparameters under energy and budget constraints, and release an open-source measurement harness and accounting protocol to provide a standardized foundation for comparable, reproducible distillation research, explicitly accountable for complete pipeline energy impact.

摘要:大型語言模型的部署增加驅動了GPU需求的激增和數據中心的擴展,這引發了對電力使用、電網壓力以及現代AI工作負載影響的擔憂。蒸餾常被推廣為獲得更便宜、更高效模型的最有效途徑之一,然而這些說法很少考慮到完整的端到端能源和資源成本,包括關鍵的教師端工作負載,如數據生成、logit緩存和評估。我們提出了一個全面的能源核算框架,通過詳細的階段性GPU設備功耗追蹤來測量蒸餾管道的完整計算成本。在我們的實驗中,我們分離並記錄不同階段的實證能源使用,並系統性地測量兩種常見蒸餾方法的能源和排放:經典的基於logit的知識蒸餾和合成數據的監督微調,構建能源-質量的Pareto邊界,揭示之前被忽視的成本。根據這些測量和分析,我們推導出在能源和預算限制下選擇蒸餾方法和超參數的實用設計規則,並發布了一個開源測量工具和核算協議,以提供可比較、可重複的蒸餾研究的標準化基礎,明確負責完整管道的能源影響。

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

2605.13950v1 by Darius A. Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, David Shih

Autonomous language-model agents are increasingly evaluated on long-horizon tool-use tasks, but existing benchmarks rarely capture the complexity and nuance of real scientific work. To address this gap, we introduce Collider-Bench, a benchmark for evaluating whether LLM agents can reproduce experimental analyses from the Large Hadron Collider (LHC) using only public papers and open scientific software. Such analyses are often difficult to reproduce because the public toolchain only approximates the software used internally by the experimental collaborations, while the published papers inevitably omit implementation details needed for a faithful reconstruction. Agents must therefore rely on physical reasoning, domain knowledge, and trial-and-error to fill these gaps. Each task requires the agent to turn a published analysis into an executable simulation-and-selection pipeline and submit predicted collision event yields in specified signal regions. These predictions are evaluated with standard histogram metrics that provide continuous fidelity scores without a hand-written rubric. We also report the computational cost incurred by each agent per task. Finally, we evaluate the codebase and full session trace using an LLM judge to catch qualitative failure modes such as fabrications, hallucinations and duplications. We release an initial set of tasks drawn from LHC searches, together with a containerized sandbox and event simulation tools. We evaluate across a capability ladder of general purpose coding agents. Our results show that on average no agent reliably beats the physicist-in-the-loop solution.

摘要:自主語言模型代理在長期工具使用任務中的評估越來越多,但現有的基準很少捕捉到真實科學工作的複雜性和細微差別。為了解決這一差距,我們引入了Collider-Bench,這是一個基準,用於評估LLM代理是否能夠僅使用公開論文和開放科學軟體來重現大型強子對撞機(LHC)的實驗分析。這些分析通常難以重現,因為公共工具鏈僅近似於實驗合作中內部使用的軟體,而已發表的論文不可避免地省略了忠實重建所需的實現細節。因此,代理必須依賴物理推理、領域知識和反覆試驗來填補這些空白。每個任務要求代理將已發表的分析轉化為可執行的模擬和選擇管道,並在指定的信號區域中提交預測的碰撞事件產量。這些預測使用標準直方圖指標進行評估,提供連續的真實度分數,而不需要手寫的評分標準。我們還報告了每個代理每個任務所產生的計算成本。最後,我們使用LLM評審評估代碼庫和完整會話追蹤,以捕捉質量失敗模式,例如虛構、幻覺和重複。我們發布了一組初始任務,這些任務來自LHC搜索,並提供了一個容器化的沙盒和事件模擬工具。我們在通用編碼代理的能力梯度上進行評估。我們的結果顯示,平均而言,沒有任何代理能可靠地超越物理學家在環中的解決方案。

WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

2605.13846v1 by Ziheng Zhang, Yunzhong Hou, Naijing Liu, Liang Zheng

This paper introduces WARDEN, an early language model system capable of transcribing and translating Wardaman, an endangered Australian indigenous language into English. The significant challenge we face is the lack of large-scale training data: in fact, we only have 6 hours of annotated audio. Therefore, while it is common practice to train a single model for transcription and translation using large datasets (like English to French), this practice is no longer viable in the Wardaman to English context. To tackle the low-resource challenge, we design WARDEN to have separate transcription and translation models: WARDEN first turns a Wardaman audio input into phonemic transcription, and then the transcription into English translation. Further, we propose two useful techniques to enhance performance. For transcription, we initialize the Wardaman token from Sundanese, a language that shares similar phonemes with Wardaman, to accelerate fine-tuning of the transcription model. For translation, we compile a Wardaman-English dictionary from expert annotations, and provide this domain-specific knowledge to a large language model (LLM) to reason and decide the final output. We empirically demonstrate that this two-stage design works better than data-hungry unified approaches in extremely low data settings. Using a mere 6 hours of annotated data, WARDEN outperforms larger open-source and proprietary models and establishes a strong baseline. Data and code are available.

摘要:這篇論文介紹了 WARDEN,一個早期的語言模型系統,能夠將一種瀕危的澳大利亞原住民語言 Wardaman 轉錄並翻譯成英語。我們面臨的重大挑戰是缺乏大規模的訓練數據:事實上,我們只有 6 小時的註釋音頻。因此,雖然使用大型數據集(如英語到法語)訓練單一模型進行轉錄和翻譯是常見做法,但在 Wardaman 到英語的情境中,這種做法已不再可行。為了應對低資源挑戰,我們設計了 WARDEN,使其擁有獨立的轉錄和翻譯模型:WARDEN 首先將 Wardaman 音頻輸入轉換為音素轉錄,然後將轉錄轉換為英語翻譯。此外,我們提出了兩種有用的技術來提升性能。對於轉錄,我們從 Sundanese(一種與 Wardaman 共享相似音素的語言)初始化 Wardaman 標記,以加速轉錄模型的微調。對於翻譯,我們從專家註釋中編纂了一本 Wardaman-英語字典,並將這一特定領域的知識提供給大型語言模型(LLM),以推理並決定最終輸出。我們實證證明,這種兩階段設計在極低數據環境下的表現優於數據需求量大的統一方法。使用僅僅 6 小時的註釋數據,WARDEN 的表現超過了更大型的開源和專有模型,並建立了強有力的基準。數據和代碼均可用。

An LLM-Based System for Argument Reconstruction

2605.13793v1 by Paulo Pirozelli, Victor Hugo Nascimento Rocha, Fabio G. Cozman, Douglas Aldred

Arguments are a fundamental aspect of human reasoning, in which claims are supported, challenged, and weighed against one another. We present an end-to-end large language model (LLM)-based system for reconstructing arguments from natural language text into abstract argument graphs. The system follows a multi-stage pipeline that progressively identifies argumentative components, selects relevant elements, and uncovers their logical relations. These elements are represented as directed acyclic graphs consisting of two component types (premises and conclusions) and three relation types (support, attack, and undercut). We conduct two complementary experiments to evaluate the system. First, we perform a manual evaluation on arguments drawn from an argumentation theory textbook to assess the system's ability to recover argumentative structure. Second, we conduct a quantitative evaluation on benchmark datasets, allowing comparison with prior work by mapping our outputs to established annotation schemes. Results show that the system can adequately recover argumentative structures and, when adapted to different annotation schemes, achieve reasonable performance across benchmark datasets. These findings highlight the potential of LLM-based pipelines for scalable argument reconstruction.

摘要:論證是人類推理的基本面向,其中主張被支持、挑戰並相互權衡。我們提出了一個基於端到端大型語言模型(LLM)的系統,旨在將自然語言文本中的論證重建為抽象論證圖。該系統遵循一個多階段的流程,逐步識別論證組件、選擇相關元素並揭示它們的邏輯關係。這些元素被表示為由兩種類型的組件(前提和結論)和三種類型的關係(支持、攻擊和削弱)組成的有向無環圖。我們進行了兩個互補的實驗來評估該系統。首先,我們對來自論證理論教科書的論證進行手動評估,以評估系統恢復論證結構的能力。其次,我們在基準數據集上進行定量評估,通過將我們的輸出映射到既定的標註方案,允許與先前的工作進行比較。結果顯示,該系統能夠充分恢復論證結構,並且在適應不同的標註方案時,能在基準數據集上達到合理的性能。這些發現突顯了基於LLM的流程在可擴展論證重建方面的潛力。

EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents

2605.13941v1 by Jiaqi Liu, Xinyu Ye, Peng Xia, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao

Long-term memory is essential for LLM agents that operate across multiple sessions, yet existing memory systems treat retrieval infrastructure as fixed: stored content evolves while scoring functions, fusion strategies, and answer-generation policies remain frozen at deployment. We argue that truly adaptive memory requires co-evolution at two levels: the stored knowledge and the retrieval mechanism that queries it. We present EvolveMem, a self-evolving memory architecture that exposes its full retrieval configuration as a structured action space optimized by an LLM-powered diagnosis module. In each evolution round, the module reads per-question failure logs, identifies root causes, and proposes targeted configuration adjustments; a guarded meta-analyzer applies them with automatic revert-on-regression and explore-on-stagnation safeguards. This closed-loop self-evolution realizes an AutoResearch process: the system autonomously conducts iterative research cycles on its own architecture, replacing manual configuration tuning. Starting from a minimal baseline, the process converges autonomously, discovering effective retrieval strategies including entirely new configuration dimensions not present in the original action space. On LoCoMo, EvolveMem outperforms the strongest baseline by 25.7% relative and achieves a 78.0% relative improvement over the minimal baseline. On MemBench, EvolveMem exceeds the strongest baseline by 18.9% relative. Evolved configurations transfer across benchmarks with positive rather than catastrophic transfer, indicating that the self-evolution process captures universal retrieval principles rather than benchmark-specific heuristics. Code is available at https://github.com/aiming-lab/SimpleMem.

摘要:長期記憶對於在多個會話中運作的LLM代理至關重要,然而現有的記憶系統將檢索基礎設施視為固定:儲存的內容不斷演變,而評分函數、融合策略和答案生成政策在部署時保持不變。我們認為,真正的自適應記憶需要在兩個層面上共同演化:儲存的知識和查詢該知識的檢索機制。我們提出了EvolveMem,一種自我演化的記憶架構,將其完整的檢索配置暴露為一個由LLM驅動的診斷模塊優化的結構化行動空間。在每個演化回合中,該模塊會讀取每個問題的失敗日誌,識別根本原因,並提出針對性的配置調整;一個受控的元分析器以自動回退和停滯探索的保障來應用這些調整。這個閉環自我演化實現了一個自動研究過程:系統自主地對其自身架構進行迭代研究循環,取代手動配置調整。從一個最小的基線開始,該過程自動收斂,發現有效的檢索策略,包括原始行動空間中不存在的全新配置維度。在LoCoMo上,EvolveMem相對於最強基線提高了25.7%,並且相對於最小基線達到了78.0%的改善。在MemBench上,EvolveMem超過了最強基線18.9%的相對增益。演化的配置在基準測試之間轉移時表現出積極而非災難性的轉移,這表明自我演化過程捕捉了普遍的檢索原則,而非特定於基準的啟發式方法。代碼可在https://github.com/aiming-lab/SimpleMem獲得。

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

2605.13695v1 by Andrea Morandi

LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no fine-tuning, retrieval, or external tools. Stage 1 wraps the input in a fixed pedagogical scaffold porting the Feynman Learning Technique (study $\to$ teach $\to$ find gaps $\to$ simplify) into LLM prompting. Stage 2 draws N=10 independent candidate verdicts at temperature 0.4. Stage 3 acts as its own critic, cross-comparing the candidate set against the original question to emit one critiqued verdict at temperature 0. On JudgeBench-GPT (350 hard pairwise items), Claude 3.7 Sonnet's pairwise accuracy climbs from 64.6% (single-shot vanilla prompt) to 78.6% (RTLC critique-of-10) -- an absolute 14.0-percentage-point gain. RTLC also beats N=10 self-consistency majority voting (77.7%) and a zero-shot first candidate (74.0%). A clean three-step ablation attributes +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalisation, and +0.9 pp to explicit critique. We discuss the cost-accuracy frontier (RTLC sits above self-consistency at every working point), the error-budget breakdown across the four JudgeBench categories (knowledge, reasoning, math, coding), and how RTLC composes orthogonally with post-hoc judge-score calibration, with the two interventions compounding multiplicatively in practice.

摘要:LLM-as-a-judge 現在是開放式生成的默認測量工具,但在公共 JudgeBench 基準上,即使是強大的指令調整評判者在客觀正確性成對項目上也僅僅勉強超過隨機。我們介紹 RTLC,一種三階段提示配方——研究、教學以學習、批評——它將單一的黑箱 LLM 轉變為一個無需微調、檢索或外部工具的思考集成評判者。第一階段將輸入包裹在固定的教學框架中,將費曼學習技術(學習 $\to$ 教學 $\to$ 發現差距 $\to$ 簡化)移植到 LLM 提示中。第二階段在溫度 0.4 下抽取 N=10 個獨立的候選判決。第三階段充當自己的批評者,將候選集與原始問題進行交叉比較,以在溫度 0 下發出一個經過批評的判決。在 JudgeBench-GPT(350 個困難的成對項目)上,Claude 3.7 Sonnet 的成對準確率從 64.6%(單次普通提示)上升到 78.6%(RTLC 的 10 次批評)——絕對增幅為 14.0 個百分點。RTLC 也超越了 N=10 自我一致性多數投票(77.7%)和零樣本第一候選(74.0%)。一個乾淨的三步消融實驗將 +9.4 pp 歸因於教學以學習框架,+3.7 pp 歸因於 N=10 的邊際化,以及 +0.9 pp 歸因於明確批評。我們討論了成本-準確性邊界(RTLC 在每個工作點上都高於自我一致性)、四個 JudgeBench 類別(知識、推理、數學、編碼)之間的錯誤預算分解,以及 RTLC 如何與事後評判分數校準正交組合,這兩種干預在實踐中以乘法方式增強效果。