Knowledge Graphs

Publish Date	Title	Authors	Homepage	Code
2026-04-03	Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization	Dipto Sumit et.al.	2604.03192v1	null
2026-04-03	Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation	Prakhar Bansal et.al.	2604.03174v1	null
2026-04-03	Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation	Nazanin Jafari et.al.	2604.03141v1	null
2026-04-03	Multi-Aspect Knowledge Distillation for Language Model with Low-rank Factorization	Zihe Liu et.al.	2604.03110v1	null
2026-04-03	AlertStar: Path-Aware Alert Prediction on Hyper-Relational Knowledge Graphs	Zahra Makki Nayeri et.al.	2604.03104v1	null
2026-04-03	Analyzing Healthcare Interoperability Vulnerabilities: Formal Modeling and Graph-Theoretic Approach	Jawad Mohammed et.al.	2604.03043v1	null
2026-04-03	Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?	Qianshan Wei et.al.	2604.03016v1	null
2026-04-03	FedSQ: Optimized Weight Averaging via Fixed Gating	Cristian Pérez-Corral et.al.	2604.02990v1	null
2026-04-03	LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation	Yilin Xiao et.al.	2604.02954v1	null
2026-04-03	Analysis of Optimality of Large Language Models on Planning Problems	Bernd Bohnet et.al.	2604.02910v1	null
2026-04-03	Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration	Wachiravit Modecrua et.al.	2604.02869v1	null
2026-04-03	LLM-based Atomic Propositions help weak extractors: Evaluation of a Propositioner for triplet extraction	Luc Pommeret et.al.	2604.02866v1	null
2026-04-03	LLM+Graph@VLDB'2025 Workshop Summary	Yixiang Fang et.al.	2604.02861v1	null
2026-04-03	GRADE: Probing Knowledge Gaps in LLMs through Gradient Subspace Dynamics	Yujing Wang et.al.	2604.02830v1	null
2026-04-03	Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection	Chaoqun He et.al.	2604.02819v1	null
2026-04-03	QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models	Xinhao Wang et.al.	2604.02816v1	null
2026-04-03	When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs	Linyu Li et.al.	2604.02778v1	null
2026-04-03	Analytic Drift Resister for Non-Exemplar Continual Graph Learning	Lei Song et.al.	2604.02633v1	null
2026-04-03	Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge	Yiyang Shen et.al.	2604.02621v1	null
2026-04-03	OntoKG: Ontology-Oriented Knowledge Graph Construction with Intrinsic-Relational Routing	Yitao Li et.al.	2604.02618v1	null
2026-04-03	AutoVerifier: An Agentic Automated Verification Framework Using Large Language Models	Yuntao Du et.al.	2604.02617v1	null
2026-04-02	Competency Questions as Executable Plans: a Controlled RAG Architecture for Cultural Heritage Storytelling	Naga Sowjanya Barla et.al.	2604.02545v1	null
2026-04-02	Opal: Private Memory for Personal AI	Darya Kaviani et.al.	2604.02522v1	null
2026-04-02	Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic Prompting	Roland Mühlenbernd et.al.	2604.02512v1	null
2026-04-02	A Comprehensive Framework for Long-Term Resiliency Investment Planning under Extreme Weather Uncertainty for Electric Utilities	Emma Benjaminson et.al.	2604.02504v1	null
2026-04-02	Managing Diabetic Retinopathy with Deep Learning: A Data Centric Overview	Shramana Dey et.al.	2604.02448v1	null
2026-04-02	Self-Directed Task Identification	Timothy Gould et.al.	2604.02430v1	null
2026-04-02	Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation	Daiwei Chen et.al.	2604.02324v1	null
2026-04-02	LumiVideo: An Intelligent Agentic System for Video Color Grading	Yuchen Guo et.al.	2604.02409v1	null
2026-04-02	Crystalite: A Lightweight Transformer for Efficient Crystal Modeling	Tin Hadži Veljković et.al.	2604.02270v1	null
2026-04-02	Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider	Tina. J. Jat et.al.	2604.02259v1	null
2026-04-02	When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning	Juarez Monteiro et.al.	2604.02226v1	null
2026-04-02	Universal Hypernetworks for Arbitrary Models	Xuanfeng Zhou et.al.	2604.02215v1	null
2026-04-02	LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications	Mayank Mayank et.al.	2604.02206v1	null
2026-04-02	Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model	Jaemin Kim et.al.	2604.02194v1	null
2026-04-02	TRU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning	Zhanting Zhou et.al.	2604.02183v1	null
2026-04-02	Adam's Law: Textual Frequency Law on Large Language Models	Hongyuan Adam Lu et.al.	2604.02176v1	null
2026-04-02	GaelEval: Benchmarking LLM Performance for Scottish Gaelic	Peter Devine et.al.	2604.02135v1	null
2026-04-02	Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning	Yuhang Wu et.al.	2604.02091v1	null
2026-04-02	Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection	Soo Won Seo et.al.	2604.02071v1	null
2026-04-02	Improving MPI Error Detection and Repair with Large Language Models and Bug References	Scott Piersall et.al.	2604.02398v1	null
2026-04-02	Diff-KD: Diffusion-based Knowledge Distillation for Collaborative Perception under Corruptions	Pengcheng Lyu et.al.	2604.02061v1	null
2026-04-02	SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning	Daeyong Kwon et.al.	2604.01993v1	null
2026-04-02	Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models	Florian Kelber et.al.	2604.01965v1	null
2026-04-02	Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia	Saja Al-Dabet et.al.	2604.01962v1	null
2026-04-02	How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization	Ramon Ferrer-i-Cancho et.al.	2604.01938v1	null
2026-04-02	CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift	HyunGi Kim et.al.	2604.01845v1	null
2026-04-02	Domain-constrained knowledge representation: A modal framework	Chao Li et.al.	2604.01770v1	null
2026-04-02	FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation	Taimur Khan et.al.	2604.01766v1	null
2026-04-02	AeroTherm-GPT: A Verification-Centered LLM Framework for Thermal Protection System Engineering Workflows	Chuhan Qiao et.al.	2604.01738v1	null
2026-04-02	The AnIML Ontology: Enabling Semantic Interoperability for Large-Scale Experimental Data in Interconnected Scientific Labs	Wilf Morlidge et.al.	2604.01728v1	null
2026-04-02	LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis	Zhihuan Wei et.al.	2604.01725v1	null
2026-04-02	Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition	Truc Nguyen et.al.	2604.01711v1	null
2026-04-02	Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework	Yanchen Wu et.al.	2604.01707v1	null
2026-04-02	MiCA Learns More Knowledge Than LoRA and Full Fine-Tuning	Sten Rüdiger et.al.	2604.01694v1	null
2026-04-02	PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment	Chenning Xu et.al.	2604.01682v1	null
2026-04-02	Can Heterogeneous Language Models Be Fused?	Shilian Chen et.al.	2604.01674v1	null
2026-04-02	PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation	Yanxin Luo et.al.	2604.01671v1	null
2026-04-02	Hierarchical Memory Orchestration for Personalized Persistent Agents	Junming Liu et.al.	2604.01670v1	null
2026-04-02	Robust Embodied Perception in Dynamic Environments via Disentangled Weight Fusion	Juncen Guo et.al.	2604.01669v1	null
2026-04-02	M3D-BFS: a Multi-stage Dynamic Fusion Strategy for Sample-Adaptive Multi-Modal Brain Network Analysis	Rui Dong et.al.	2604.01667v1	null
2026-04-02	CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery	Ao Qu et.al.	2604.01658v1	null
2026-04-02	AromaGen: Interactive Generation of Rich Olfactory Experiences with Multimodal Language Models	Yunge Wen et.al.	2604.01650v1	null
2026-04-02	Exploring Robust Multi-Agent Workflows for Environmental Data Management	Boyuan Guan et.al.	2604.01647v1	null
2026-04-02	CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning	Junyoung Sung et.al.	2604.01634v1	null
2026-04-02	GraphWalk: Enabling Reasoning in Large Language Models through Tool-Based Graph Navigation	Taraneh Ghandi et.al.	2604.01610v1	null
2026-04-02	From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?	Binyan Xu et.al.	2604.01608v1	null
2026-04-02	ByteRover: Agent-Native Memory Through LLM-Curated Hierarchical Context	Andy Nguyen et.al.	2604.01599v1	null
2026-04-02	Do Large Language Models Mentalize When They Teach?	Sevan K. Harootonian et.al.	2604.01594v1	null
2026-04-02	A Role-Based LLM Framework for Structured Information Extraction from Healthy Food Policies	Congjing Zhang et.al.	2604.01529v1	null
2026-04-01	A Self-Evolving Agentic Framework for Metasurface Inverse Design	Yi Huang et.al.	2604.01480v1	null
2026-04-01	Can LLMs Predict Academic Collaboration? Topology Heuristics vs. LLM-Based Link Prediction on Real Co-authorship Networks	Fan Huang et.al.	2604.01379v1	null
2026-04-01	Ambig-IaC: Multi-level Disambiguation for Interactive Cloud Infrastructure-as-Code Synthesis	Zhenning Yang et.al.	2604.02382v1	null
2026-04-01	No Attacker Needed: Unintentional Cross-User Contamination in Shared-State LLM Agents	Tiankai Yang et.al.	2604.01350v1	null
2026-04-01	Procedural Knowledge at Scale Improves Reasoning	Di Wu et.al.	2604.01348v1	null
2026-04-01	IDEA2: Expert-in-the-loop competency question elicitation for collaborative ontology engineering	Elliott Watkiss-Leek et.al.	2604.01344v1	null
2026-04-01	Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models	Marco Morini et.al.	2604.01280v1	null
2026-04-01	Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning	Mohammad R. Abu Ayyash et.al.	2604.01152v1	null
2026-04-01	Looking into a Pixel by Nonlinear Unmixing -- A Generative Approach	Maofeng Tang et.al.	2604.01141v1	null
2026-04-01	Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation	Reyhaneh Ahani Manghotay et.al.	2604.01118v1	null
2026-04-01	Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines	Jingjie Ning et.al.	2604.01029v1	null
2026-04-01	Bridging Structured Knowledge and Data: A Unified Framework with Finance Applications	Yi Cao et.al.	2604.00987v1	null
2026-04-01	Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts	Sha Li et.al.	2604.00901v2	null
2026-04-01	Transforming OPACs into Intelligent Discovery Systems: An AI-Powered, Knowledge Graph-Driven Smart OPAC for Digital Libraries	M. S. Rajeevan et.al.	2604.01262v1	null
2026-04-01	LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation	Patrick Amadeus Irawan et.al.	2604.00829v1	null
2026-04-01	From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks	Ayan Datta et.al.	2604.00778v1	null
2026-04-01	BioCOMPASS: Integrating Biomarkers into Transformer-Based Immunotherapy Response Prediction	Sayed Hashim et.al.	2604.00739v1	null
2026-04-01	To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining	Karan Singh et.al.	2604.00715v1	null
2026-04-01	Internal APIs Are All You Need: Shadow APIs, Shared Discovery, and the Case Against Browser-First Agent Architectures	Lewis Tham et.al.	2604.00694v1	null
2026-04-01	A Survey of On-Policy Distillation for Large Language Models	Mingyang Song et.al.	2604.00626v1	null
2026-04-01	Speech LLMs are Contextual Reasoning Transcribers	Keqi Deng et.al.	2604.00610v1	null
2026-04-01	Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents	Thanh Luong Tuan et.al.	2604.00555v1	null
2026-04-01	Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation	Zhiting Fan et.al.	2604.00536v1	null
2026-04-01	Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics	Iyad Ait Hou et.al.	2604.00443v1	null
2026-04-01	TR-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning	Wenxuan Jiang et.al.	2604.00438v1	null
2026-04-01	COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving	Seohyoung Park et.al.	2604.00402v1	null
2026-04-01	RAGShield: Provenance-Verified Defense-in-Depth Against Knowledge Base Poisoning in Government Retrieval-Augmented Generation Systems	KrishnaSaiReddy Patil et.al.	2604.00387v1	null
2026-04-01	Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning	Eric Hanchen Jiang et.al.	2604.00344v1	null
2026-03-31	Improvisational Games as a Benchmark for Social Intelligence of AI Agents: The Case of Connections	Gaurav Rajesh Parikh et.al.	2604.00284v1	null
2026-03-31	A Study on the Impact of Fault localization Granularity for Repository-Scale Code Repair Tasks	Joseph Townsend et.al.	2604.00167v1	null

Abstracts

Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization

2604.03192v1 by Dipto Sumit, Ankan Kumar Roy, Sadia Khair Rodela, Atia Haque Asha, Mourchona Afrin, Niloy Farhan, Farig Yousuf Sadeque

We study multiteacher knowledge distillation for low resource abstractive summarization from a reliability aware perspective. We introduce EWAD (Entropy Weighted Agreement Aware Distillation), a token level mechanism that routes supervision between teacher distillation and gold supervision based on inter teacher agreement, and CPDP (Capacity Proportional Divergence Preservation), a geometric constraint on the student position relative to heterogeneous teachers. Across two Bangla datasets, 13 BanglaT5 ablations, and eight Qwen2.5 experiments, we find that logit level KD provides the most reliable gains, while more complex distillation improves semantic similarity for short summaries but degrades longer outputs. Cross lingual pseudo label KD across ten languages retains 71-122 percent of teacher ROUGE L at 3.2x compression. A human validated multi judge LLM evaluation further reveals calibration bias in single judge pipelines. Overall, our results show that reliability aware distillation helps characterize when multi teacher supervision improves summarization and when data scaling outweighs loss engineering.

摘要：我們從可靠性意識的角度研究多教師知識蒸餾在低資源抽象摘要中的應用。我們引入了EWAD（熵加權一致性意識蒸餾），這是一種基於教師間一致性在教師蒸餾和黃金監督之間路由監督的標記級機制，以及CPDP（容量比例發散保護），這是一種對學生位置相對於異質教師的幾何約束。在兩個孟加拉語數據集、13個BanglaT5消融實驗和八個Qwen2.5實驗中，我們發現邏輯級KD提供了最可靠的增益，而更複雜的蒸餾雖然改善了短摘要的語義相似性，但卻降低了較長輸出的質量。跨十種語言的跨語言偽標籤KD在3.2倍壓縮下保留了71-122%的教師ROUGE L。一項經過人類驗證的多評委LLM評估進一步揭示了單一評委管道中的校準偏差。總體而言，我們的結果顯示，可靠性意識蒸餾有助於描述何時多教師監督改善摘要，以及何時數據擴展超過損失工程的影響。

Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation

2604.03174v1 by Prakhar Bansal, Shivangi Agarwal

Large language models (LLMs) encode vast world knowledge in their parameters, yet they remain fundamentally limited by static knowledge, finite context windows, and weakly structured causal reasoning. This survey provides a unified account of augmentation strategies along a single axis: the degree of structured context supplied at inference time. We cover in-context learning and prompt engineering, Retrieval-Augmented Generation (RAG), GraphRAG, and CausalRAG. Beyond conceptual comparison, we provide a transparent literature-screening protocol, a claim-audit framework, and a structured cross-paper evidence synthesis that distinguishes higher-confidence findings from emerging results. The paper concludes with a deployment-oriented decision framework and concrete research priorities for trustworthy retrieval-augmented NLP.

摘要：大型語言模型（LLMs）在其參數中編碼了大量的世界知識，但它們仍然受到靜態知識、有限的上下文窗口和弱結構的因果推理的根本限制。這項調查提供了一個統一的增強策略說明，沿著一個單一的軸心：在推理時提供的結構化上下文的程度。我們涵蓋了上下文學習和提示工程、檢索增強生成（RAG）、GraphRAG 和 CausalRAG。除了概念比較之外，我們提供了一個透明的文獻篩選協議、一個主張審核框架，以及一個結構化的跨論文證據綜合，區分出更高信心的發現與新興結果。本文以一個以部署為導向的決策框架和具體的可信檢索增強 NLP 研究優先事項作結。

Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation

2604.03141v1 by Nazanin Jafari, James Allan, Mohit Iyyer

Evaluating the factuality of long-form output generated by large language models (LLMs) remains challenging, particularly when responses are open-ended and contain many fine-grained factual statements. Existing evaluation methods primarily focus on precision: they decompose a response into atomic claims and verify each claim against external knowledge sources such as Wikipedia. However, this overlooks an equally important dimension of factuality: recall, whether the generated response covers the relevant facts that should be included. We propose a comprehensive factuality evaluation framework that jointly measures precision and recall. Our method leverages external knowledge sources to construct reference facts and determine whether they are captured in generated text. We further introduce an importance-aware weighting scheme based on relevance and salience. Our analysis reveals that current LLMs perform substantially better on precision than on recall, suggesting that factual incompleteness remains a major limitation of long-form generation and that models are generally better at covering highly important facts than the full set of relevant facts.

摘要：評估大型語言模型（LLMs）生成的長篇輸出的事實性仍然具有挑戰性，特別是當回應是開放式的並包含許多細緻的事實陳述時。現有的評估方法主要集中在精確度上：它們將回應分解為原子性主張，並根據維基百科等外部知識來源驗證每個主張。然而，這忽略了事實性的一個同樣重要的維度：召回，即生成的回應是否涵蓋應該包含的相關事實。我們提出了一個綜合的事實性評估框架，該框架共同測量精確度和召回率。我們的方法利用外部知識來源來構建參考事實並確定它們是否被捕捉在生成的文本中。我們進一步引入了一種基於相關性和顯著性的重視權重方案。我們的分析顯示，當前的LLMs在精確度上表現明顯優於召回率，這表明事實的不完整性仍然是長篇生成的一個主要限制，並且模型通常在涵蓋高度重要的事實方面表現得比涵蓋所有相關事實要好。

Multi-Aspect Knowledge Distillation for Language Model with Low-rank Factorization

2604.03110v1 by Zihe Liu, Yulong Mao, Jinan Xu, Xinrui Peng, Kaiyu Huang

Knowledge distillation is an effective technique for pre-trained language model compression. However, existing methods only focus on the knowledge distribution among layers, which may cause the loss of fine-grained information in the alignment process. To address this issue, we introduce the Multi-aspect Knowledge Distillation (MaKD) method, which mimics the self-attention and feed-forward modules in greater depth to capture rich language knowledge information at different aspects. Experimental results demonstrate that MaKD can achieve competitive performance compared with various strong baselines with the same storage parameter budget. In addition, our method also performs well in distilling auto-regressive architecture models.

摘要：知識蒸餾是一種有效的預訓練語言模型壓縮技術。然而，現有的方法僅專注於層之間的知識分佈，這可能在對齊過程中導致細粒度信息的損失。為了解決這個問題，我們引入了多方面知識蒸餾（MaKD）方法，該方法更深入地模擬自注意力和前饋模塊，以捕捉不同方面的豐富語言知識信息。實驗結果表明，MaKD在相同存儲參數預算下，能夠與各種強基準相比實現競爭性的性能。此外，我們的方法在蒸餾自回歸架構模型方面也表現良好。

AlertStar: Path-Aware Alert Prediction on Hyper-Relational Knowledge Graphs

2604.03104v1 by Zahra Makki Nayeri, Mohsen Rezvani

Cyber-attacks continue to grow in scale and sophistication, yet existing network intrusion detection approaches lack the semantic depth required for path reasoning over attacker-victim interactions. We address this by first modelling network alerts as a knowledge graph, then formulating hyper-relational alert prediction as a hyper-relational knowledge graph completion (HR-KGC) problem, representing each network alert as a qualified statement (h, r, t, Q), where h and t are source and destination IPs, r denotes the attack type, and Q encodes flow-level metadata such as timestamps, ports, protocols, and attack intensity, going beyond standard KGC binary triples (h, r, t) that would discard this contextual richness. We introduce five models across three contributions: first, Hyper-relational Neural Bellman-Ford (HR-NBFNet) extends Neural Bellman-Ford Networks to the hyper-relational setting with qualifier-aware multi-hop path reasoning, while its multi-task variant MT-HR-NBFNet jointly predicts tail, relation, and qualifier-value within a single traversal pass; second, AlertStar fuses qualifier context and structural path information entirely in embedding space via cross-attention and learned path composition, and its multi-task extension MT-AlertStar eliminates the overhead of full knowledge graph propagation; third, HR-NBFNet-CQ extends qualifier-aware representations to answer complex first-order logic queries, including one-hop, two-hop chain, two-anchor intersection, and union, enabling multi-condition threat reasoning over the alert knowledge graph. Evaluated inductively on the Warden and UNSW-NB15 benchmarks across three qualifier-density regimes, AlertStar and MT-AlertStar achieve superior MR, MRR, and Hits@k, demonstrating that local qualifier fusion is both sufficient and more efficient than global path propagation for hyper-relational alert prediction.

摘要：網路攻擊的規模和複雜性持續增長，但現有的網路入侵檢測方法缺乏對攻擊者和受害者互動進行路徑推理所需的語義深度。我們通過首先將網路警報建模為知識圖，然後將超關聯警報預測公式化為超關聯知識圖完成（HR-KGC）問題來解決這一問題，將每個網路警報表示為一個合格的陳述（h, r, t, Q），其中 h 和 t 是源和目的 IP，r 表示攻擊類型，而 Q 編碼流級元數據，如時間戳、端口、協議和攻擊強度，超越了標準 KGC 二元三元組（h, r, t），這會丟棄這種上下文豐富性。我們在三個貢獻中介紹了五個模型：首先，超關聯神經貝爾曼-福特（HR-NBFNet）將神經貝爾曼-福特網路擴展到超關聯設置，具有合格者感知的多跳路徑推理，而其多任務變體 MT-HR-NBFNet 在單次遍歷中共同預測尾部、關係和合格者值；其次，AlertStar 通過交叉注意力和學習的路徑組合在嵌入空間中完全融合合格者上下文和結構路徑信息，其多任務擴展 MT-AlertStar 消除了完整知識圖傳播的開銷；第三，HR-NBFNet-CQ 擴展合格者感知表示，以回答複雜的一階邏輯查詢，包括一跳、兩跳鏈、兩錨點交集和聯集，從而實現對警報知識圖的多條件威脅推理。在 Warden 和 UNSW-NB15 基準上進行歸納評估，AlertStar 和 MT-AlertStar 在三個合格者密度範疇中實現了更優的 MR、MRR 和 Hits@k，證明局部合格者融合對於超關聯警報預測是既足夠又更有效的，優於全局路徑傳播。

Analyzing Healthcare Interoperability Vulnerabilities: Formal Modeling and Graph-Theoretic Approach

2604.03043v1 by Jawad Mohammed, Gahangir Hossain

In a healthcare environment, the healthcare interoperability platforms based on HL7 FHIR allow concurrent, asynchronous access to a set of shared patient resources, which are independent systems, i.e., EHR systems, pharmacy systems, lab systems, and devices. The FHIR specification lacks a protocol for concurrency control, and the research on detecting a race condition only targets the OS kernel. The research on FHIR security only targets authentication and injection attacks, considering concurrent access to patient resources to be sequential. The gap in the research in this area is addressed through the introduction of FHIR Resource Access Graph (FRAG), a formally defined graph G = (P,R,E, λ, τ, S), in which the nodes are the concurrent processes, the typed edges represent the resource access events, and the race conditions are represented as detectable structural properties. Three clinically relevant race condition classes are formally specified: Simultaneous Write Conflict (SWC), TOCTOU Authorization Violation (TAV), and Cascading Update Race (CUR). The FRAG model is implemented as a three-pass graph traversal detection algorithm and tested against a time window-based baseline on 1,500 synthetic FHIR R4 transaction logs. Under full concurrent access (C2), FRAG attains a 90.0% F1 score vs. 25.5% for the baseline, a 64.5 pp improvement.

摘要：在醫療環境中，基於 HL7 FHIR 的醫療互操作性平台允許對一組共享病人資源進行同時的、非同步的訪問，這些資源是獨立的系統，即電子健康紀錄系統、藥房系統、實驗室系統和設備。FHIR 規範缺乏並發控制的協議，且對於檢測競爭條件的研究僅針對操作系統核心。對 FHIR 安全性的研究僅針對身份驗證和注入攻擊，將對病人資源的並發訪問視為順序訪問。這一領域研究的空白通過引入 FHIR 資源訪問圖（FRAG）來解決，這是一個形式上定義的圖 G = (P,R,E, λ, τ, S)，其中節點是並發過程，類型化的邊表示資源訪問事件，競爭條件則表示為可檢測的結構性特徵。三個臨床相關的競爭條件類別被正式指定：同時寫入衝突（SWC）、TOCTOU 授權違規（TAV）和級聯更新競爭（CUR）。FRAG 模型實現為一種三遍圖遍歷檢測算法，並在 1,500 個合成的 FHIR R4 交易日誌中與基於時間窗口的基線進行測試。在完全並發訪問（C2）下，FRAG 獲得了 90.0% 的 F1 分數，而基線為 25.5%，提高了 64.5 個百分點。

Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

2604.03016v1 by Qianshan Wei, Yishan Yang, Siyi Wang, Jinglin Chen, Binyu Wang, Jiaming Wang, Shuang Chen, Zechen Li, Yang Shi, Yuqi Tang, Weining Wang, Yi Yu, Chaoyou Fu, Qi Li, Yi-Fan Zhang

Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.

摘要：多模態大型語言模型（MLLMs）正從被動觀察者演變為主動代理，通過視覺擴展（調用視覺工具）和知識擴展（開放網絡搜索）來解決問題。然而，現有的評估存在不足：它們缺乏靈活的工具整合，分別測試視覺和搜索工具，並主要通過最終答案進行評估。因此，它們無法驗證工具是否實際被調用、正確應用或有效使用。為了解決這個問題，我們引入了Agentic-MME，一個針對多模態代理能力的過程驗證基準。它包含418個來自6個領域和3個難度級別的真實世界任務，以評估能力的協同作用，並設有超過2,000個逐步檢查點，每個任務平均需要10小時以上的人力標註。每個任務包括一個統一的評估框架，支持沙盒代碼和API，以及一個人類參考軌跡，並沿著雙軸（S軸和V軸）標註逐步檢查點。為了實現真正的過程級驗證，我們審核細粒度的中間狀態，而不僅僅是最終答案，並通過相對於人類軌跡的過度思考指標量化效率。實驗結果顯示，最佳模型Gemini3-pro的整體準確率為56.3%，在三級任務上顯著下降至23.0%，突顯了真實世界多模態代理問題解決的難度。

FedSQ: Optimized Weight Averaging via Fixed Gating

2604.02990v1 by Cristian Pérez-Corral, Jose I. Mestre, Alberto Fernández-Hernández, Manuel F. Dolz, José Duato, Enrique S. Quintana-Ortí

Federated learning (FL) enables collaborative training across organizations without sharing raw data, but it is hindered by statistical heterogeneity (non-i.i.d.\ client data) and by instability of naive weight averaging under client drift. In many cross-silo deployments, FL is warm-started from a strong pretrained backbone (e.g., ImageNet-1K) and then adapted to local domains. Motivated by recent evidence that ReLU-like gating regimes (structural knowledge) stabilize earlier than the remaining parameter values (quantitative knowledge), we propose FedSQ (Federated Structural-Quantitative learning), a transfer-initialized neural federated procedure based on a DualCopy, piecewise-linear view of deep networks. FedSQ freezes a structural copy of the pretrained model to induce fixed binary gating masks during federated fine-tuning, while only a quantitative copy is optimized locally and aggregated across rounds. Fixing the gating reduces learning to within-regime affine refinements, which stabilizes aggregation under heterogeneous partitions. Experiments on two convolutional neural network backbones under i.i.d.\ and Dirichlet splits show that FedSQ improves robustness and can reduce rounds-to-best validation performance relative to standard baselines while preserving accuracy in the transfer setting.

摘要：聯邦學習（FL）使得組織之間可以在不共享原始數據的情況下進行協作訓練，但它受到統計異質性（非獨立同分佈的客戶端數據）和客戶端漂移下簡單權重平均不穩定性的阻礙。在許多跨孤島的部署中，FL 是從強大的預訓練骨幹（例如，ImageNet-1K）進行熱啟動，然後適應於本地領域。受到最近證據的啟發，即類 ReLU 的閘控機制（結構知識）比其餘參數值（定量知識）更早穩定，我們提出了 FedSQ（聯邦結構-定量學習），這是一種基於雙重複製、分段線性視角的深度網絡轉移初始化神經聯邦程序。FedSQ 冻結了預訓練模型的結構副本，以在聯邦微調期間引入固定的二元閘控掩碼，而只有定量副本在本地優化並在多輪中聚合。固定閘控將學習減少為內部範疇的仿射細化，這在異質分區下穩定了聚合。在獨立同分佈和狄利克雷分割下對兩個卷積神經網絡骨幹的實驗顯示，FedSQ 提高了穩健性，並且相對於標準基準可以減少達到最佳驗證性能的輪次，同時在轉移設置中保持準確性。

LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation

2604.02954v1 by Yilin Xiao, Jin Chen, Qinggang Zhang, Yujing Zhang, Chuang Zhou, Longhao Yang, Lingfei Ren, Xin Yang, Xiao Huang

Graph-based Retrieval-Augmented Generation (GraphRAG) enhances the reasoning capabilities of Large Language Models (LLMs) by grounding their responses in structured knowledge graphs. Leveraging community detection and relation filtering techniques, GraphRAG systems demonstrate inherent resistance to traditional RAG attacks, such as text poisoning and prompt injection. However, in this paper, we find that the security of GraphRAG systems fundamentally relies on the topological integrity of the underlying graph, which can be undermined by implicitly corrupting the logical connections, without altering surface-level text semantics. To exploit this vulnerability, we propose \textsc{LogicPoison}, a novel attack framework that targets logical reasoning rather than injecting false contents. Specifically, \textsc{LogicPoison} employs a type-preserving entity swapping mechanism to perturb both global logic hubs for disrupting overall graph connectivity and query-specific reasoning bridges for severing essential multi-hop inference paths. This approach effectively reroutes valid reasoning into dead ends while maintaining surface-level textual plausibility. Comprehensive experiments across multiple benchmarks demonstrate that \textsc{LogicPoison} successfully bypasses GraphRAG's defenses, significantly degrading performance and outperforming state-of-the-art baselines in both effectiveness and stealth. Our code is available at \textcolor{blue}https://github.com/Jord8061/logicPoison.

摘要：圖形基礎的檢索增強生成（GraphRAG）通過將其回應基於結構化知識圖譜來增強大型語言模型（LLMs）的推理能力。利用社群檢測和關係過濾技術，GraphRAG 系統對傳統的 RAG 攻擊，如文本中毒和提示注入，顯示出固有的抵抗力。然而，在本文中，我們發現 GraphRAG 系統的安全性根本上依賴於底層圖的拓撲完整性，這可以通過隱性地破壞邏輯連接來削弱，而不改變表面文本語義。為了利用這一漏洞，我們提出了 \textsc{LogicPoison}，一種針對邏輯推理的新型攻擊框架，旨在攻擊邏輯推理而不是注入虛假內容。具體而言，\textsc{LogicPoison} 採用一種類型保持的實體交換機制，來擾動全球邏輯樞紐以破壞整體圖的連通性，以及查詢特定的推理橋以切斷關鍵的多跳推理路徑。這種方法有效地將有效的推理重定向到死胡同，同時保持表面文本的合理性。跨多個基準的全面實驗表明，\textsc{LogicPoison} 成功繞過了 GraphRAG 的防禦，顯著降低了性能，並在有效性和隱蔽性方面超越了最先進的基準。我們的代碼可在 \textcolor{blue}https://github.com/Jord8061/logicPoison 獲得。

Analysis of Optimality of Large Language Models on Planning Problems

2604.02910v1 by Bernd Bohnet, Michael C. Mozer, Kevin Swersky, Wil Cunningham, Aaron Parisi, Kathleen Kenealy, Noah Fiedel

Classic AI planning problems have been revisited in the Large Language Model (LLM) era, with a focus of recent benchmarks on success rates rather than plan efficiency. We examine the degree to which frontier models reason optimally versus relying on simple, heuristic, and possibly inefficient strategies. We focus on the Blocksworld domain involving towers of labeled blocks which have to be moved from an initial to a goal configuration via a set of primitive actions. We also study a formally equivalent task, the generalized Path-Star ($P^$) graph, in order to isolate true topological reasoning from semantic priors. We systematically manipulate problem depth (the height of block towers), width (the number of towers), and compositionality (the number of goal blocks). Reasoning-enhanced LLMs significantly outperform traditional satisficing planners (e.g., LAMA) in complex, multi-goal configurations. Although classical search algorithms hit a wall as the search space expands, LLMs track theoretical optimality limits with near-perfect precision, even when domain-specific semantic hints are stripped away. To explain these surprising findings, we consider (and find evidence to support) two hypotheses: an active Algorithmic Simulation executed via reasoning tokens and a Geometric Memory that allows models to represent the $P^$ topology as a navigable global geometry, effectively bypassing exponential combinatorial complexity.

摘要：經典的人工智慧規劃問題在大型語言模型（LLM）時代重新被檢視，最近的基準測試重點在於成功率而非計劃效率。我們檢查前沿模型在最佳推理與依賴簡單、啟發式且可能低效的策略之間的程度。我們專注於涉及標記方塊塔的Blocksworld領域，這些方塊必須通過一組原始動作從初始配置移動到目標配置。我們還研究一個形式上等價的任務，即廣義Path-Star（$P^$）圖，以便將真正的拓撲推理與語義先驗隔離。我們系統性地操控問題的深度（方塊塔的高度）、寬度（塔的數量）和組合性（目標方塊的數量）。增強推理的LLM在複雜的多目標配置中顯著超越傳統的滿意規劃者（例如，LAMA）。儘管傳統搜索算法在搜索空間擴展時遇到瓶頸，但LLM以近乎完美的精度追蹤理論最佳性限制，即使當特定領域的語義提示被剝除時。為了解釋這些驚人的發現，我們考慮（並找到支持的證據）兩個假設：通過推理標記執行的主動算法模擬和一種幾何記憶，允許模型將$P^$拓撲表示為可導航的全局幾何，從而有效地繞過指數組合複雜性。

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

2604.02869v1 by Wachiravit Modecrua, Krittanon Kaewtawee, Krittin Pachtrachai, Touchapon Kraisingkorn

Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Relative Policy Optimization) combined with GTPO (Generalized Token-level Policy Optimization) for training a tool-calling agent on realistic customer service tasks with an LLM-based user simulator. Through systematic analysis of training rollouts, we discover that naively designed dense per-turn rewards degrade performance by up to 14 percentage points due to misalignment between reward discriminativeness and advantage direction. We introduce Iterative Reward Calibration, a methodology for designing per-turn rewards using empirical discriminative analysis of rollout data, and show that our GTPO hybrid advantage formulation eliminates the advantage misalignment problem. Applied to the Tau-Bench airline benchmark, our approach improves Qwen3.5-4B from 63.8 percent to 66.7 percent (+2.9pp) and Qwen3-30B-A3B from 58.0 percent to 69.5 percent (+11.5pp) -- with the trained 4B model exceeding GPT-4.1 (49.4 percent) and GPT-4o (42.8 percent) despite being 50 times smaller, and the 30.5B MoE model approaching Claude Sonnet 4.5 (70.0 percent). To our knowledge, these are the first published RL training results on Tau-Bench. We release our code, reward calibration analysis, and training recipes.

摘要：訓練工具呼叫代理在多輪任務上的強化學習仍然具有挑戰性，因為結果獎勵稀疏且跨對話輪次的信用分配困難。我們展示了MT-GRPO（多輪群體相對政策優化）與GTPO（廣義標記級政策優化）結合的首次應用，以訓練一個基於LLM的用戶模擬器的工具呼叫代理，針對現實的客戶服務任務。通過對訓練回合的系統分析，我們發現天真設計的每輪密集獎勵因獎勵區分度與優勢方向之間的不一致，導致性能下降高達14個百分點。我們引入了迭代獎勵校準，這是一種使用回合數據的經驗性區分分析來設計每輪獎勵的方法論，並顯示我們的GTPO混合優勢公式消除了優勢不一致問題。應用於Tau-Bench航空基準，我們的方法將Qwen3.5-4B的表現從63.8%提高到66.7%（+2.9個百分點），將Qwen3-30B-A3B的表現從58.0%提高到69.5%（+11.5個百分點）——訓練後的4B模型超過了GPT-4.1（49.4%）和GPT-4o（42.8%），儘管其規模小50倍，而30.5B的MoE模型接近Claude Sonnet 4.5（70.0%）。據我們所知，這些是Tau-Bench上首次發表的強化學習訓練結果。我們發布了我們的代碼、獎勵校準分析和訓練食譜。

LLM-based Atomic Propositions help weak extractors: Evaluation of a Propositioner for triplet extraction

2604.02866v1 by Luc Pommeret, Thomas Gerald, Patrick Paroubek, Sahar Ghannay, Christophe Servan, Sophie Rosset

Knowledge Graph construction from natural language requires extracting structured triplets from complex, information-dense sentences. In this paper, we investigate if the decomposition of text into atomic propositions (minimal, semantically autonomous units of information) can improve the triplet extraction. We introduce MPropositionneur-V2, a small multilingual model covering six European languages trained by knowledge distillation from Qwen3-32B into a Qwen3-0.6B architecture, and we evaluate its integration into two extraction paradigms: entity-centric (GLiREL) and generative (Qwen3). Experiments on SMiLER, FewRel, DocRED and CaRB show that atomic propositions benefit weaker extractors (GLiREL, CoreNLP, 0.6B models), improving relation recall and, in the multilingual setting, overall accuracy. For stronger LLMs, a fallback combination strategy recovers entity recall losses while preserving the gains in relation extraction. These results show that atomic propositions are an interpretable intermediate data structure that complements extractors without replacing them.

摘要：知識圖譜的構建需要從複雜且信息密集的句子中提取結構化的三元組。在本文中，我們研究將文本分解為原子命題（最小的、語義上自主的信息單位）是否能改善三元組的提取。我們介紹了 MPropositionneur-V2，一個涵蓋六種歐洲語言的小型多語言模型，通過知識蒸餾從 Qwen3-32B 訓練到 Qwen3-0.6B 架構，並評估其在兩種提取範式中的整合：以實體為中心的（GLiREL）和生成的（Qwen3）。在 SMiLER、FewRel、DocRED 和 CaRB 上的實驗顯示，原子命題對較弱的提取器（GLiREL、CoreNLP、0.6B 模型）有益，提高了關係召回率，並且在多語言環境中整體準確率也有所提升。對於較強的 LLM，後備組合策略在保持關係提取增益的同時恢復了實體召回的損失。這些結果顯示，原子命題是一種可解釋的中介數據結構，能夠補充提取器而不替代它們。

LLM+Graph@VLDB'2025 Workshop Summary

2604.02861v1 by Yixiang Fang, Arijit Khan, Tianxing Wu, Da Yan, Shu Wang

The integration of large language models (LLMs) with graph-structured data has become a pivotal and fast evolving research frontier, drawing strong interest from both academia and industry. The 2nd LLM+Graph Workshop, co-located with the 51st International Conference on Very Large Data Bases (VLDB 2025) in London, focused on advancing algorithms and systems that bridge LLMs, graph data management, and graph machine learning for practical applications. This report highlights the key research directions, challenges, and innovative solutions presented by the workshop's speakers.

摘要：大型語言模型（LLMs）與圖結構數據的整合已成為一個關鍵且快速發展的研究前沿，吸引了學術界和產業界的強烈關注。第二屆 LLM+Graph 研討會與第 51 屆國際大型數據庫會議（VLDB 2025）在倫敦同時舉行，專注於推進橋接 LLM、圖數據管理和圖機器學習的算法和系統，以實現實際應用。這份報告突顯了研討會演講者所提出的關鍵研究方向、挑戰和創新解決方案。

GRADE: Probing Knowledge Gaps in LLMs through Gradient Subspace Dynamics

2604.02830v1 by Yujing Wang, Yuanbang Liang, Yukun Lai, Hainan Zhang, Hanqi Yan

Detecting whether a model's internal knowledge is sufficient to correctly answer a given question is a fundamental challenge in deploying responsible LLMs. In addition to verbalising the confidence by LLM self-report, more recent methods explore the model internals, such as the hidden states of the response tokens to capture how much knowledge is activated. We argue that such activated knowledge may not align with what the query requires, e.g., capturing the stylistic and length-related features that are uninformative for answering the query. To fill the gap, we propose GRADE (Gradient Dynamics for knowledge gap detection), which quantifies the knowledge gap via the cross-layer rank ratio of the gradient to that of the corresponding hidden state subspace. This is motivated by the property of gradients as estimators of the required knowledge updates for a given target. We validate \modelname{} on six benchmarks, demonstrating its effectiveness and robustness to input perturbations. In addition, we present a case study showing how the gradient chain can generate interpretable explanations of knowledge gaps for long-form answers.

摘要：檢測模型的內部知識是否足以正確回答給定問題是部署負責任的大型語言模型（LLMs）的一個基本挑戰。除了通過LLM自我報告來表達信心外，最近的方法還探索模型的內部結構，例如響應標記的隱藏狀態，以捕捉激活了多少知識。我們認為，這種激活的知識可能與查詢所需的內容不一致，例如捕捉對回答查詢沒有信息量的風格和長度相關特徵。為了填補這一空白，我們提出了GRADE（知識差距檢測的梯度動態），它通過梯度與相應隱藏狀態子空間的跨層排名比率來量化知識差距。這是受到梯度作為給定目標所需知識更新的估計器的特性所激勵。我們在六個基準上驗證了\modelname{}，展示了其對輸入擾動的有效性和穩健性。此外，我們還展示了一個案例研究，顯示梯度鏈如何生成可解釋的知識差距解釋，適用於長篇回答。

Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection

2604.02819v1 by Chaoqun He, Yingfa Chen, Chaojun Xiao, Xu Han, Lijie Wen

Large reasoning models achieve strong performance on complex tasks through long chain-of-thought (CoT) trajectories, but directly transferring such reasoning processes to smaller models remains challenging. A key difficulty is that not all teacher-generated reasoning trajectories are suitable for student learning. Existing approaches typically rely on post-hoc filtering, selecting trajectories after full generation based on heuristic criteria. However, such methods cannot control the generation process itself and may still produce reasoning paths that lie outside the student's learning capacity. To address this limitation, we propose Gen-SSD (Generation-time Self-Selection Distillation), a student-in-the-loop framework that performs generation-time selection. Instead of passively consuming complete trajectories, the student evaluates candidate continuations during the teacher's sampling process, guiding the expansion of only learnable reasoning paths and enabling early pruning of unhelpful branches. Experiments on mathematical reasoning benchmarks demonstrate that Gen-SSD consistently outperforms standard knowledge distillation and recent baselines, with improvements of around 5.9 points over Standard KD and up to 4.7 points over other baselines. Further analysis shows that Gen-SSD produces more stable and learnable reasoning trajectories, highlighting the importance of incorporating supervision during generation for effective distillation.

摘要：大型推理模型透過長鏈思考（CoT）軌跡在複雜任務上達成強勁表現，但將這些推理過程直接轉移到較小模型上仍然具有挑戰性。主要的困難在於並非所有教師生成的推理軌跡都適合學生學習。現有的方法通常依賴事後過濾，根據啟發式標準在完整生成後選擇軌跡。然而，這些方法無法控制生成過程本身，並且仍可能產生超出學生學習能力的推理路徑。為了解決這一限制，我們提出了Gen-SSD（生成時自我選擇蒸餾），這是一個學生參與的框架，執行生成時選擇。學生在教師的取樣過程中評估候選延續，而不是被動消耗完整的軌跡，這樣可以引導只擴展可學習的推理路徑，並使不有幫助的分支能夠及早修剪。在數學推理基準上的實驗表明，Gen-SSD始終超越標準知識蒸餾和最近的基準，與標準KD相比提高約5.9分，與其他基準相比提高最多4.7分。進一步分析顯示，Gen-SSD產生了更穩定且可學習的推理軌跡，強調了在生成過程中納入監督以實現有效蒸餾的重要性。

QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

2604.02816v1 by Xinhao Wang, Zhonyu Xia, Zhiwei Lin, Zhe Li, Yongtao Wang

Multimodal Large Language Models (MLLMs) have shown strong reasoning ability, but their high computational and memory costs hinder deployment in resource-constrained settings. While Post-Training Quantization (PTQ) and vision token pruning are standard compression techniques, they are usually treated as independent optimizations. In this paper, we show that these two techniques are strongly coupled: naively applying semantic-based token pruning to PTQ-optimized MLLMs can discard activation outliers that are important for numerical stability and thus worsen quantization errors in low-bit regimes (\textit{e.g.}, W4A4). To address this issue, we propose a quantization-aware vision token pruning framework. Our method introduces a lightweight hybrid sensitivity metric that combines simulated group-wise quantization error with outlier intensity. By combining this metric with standard semantic relevance scores, the method retains tokens that are both semantically informative and robust to quantization. Experiments on standard LLaVA architectures show that our method consistently outperforms naive integration baselines. At an aggressive pruning ratio that retains only 12.5\% of visual tokens, our framework improves accuracy by 2.24\% over the baseline and even surpasses dense quantization without pruning. To the best of our knowledge, this is the first method that explicitly co-optimizes vision token pruning and PTQ for accurate low-bit MLLM inference.

摘要：多模態大型語言模型（MLLMs）顯示出強大的推理能力，但其高計算和記憶體成本阻礙了在資源受限環境中的部署。雖然後訓練量化（PTQ）和視覺標記剪枝是標準的壓縮技術，但它們通常被視為獨立的優化。本文中，我們展示了這兩種技術是強烈耦合的：天真地將基於語義的標記剪枝應用於PTQ優化的MLLMs可能會丟棄對數值穩定性重要的激活異常值，從而在低位元範疇中惡化量化誤差（例如，W4A4）。為了解決這個問題，我們提出了一個量化感知的視覺標記剪枝框架。我們的方法引入了一種輕量級的混合靈敏度度量，將模擬的組別量化誤差與異常值強度結合起來。通過將這一度量與標準的語義相關性分數結合，該方法保留了既具語義信息又對量化穩健的標記。在標準的LLaVA架構上的實驗顯示，我們的方法始終優於天真整合的基準。在一個激進的剪枝比率下，只保留12.5%的視覺標記，我們的框架使準確率比基準提高了2.24%，甚至超越了未剪枝的密集量化。據我們所知，這是第一個明確共同優化視覺標記剪枝和PTQ以實現準確的低位元MLLM推理的方法。

When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs

2604.02778v1 by Linyu Li, Zhi Jin, Yichi Zhang, Dongming Jin, Yuanpeng He, Haoran Duan, Gadeng Luosang, Nyima Tashi

Real-world multimodal knowledge graphs (MMKGs) are dynamic, with new entities, relations, and multimodal knowledge emerging over time. Existing continual knowledge graph reasoning (CKGR) methods focus on structural triples and cannot fully exploit multimodal signals from new entities. Existing multimodal knowledge graph reasoning (MMKGR) methods, however, usually assume static graphs and suffer catastrophic forgetting as graphs evolve. To address this gap, we present a systematic study of continual multimodal knowledge graph reasoning (CMMKGR). We construct several continual multimodal knowledge graph benchmarks from existing MMKG datasets and propose MRCKG, a new CMMKGR model. Specifically, MRCKG employs a multimodal-structural collaborative curriculum to schedule progressive learning based on the structural connectivity of new triples to the historical graph and their multimodal compatibility. It also introduces a cross-modal knowledge preservation mechanism to mitigate forgetting through entity representation stability, relational semantic consistency, and modality anchoring. In addition, a multimodal contrastive replay scheme with a two-stage optimization strategy reinforces learned knowledge via multimodal importance sampling and representation alignment. Experiments on multiple datasets show that MRCKG preserves previously learned multimodal knowledge while substantially improving the learning of new knowledge.

摘要：現實世界的多模態知識圖譜 (MMKGs) 是動態的，隨著時間的推移，新的實體、關係和多模態知識不斷出現。現有的持續知識圖譜推理 (CKGR) 方法專注於結構三元組，無法充分利用來自新實體的多模態信號。然而，現有的多模態知識圖譜推理 (MMKGR) 方法通常假設圖譜是靜態的，並在圖譜演變時遭遇災難性遺忘。為了解決這一問題，我們提出了一項持續多模態知識圖譜推理 (CMMKGR) 的系統研究。我們從現有的 MMKG 數據集中構建了幾個持續多模態知識圖譜基準，並提出了 MRCKG，一個新的 CMMKGR 模型。具體而言，MRCKG 採用多模態-結構協作課程，根據新三元組與歷史圖譜的結構連通性及其多模態兼容性來安排漸進學習。它還引入了一種跨模態知識保護機制，通過實體表示穩定性、關係語義一致性和模態錨定來減輕遺忘。此外，一種具有兩階段優化策略的多模態對比重播方案通過多模態重要性抽樣和表示對齊來強化學習的知識。在多個數據集上的實驗顯示，MRCKG 在顯著提高新知識學習的同時，保留了先前學習的多模態知識。

Analytic Drift Resister for Non-Exemplar Continual Graph Learning

2604.02633v1 by Lei Song, Shihan Guan, Youyong Kong

Non-Exemplar Continual Graph Learning (NECGL) seeks to eliminate the privacy risks intrinsic to rehearsal-based paradigms by retaining solely class-level prototype representations rather than raw graph examples for mitigating catastrophic forgetting. However, this design choice inevitably precipitates feature drift. As a nascent alternative, Analytic Continual Learning (ACL) capitalizes on the intrinsic generalization properties of frozen pre-trained models to bolster continual learning performance. Nonetheless, a key drawback resides in the pronounced attenuation of model plasticity. To surmount these challenges, we propose Analytic Drift Resister (ADR), a novel and theoretically grounded NECGL framework. ADR exploits iterative backpropagation to break free from the frozen pre-trained constraint, adapting to evolving task graph distributions and fortifying model plasticity. Since parameter updates trigger feature drift, we further propose Hierarchical Analytic Merging (HAM), performing layer-wise merging of linear transformations in Graph Neural Networks (GNNs) via ridge regression, thereby ensuring absolute resistance to feature drift. On this basis, Analytic Classifier Reconstruction (ACR) enables theoretically zero-forgetting class-incremental learning. Empirical evaluation on four node classification benchmarks demonstrates that ADR maintains strong competitiveness against existing state-of-the-art methods.

摘要：非範例持續圖學習（NECGL）旨在消除基於重演的範式內在的隱私風險，僅保留類別級原型表示，而非原始圖形範例，以減輕災難性遺忘。然而，這一設計選擇不可避免地引發特徵漂移。作為一種新興替代方案，分析持續學習（ACL）利用凍結的預訓練模型的內在泛化特性來增強持續學習的性能。儘管如此，一個主要缺點在於模型可塑性的顯著減弱。為了克服這些挑戰，我們提出了分析漂移抵抗器（ADR），這是一個新穎且理論上有根據的NECGL框架。ADR利用迭代反向傳播突破凍結的預訓練限制，適應不斷演變的任務圖分佈並加強模型的可塑性。由於參數更新會觸發特徵漂移，我們進一步提出了分層分析合併（HAM），通過脊回歸在圖神經網絡（GNNs）中執行線性變換的層級合併，從而確保對特徵漂移的絕對抵抗。在此基礎上，分析分類器重建（ACR）實現了理論上零遺忘的類別增量學習。在四個節點分類基準上的實證評估顯示，ADR在與現有最先進方法的競爭中保持了強大的競爭力。

Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

2604.02621v1 by Yiyang Shen, Lifu Tu, Weiran Wang

Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an RL framework that uses rewards from an LLM that acts as a judge evaluating model outputs over large amounts of unlabeled data, enabling label-free knowledge distillation and replacing the need of ground truth supervision. Notably, the judge operates with a single-token output, making reward computation efficient. When combined with verifiable rewards, our approach yields substantial performance gains across math reasoning benchmarks. These results suggest that LLM-based evaluators can produce effective training signals for RL fine-tuning.

摘要：強化學習（RL）已被證明可以顯著提高小型和大型語言模型（LLMs）的推理能力，但現有的方法通常依賴於可驗證的獎勵，因此需要真實標籤。我們提出了一個RL框架，利用來自LLM的獎勵，該LLM作為評判者評估大量未標記數據的模型輸出，從而實現無標籤的知識蒸餾，並取代對真實標籤的需求。值得注意的是，評判者以單個標記輸出運行，使得獎勵計算變得高效。當與可驗證的獎勵結合時，我們的方法在數學推理基準上產生了顯著的性能提升。這些結果表明，基於LLM的評估者可以為RL微調產生有效的訓練信號。

OntoKG: Ontology-Oriented Knowledge Graph Construction with Intrinsic-Relational Routing

2604.02618v1 by Yitao Li, Zhanlin Liu, Anuranjan Pandey, Muni Srikanth

Organizing a large-scale knowledge graph into a typed property graph requires structural decisions -- which entities become nodes, which properties become edges, and what schema governs these choices. Existing approaches embed these decisions in pipeline code or extract relations ad hoc, producing schemas that are tightly coupled to their construction process and difficult to reuse for downstream ontology-level tasks. We present an ontology-oriented approach in which the schema is designed from the outset for ontology analysis, entity disambiguation, domain customization, and LLM-guided extraction -- not merely as a byproduct of graph building. The core mechanism is intrinsic-relational routing, which classifies every property as either intrinsic or relational and routes it to the corresponding schema module. This routing produces a declarative schema that is portable across storage backends and independently reusable. We instantiate the approach on the January 2026 Wikidata dump. A rule-based cleaning stage identifies a 34.6M-entity core set from the full dump, followed by iterative intrinsic-relational routing that assigns each property to one of 94 modules organized into 8 categories. With tool-augmented LLM support and human review, the schema reaches 93.3% category coverage and 98.0% module assignment among classified entities. Exporting this schema yields a property graph with 34.0M nodes and 61.2M edges across 38 relationship types. We validate the ontology-oriented claim through five applications that consume the schema independently of the construction pipeline: ontology structure analysis, benchmark annotation auditing, entity disambiguation, domain customization, and LLM-guided extraction.

摘要：將大型知識圖譜組織成類型屬性圖需要結構決策——哪些實體成為節點，哪些屬性成為邊緣，以及什麼模式來規範這些選擇。現有的方法將這些決策嵌入管道代碼或臨時提取關係，產生的模式與其構建過程緊密耦合，難以重用於下游本體級任務。我們提出了一種以本體為導向的方法，其中模式從一開始就為本體分析、實體消歧、領域自定義和LLM引導提取而設計——而不僅僅是圖形構建的副產品。核心機制是內在關係路由，將每個屬性分類為內在或關係，並將其路由到相應的模式模塊。這種路由產生了一個可在存儲後端之間移植且獨立可重用的聲明性模式。

我們在2026年1月的Wikidata轉儲上實現了這種方法。一個基於規則的清理階段從完整轉儲中識別出3460萬實體的核心集，隨後進行迭代的內在關係路由，將每個屬性分配到94個模塊中的一個，這些模塊被組織成8個類別。在工具增強的LLM支持和人工審查下，該模式達到了93.3%的類別覆蓋率和98.0%的分類實體模塊分配。導出這個模式產生了一個擁有3400萬節點和6120萬邊緣的屬性圖，涵蓋38種關係類型。我們通過五個獨立於構建管道消耗該模式的應用來驗證以本體為導向的主張：本體結構分析、基準註釋審核、實體消歧、領域自定義和LLM引導提取。

AutoVerifier: An Agentic Automated Verification Framework Using Large Language Models

2604.02617v1 by Yuntao Du, Minh Dinh, Kaiyuan Zhang, Ninghui Li

Scientific and Technical Intelligence (S&TI) analysis requires verifying complex technical claims across rapidly growing literature, where existing approaches fail to bridge the verification gap between surface-level accuracy and deeper methodological validity. We present AutoVerifier, an LLM-based agentic framework that automates end-to-end verification of technical claims without requiring domain expertise. AutoVerifier decomposes every technical assertion into structured claim triples of the form (Subject, Predicate, Object), constructing knowledge graphs that enable structured reasoning across six progressively enriching layers: corpus construction and ingestion, entity and claim extraction, intra-document verification, cross-source verification, external signal corroboration, and final hypothesis matrix generation. We demonstrate AutoVerifier on a contested quantum computing claim, where the framework, operated by analysts with no quantum expertise, automatically identified overclaims and metric inconsistencies within the target paper, traced cross-source contradictions, uncovered undisclosed commercial conflicts of interest, and produced a final assessment. These results show that structured LLM verification can reliably evaluate the validity and maturity of emerging technologies, turning raw technical documents into traceable, evidence-backed intelligence assessments.

摘要：科學與技術情報（S&TI）分析需要在快速增長的文獻中驗證複雜的技術主張，而現有的方法無法填補表面準確性與更深層方法有效性之間的驗證差距。我們提出了AutoVerifier，一個基於LLM的代理框架，能自動化技術主張的端到端驗證，而無需領域專業知識。AutoVerifier將每個技術主張分解為結構化的主張三元組，形式為（主體，謂詞，賓語），構建知識圖譜，促進在六個逐步豐富的層次上進行結構化推理：語料庫構建與攝取、實體與主張提取、文檔內驗證、跨來源驗證、外部信號證實，以及最終假設矩陣生成。我們在一個有爭議的量子計算主張上演示了AutoVerifier，該框架由沒有量子專業知識的分析師操作，自動識別了目標論文中的過度主張和指標不一致，追蹤了跨來源的矛盾，揭露了未披露的商業利益衝突，並產生了最終評估。這些結果表明，結構化的LLM驗證可以可靠地評估新興技術的有效性和成熟度，將原始技術文檔轉化為可追溯的、有證據支持的情報評估。

Competency Questions as Executable Plans: a Controlled RAG Architecture for Cultural Heritage Storytelling

2604.02545v1 by Naga Sowjanya Barla, Jacopo de Berardinis

The preservation of intangible cultural heritage is a critical challenge as collective memory fades over time. While Large Language Models (LLMs) offer a promising avenue for generating engaging narratives, their propensity for factual inaccuracies or "hallucinations" makes them unreliable for heritage applications where veracity is a central requirement. To address this, we propose a novel neuro-symbolic architecture grounded in Knowledge Graphs (KGs) that establishes a transparent "plan-retrieve-generate" workflow for story generation. A key novelty of our approach is the repurposing of competency questions (CQs) - traditionally design-time validation artifacts - into run-time executable narrative plans. This approach bridges the gap between high-level user personas and atomic knowledge retrieval, ensuring that generation is evidence-closed and fully auditable. We validate this architecture using a new resource: the Live Aid KG, a multimodal dataset aligning 1985 concert data with the Music Meta Ontology and linking to external multimedia assets. We present a systematic comparative evaluation of three distinct Retrieval-Augmented Generation (RAG) strategies over this graph: a purely symbolic KG-RAG, a text-enriched Hybrid-RAG, and a structure-aware Graph-RAG. Our experiments reveal a quantifiable trade-off between the factual precision of symbolic retrieval, the contextual richness of hybrid methods, and the narrative coherence of graph-based traversal. Our findings offer actionable insights for designing personalised and controllable storytelling systems.

摘要：保存非物質文化遺產是一項關鍵挑戰，因為集體記憶隨著時間的推移而逐漸消退。儘管大型語言模型（LLMs）為生成引人入勝的敘事提供了有希望的途徑，但它們對事實不準確或「幻覺」的傾向使其在需要真實性的遺產應用中變得不可靠。為了解決這個問題，我們提出了一種基於知識圖譜（KGs）的新型神經符號架構，建立了一個透明的「計劃-檢索-生成」工作流程來生成故事。我們方法的一個關鍵創新是將能力問題（CQs）——傳統上是設計時的驗證工件——重新利用為運行時可執行的敘事計劃。這種方法彌合了高級用戶角色與原子知識檢索之間的鴻溝，確保生成過程是證據封閉且完全可審計的。我們使用一個新資源來驗證這一架構：現場救助知識圖譜（Live Aid KG），這是一個多模態數據集，將1985年的音樂會數據與音樂元本體對齊，並鏈接到外部多媒體資產。我們對這個圖進行了三種不同的檢索增強生成（RAG）策略的系統比較評估：純符號的KG-RAG、文本增強的混合RAG和結構感知的Graph-RAG。我們的實驗揭示了符號檢索的事實精確度、混合方法的上下文豐富性以及基於圖的遍歷的敘事連貫性之間的可量化權衡。我們的發現為設計個性化和可控的故事系統提供了可行的見解。

Opal: Private Memory for Personal AI

2604.02522v1 by Darya Kaviani, Alp Eren Ozdarendeli, Jinhao Zhu, Yu Ding, Raluca Ada Popa

Personal AI systems increasingly retain long-term memory of user activity, including documents, emails, messages, meetings, and ambient recordings. Trusted hardware can keep this data private, but struggles to scale with a growing datastore. This pushes the data to external storage, which exposes retrieval access patterns that leak private information to the application provider. Oblivious RAM (ORAM) is a cryptographic primitive that can hide these patterns, but it requires a fixed access budget, precluding the query-dependent traversals that agentic memory systems rely on for accuracy. We present Opal, a private memory system for personal AI. Our key insight is to decouple all data-dependent reasoning from the bulk of personal data, confining it to the trusted enclave. Untrusted disk then sees only fixed, oblivious memory accesses. This enclave-resident component uses a lightweight knowledge graph to capture personal context that semantic search alone misses and handles continuous ingestion by piggybacking reindexing and capacity management on every ORAM access. Evaluated on a comprehensive synthetic personal-data pipeline driven by stochastic communication models, Opal improves retrieval accuracy by 13 percentage points over semantic search and achieves 29x higher throughput with 15x lower infrastructure cost than a secure baseline. Opal is under consideration for deployment to millions of users at a major AI provider.

摘要：個人 AI 系統越來越多地保留用戶活動的長期記憶，包括文件、電子郵件、消息、會議和環境錄音。受信硬體可以保持這些數據的私密性，但在面對不斷增長的數據存儲時卻難以擴展。這迫使數據轉移到外部存儲，這暴露了檢索訪問模式，將私密信息洩漏給應用提供者。無知 RAM (ORAM) 是一種加密原語，可以隱藏這些模式，但它需要固定的訪問預算，這排除了依賴查詢的遍歷，這是代理記憶系統所依賴的準確性。我們提出了 Opal，一個用於個人 AI 的私密記憶系統。我們的關鍵見解是將所有數據依賴的推理與大量個人數據解耦，將其限制在受信的區域內。不受信的磁碟則僅看到固定的、無知的記憶訪問。這個居住在區域內的組件使用輕量級知識圖來捕捉僅靠語義搜索無法獲得的個人上下文，並通過在每次 ORAM 訪問上附帶重新索引和容量管理來處理持續的數據攝取。在一個由隨機通信模型驅動的綜合合成個人數據管道上進行評估，Opal 在檢索準確性上比語義搜索提高了 13 個百分點，並且在基礎設施成本上比安全基線達到了 29 倍的更高吞吐量和 15 倍的更低成本。Opal 正在考慮部署到一家主要 AI 提供者的數百萬用戶。

2604.02512v1 by Roland Mühlenbernd

Large language models (LLMs) increasingly exhibit human-like patterns of pragmatic and social reasoning. This paper addresses two related questions: do LLMs approximate human social meaning not only qualitatively but also quantitatively, and can prompting strategies informed by pragmatic theory improve this approximation? To address the first, we introduce two calibration-focused metrics distinguishing structural fidelity from magnitude calibration: the Effect Size Ratio (ESR) and the Calibration Deviation Score (CDS). To address the second, we derive prompting conditions from two pragmatic assumptions: that social meaning arises from reasoning over linguistic alternatives, and that listeners infer speaker knowledge states and communicative motives. Applied to a case study on numerical (im)precision across three frontier LLMs, we find that all models reliably reproduce the qualitative structure of human social inferences but differ substantially in magnitude calibration. Prompting models to reason about speaker knowledge and motives most consistently reduces magnitude deviation, while prompting for alternative-awareness tends to amplify exaggeration. Combining both components is the only intervention that improves all calibration-sensitive metrics across all models, though fine-grained magnitude calibration remains only partially resolved. LLMs thus capture inferential structure while variably distorting inferential strength, and pragmatic theory provides a useful but incomplete handle for improving that approximation.

摘要：大型語言模型（LLMs）越來越展現出類似人類的實用和社會推理模式。本文探討兩個相關問題：LLMs是否在質量和數量上都近似於人類的社會意義，以及基於實用理論的提示策略是否能改善這種近似？為了解決第一個問題，我們引入兩個以校準為重點的指標，區分結構忠實度和幅度校準：效應大小比率（ESR）和校準偏差分數（CDS）。為了解決第二個問題，我們根據兩個實用假設推導出提示條件：社會意義源於對語言選擇的推理，以及聽眾推斷說話者的知識狀態和交流動機。應用於對三個前沿LLMs的數字（不）精確度的案例研究，我們發現所有模型可靠地再現了人類社會推理的質量結構，但在幅度校準上有顯著差異。提示模型推理說話者的知識和動機最一致地減少了幅度偏差，而提示對替代意識的認知則往往會放大誇張。結合這兩個組件是唯一能改善所有模型中所有校準敏感指標的干預，儘管細緻的幅度校準仍然只有部分解決。因此，LLMs捕捉推理結構，同時在推理強度上變化扭曲，而實用理論提供了一個有用但不完整的手段來改善這種近似。

A Comprehensive Framework for Long-Term Resiliency Investment Planning under Extreme Weather Uncertainty for Electric Utilities

2604.02504v1 by Emma Benjaminson

Electric utilities must make massive capital investments in the coming years to respond to explosive growth in demand, aging assets and rising threats from extreme weather. Utilities today already have rigorous frameworks for capital planning, and there are opportunities to extend this capability to solve multi-objective optimization problems in the face of uncertainty. This work presents a four-part framework that 1) incorporates extreme weather as a source of uncertainty, 2) leverages a digital twin of the grid, 3) uses Monte Carlo simulation to capture variability and 4) applies a multi-objective optimization method for finding the optimal investment portfolio. We use this framework to investigate whether grid-aware optimization methods outperform model-free approaches. We find that, in fact, given the computational complexity of model-based metaheuristic optimization methods, the simpler net present value ranking method was able to find more optimal portfolios with only limited knowledge of the grid.

摘要：電力公用事業在未來幾年必須進行大量資本投資，以應對需求的爆炸性增長、資產老化以及來自極端天氣的上升威脅。當前的公用事業已經擁有嚴謹的資本規劃框架，並且有機會將這一能力擴展到應對不確定性下的多目標優化問題。這項工作提出了一個四部分的框架，1) 將極端天氣納入不確定性的來源，2) 利用電網的數位雙胞胎，3) 使用蒙地卡羅模擬來捕捉變異性，4) 應用多目標優化方法來尋找最佳投資組合。我們使用這個框架來調查電網感知優化方法是否優於無模型的方法。我們發現，事實上，考慮到基於模型的元啟發式優化方法的計算複雜性，更簡單的淨現值排名方法能夠在僅限於對電網的有限了解的情況下找到更優的投資組合。

Managing Diabetic Retinopathy with Deep Learning: A Data Centric Overview

2604.02448v1 by Shramana Dey, Zahir Khan, T. A. PramodKumar, B. Uma Shankar, Ashis K. Dhara, Ramachandran Rajalakshmi, Rajiv Raman, Sushmita Mitra

Diabetic Retinopathy (DR) is a serious microvascular complication of diabetes, and one of the leading causes of vision loss worldwide. Although automated detection and grading, with Deep Learning (DL), can reduce the burden on ophthalmologists, it is constrained by the limited availability of high-quality datasets. Existing repositories often remain geographically narrow, contain limited samples, and exhibit inconsistent annotations or variable image quality; thereby, restricting their clinical reliability. This paper presents a comprehensive review and comparative analysis of fundus image datasets used in the management of DR. The study evaluates their usability across key tasks, including binary classification, severity grading, lesion localization, and multi-disease screening. It also categorizes the datasets by size, accessibility, and annotation type (such as image-level, lesion-level, and multi-disease). Finally, a recently published dataset is presented as a case study to illustrate broader challenges in dataset curation and usage. The review consolidates current knowledge while highlighting persistent gaps such as the lack of standardized lesion-level annotations and longitudinal data. It also outlines recommendations for future dataset development to support clinically reliable and explainable solutions in DR screening.

摘要：糖尿病視網膜病變（DR）是糖尿病的一種嚴重微血管併發症，也是全球視力喪失的主要原因之一。雖然自動化檢測和分級，結合深度學習（DL），可以減輕眼科醫生的負擔，但其受到高品質數據集有限可用性的限制。現有的資料庫往往地理範圍狹窄，樣本有限，且標註不一致或影像質量變化，從而限制了其臨床可靠性。本文提供了一個全面的回顧和比較分析，針對用於管理DR的眼底影像數據集。該研究評估了這些數據集在關鍵任務中的可用性，包括二元分類、嚴重程度分級、病變定位和多疾病篩檢。它還根據大小、可及性和標註類型（如影像級、病變級和多疾病）對數據集進行分類。最後，最近發表的一個數據集被作為案例研究，展示了數據集策劃和使用中的更廣泛挑戰。該回顧整合了當前知識，同時突顯了持續存在的差距，例如缺乏標準化的病變級標註和縱向數據。它還概述了對未來數據集開發的建議，以支持在DR篩檢中臨床可靠和可解釋的解決方案。

Self-Directed Task Identification

2604.02430v1 by Timothy Gould, Sidike Paheding

In this work, we present a novel machine learning framework called Self-Directed Task Identification (SDTI), which enables models to autonomously identify the correct target variable for each dataset in a zero-shot setting without pre-training. SDTI is a minimal, interpretable framework demonstrating the feasibility of repurposing core machine learning concepts for a novel task structure. To our knowledge, no existing architectures have demonstrated this ability. Traditional approaches lack this capability, leaving data annotation as a time-consuming process that relies heavily on human effort. Using only standard neural network components, we show that SDTI can be achieved through appropriate problem formulation and architectural design. We evaluate the proposed framework on a range of benchmark tasks and demonstrate its effectiveness in reliably identifying the ground truth out of a set of potential target variables. SDTI outperformed baseline architectures by 14% in F1 score on synthetic task identification benchmarks. These proof-of-concept experiments highlight the future potential of SDTI to reduce dependence on manual annotation and to enhance the scalability of autonomous learning systems in real-world applications.

摘要：在這項工作中，我們提出了一個名為自我導向任務識別（Self-Directed Task Identification, SDTI）的新型機器學習框架，該框架使模型能夠在零樣本設置中自主識別每個數據集的正確目標變量，而無需預訓練。SDTI是一個最小化的、可解釋的框架，展示了將核心機器學習概念重新用於新任務結構的可行性。據我們所知，現有的架構尚未展示出這種能力。傳統方法缺乏這種能力，將數據標註變成一個耗時的過程，並且在很大程度上依賴於人力。僅使用標準神經網絡組件，我們展示了通過適當的問題表述和架構設計可以實現SDTI。我們在一系列基準任務上評估了所提出的框架，並展示了其在可靠地從一組潛在目標變量中識別真實值方面的有效性。在合成任務識別基準上，SDTI在F1分數上比基準架構高出14%。這些概念驗證實驗突顯了SDTI未來減少對手動標註依賴的潛力，並增強了自主學習系統在現實應用中的可擴展性。

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

2604.02324v1 by Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk, Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

摘要：語言模型（LMs）越來越多地擴展了新的可學習詞彙標記，以應對特定領域的任務，例如生成推薦中的語義識別（Semantic-ID）標記。標準做法是將這些新標記初始化為現有詞彙嵌入的均值，然後依賴監督性微調來學習它們的表示。我們對這一策略進行了系統分析：通過光譜和幾何診斷，我們顯示均值初始化將所有新標記壓縮到一個退化子空間中，抹去了標記之間的區別，這使得隨後的微調難以完全恢復。這些發現表明，\emph{標記初始化}是在擴展LMs時引入新詞彙的關鍵瓶頸。受到這一診斷的啟發，我們提出了\emph{基於語言的標記初始化假設}：在微調之前，將新標記在預訓練嵌入空間中進行語言學上的基礎化，更能幫助模型利用其通用知識來應對新標記領域。我們將這一假設具體化為GTI（基於語言的標記初始化），這是一個輕量級的基礎化階段，在微調之前，僅使用配對的語言監督，將新標記映射到預訓練嵌入空間中不同的、語義上有意義的位置。儘管其簡單性，GTI在多數評估設置中超越了均值初始化和現有的輔助任務適應方法，涵蓋了多個生成推薦基準，包括行業規模和公共數據集。進一步分析顯示，基於語言的嵌入產生了更豐富的標記間結構，並在微調過程中持續存在，證實了初始化質量是詞彙擴展中的關鍵瓶頸的假設。

LumiVideo: An Intelligent Agentic System for Video Color Grading

2604.02409v1 by Yuchen Guo, Junli Gong, Hongmin Cai, Yiu-ming Cheung, Weifeng Su

Video color grading is a critical post-production process that transforms flat, log-encoded raw footage into emotionally resonant cinematic visuals. Existing automated methods act as static, black-box executors that directly output edited pixels, lacking both interpretability and the iterative control required by professionals. We introduce LumiVideo, an agentic system that mimics the cognitive workflow of professional colorists through four stages: Perception, Reasoning, Execution, and Reflection. Given only raw log video, LumiVideo autonomously produces a cinematic base grade by analyzing the scene's physical lighting and semantic content. Its Reasoning engine synergizes an LLM's internalized cinematic knowledge with a Retrieval-Augmented Generation (RAG) framework via a Tree of Thoughts (ToT) search to navigate the non-linear color parameter space. Rather than generating pixels, the system compiles the deduced parameters into industry-standard ASC-CDL configurations and a globally consistent 3D LUT, analytically guaranteeing temporal consistency. An optional Reflection loop then allows creators to refine the result via natural language feedback. We further introduce LumiGrade, the first log-encoded video benchmark for evaluating automated grading. Experiments show that LumiVideo approaches human expert quality in fully automatic mode while enabling precise iterative control when directed.

摘要：視頻色彩分級是一個關鍵的後期製作過程，將平坦的、以日誌編碼的原始影片轉變為情感共鳴的電影視覺效果。現有的自動化方法充當靜態的黑箱執行器，直接輸出編輯過的像素，缺乏專業人士所需的可解釋性和迭代控制。我們介紹了LumiVideo，一個模擬專業調色師認知工作流程的代理系統，分為四個階段：感知、推理、執行和反思。僅根據原始日誌視頻，LumiVideo通過分析場景的物理光照和語義內容，自主生成電影基礎色彩分級。其推理引擎通過樹狀思維（ToT）搜索，將大型語言模型（LLM）內化的電影知識與檢索增強生成（RAG）框架協同作用，以導航非線性顏色參數空間。系統不是生成像素，而是將推導出的參數整理成行業標準的ASC-CDL配置和全球一致的3D LUT，從而在分析上保證時間一致性。可選的反思循環允許創作者通過自然語言反饋來細化結果。我們進一步介紹了LumiGrade，第一個用於評估自動分級的日誌編碼視頻基準。實驗顯示，LumiVideo在完全自動模式下接近人類專家的質量，同時在指導下實現精確的迭代控制。

Crystalite: A Lightweight Transformer for Efficient Crystal Modeling

2604.02270v1 by Tin Hadži Veljković, Joshua Rosenthal, Ivor Lončarić, Jan-Willem van de Meent

Generative models for crystalline materials often rely on equivariant graph neural networks, which capture geometric structure well but are costly to train and slow to sample. We present Crystalite, a lightweight diffusion Transformer for crystal modeling built around two simple inductive biases. The first is Subatomic Tokenization, a compact chemically structured atom representation that replaces high-dimensional one-hot encodings and is better suited to continuous diffusion. The second is the Geometry Enhancement Module (GEM), which injects periodic minimum-image pair geometry directly into attention through additive geometric biases. Together, these components preserve the simplicity and efficiency of a standard Transformer while making it better matched to the structure of crystalline materials. Crystalite achieves state-of-the-art results on crystal structure prediction benchmarks, and de novo generation performance, attaining the best S.U.N. discovery score among the evaluated baselines while sampling substantially faster than geometry-heavy alternatives.

摘要：生成晶體材料的模型通常依賴於等變圖神經網絡，這些網絡能夠很好地捕捉幾何結構，但訓練成本高且取樣速度慢。我們提出了Crystalite，一種輕量級的擴散Transformer，用於晶體建模，基於兩個簡單的歸納偏見。第一個是亞原子標記化，一種緊湊的化學結構原子表示，取代了高維的獨熱編碼，更適合連續擴散。第二個是幾何增強模塊（GEM），通過附加幾何偏見，將周期性最小影像對幾何直接注入注意力中。這些組件共同保持了標準Transformer的簡單性和效率，同時使其更適合晶體材料的結構。Crystalite在晶體結構預測基準測試和新穎生成性能上達到了最先進的結果，在評估的基準中獲得了最佳的S.U.N.發現分數，同時取樣速度顯著快於以幾何為重的替代方案。

Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider

2604.02259v1 by Tina. J. Jat, T. Ghosh, Karthik Suresh

To harness the power of Language Models in answering domain specific specialized technical questions, Retrieval Augmented Generation (RAG) is been used widely. In this work, we have developed a Q\&A application inspired by the Retrieval Augmented Generation (RAG), which is comprised of an in-house database indexed on the arXiv articles related to the Electron-Ion Collider (EIC) experiment - one of the largest international scientific collaboration and incorporated an open-source LLaMA model for answer generation. This is an extension to it's proceeding application built on proprietary model and Cloud-hosted external knowledge-base for the EIC experiment. This locally-deployed RAG-system offers a cost-effective, resource-constraint alternative solution to build a RAG-assisted Q\&A application on answering domain-specific queries in the field of experimental nuclear physics. This set-up facilitates data-privacy, avoids sending any pre-publication scientific data and information to public domain. Future improvement will expand the knowledge base to encompass heterogeneous EIC-related publications and reports and upgrade the application pipeline orchestration to the LangGraph framework.

摘要：為了利用語言模型的力量來回答特定領域的專業技術問題，檢索增強生成（RAG）被廣泛使用。在這項工作中，我們開發了一個受到檢索增強生成（RAG）啟發的問答應用程式，該應用程式由一個內部數據庫組成，該數據庫索引了與電子-離子對撞機（EIC）實驗相關的arXiv文章——這是最大的國際科學合作之一，並結合了一個開源的LLaMA模型來生成答案。這是對其先前應用的擴展，該應用基於專有模型和雲端托管的外部知識庫，用於EIC實驗。這個本地部署的RAG系統提供了一種具成本效益的資源受限替代方案，以建立一個RAG輔助的問答應用程式，回答實驗核物理領域的特定查詢。這一設置促進了數據隱私，避免將任何未發表的科學數據和信息發送到公共領域。未來的改進將擴展知識庫，以涵蓋異質的EIC相關出版物和報告，並將應用管道編排升級到LangGraph框架。

When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning

2604.02226v1 by Juarez Monteiro, Nathan Gavenski, Gianlucca Zuin, Adriano Veloso

Reinforcement learning (RL) agents often struggle with out-of-distribution (OOD) scenarios, leading to high uncertainty and random behavior. While language models (LMs) contain valuable world knowledge, larger ones incur high computational costs, hindering real-time use, and exhibit limitations in autonomous planning. We introduce Adaptive Safety through Knowledge (ASK), which combines smaller LMs with trained RL policies to enhance OOD generalization without retraining. ASK employs Monte Carlo Dropout to assess uncertainty and queries the LM for action suggestions only when uncertainty exceeds a set threshold. This selective use preserves the efficiency of existing policies while leveraging the language model's reasoning in uncertain situations. In experiments on the FrozenLake environment, ASK shows no improvement in-domain, but demonstrates robust navigation in transfer tasks, achieving a reward of 0.95. Our findings indicate that effective neuro-symbolic integration requires careful orchestration rather than simple combination, highlighting the need for sufficient model scale and effective hybridization mechanisms for successful OOD generalization.

摘要：強化學習（RL）代理在處理分佈外（OOD）情境時常常面臨困難，導致高度的不確定性和隨機行為。雖然語言模型（LM）包含有價值的世界知識，但較大的模型會產生高計算成本，妨礙實時使用，並在自主規劃方面顯示出限制。我們引入了通過知識的自適應安全（ASK），它將較小的LM與訓練過的RL策略結合，以增強OOD泛化而無需重新訓練。ASK採用蒙特卡羅隨機失活來評估不確定性，並僅在不確定性超過設定閾值時查詢LM以獲取行動建議。這種選擇性使用保留了現有策略的效率，同時利用語言模型在不確定情況下的推理能力。在FrozenLake環境的實驗中，ASK在領域內沒有顯示出改善，但在轉移任務中顯示出穩健的導航，獲得了0.95的獎勵。我們的研究結果表明，有效的神經符號整合需要謹慎的協調，而非簡單的組合，突顯了成功的OOD泛化所需的足夠模型規模和有效的混合機制。

Universal Hypernetworks for Arbitrary Models

2604.02215v1 by Xuanfeng Zhou

Conventional hypernetworks are typically engineered around a specific base-model parameterization, so changing the target architecture often entails redesigning the hypernetwork and retraining it from scratch. We introduce the \emph{Universal Hypernetwork} (UHN), a fixed-architecture generator that predicts weights from deterministic parameter, architecture, and task descriptors. This descriptor-based formulation decouples the generator architecture from target-network parameterization, so one generator can instantiate heterogeneous models across the tested architecture and task families. Our empirical claims are threefold: (1) one fixed UHN remains competitive with direct training across vision, graph, text, and formula-regression benchmarks; (2) the same UHN supports both multi-model generalization within a family and multi-task learning across heterogeneous models; and (3) UHN enables stable recursive generation with up to three intermediate generated UHNs before the final base model. Our code is available at https://github.com/Xuanfeng-Zhou/UHN.

摘要：傳統的超網絡通常是圍繞特定的基礎模型參數化而設計，因此更改目標架構通常需要重新設計超網絡並從頭開始進行訓練。我們介紹了\emph{通用超網絡}（UHN），這是一個固定架構的生成器，能夠根據確定性的參數、架構和任務描述符預測權重。這種基於描述符的公式將生成器架構與目標網絡參數化解耦，這樣一個生成器就可以在測試的架構和任務家族中實例化異質模型。我們的實證主張有三個方面：（1）一個固定的UHN在視覺、圖形、文本和公式回歸基準測試中與直接訓練保持競爭力；（2）相同的UHN支持在一個家族內的多模型泛化以及跨異質模型的多任務學習；（3）UHN使得在最終基礎模型之前能夠穩定地進行多達三個中間生成的UHN的遞歸生成。我們的代碼可在https://github.com/Xuanfeng-Zhou/UHN獲得。

LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications

2604.02206v1 by Mayank Mayank, Bharanidhar Duraisamy, Florian Geiss

Accurate shape and trajectory estimation of dynamic objects is essential for reliable automated driving. Classical Bayesian extended-object models offer theoretical robustness and efficiency but depend on completeness of a-priori and update-likelihood functions, while deep learning methods bring adaptability at the cost of dense annotations and high compute. We bridge these strengths with LEO (Learned Extension of Objects), a spatio-temporal Graph Attention Network that fuses multi-modal production-grade sensor tracks to learn adaptive fusion weights, ensure temporal consistency, and represent multi-scale shapes. Using a task-specific parallelogram ground-truth formulation, LEO models complex geometries (e.g. articulated trucks and trailers) and generalizes across sensor types, configurations, object classes, and regions, remaining robust for challenging and long-range targets. Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset generalization.

摘要：準確的動態物體形狀和軌跡估計對於可靠的自動駕駛至關重要。傳統的貝葉斯擴展物體模型提供了理論上的穩健性和效率，但依賴於先驗和更新似然函數的完整性，而深度學習方法則以密集標註和高計算成本為代價帶來了適應性。我們通過LEO（物體的學習擴展）架起這些優勢的橋樑，這是一種時空圖注意力網絡，融合多模態生產級傳感器軌跡以學習自適應融合權重，確保時間一致性並表示多尺度形狀。使用特定任務的平行四邊形真實值公式，LEO建模複雜的幾何形狀（例如關節式卡車和拖車），並在傳感器類型、配置、物體類別和區域之間進行泛化，對於挑戰性和長距離目標保持穩健性。在梅賽德斯-奔馳DRIVE PILOT SAE L3數據集上的評估展示了適合生產系統的實時計算效率；在公共數據集如代爾夫特視圖（VoD）上的額外驗證進一步確認了跨數據集的泛化能力。

Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model

2604.02194v1 by Jaemin Kim, Jae O Lee, Sumyeong Ahn, Seo Yeon Park

Retrieval-Augmented Language Models (RALMs) have demonstrated significant potential in knowledge-intensive tasks; however, they remain vulnerable to performance degradation when presented with irrelevant or noisy retrieved contexts. Existing approaches to enhance robustness typically operate via coarse-grained parameter updates at the layer or module level, often overlooking the inherent neuron-level sparsity of Large Language Models (LLMs). To address this limitation, we propose Neuro-RIT (Neuron-guided Robust Instruction Tuning), a novel framework that shifts the paradigm from dense adaptation to precision-driven neuron alignment. Our method explicitly disentangles neurons that are responsible for processing relevant versus irrelevant contexts using attribution-based neuron mining. Subsequently, we introduce a two-stage instruction tuning strategy that enforces a dual capability for noise robustness: achieving direct noise suppression by functionally deactivating neurons exclusive to irrelevant contexts, while simultaneously optimizing targeted layers for evidence distillation. Extensive experiments across diverse QA benchmarks demonstrate that Neuro-RIT consistently outperforms strong baselines and robustness-enhancing methods.

摘要：檢索增強語言模型（RALMs）在知識密集型任務中顯示出顯著的潛力；然而，當面對不相關或噪聲的檢索上下文時，它們仍然容易出現性能下降。現有增強穩健性的方案通常通過層或模塊級的粗粒度參數更新來運作，往往忽略了大型語言模型（LLMs）固有的神經元級稀疏性。為了解決這一限制，我們提出了神經引導穩健指令調整（Neuro-RIT），這是一個新穎的框架，將範式從密集適應轉變為以精確驅動的神經元對齊。我們的方法明確區分了負責處理相關與不相關上下文的神經元，使用基於歸因的神經元挖掘。隨後，我們引入了一種兩階段的指令調整策略，強化了噪聲穩健性的雙重能力：通過功能性地停用專門處理不相關上下文的神經元來實現直接的噪聲抑制，同時優化針對證據蒸餾的目標層。廣泛的實驗跨越多個問答基準顯示，Neuro-RIT 始終超越強基準和增強穩健性的方法。

TRU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning

2604.02183v1 by Zhanting Zhou, KaHou Tam, Ziqiang Zheng, Zeyu Ma, Zhanting Zhou

Multimodal recommendation systems (MRS) jointly model user-item interaction graphs and rich item content, but this tight coupling makes user data difficult to remove once learned. Approximate machine unlearning offers an efficient alternative to full retraining, yet existing methods for MRS mainly rely on a largely uniform reverse update across the model. We show that this assumption is fundamentally mismatched to modern MRS: deleted-data influence is not uniformly distributed, but concentrated unevenly across \textit{ranking behavior}, \textit{modality branches}, and \textit{network layers}. This non-uniformity gives rise to three bottlenecks in MRS unlearning: target-item persistence in the collaborative graph, modality imbalance across feature branches, and layer-wise sensitivity in the parameter space. To address this mismatch, we propose \textbf{targeted reverse update} (TRU), a plug-and-play unlearning framework for MRS. Instead of applying a blind global reversal, TRU performs three coordinated interventions across the model hierarchy: a ranking fusion gate to suppress residual target-item influence in ranking, branch-wise modality scaling to preserve retained multimodal representations, and capacity-aware layer isolation to localize reverse updates to deletion-sensitive modules. Experiments across two representative backbones, three datasets, and three unlearning regimes show that TRU consistently achieves a better retain-forget trade-off than prior approximate baselines, while security audits further confirm deeper forgetting and behavior closer to a full retraining on the retained data.

摘要：多模態推薦系統（MRS）共同建模用戶-項目互動圖和豐富的項目內容，但這種緊密耦合使得一旦學習後用戶數據難以移除。近似機器遺忘提供了一種高效的替代方案來進行全面重訓練，然而現有的MRS方法主要依賴於模型中大致均勻的反向更新。我們顯示這一假設與現代MRS根本不匹配：刪除數據的影響並不是均勻分佈的，而是集中在\textit{排名行為}、\textit{模態分支}和\textit{網絡層}之間不均勻地分佈。這種不均勻性在MRS的遺忘中產生了三個瓶頸：協作圖中目標項目的持續性、特徵分支之間的模態不平衡，以及參數空間中的層級敏感性。為了解決這一不匹配，我們提出了\textbf{針對性反向更新}（TRU），這是一個適用於MRS的即插即用遺忘框架。TRU並不是進行盲目的全局反轉，而是在模型層級中執行三個協調的干預：一個排名融合閘來抑制排名中殘留的目標項目影響、分支級模態縮放以保留保留的多模態表示，以及容量感知的層級隔離以將反向更新定位於對刪除敏感的模塊。在兩個代表性骨幹、三個數據集和三種遺忘方案上的實驗表明，TRU始終比先前的近似基準實現了更好的保留-遺忘權衡，而安全審計進一步確認了更深的遺忘和在保留數據上更接近全面重訓練的行為。

Adam's Law: Textual Frequency Law on Large Language Models

2604.02176v1 by Hongyuan Adam Lu, Z. L., Victor Wei, Zefan Zhang, Zhao Hong, Qiqi Xiang, Bowen Cao, Wai Lam

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.

摘要：雖然文本頻率已被驗證與人類在閱讀速度上的認知相關，但其與大型語言模型（LLMs）的關聯性卻鮮少被研究。據我們所知，我們提出了一個關於文本數據頻率的新研究方向，這是一個尚未被充分研究的主題。我們的框架由三個單元組成。首先，本文提出了文本頻率法則（TFL），該法則指出，應該優先選擇頻繁的文本數據來用於LLMs的提示和微調。由於許多LLMs在其訓練數據中是閉源的，我們建議使用在線資源來估計句子級別的頻率。然後，我們利用一個輸入改寫器將輸入改寫為更頻繁的文本表達。接下來，我們通過查詢LLMs進行故事完成，提出了文本頻率蒸餾（TFD），進一步擴展數據集中的句子，並使用生成的語料來調整初始估計。最後，我們提出了課程文本頻率訓練（CTFT），以句子級別頻率的遞增順序微調LLMs。我們在我們精心策劃的數據集文本頻率配對數據集（TFPD）上進行了數學推理、機器翻譯、常識推理和代理工具調用的實驗。結果顯示我們框架的有效性。

GaelEval: Benchmarking LLM Performance for Scottish Gaelic

2604.02135v1 by Peter Devine, William Lamb, Beatrice Alex, Ignatius Ezeani, Dawn Knight, Mícheál J. Ó Meachair, Paul Rayson, Martin Wynne

Multilingual large language models (LLMs) often exhibit emergent 'shadow' capabilities in languages without official support, yet their performance on these languages remains uneven and under-measured. This is particularly acute for morphosyntactically rich minority languages such as Scottish Gaelic, where translation benchmarks fail to capture structural competence. We introduce GaelEval, the first multi-dimensional benchmark for Gaelic, comprising: (i) an expert-authored morphosyntactic MCQA task; (ii) a culturally grounded translation benchmark and (iii) a large-scale cultural knowledge Q&A task. Evaluating 19 LLMs against a fluent-speaker human baseline ($n=30$), we find that Gemini 3 Pro Preview achieves $83.3\%$ accuracy on the linguistic task, surpassing the human baseline ($78.1\%$). Proprietary models consistently outperform open-weight systems, and in-language (Gaelic) prompting yields a small but stable advantage (+$2.4\%$). On the cultural task, leading models exceed $90\%$ accuracy, though most systems perform worse under Gaelic prompting and absolute scores are inflated relative to the manual benchmark. Overall, GaelEval reveals that frontier models achieve above-human performance on several dimensions of Gaelic grammar, demonstrates the effect of Gaelic prompting and shows a consistent performance gap favouring proprietary over open-weight models.

摘要：多語言大型語言模型（LLMs）在沒有官方支持的語言中，經常展現出新興的「影子」能力，但它們在這些語言上的表現仍然不均衡且未被充分測量。這對於像蘇格蘭蓋爾語這樣形態句法豐富的少數語言尤其明顯，因為翻譯基準未能捕捉結構能力。我們介紹了 GaelEval，這是第一個針對蓋爾語的多維基準，包括：（i）專家撰寫的形態句法多選題（MCQA）任務；（ii）以文化為基礎的翻譯基準，以及（iii）大規模文化知識問答任務。對 19 個 LLM 進行評估，與流利講者的人類基準（$n=30$）相比，我們發現 Gemini 3 Pro Preview 在語言任務上達到 $83.3\%$ 的準確率，超過人類基準（$78.1\%$）。專有模型始終優於開放權重系統，而在語言內（蓋爾語）提示下則產生了小但穩定的優勢（+$2.4\%$）。在文化任務上，領先模型的準確率超過 $90\%$，儘管大多數系統在蓋爾語提示下表現較差，且相對於手動基準的絕對分數被膨脹。總體而言，GaelEval 顯示出前沿模型在蓋爾語語法的幾個維度上達到超越人類的表現，展示了蓋爾語提示的效果，並顯示出專有模型相對於開放權重模型的穩定性能差距。

Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning

2604.02091v1 by Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu, Xinyu Dai, Rui Xia

Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.

摘要：Rerankers 在精煉檢索結果以進行檢索增強生成中扮演著關鍵角色。然而，目前的重新排序模型通常是在靜態的人類標註相關性標籤上進行優化，與下游生成過程脫節。這種脫節導致了根本的不一致：信息檢索指標所識別的主題相關文件，往往無法提供 LLM 進行精確回答生成所需的實際效用。為了彌補這一差距，我們引入了 ReRanking Preference Optimization (RRPO)，這是一個強化學習框架，直接將重新排序與 LLM 的生成質量對齊。通過將重新排序表述為一個序列決策過程，RRPO 利用 LLM 反饋優化上下文效用，從而消除了對昂貴的人類標註的需求。為了確保訓練的穩定性，我們進一步引入了一個參考錨定的確定性基線。在知識密集型基準上的大量實驗表明，RRPO 顯著超越了強大的基線，包括強大的列表式重新排序器 RankZephyr。進一步的分析突顯了我們框架的多樣性：它能無縫地推廣到各種讀者（例如，GPT-4o），與查詢擴展模塊（如 Query2Doc）正交整合，即使在用噪聲監督訓練時也保持穩健。

Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection

2604.02071v1 by Soo Won Seo, KyungChae Lee, Hyungchan Cho, Taein Son, Nam Ik Cho, Jun Won Choi

Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.

摘要：人類-物體互動（HOI）檢測旨在從單一圖像中定位人類-物體對並分類其互動，這是一項需要強大視覺理解和細緻上下文推理的任務。最近的方法利用視覺-語言模型（VLMs）引入語義先驗，顯著提高了HOI檢測的性能。然而，現有方法往往未能充分利用分佈在整個場景中的多樣上下文線索。為了克服這些限制，我們提出了以實例為中心的上下文挖掘網絡（InCoM-Net）——一個新穎的框架，有效地將從VLM中提取的豐富語義知識與物體檢測器產生的實例特徵整合。這一設計通過建模不僅在每個檢測到的實例內部，還在實例之間及其周圍場景上下文中的關係，實現了更深入的互動推理。InCoM-Net包含兩個核心組件：以實例為中心的上下文精煉（ICR），它分別從VLM衍生的特徵中提取實例內、實例間和全局上下文線索，以及漸進式上下文聚合（ProCA），它迭代地將這些多上下文特徵與實例級檢測器特徵融合，以支持高級HOI推理。在HICO-DET和V-COCO基準上的大量實驗表明，InCoM-Net達到了最先進的性能，超越了之前的HOI檢測方法。代碼可在 https://github.com/nowuss/InCoM-Net 獲得。

Improving MPI Error Detection and Repair with Large Language Models and Bug References

2604.02398v1 by Scott Piersall, Yang Gao, Shenyang Liu, Liqiang Wang

Message Passing Interface (MPI) is a foundational technology in high-performance computing (HPC), widely used for large-scale simulations and distributed training (e.g., in machine learning frameworks such as PyTorch and TensorFlow). However, maintaining MPI programs remains challenging due to their complex interplay among processes and the intricacies of message passing and synchronization. With the advancement of large language models like ChatGPT, it is tempting to adopt such technology for automated error detection and repair. Yet, our studies reveal that directly applying large language models (LLMs) yields suboptimal results, largely because these models lack essential knowledge about correct and incorrect usage, particularly the bugs found in MPI programs. In this paper, we design a bug detection and repair technique alongside Few-Shot Learning (FSL), Chain-of-Thought (CoT) reasoning, and Retrieval Augmented Generation (RAG) techniques in LLMs to enhance the large language model's ability to detect and repair errors. Surprisingly, such enhancements lead to a significant improvement, from 44% to 77%, in error detection accuracy compared to baseline methods that use ChatGPT directly. Additionally, our experiments demonstrate our bug referencing technique generalizes well to other large language models.

摘要：訊息傳遞介面（MPI）是高效能計算（HPC）中的基礎技術，廣泛用於大規模模擬和分散式訓練（例如，在機器學習框架如PyTorch和TensorFlow中）。然而，由於進程之間的複雜互動以及訊息傳遞和同步的細微差別，維護MPI程式仍然具有挑戰性。隨著像ChatGPT這樣的大型語言模型的進步，採用這種技術進行自動錯誤檢測和修復的誘惑也隨之而來。然而，我們的研究顯示，直接應用大型語言模型（LLMs）會產生次優結果，主要是因為這些模型缺乏對正確和不正確用法的基本知識，特別是在MPI程式中發現的錯誤。在本文中，我們設計了一種錯誤檢測和修復技術，結合了少量學習（Few-Shot Learning, FSL）、思考鏈（Chain-of-Thought, CoT）推理和檢索增強生成（Retrieval Augmented Generation, RAG）技術，以增強大型語言模型檢測和修復錯誤的能力。令人驚訝的是，這些增強導致錯誤檢測準確率從44%顯著提高至77%，相比之下，直接使用ChatGPT的基線方法。此外，我們的實驗顯示，我們的錯誤參考技術對其他大型語言模型具有良好的泛化能力。

Diff-KD: Diffusion-based Knowledge Distillation for Collaborative Perception under Corruptions

2604.02061v1 by Pengcheng Lyu, Chaokun Zhang, Gong Chen, Tao Tang, Zhaoxiang Luo

Multi-agent collaborative perception enables autonomous systems to overcome individual sensing limits through collective intelligence. However, real-world sensor and communication corruptions severely undermine this advantage. Crucially, existing approaches treat corruptions as static perturbations or passively conform to corrupted inputs, failing to actively recover the underlying clean semantics. To address this limitation, we introduce Diff-KD, a framework that integrates diffusion-based generative refinement into teacher-student knowledge distillation for robust collaborative perception. Diff-KD features two core components: (i) Progressive Knowledge Distillation (PKD), which treats local feature restoration as a conditional diffusion process to recover global semantics from corrupted observations; and (ii) Adaptive Gated Fusion (AGF), which dynamically weights neighbors based on ego reliability during fusion. Evaluated on OPV2V and DAIR-V2X under seven corruption types, Diff-KD achieves state-of-the-art performance in both detection accuracy and calibration robustness.

摘要：多智能體協作感知使自主系統能夠通過集體智慧克服個別感知的限制。然而，現實世界中的感測器和通信干擾嚴重削弱了這一優勢。關鍵是，現有的方法將干擾視為靜態擾動或被動地適應受損的輸入，未能主動恢復潛在的乾淨語義。為了解決這一限制，我們介紹了Diff-KD，一個將基於擴散的生成精煉整合到教師-學生知識蒸餾中的框架，以實現穩健的協作感知。Diff-KD具有兩個核心組件：(i) 漸進式知識蒸餾（PKD），將局部特徵恢復視為條件擴散過程，以從受損觀察中恢復全局語義；(ii) 自適應門控融合（AGF），在融合過程中根據自我可靠性動態加權鄰居。在七種干擾類型下對OPV2V和DAIR-V2X進行評估，Diff-KD在檢測準確性和校準穩健性方面達到了最先進的性能。

SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning

2604.01993v1 by Daeyong Kwon, Soyoung Yoon, Seung-won Hwang

Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.

摘要：多跳 QA 基準常常因表面正確性而獎勵大型語言模型（LLMs），掩蓋了未經證實或有缺陷的推理步驟。為了轉向嚴謹的推理，我們提出了 SAFE，一個動態基準框架，將未經證實的思維鏈（CoT）替換為一個嚴格可驗證的基於實體的序列。我們的框架分為兩個階段運作： (1) 訓練時驗證，我們建立了一個原子錯誤分類法和一個基於知識圖譜（KG）的驗證管道，以消除標準基準中的噪聲監督，並將多達 14% 的實例識別為無法回答， (2) 推理時驗證，訓練於這個經過驗證的數據集的反饋模型能夠實時動態檢測未經證實的步驟。實驗結果顯示，SAFE 不僅在訓練時揭示了現有基準的關鍵缺陷，還顯著超越了標準基準，實現了平均準確率提升 8.4 個百分點，同時在推理時保證可驗證的軌跡。

Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models

2604.01965v1 by Florian Kelber, Matthias Jobst, Yuni Susanti, Michael Färber

Scientific knowledge discovery increasingly relies on large language models, yet many existing scholarly assistants depend on proprietary systems with tens or hundreds of billions of parameters. Such reliance limits reproducibility and accessibility for the research community. In this work, we ask a simple question: do we need bigger models for scientific applications? Specifically, we investigate to what extent carefully designed retrieval pipelines can compensate for reduced model scale in scientific applications. We design a lightweight retrieval-augmented framework that performs task-aware routing to select specialized retrieval strategies based on the input query. The system further integrates evidence from full-text scientific papers and structured scholarly metadata, and employs compact instruction-tuned language models to generate responses with citations. We evaluate the framework across several scholarly tasks, focusing on scholarly question answering (QA), including single- and multi-document scenarios, as well as biomedical QA under domain shift and scientific text compression. Our findings demonstrate that retrieval and model scale are complementary rather than interchangeable. While retrieval design can partially compensate for smaller models, model capacity remains important for complex reasoning tasks. This work highlights retrieval and task-aware design as key factors for building practical and reproducible scholarly assistants.

摘要：科學知識發現越來越依賴大型語言模型，但許多現有的學術助手卻依賴於擁有數十億或數百億參數的專有系統。這種依賴限制了研究社群的可重複性和可及性。在這項工作中，我們提出了一個簡單的問題：我們是否需要更大的模型來應用於科學？具體來說，我們調查精心設計的檢索管道在多大程度上可以彌補科學應用中模型規模的減少。我們設計了一個輕量級的檢索增強框架，該框架執行任務感知路由，以根據輸入查詢選擇專門的檢索策略。該系統進一步整合來自全文科學論文和結構化學術元數據的證據，並使用緊湊的指令調整語言模型生成帶有引用的回應。我們在幾個學術任務中評估該框架，重點關注學術問答（QA），包括單文檔和多文檔場景，以及在領域轉移和科學文本壓縮下的生物醫學QA。我們的研究結果表明，檢索和模型規模是互補的，而非可互換的。雖然檢索設計可以部分彌補較小模型的不足，但模型容量在複雜推理任務中仍然重要。這項工作突顯了檢索和任務感知設計是構建實用且可重複的學術助手的關鍵因素。

Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia

2604.01962v1 by Saja Al-Dabet, Sherzod Turaev, Nazar Zaki

Abnormal head movements (AHMs) manifest across a broad spectrum of neurological disorders; however, the absence of a multi-condition resource integrating kinematic measurements, clinical severity scores, and patient demographics constitutes a persistent barrier to the development of AI-driven diagnostic tools. To address this gap, this study introduces NeuroPose-AHM, a knowledge-based dataset of neurologically induced AHMs constructed through a multi-LLM extraction framework applied to 1,430 peer-reviewed publications. The dataset contains 2,756 patient-group-level records spanning 57 neurological conditions, derived from 846 AHM-relevant papers. Inter-LLM reliability analysis confirms robust extraction performance, with study-level classification achieving strong agreement (kappa = 0.822). To demonstrate the dataset's analytical utility, a four-task framework is applied to cervical dystonia (CD), the condition most directly defined by pathological head movement. First, Task 1 performs multi-label AHM type classification (F1 = 0.856). Task 2 constructs the Head-Neck Severity Index (HNSI), a unified metric that normalizes heterogeneous clinical rating scales. The clinical relevance of this index is then evaluated in Task 3, where HNSI is validated against real-world CD patient data, with aligned severe-band proportions (6.7%) providing a preliminary plausibility indication for index calibration within the high severity range. Finally, Task 4 performs bridge analysis between movement-type probabilities and HNSI scores, producing significant correlations (p less than 0.001). These results demonstrate the analytical utility of NeuroPose-AHM as a structured, knowledge-based resource for neurological AHM research. The NeuroPose-AHM dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.19386862).

摘要：異常頭部運動（AHMs）在廣泛的神經疾病中表現出來；然而，缺乏一個整合運動學測量、臨床嚴重程度評分和患者人口統計的多條件資源，構成了開發基於人工智慧的診斷工具的持續障礙。為了解決這一問題，本研究介紹了NeuroPose-AHM，這是一個基於知識的神經誘發AHMs數據集，通過應用於1,430篇經過同行評審的出版物的多LLM提取框架構建而成。該數據集包含2,756個患者群體級別的記錄，涵蓋57種神經疾病，來源於846篇與AHM相關的論文。跨LLM可靠性分析確認了穩健的提取性能，研究級別的分類達到強一致性（kappa = 0.822）。為了展示該數據集的分析效用，將四任務框架應用於頸部肌張力障礙（CD），這是由病理性頭部運動最直接定義的疾病。首先，任務1執行多標籤AHM類型分類（F1 = 0.856）。任務2構建頭頸嚴重程度指數（HNSI），這是一個統一的指標，將異質的臨床評分標準進行標準化。然後在任務3中評估該指數的臨床相關性，其中HNSI與現實世界的CD患者數據進行驗證，對應的重度比例（6.7%）為指數在高嚴重程度範圍內的校準提供了初步的合理性指示。最後，任務4在運動類型概率和HNSI分數之間進行橋接分析，產生了顯著的相關性（p小於0.001）。這些結果展示了NeuroPose-AHM作為一個結構化的、基於知識的神經AHM研究資源的分析效用。NeuroPose-AHM數據集在Zenodo上公開可用（https://doi.org/10.5281/zenodo.19386862）。

How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization

2604.01938v1 by Ramon Ferrer-i-Cancho

The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permutation of the other vertex. It has been hypothesized that word orders in languages minimize the swap distance in the permutohedron: given a source order, word orders that are closer in the permutohedron should be less costly and thus more likely. Here we explain how to measure the degree of optimality of word order variation with respect to swap distance minimization. We illustrate the power of our novel mathematical framework by showing that crosslinguistic gestures are at least $77\%$ optimal. It is unlikely that the multiple times where crosslinguistic gestures hit optimality are due to chance. We establish the theoretical foundations for research on the optimality of word or gesture order with respect to swap distance minimization in communication systems. Finally, we introduce the quadratic assignment problem (QAP) into language research as an umbrella for multiple optimization problems and, accordingly, postulate a general principle of optimal assignment that unifies various linguistic principles including swap distance minimization.

摘要：所有序列的排列結構可以表示為一個排列多面體（permutohedron），這是一個圖，其中頂點是排列，兩個頂點相連如果在其中一個頂點的排列中相鄰元素的交換產生了另一個頂點的排列。有人假設語言中的詞序最小化排列多面體中的交換距離：給定一個源詞序，排列多面體中更接近的詞序應該成本較低，因此更有可能。在這裡，我們解釋如何測量詞序變化的最佳性程度，以最小化交換距離。我們通過顯示跨語言手勢至少達到 $77\%$ 的最佳性來說明我們新數學框架的威力。跨語言手勢多次達到最佳性不太可能是偶然的。我們為關於詞序或手勢序在通訊系統中最小化交換距離的最佳性研究建立了理論基礎。最後，我們將二次分配問題（QAP）引入語言研究，作為多個優化問題的總稱，並因此假設一個統一各種語言原則的最佳分配的一般原則，包括交換距離最小化。

CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift

2604.01845v1 by HyunGi Kim, Jisoo Mok, Hyungyu Lee, Juhyeon Shin, Sungroh Yoon

Multivariate time-series anomaly detection (MTSAD) aims to identify deviations from normality in multivariate time-series and is critical in real-world applications. However, in real-world deployments, distribution shifts are ubiquitous and cause severe performance degradation in pre-trained anomaly detector. Test-time adaptation (TTA) updates a pre-trained model on-the-fly using only unlabeled test data, making it promising for addressing this challenge. In this study, we propose CANDI (Curated test-time adaptation for multivariate time-series ANomaly detection under DIstribution shift), a novel TTA framework that selectively identifies and adapts to potential false positives while preserving pre-trained knowledge. CANDI introduces a False Positive Mining (FPM) strategy to curate adaptation samples based on anomaly scores and latent similarity, and incorporates a plug-and-play Spatiotemporally-Aware Normality Adaptation (SANA) module for structurally informed model updates. Extensive experiments demonstrate that CANDI significantly improves the performance of MTSAD under distribution shift, improving AUROC up to 14% while using fewer adaptation samples.

摘要：多變量時間序列異常檢測（MTSAD）旨在識別多變量時間序列中的正常性偏差，並在現實應用中至關重要。然而，在現實部署中，分佈變化無處不在，並導致預訓練異常檢測器的性能嚴重下降。測試時適應（TTA）僅使用未標記的測試數據即時更新預訓練模型，使其在應對這一挑戰時顯得前景可期。在本研究中，我們提出了CANDI（針對分佈變化的多變量時間序列異常檢測的精選測試時適應），這是一個新穎的TTA框架，能夠選擇性地識別和適應潛在的假陽性，同時保留預訓練知識。CANDI引入了一種假陽性挖掘（FPM）策略，根據異常分數和潛在相似性來篩選適應樣本，並結合了一個即插即用的時空感知正常性適應（SANA）模塊，以進行結構性知識更新。大量實驗表明，CANDI在分佈變化下顯著提高了MTSAD的性能，AUROC提高了多達14%，同時使用了更少的適應樣本。

2604.01770v1 by Chao Li, Yuru Wang, Chunyi Zhao

Knowledge graphs store large numbers of relations efficiently, but they remain weak at representing a quieter difficulty: the meaning of a concept often shifts with the domain in which it is used. A triple such as Apple, instance-of, Company may be acceptable in one setting while being misleading or unusable in another. In most current systems, domain information is attached as metadata, qualifiers, or graph-level organization. These mechanisms help with filtering and provenance, but they usually do not alter the formal status of the assertion itself. This paper argues that domain should be treated as part of knowledge representation rather than as supplementary annotation. It introduces the Domain-Contextualized Concept Graph (DCG), a framework in which domain is written into the relation and interpreted as a modal world constraint. In the DCG form (C, R at D, C'), the marker at D identifies the world in which the relation holds. Formally, the relation is interpreted through a domain-indexed necessity operator, so that truth, inference, and conflict checking are all scoped to the relevant world. This move has three consequences: ambiguous concepts can be disambiguated at the point of representation; invalid assertions can be challenged against their domain; cross-domain relations can be connected through explicit predicates. The paper develops this claim through a Kripke-style semantics, a compact predicate system, a Prolog implementation, and mappings to RDF, OWL, and relational databases. The contribution is a representational reinterpretation of domain itself. The central claim is that many practical failures in knowledge systems begin when domain is treated as external to the assertion. DCG addresses that by giving domain a structural and computable role inside the representation.

摘要：知識圖譜有效地儲存大量的關係，但在表達一個更微妙的困難上仍然顯得薄弱：概念的意義往往隨著使用的領域而變化。像 Apple, instance-of, Company 這樣的三元組在一個環境中可能是可接受的，而在另一個環境中則可能會產生誤導或無法使用。在大多數當前系統中，領域信息作為元數據、限定詞或圖層組織附加。這些機制有助於過濾和來源追溯，但通常不會改變斷言本身的正式狀態。本文主張，領域應被視為知識表達的一部分，而非補充註釋。它引入了領域情境化概念圖（DCG），這是一個將領域寫入關係並解釋為模態世界約束的框架。在 DCG 形式 (C, R at D, C') 中，D 的標記標識了關係成立的世界。正式地，該關係通過一個領域索引的必要運算符來解釋，因此真理、推理和衝突檢查都被限制在相關的世界範疇內。這一舉措有三個後果：模糊的概念可以在表達的時候進行消歧；無效的斷言可以根據其領域受到挑戰；跨領域的關係可以通過明確的謂詞連接。本文通過克里普克風格的語義學、一個緊湊的謂詞系統、一個 Prolog 實現，以及對 RDF、OWL 和關係數據庫的映射來發展這一主張。其貢獻在於對領域本身的表達重新詮釋。核心主張是，許多知識系統中的實際失敗始於將領域視為斷言的外部。DCG 通過在表達內部賦予領域結構性和可計算的角色來解決這一問題。

FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation

2604.01766v1 by Taimur Khan, Hannes Feilhauer, Muhammad Jazib Zafar

Very High Resolution (VHR) forest structure data at individual-tree scale is essential for carbon, biodiversity, and ecosystem monitoring. Still, airborne LiDAR remains costly and infrequent despite being the reference for forest structure metrics like Canopy Height Model (CHM), Plant Area Index (PAI), and Foliage Height Diversity (FHD). We propose FSKD: a LiDAR-to-RGB-Infrared (RGBI) knowledge distillation (KD) framework in which a multi-modal teacher fuses RGBI imagery with LiDAR-derived planar metrics and vertical profiles via cross-attention, and an RGBI-only SegFormer student learns to reproduce these outputs. Trained on 384 $km^2$ of forests in Saxony, Germany (20 cm ground sampling distance (GSD)) and evaluated on eight geographically distinct test tiles, the student achieves state-of-the-art (SOTA) zero-shot CHM performance (MedAE 4.17 m, $R^2$=0.51, IoU 0.87), outperforming HRCHM/DAC baselines by 29--46% in MAE (5.81 m vs. 8.14--10.84 m) with stronger correlation coefficients (0.713 vs. 0.166--0.652). Ablations show that multi-modal fusion improves performance by 10--26% over RGBI-only training, and that asymmetric distillation with appropriate model capacity is critical. The method jointly predicts CHM, PAI, and FHD, a multi-metric capability not provided by current monocular CHM estimators, although PAI/FHD transfer remains region-dependent and benefits from local calibration. The framework also remains effective under temporal mismatch (winter LiDAR, summer RGBI), removing strict co-acquisition constraints and enabling scalable 20 cm operational monitoring for workflows such as Digital Twin Germany and national Digital Orthophoto programs.

摘要：非常高解析度 (VHR) 的森林結構數據在單棵樹的尺度上對於碳、生物多樣性和生態系統監測至關重要。儘管空中LiDAR仍然是森林結構指標（如樹冠高度模型 (CHM)、植物面積指數 (PAI) 和葉片高度多樣性 (FHD)）的參考，但其成本高昂且使用頻率不高。我們提出了FSKD：一個LiDAR到RGB-紅外 (RGBI) 知識蒸餾 (KD) 框架，其中一個多模態教師通過交叉注意力將RGBI影像與LiDAR衍生的平面指標和垂直剖面融合，而僅使用RGBI的SegFormer學生則學習重現這些輸出。在德國薩克森州的384 $km^2$ 森林上進行訓練（地面取樣距離 (GSD) 為20厘米），並在八個地理上不同的測試區塊上進行評估，該學生在零樣本CHM性能上達到了最先進的 (SOTA) 表現（MedAE 4.17 m，$R^2$=0.51，IoU 0.87），在MAE方面超越了HRCHM/DAC基準29--46%（5.81 m對比8.14--10.84 m），並且具有更強的相關係數（0.713對比0.166--0.652）。消融實驗顯示，多模態融合在性能上比僅RGBI訓練提高了10--26%，而且具備適當模型容量的非對稱蒸餾是關鍵。該方法共同預測CHM、PAI和FHD，這是一種當前單目CHM估計器所不具備的多指標能力，儘管PAI/FHD的轉移仍然依賴於區域，並受益於本地校準。該框架在時間不匹配（冬季LiDAR，夏季RGBI）下仍然有效，消除了嚴格的共同獲取限制，並為數位雙胞胎德國和國家數位正射影像計畫等工作流程實現可擴展的20厘米操作監測。

AeroTherm-GPT: A Verification-Centered LLM Framework for Thermal Protection System Engineering Workflows

2604.01738v1 by Chuhan Qiao, Jinglai Zheng, Jie Huang, Buyue Zhao, Fan Li, Haiming Huang

Integrating Large Language Models (LLMs) into hypersonic thermal protection system (TPS) design is bottlenecked by cascading constraint violations when generating executable simulation artifacts. General-purpose LLMs, treating generation as single-pass text completion, fail to satisfy the sequential, multi-gate constraints inherent in safety-critical engineering workflows. To address this, we propose AeroTherm-GPT, the first TPS-specialized LLM Agent, instantiated through a Constraint-Closed-Loop Generation (CCLG) framework. CCLG organizes TPS artifact generation as an iterative workflow comprising generation, validation, CDG-guided repair, execution, and audit. The Constraint Dependency Graph (CDG) encodes empirical co-resolution structure among constraint categories, directing repair toward upstream fault candidates based on lifecycle ordering priors and empirical co-resolution probabilities. This upstream-priority mechanism resolves multiple downstream violations per action, achieving a Root-Cause Fix Efficiency of 4.16 versus 1.76 for flat-checklist repair. Evaluated on HyTPS-Bench and validated against external benchmarks, AeroTherm-GPT achieves 88.7% End-to-End Success Rate (95% CI: 87.5-89.9), a gain of +12.5 pp over the matched non-CDG ablation baseline, without catastrophic forgetting on scientific reasoning and code generation tasks.

摘要：將大型語言模型 (LLMs) 整合到超音速熱保護系統 (TPS) 設計中，因生成可執行的模擬工件時出現級聯約束違反而受到瓶頸。通用 LLMs 將生成視為單次文本完成，無法滿足安全關鍵工程工作流程中固有的序列多閘約束。為了解決這個問題，我們提出了 AeroTherm-GPT，第一個專門針對 TPS 的 LLM 代理，通過約束閉環生成 (CCLG) 框架實現。CCLG 將 TPS 工件生成組織為一個迭代工作流程，包括生成、驗證、CDG 引導的修復、執行和審核。約束依賴圖 (CDG) 編碼了約束類別之間的實證共同解決結構，根據生命週期排序的先驗和實證共同解決概率，將修復指向上游故障候選。這一上游優先機制每個行動解決多個下游違規，實現了 4.16 的根本原因修復效率，相較於 1.76 的平面清單修復。經過在 HyTPS-Bench 上評估並與外部基準驗證，AeroTherm-GPT 實現了 88.7% 的端到端成功率 (95% CI: 87.5-89.9)，相比匹配的非 CDG 消融基線提高了 +12.5 個百分點，且在科學推理和代碼生成任務上沒有出現災難性遺忘。

The AnIML Ontology: Enabling Semantic Interoperability for Large-Scale Experimental Data in Interconnected Scientific Labs

2604.01728v1 by Wilf Morlidge, Elliott Watkiss-Leek, George Hannah, Harry Rostron, Andrew Ng, Ewan Johnson, Andrew Mitchell, Terry R. Payne, Valentina Tamma, Jacopo de Berardinis

Achieving semantic interoperability across heterogeneous experimental data systems remains a major barrier to data-driven scientific discovery. The Analytical Information Markup Language (AnIML), a flexible XML-based standard for analytical chemistry and biology, is increasingly used in industrial R&D labs for managing and exchanging experimental data. However, the expressivity of the XML schema permits divergent interpretations across stakeholders, introducing inconsistencies that undermine the interoperability the AnIML schema was designed to support. In this paper, we present the AnIML Ontology, an OWL 2 ontology that formalises the semantics of AnIML and aligns it with the Allotrope Data Format to support future cross-system and cross-lab interoperability. The ontology was developed using an expert-in-the-loop approach combining LLM-assisted requirement elicitation with collaborative ontology engineering. We validate the ontology through a multi-layered approach: data-driven transformation of real-world AnIML files into knowledge graphs, competency question verification via SPARQL, and a novel validation protocol based on adversarial negative competency questions mapped to established ontological anti-patterns and enforced via SHACL constraints.

摘要：實現異質實驗數據系統之間的語義互操作性仍然是數據驅動科學發現的一大障礙。分析信息標記語言（AnIML）是一種靈活的基於XML的標準，用於分析化學和生物學，越來越多地被工業研發實驗室用於管理和交換實驗數據。然而，XML架構的表達能力允許利益相關者之間存在不同的解釋，這引入了不一致性，削弱了AnIML架構所設計支持的互操作性。在本文中，我們提出了AnIML本體，一種OWL 2本體，正式化AnIML的語義並將其與Allotrope數據格式對齊，以支持未來的跨系統和跨實驗室互操作性。該本體是通過專家參與的方式開發的，結合了LLM輔助的需求引導和協作本體工程。我們通過多層次的方法驗證該本體：將現實世界的AnIML文件數據驅動地轉換為知識圖譜，通過SPARQL進行能力問題驗證，以及基於對抗性負能力問題的創新驗證協議，這些問題映射到已建立的本體反模式並通過SHACL約束強制執行。

LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis

2604.01725v1 by Zhihuan Wei, Xinhang Chen, Danyang Han, Yang Hu, Jie Liu, Xuewen Miao, Guijiang Li

General aviation fault diagnosis and efficient maintenance are critical to flight safety; however, deploying deep learning models on resource-constrained edge devices poses dual challenges in computational capacity and interpretability. This paper proposes LiteInception--a lightweight interpretable fault diagnosis framework designed for edge deployment. The framework adopts a two-stage cascaded architecture aligned with standard maintenance workflows: Stage 1 performs high-recall fault detection, and Stage 2 conducts fine-grained fault classification on anomalous samples, thereby decoupling optimization objectives and enabling on-demand allocation of computational resources. For model compression, a multi-method fusion strategy based on mutual information, gradient analysis, and SE attention weights is proposed to reduce the input sensor channels from 23 to 15, and a 1+1 branch LiteInception architecture is introduced that compresses InceptionTime parameters by 70%, accelerates CPU inference by over 8x, with less than 3% F1 loss. Furthermore, knowledge distillation is introduced as a precision-recall regulation mechanism, enabling the same lightweight model to adapt to different scenarios--such as safety-critical and auxiliary diagnosis--by switching training strategies. Finally, a dual-layer interpretability framework integrating four attribution methods is constructed, providing traceable evidence chains of "which sensor x which time period." Experiments on the NGAFID dataset demonstrate a fault detection accuracy of 81.92% with 83.24% recall, and a fault identification accuracy of 77.00%, validating the framework's favorable balance among efficiency, accuracy, and interpretability.

摘要：一般航空故障診斷和高效維護對於飛行安全至關重要；然而，在資源受限的邊緣設備上部署深度學習模型面臨計算能力和可解釋性兩方面的挑戰。本文提出了LiteInception——一種設計用於邊緣部署的輕量級可解釋故障診斷框架。該框架採用與標準維護工作流程對齊的兩階段級聯架構：第一階段執行高召回率的故障檢測，第二階段對異常樣本進行細粒度故障分類，從而解耦優化目標並實現計算資源的按需分配。為了進行模型壓縮，提出了一種基於互信息、梯度分析和SE注意權重的多方法融合策略，將輸入傳感器通道從23個減少到15個，並引入了一種1+1分支的LiteInception架構，將InceptionTime參數壓縮70%，使CPU推理加速超過8倍，F1損失低於3%。此外，引入知識蒸餾作為精度-召回調節機制，使得相同的輕量級模型能夠通過切換訓練策略適應不同場景——例如安全關鍵和輔助診斷。最後，構建了一個整合四種歸因方法的雙層可解釋性框架，提供“哪個傳感器 x 哪個時間段”的可追溯證據鏈。在NGAFID數據集上的實驗顯示，故障檢測準確率為81.92%，召回率為83.24%，故障識別準確率為77.00%，驗證了該框架在效率、準確性和可解釋性之間的良好平衡。

Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition

2604.01711v1 by Truc Nguyen, Then Tran, Binh Truong, Phuoc Nguyen T. H

Vietnamese Speech Emotion Recognition (SER) remains challenging due to ambiguous acoustic patterns and the lack of reliable annotated data, especially in real-world conditions where emotional boundaries are not clearly separable. To address this problem, this paper proposes a human-machine collaborative framework that integrates human knowledge into the learning process rather than relying solely on data-driven models. The proposed framework is centered around LLM-based reasoning, where acoustic feature-based models are used to provide auxiliary signals such as confidence and feature-level evidence. A confidence-based routing mechanism is introduced to distinguish between easy and ambiguous samples, allowing uncertain cases to be delegated to LLMs for deeper reasoning guided by structured rules derived from human annotation behavior. In addition, an iterative refinement strategy is employed to continuously improve system performance through error analysis and rule updates. Experiments are conducted on a Vietnamese speech dataset of 2,764 samples across three emotion classes (calm, angry, panic), with high inter-annotator agreement (Fleiss Kappa = 0.8574), ensuring reliable ground truth. The proposed method achieves strong performance, reaching up to 86.59% accuracy and Macro F1 around 0.85-0.86, demonstrating its effectiveness in handling ambiguous and hard-to-classify cases. Overall, this work highlights the importance of combining data-driven models with human reasoning, providing a robust and model-agnostic approach for speech emotion recognition in low-resource settings.

摘要：越南語語音情感識別（SER）由於模糊的聲學模式和缺乏可靠的標註數據，仍然面臨挑戰，特別是在情感邊界不明確的現實條件下。為了解決這個問題，本文提出了一個人機協作框架，將人類知識整合到學習過程中，而不是僅僅依賴數據驅動的模型。所提出的框架圍繞基於大語言模型（LLM）的推理展開，其中使用基於聲學特徵的模型提供輔助信號，例如信心和特徵級證據。引入了一種基於信心的路由機制，以區分簡單和模糊的樣本，允許不確定的案例委託給LLM進行更深入的推理，這些推理受到來自人類標註行為的結構化規則的指導。此外，採用了一種迭代精煉策略，通過錯誤分析和規則更新不斷提高系統性能。在一個包含2,764個樣本的越南語語音數據集上進行了實驗，涵蓋三個情感類別（平靜、憤怒、驚慌），具有高的標註者間一致性（Fleiss Kappa = 0.8574），確保了可靠的真實標準。所提出的方法達到了強大的性能，準確率高達86.59%，宏觀F1約為0.85-0.86，顯示出其在處理模糊和難以分類的案例中的有效性。總體而言，這項工作突顯了將數據驅動模型與人類推理相結合的重要性，提供了一種強健且與模型無關的語音情感識別方法，適用於資源匱乏的環境。

Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework

2604.01707v1 by Yanchen Wu, Tenghui Lin, Yingli Zhou, Fangyuan Zhang, Qintian Guo, Xun Zhou, Sibo Wang, Xilin Liu, Yuchi Ma, Yixiang Fang

Memory emerges as the core module in the large language model (LLM)-based agents for long-horizon complex tasks (e.g., multi-turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative reasoning and self-evolution. A number of memory methods have been proposed in the literature. However, these methods have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first summarize a unified framework that incorporates all the existing agent memory methods from a high-level perspective. We then extensively compare representative agent memory methods on two well-known benchmarks and examine the effectiveness of all methods, providing a thorough analysis of those methods. As a byproduct of our experimental analysis, we also design a new memory method by exploiting modules in the existing methods, which outperforms the state-of-the-art methods. Finally, based on these findings, we offer promising future research opportunities. We believe that a deeper understanding of the behavior of existing methods can provide valuable new insights for future research.

摘要：記憶在基於大型語言模型 (LLM) 的代理中成為長期複雜任務（例如，多輪對話、遊戲、科學發現）的核心模塊，記憶可以促進知識積累、迭代推理和自我演變。文獻中提出了多種記憶方法。然而，這些方法在相同的實驗設置下尚未進行系統性和全面的比較。在本文中，我們首先從高層次的角度總結了一個統一框架，該框架整合了所有現有的代理記憶方法。然後，我們在兩個知名基準上廣泛比較了代表性的代理記憶方法，並檢查了所有方法的有效性，提供了對這些方法的徹底分析。作為我們實驗分析的副產品，我們還通過利用現有方法中的模塊設計了一種新的記憶方法，該方法超越了最先進的技術。最後，基於這些發現，我們提供了有前景的未來研究機會。我們相信，對現有方法行為的更深入理解可以為未來的研究提供有價值的新見解。

MiCA Learns More Knowledge Than LoRA and Full Fine-Tuning

2604.01694v1 by Sten Rüdiger, Sebastian Raschka

Minor Component Adaptation (MiCA) is a novel parameter-efficient fine-tuning method for large language models that focuses on adapting underutilized subspaces of model representations. Unlike conventional methods such as Low-Rank Adaptation (LoRA), which target dominant subspaces, MiCA leverages Singular Value Decomposition to identify subspaces related to minor singular vectors associated with the least significant singular values and constrains the update of parameters during fine-tuning to those directions. This strategy leads to up to 5.9x improvement in knowledge acquisition under optimized training hyperparameters and a minimal parameter footprint of 6-60% compared to LoRA. These results suggest that constraining adaptation to minor singular directions provides a more efficient and stable mechanism for integrating new knowledge into pre-trained language models.

摘要：次要組件適應（MiCA）是一種針對大型語言模型的新型參數高效微調方法，專注於適應模型表示中的未充分利用的子空間。與傳統方法如低秩適應（LoRA）不同，後者針對主導子空間，MiCA利用奇異值分解來識別與最不重要的奇異值相關的次要奇異向量的子空間，並在微調過程中將參數更新限制在這些方向上。這一策略在優化的訓練超參數下，知識獲取的提升可達5.9倍，並且與LoRA相比，參數佔用最小為6-60%。這些結果表明，將適應限制在次要奇異方向上提供了一種更高效且穩定的機制，以將新知識整合到預訓練的語言模型中。

PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment

2604.01682v1 by Chenning Xu, Mao Zheng, Mingyang Song

Supervised fine-tuning (SFT) with token-level hard labels can amplify overconfident imitation of factually unsupported targets, causing hallucinations that propagate in multi-sentence generation. We study an augmented SFT setting in which training instances include coarse sentence-level factuality risk labels and inter-sentence dependency annotations, providing structured signals about where factual commitments are weakly supported. We propose \textbf{PRISM}, a differentiable risk-gated framework that modifies learning only at fact-critical positions. PRISM augments standard SFT with a lightweight, model-aware probability reallocation objective that penalizes high-confidence predictions on risky target tokens, with its scope controlled by span-level risk weights and model-aware gating. Experiments on hallucination-sensitive factual benchmarks and general evaluations show that PRISM improves factual aggregates across backbones while maintaining a competitive overall capability profile. Ablations further show that the auxiliary signal is most effective when used conservatively, and that knowledge masking and model-aware reallocation play complementary roles in balancing factual correction and capability preservation.

摘要：監督式微調（SFT）使用標記級的硬標籤可能會放大對事實上不支持目標的過度自信模仿，導致在多句生成中出現的幻覺。我們研究了一種增強的 SFT 設定，其中訓練實例包括粗略的句子級事實風險標籤和句子間依賴性註釋，提供有關事實承諾薄弱支持的結構性信號。我們提出了 \textbf{PRISM}，這是一個可微分的風險閘框架，僅在事實關鍵位置修改學習。PRISM 通過一個輕量級、模型感知的概率重新分配目標來增強標準 SFT，該目標對於高置信度預測在風險目標標記上進行懲罰，其範圍由跨度級風險權重和模型感知閘控製。對於對幻覺敏感的事實基準和一般評估的實驗顯示，PRISM 在保持競爭性的整體能力配置文件的同時，改善了各個骨幹的事實聚合。消融實驗進一步顯示，輔助信號在保守使用時最為有效，並且知識遮蔽和模型感知重新分配在平衡事實修正和能力保留方面扮演互補角色。

Can Heterogeneous Language Models Be Fused?

2604.01674v1 by Shilian Chen, Jie Zhou, Qin Chen, Wen Wu, Xin Li, Qi Feng, Liang He

Model merging aims to integrate multiple expert models into a single model that inherits their complementary strengths without incurring the inference-time cost of ensembling. Recent progress has shown that merging can be highly effective when all source models are \emph{homogeneous}, i.e., derived from the same pretrained backbone and therefore share aligned parameter coordinates or compatible task vectors. Yet this assumption is increasingly unrealistic in open model ecosystems, where useful experts are often built on different families such as Llama, Qwen, and Mistral. In such \emph{heterogeneous} settings, direct weight-space fusion becomes ill-posed due to architectural mismatch, latent basis misalignment, and amplified cross-source conflict. We address this problem with \texttt{HeteroFusion} for heterogeneous language model fusion, which consists of two key components: topology-based alignment that transfers knowledge across heterogeneous backbones by matching functional module structures instead of raw tensor coordinates, and conflict-aware denoising that suppresses incompatible or noisy transfer signals during fusion. We further provide analytical justification showing that preserving the target adapter basis while predicting structured updates leads to a stable and well-conditioned transfer process. Across heterogeneous transfer, multi-source fusion, noisy-source robustness, and cross-family generalization settings, \texttt{HeteroFusion} consistently outperforms strong merging, fusion, and ensemble baselines.

摘要：模型合併旨在將多個專家模型整合為一個單一模型，該模型繼承它們的互補優勢，而不會產生集成時的推理成本。最近的進展顯示，當所有源模型是\emph{同質}時，即源自相同的預訓練骨幹，因此共享對齊的參數坐標或兼容的任務向量，合併可以非常有效。然而，在開放模型生態系統中，這一假設越來越不現實，因為有用的專家通常建立在不同的家族上，例如Llama、Qwen和Mistral。在這種\emph{異質}環境中，由於架構不匹配、潛在基礎不對齊以及放大的跨源衝突，直接的權重空間融合變得不適定。我們通過\texttt{HeteroFusion}來解決這個問題，這是一種異質語言模型融合方法，包含兩個關鍵組件：基於拓撲的對齊，通過匹配功能模塊結構而非原始張量坐標來在異質骨幹之間轉移知識，以及衝突感知去噪，在融合過程中抑制不兼容或噪聲的轉移信號。我們進一步提供分析證明，顯示在預測結構更新的同時保持目標適配器基礎會導致穩定且良好條件的轉移過程。在異質轉移、多源融合、噪聲源穩健性和跨家族泛化設置中，\texttt{HeteroFusion}始終超越強大的合併、融合和集成基準。

PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation

2604.01671v1 by Yanxin Luo, Xiaoyu Zhang, Jing Li, Yan Gao, Donghong Han

Emotional Support Conversation (ESC) aims to alleviate individual emotional distress by generating empathetic responses. However, existing methods face challenges in effectively supporting deep contextual understanding. To address this issue, we propose PRCCF, a Persona-guided Retrieval and Causality-aware Cognitive Filtering framework. Specifically, the framework incorporates a persona-guided retrieval mechanism that jointly models semantic compatibility and persona alignment to enhance response generation. Furthermore, it employs a causality-aware cognitive filtering module to prioritize causally relevant external knowledge, thereby improving contextual cognitive understanding for emotional reasoning. Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations. Our code is publicly available at: https://github.com/YancyLyx/PRCCF.

摘要：情感支持對話（ESC）旨在通過生成同理心反應來減輕個體的情感困擾。然而，現有方法在有效支持深層次上下文理解方面面臨挑戰。為了解決這個問題，我們提出了PRCCF，一個以角色為導向的檢索和因果關聯認知過濾框架。具體而言，該框架包含一個角色導向的檢索機制，該機制共同建模語義相容性和角色對齊，以增強反應生成。此外，它還採用一個因果關聯認知過濾模塊，以優先考慮因果相關的外部知識，從而改善情感推理的上下文認知理解。在ESConv數據集上進行的大量實驗表明，PRCCF在自動指標和人類評估上均優於最先進的基準。我們的代碼已公開可用，網址為：https://github.com/YancyLyx/PRCCF。

Hierarchical Memory Orchestration for Personalized Persistent Agents

2604.01670v1 by Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yuqi Li, Yirong Chen, Ding Wang

While long-term memory is essential for intelligent agents to maintain consistent historical awareness, the accumulation of extensive interaction data often leads to performance bottlenecks. Naive storage expansion increases retrieval noise and computational latency, overwhelming the reasoning capacity of models deployed on constrained personal devices. To address this, we propose Hierarchical Memory Orchestration (HMO), a framework that organizes interaction history into a three-tiered directory driven by user-centric contextual relevance. Our system maintains a compact primary cache, coupling recent and pivotal memories with an evolving user profile to ensure agent reasoning remains aligned with individual behavioral traits. This primary cache is complemented by a high-priority secondary layer, both of which are managed within a global archive of the full interaction history. Crucially, the user persona dictates memory redistribution across this hierarchy, promoting records mapped to long-term patterns toward more active tiers while relegating less relevant information. This targeted orchestration surfaces historical knowledge precisely when needed while maintaining a lean and efficient active search space. Evaluations on multiple benchmarks achieve state-of-the-art performance. Real-world deployments in ecosystems like OpenClaw demonstrate that HMO significantly enhances agent fluidity and personalization.

摘要：長期記憶對於智能代理保持一致的歷史意識至關重要，但大量互動數據的積累往往導致性能瓶頸。簡單的存儲擴展會增加檢索噪聲和計算延遲，超出受限個人設備上模型的推理能力。為了解決這個問題，我們提出了分層記憶協調（HMO），這是一個將互動歷史組織成三層目錄的框架，驅動因素是以用戶為中心的上下文相關性。我們的系統維持一個緊湊的主緩存，將近期和關鍵的記憶與不斷演變的用戶檔案結合，以確保代理的推理與個體行為特徵保持一致。這個主緩存由一個高優先級的次級層補充，這兩者都在完整互動歷史的全球檔案中進行管理。關鍵在於，用戶角色決定了這個層級中的記憶重新分配，促進與長期模式映射的記錄向更活躍的層級移動，同時將不太相關的信息降級。這種有針對性的協調在需要時準確地顯現歷史知識，同時保持精簡且高效的主動搜索空間。在多個基準上的評估達到了最先進的性能。在像OpenClaw這樣的生態系統中的實際部署顯示，HMO顯著增強了代理的流暢性和個性化。

Robust Embodied Perception in Dynamic Environments via Disentangled Weight Fusion

2604.01669v1 by Juncen Guo, Xiaoguang Zhu, Jingyi Wu, Jingyu Zhang, Jingnan Cai, Zhenghao Niu, Liang Song

Embodied perception systems face severe challenges of dynamic environment distribution drift when they continuously interact in open physical spaces. However, the existing domain incremental awareness methods often rely on the domain id obtained in advance during the testing phase, which limits their practicability in unknown interaction scenarios. At the same time, the model often overfits to the context-specific perceptual noise, which leads to insufficient generalization ability and catastrophic forgetting. To address these limitations, we propose a domain-id and exemplar-free incremental learning framework for embodied multimedia systems, which aims to achieve robust continuous environment adaptation. This method designs a disentangled representation mechanism to remove non-essential environmental style interference, and guide the model to focus on extracting semantic intrinsic features shared across scenes, thereby eliminating perceptual uncertainty and improving generalization. We further use the weight fusion strategy to dynamically integrate the old and new environment knowledge in the parameter space, so as to ensure that the model adapts to the new distribution without storing historical data and maximally retains the discrimination ability of the old environment. Extensive experiments on multiple standard benchmark datasets show that the proposed method significantly reduces catastrophic forgetting in a completely exemplar-free and domain-id free setting, and its accuracy is better than the existing state-of-the-art methods.

摘要：具身感知系統在持續與開放物理空間互動時，面臨動態環境分佈漂移的嚴重挑戰。然而，現有的領域增量感知方法通常依賴於在測試階段事先獲得的領域 ID，這限制了它們在未知互動場景中的實用性。同時，模型往往過度擬合於特定上下文的感知噪聲，導致泛化能力不足和災難性遺忘。為了解決這些限制，我們提出了一個無領域 ID 和範例的增量學習框架，旨在實現穩健的持續環境適應。該方法設計了一個解耦表示機制，以去除非必要的環境風格干擾，並引導模型專注於提取跨場景共享的語義內在特徵，從而消除感知不確定性並改善泛化。我們進一步使用權重融合策略，在參數空間中動態整合舊環境和新環境的知識，以確保模型能夠適應新分佈而不需存儲歷史數據，並最大限度地保留舊環境的區分能力。在多個標準基準數據集上的廣泛實驗表明，所提出的方法在完全無範例和無領域 ID 的設置下顯著減少了災難性遺忘，其準確性優於現有的最先進方法。

2604.01667v1 by Rui Dong, Xiaotong Zhang, Jiaxing Li, Yueying Li, Jiayin Wei, Youyong Kong

Multi-modal fusion is of great significance in neuroscience which integrates information from different modalities and can achieve better performance than uni-modal methods in downstream tasks. Current multi-modal fusion methods in brain networks, which mainly focus on structural connectivity (SC) and functional connectivity (FC) modalities, are static in nature. They feed different samples into the same model with identical computation, ignoring inherent difference between input samples. This lack of sample adaptation hinders model's further performance. To this end, we innovatively propose a multi-stage dynamic fusion strategy (M3D-BFS) for sample-adaptive multi-modal brain network analysis. Unlike other static fusion methods, we design different mixture-of-experts (MoEs) for uni- and multi-modal representations where modules can adaptively change as input sample changes during inference. To alleviate issue of MoE where training of experts may be collapsed, we divide our method into 3 stages. We first train uni-modal encoders respectively, then pretrain single experts of MoEs before finally finetuning the whole model. A multi-modal disentanglement loss is designed to enhance the final representations. To the best of our knowledge, this is the first work for dynamic fusion for multi-modal brain network analysis. Extensive experiments on different real-world datasets demonstrates the superiority of M3D-BFS.

摘要：多模態融合在神經科學中具有重要意義，它整合來自不同模態的信息，並能在下游任務中實現比單模態方法更好的性能。目前在腦網絡中的多模態融合方法主要集中於結構連接（SC）和功能連接（FC）模態，這些方法本質上是靜態的。它們將不同的樣本輸入相同的模型，並使用相同的計算，忽略了輸入樣本之間的固有差異。這種缺乏樣本適應性的問題阻礙了模型的進一步性能。為此，我們創新性地提出了一種多階段動態融合策略（M3D-BFS），用於樣本自適應的多模態腦網絡分析。與其他靜態融合方法不同，我們為單模態和多模態表示設計了不同的專家混合（MoEs），這些模塊可以在推理過程中根據輸入樣本的變化自適應地改變。為了解決專家訓練可能崩潰的MoE問題，我們將我們的方法分為三個階段。我們首先分別訓練單模態編碼器，然後預訓練MoEs的單一專家，最後微調整個模型。設計了一種多模態解耦損失來增強最終表示。據我們所知，這是針對多模態腦網絡分析的動態融合的首個工作。在不同的真實世界數據集上進行的廣泛實驗證明了M3D-BFS的優越性。

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

2604.01658v1 by Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, Paul Pu Liang

Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.

摘要：大型語言模型（LLM）基礎的演化是一種有前景的開放式發現方法，其中進展需要持續的搜索和知識積累。現有的方法仍然在很大程度上依賴於固定的啟發式和硬編碼的探索規則，這限制了LLM代理的自主性。我們提出了CORAL，這是第一個針對開放式問題的自主多代理演化框架。CORAL用長期運行的代理取代了僵化的控制，這些代理通過共享的持久記憶、異步多代理執行和基於心跳的干預進行探索、反思和合作。它還提供了實用的安全措施，包括隔離的工作空間、評估者分離、資源管理以及代理會話和健康管理。在多樣的數學、算法和系統優化任務上進行評估，CORAL在10個任務上設置了新的最先進結果，實現了3-10倍的更高改進率，並且在任務中所需的評估次數遠少於固定的演化搜索基準。在Anthropic的核心工程任務上，四個共同演化的代理將最佳已知分數從1363改善到1103個循環。機械分析進一步顯示這些增益是如何來自知識重用和多代理的探索與通信。總體而言，這些結果表明，更大的代理自主性和多代理演化可以顯著改善開放式發現。代碼可在 https://github.com/Human-Agent-Society/CORAL 獲得。

AromaGen: Interactive Generation of Rich Olfactory Experiences with Multimodal Language Models

2604.01650v1 by Yunge Wen, Awu Chen, Jianing Yu, Jas Brooks, Hiroshi Ishii, Paul Pu Liang

Smell's deep connection with food, memory, and social experience has long motivated researchers to bring olfaction into interactive systems. Yet most olfactory interfaces remain limited to fixed scent cartridges and pre-defined generation patterns, and the scarcity of large-scale olfactory datasets has further constrained AI-based approaches. We present AromaGen, an AI-powered wearable interface capable of real-time, general-purpose aroma generation from free-form text or visual inputs. AromaGen is powered by a multimodal LLM that leverages latent olfactory knowledge to map semantic inputs to structured mixtures of 12 carefully selected base odorants, released through a neck-worn dispenser. Users can iteratively refine generated aromas through natural language feedback via in-context learning. Through a controlled user study ($N = 26$), AromaGen matches human-composed mixtures in zero-shot generation and significantly surpasses them after iterative refinement, achieving a median similarity of 8/10 to real food aromas and reducing perceived artificiality to levels comparable to real food. AromaGen is a step towards real-world interactive aroma generation, opening new possibilities for communication, wellbeing, and immersive technologies.

摘要：氣味與食物、記憶和社交體驗之間的深厚聯繫，長久以來激勵著研究者將嗅覺引入互動系統。然而，大多數嗅覺介面仍然限於固定的香味墨盒和預定義的生成模式，而大型嗅覺數據集的稀缺進一步限制了基於 AI 的方法。我們提出了 AromaGen，一種能夠從自由形式的文本或視覺輸入中實時生成通用香氣的 AI 驅動可穿戴介面。AromaGen 由一個多模態 LLM 驅動，利用潛在的嗅覺知識將語義輸入映射到 12 種精心挑選的基礎氣味的結構混合物，這些氣味通過佩戴在頸部的分配器釋放。用戶可以通過上下文學習，通過自然語言反饋來迭代地細化生成的香氣。通過一項受控的用戶研究（$N = 26$），AromaGen 在零樣本生成中與人類創作的混合物相匹配，並在迭代細化後顯著超越它們，實現了與真實食物香氣的中位相似度為 8/10，並將感知的人工性降低到與真實食物相當的水平。AromaGen 是邁向現實世界互動香氣生成的一步，為溝通、福祉和沉浸式技術開啟了新的可能性。

Exploring Robust Multi-Agent Workflows for Environmental Data Management

2604.01647v1 by Boyuan Guan, Jason Liu, Yanzhao Wu, Kiavash Bahreini

Embedding LLM-driven agents into environmental FAIR data management is compelling - they can externalize operational knowledge and scale curation across heterogeneous data and evolving conventions. However, replacing deterministic components with probabilistic workflows changes the failure mode: LLM pipelines may generate plausible but incorrect outputs that pass superficial checks and propagate into irreversible actions such as DOI minting and public release. We introduce EnviSmart, a production data management system deployed on campus-wide storage infrastructure for environmental research. EnviSmart treats reliability as an architectural property through two mechanisms: a three-track knowledge architecture that externalizes behaviors (governance constraints), domain knowledge (retrievable context), and skills (tool-using procedures) as persistent, interlocking artifacts; and a role-separated multi-agent design where deterministic validators and audited handoffs restore fail-stop semantics at trust boundaries before irreversible steps. We compare two production deployments. The University's GIS Center Ecological Archive (849 curated datasets) serves as a single-agent baseline. SF2Bench, a compound flooding benchmark comprising 2,452 monitoring stations and 8,557 published files spanning 39 years, validates the multi-agent workflow. The multi-agent approach improved both efficiency - completed by a single operator in two days with repeated artifact reuse across deployments - and reliability: audited handoffs detected and blocked a coordinate transformation error affecting all 2,452 stations before publication. A representative incident (ISS-004) demonstrated boundary-based containment with 10-minute detection latency, zero user exposure, and 80-minute resolution. This paper has been accepted at PEARC 2026.

摘要：將 LLM 驅動的代理嵌入環境 FAIR 數據管理是引人注目的——它們可以外部化操作知識並在異構數據和不斷演變的規範中擴展策展。然而，用概率工作流程替換確定性組件改變了失敗模式：LLM 管道可能生成看似合理但不正確的輸出，這些輸出能通過表面檢查並傳播到不可逆的行動中，例如 DOI 鑄造和公開發布。我們介紹 EnviSmart，一個部署在校園範圍內存儲基礎設施上的生產數據管理系統，用於環境研究。EnviSmart 通過兩個機制將可靠性視為一種架構屬性：三軌知識架構外部化行為（治理約束）、領域知識（可檢索的上下文）和技能（工具使用程序）作為持久的、互鎖的工件；以及一種角色分離的多代理設計，其中確定性驗證者和經過審核的交接在不可逆步驟之前恢復信任邊界的故障停止語義。我們比較了兩個生產部署。大學的 GIS 中心生態檔案館（849 個策展數據集）作為單一代理基準。SF2Bench 是一個複合洪水基準，包含 2,452 個監測站和 8,557 個跨越 39 年的已發布文件，驗證了多代理工作流程。多代理方法提高了效率——由單一操作員在兩天內完成，並在不同部署之間重複使用工件——以及可靠性：經過審核的交接檢測並阻止了一個影響所有 2,452 個站點的坐標轉換錯誤，並在發布之前進行了阻止。一個代表性的事件（ISS-004）展示了基於邊界的封閉，檢測延遲為 10 分鐘，零用戶暴露，解決時間為 80 分鐘。本文已被 PEARC 2026 接受。

2604.01634v1 by Junyoung Sung, Seungwoo Lyu, Minjun Kim, Sumin An, Arsha Nagrani, Paul Hongsuck Seo

Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other standard multimodal benchmarks.

摘要：現實世界的推理通常需要跨模態結合資訊，在多跳過程中將文本上下文與視覺線索連接起來。然而，大多數多模態基準無法捕捉這種能力：它們通常依賴單一圖像或圖像集，答案可以僅從單一模態推斷出來。這一限制在訓練數據中也得到了反映，其中交錯的圖像-文本內容很少強調互補的多跳推理。因此，視覺-語言模型（VLMs）經常產生幻覺，並生成與視覺證據聯繫不良的推理痕跡。為了解決這一差距，我們介紹了CRIT，一個新的數據集和基準，通過基於圖形的自動管道生成複雜的跨模態推理任務。CRIT包含來自自然圖像、視頻和文本豐富來源的多樣領域，並包括一個經過手動驗證的測試集，以便進行可靠的評估。在這個基準上的實驗顯示，即使是最先進的模型在這類推理任務上也面臨挑戰。在CRIT上訓練的模型在跨模態多跳推理上顯示出顯著的增益，包括在SPIQA和其他標準多模態基準上的強勁改進。

2604.01610v1 by Taraneh Ghandi, Hamidreza Mahyar, Shachar Klaiman

The use of knowledge graphs for grounding agents in real-world Q&A applications has become increasingly common. Answering complex queries often requires multi-hop reasoning and the ability to navigate vast relational structures. Standard approaches rely on prompting techniques that steer large language models to reason over raw graph context, or retrieval-augmented generation pipelines where relevant subgraphs are injected into the context. These, however, face severe limitations with enterprise-scale KGs that cannot fit in even the largest context windows available today. We present GraphWalk, a problem-agnostic, training-free, tool-based framework that allows off-the-shelf LLMs to reason through sequential graph navigation, dramatically increasing performance across different tasks. Unlike task-specific agent frameworks that encode domain knowledge into specialized tools, GraphWalk equips the LLM with a minimal set of orthogonal graph operations sufficient to traverse any graph structure. We evaluate whether models equipped with GraphWalk can compose these operations into correct multi-step reasoning chains, where each tool call represents a verifiable step creating a transparent execution trace. We first demonstrate our approach on maze traversal, a problem non-reasoning models are completely unable to solve, then present results on graphs resembling real-world enterprise knowledge graphs. To isolate structural reasoning from world knowledge, we evaluate on entirely synthetic graphs with random, non-semantic labels. Our benchmark spans 12 query templates from basic retrieval to compound first-order logic queries. Results show that tool-based traversal yields substantial and consistent gains over in-context baselines across all model families tested, with gains becoming more pronounced as scale increases, precisely where in-context approaches fail catastrophically.

摘要：使用知識圖譜為現實世界的問答應用程序提供基礎的做法變得越來越普遍。回答複雜的查詢通常需要多步推理和導航龐大關係結構的能力。標準方法依賴於引導技術，這些技術使大型語言模型能夠在原始圖譜上下文中進行推理，或者使用檢索增強生成管道，將相關子圖注入上下文中。然而，這些方法在企業級知識圖譜面臨嚴重限制，這些知識圖譜甚至無法適應當前可用的最大上下文窗口。我們提出了GraphWalk，一種與問題無關、無需訓練的基於工具的框架，允許現成的LLM通過順序圖導航進行推理，顯著提高不同任務的性能。與將領域知識編碼到專用工具中的任務特定代理框架不同，GraphWalk為LLM提供了一組最小的正交圖操作，足以遍歷任何圖結構。我們評估裝備GraphWalk的模型是否能夠將這些操作組合成正確的多步推理鏈，其中每個工具調用代表一個可驗證的步驟，創建一個透明的執行痕跡。我們首先在迷宮遍歷問題上演示我們的方法，這是一個非推理模型完全無法解決的問題，然後呈現類似於現實世界企業知識圖譜的圖形結果。為了將結構推理與世界知識隔離，我們在完全合成的圖上進行評估，這些圖具有隨機的非語義標籤。我們的基準涵蓋了從基本檢索到複合一階邏輯查詢的12個查詢模板。結果顯示，基於工具的遍歷在所有測試的模型系列中，相較於上下文基準產生了顯著且一致的增益，隨著規模的增加，增益變得更加明顯，正是上下文方法在此時慘遭失敗的地方。

From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?

2604.01608v1 by Binyan Xu, Dong Fang, Haitao Li, Kehuan Zhang

Multi-agent systems (MAS) tackle complex tasks by distributing expertise, though this often comes at the cost of heavy coordination overhead, context fragmentation, and brittle phase ordering. Distilling a MAS into a single-agent skill can bypass these costs, but this conversion lacks a principled answer for when and what to distill. Instead, the empirical outcome is surprisingly inconsistent: skill lift ranges from a 28% improvement to a 2% degradation across metrics of the exact same task. In this work, we reveal that skill utility is governed not by the task, but by the evaluation metric. We introduce Metric Freedom ($F$), the first a priori predictor of skill utility. $F$ measures the topological rigidity of a metric's scoring landscape by quantifying how output diversity couples with score variance via a Mantel test. Guided by $F$, we propose a two-stage adaptive distillation framework. Stage 1 acts as a selective extraction mechanism, extracting tools and knowledge while discarding restrictive structures on "free" metrics to preserve exploration. Stage 2 targets computationally intensive iterative refinement exclusively toward "rigid" metrics ($F \lesssim 0.6$) to eliminate trajectory-local overfitting. Evaluating across 4 tasks, 11 datasets, and 6 metrics, $F$ strongly predicts skill utility ($ρ= -0.62$, $p < 0.05$). Strikingly, identical agent trajectories yield diametrically opposite skill lifts under rigid versus free metrics, demonstrating that skill utility is fundamentally a metric-level property. Driven by this signal, our adaptive agent matches or exceeds the original MAS while reducing cost up to 8$\times$ and latency by up to 15$\times$.

摘要：多智能體系統（MAS）透過分配專業知識來處理複雜任務，儘管這通常會帶來大量的協調開銷、上下文碎片化和脆弱的階段排序。將MAS提煉為單一智能體技能可以繞過這些成本，但這種轉換缺乏一個原則性的答案來決定何時以及提煉什麼。相反，實證結果令人驚訝地不一致：技能提升在相同任務的不同指標中範圍從28%的改善到2%的劣化。在這項工作中，我們揭示了技能效用不是由任務決定的，而是由評估指標決定的。我們引入了指標自由（$F$），這是第一個先驗的技能效用預測器。$F$通過量化輸出多樣性如何與分數變異性耦合來測量指標評分景觀的拓撲剛性，這是通過Mantel測試實現的。在$F$的指導下，我們提出了一個兩階段的自適應提煉框架。第一階段作為一個選擇性提取機制，提取工具和知識，同時丟棄對「自由」指標的限制性結構，以保留探索。第二階段專門針對計算密集型的迭代精煉，僅針對「剛性」指標（$F \lesssim 0.6$）以消除軌跡局部過擬合。在4個任務、11個數據集和6個指標中進行評估，$F$強烈預測技能效用（$ρ= -0.62$，$p < 0.05$）。驚人的是，在剛性與自由指標下，相同的智能體軌跡產生了截然相反的技能提升，這表明技能效用從根本上是一個指標級別的特性。在這個信號的驅動下，我們的自適應智能體在降低成本高達8倍和延遲高達15倍的同時，達到或超過了原始的MAS。

ByteRover: Agent-Native Memory Through LLM-Curated Hierarchical Context

2604.01599v1 by Andy Nguyen, Danh Doan, Hoang Pham, Bao Ha, Dat Pham, Linh Nguyen, Hieu Nguyen, Thien Nguyen, Cuong Do, Phat Nguyen, Toan Nguyen

Memory-Augmented Generation (MAG) extends large language models with external memory to support long-context reasoning, but existing approaches universally treat memory as an external service that agents call into, delegating storage to separate pipelines of chunking, embedding, and graph extraction. This architectural separation means the system that stores knowledge does not understand it, leading to semantic drift between what the agent intended to remember and what the pipeline actually captured, loss of coordination context across agents, and fragile recovery after failures. In this paper, we propose ByteRover, an agent-native memory architecture that inverts the memory pipeline: the same LLM that reasons about a task also curates, structures, and retrieves knowledge. ByteRover represents knowledge in a hierarchical Context Tree, a file-based knowledge graph organized as Domain, Topic, Subtopic, and Entry, where each entry carries explicit relations, provenance, and an Adaptive Knowledge Lifecycle (AKL) with importance scoring, maturity tiers, and recency decay. Retrieval uses a 5-tier progressive strategy that resolves most queries at sub-100 ms latency without LLM calls, escalating to agentic reasoning only for novel questions. Experiments on LoCoMo and LongMemEval demonstrate that ByteRover achieves state-of-the-art accuracy on LoCoMo and competitive results on LongMemEval while requiring zero external infrastructure, no vector database, no graph database, no embedding service, with all knowledge stored as human-readable markdown files on the local filesystem.

摘要：記憶增強生成（MAG）通過外部記憶擴展大型語言模型，以支持長上下文推理，但現有的方法普遍將記憶視為代理調用的外部服務，將存儲委派給分開的切片、嵌入和圖形提取管道。這種架構的分離意味著存儲知識的系統並不理解它，導致代理想要記住的內容與管道實際捕獲的內容之間出現語義漂移，代理之間的協調上下文丟失，以及在故障後的脆弱恢復。在本文中，我們提出了ByteRover，一種代理本地的記憶架構，顛覆了記憶管道：同一個LLM不僅推理任務，還策劃、結構化和檢索知識。ByteRover以層次化的上下文樹表示知識，這是一個基於文件的知識圖，按領域、主題、副主題和條目組織，其中每個條目都攜帶明確的關係、來源，以及具有重要性評分、成熟度層級和近期衰減的自適應知識生命周期（AKL）。檢索使用5層漸進策略，能在低於100毫秒的延遲內解決大多數查詢，而無需LLM調用，僅在面對新問題時升級到代理推理。在LoCoMo和LongMemEval上的實驗表明，ByteRover在LoCoMo上達到了最先進的準確性，在LongMemEval上也取得了競爭性的結果，同時不需要任何外部基礎設施，無需向量數據庫、圖形數據庫、嵌入服務，所有知識均以人類可讀的Markdown文件形式存儲在本地文件系統中。

Do Large Language Models Mentalize When They Teach?

2604.01594v1 by Sevan K. Harootonian, Mark K. Ho, Thomas L. Griffiths, Yael Niv, Ilia Sucholutsky

How do LLMs decide what to teach next: by reasoning about a learner's knowledge, or by using simpler rules of thumb? We test this in a controlled task previously used to study human teaching strategies. On each trial, a teacher LLM sees a hypothetical learner's trajectory through a reward-annotated directed graph and must reveal a single edge so the learner would choose a better path if they replanned. We run a range of LLMs as simulated teachers and fit their trial-by-trial choices with the same cognitive models used for humans: a Bayes-Optimal teacher that infers which transitions the learner is missing (inverse planning), weaker Bayesian variants, heuristic baselines (e.g., reward based), and non-mentalizing utility models. In a baseline experiment matched to the stimuli presented to human subjects, most LLMs perform well, show little change in strategy over trials, and their graph-by-graph performance is similar to that of humans. Model comparison (BIC) shows that Bayes-Optimal teaching best explains most models' choices. When given a scaffolding intervention, models follow auxiliary inference- or reward-focused prompts, but these scaffolds do not reliably improve later teaching on heuristic-incongruent test graphs and can sometimes reduce performance. Overall, cognitive model fits provide insight into LLM tutoring policies and show that prompt compliance does not guarantee better teaching decisions.

摘要：如何決定 LLM 接下來要教什麼：是通過推理學習者的知識，還是使用更簡單的經驗法則？我們在一個受控任務中測試這一點，該任務之前用於研究人類教學策略。在每次試驗中，教師 LLM 看到一個假設學習者在一個獎勵標註的有向圖中的軌跡，並必須揭示一條邊，以便學習者在重新規劃時能選擇更好的路徑。我們運行一系列 LLM 作為模擬教師，並用與人類相同的認知模型來擬合他們的逐次選擇：一個貝葉斯最優教師，推斷學習者缺失的轉換（逆向規劃）、較弱的貝葉斯變體、啟發式基準（例如，基於獎勵的）和非心理化的效用模型。在與呈現給人類受試者的刺激相匹配的基準實驗中，大多數 LLM 表現良好，策略在試驗中變化不大，且它們的圖對圖表現與人類相似。模型比較（BIC）顯示，貝葉斯最優教學最能解釋大多數模型的選擇。當給予一個支架干預時，模型遵循輔助推理或獎勵為重點的提示，但這些支架並不可靠地改善在與啟發式不一致的測試圖上的後續教學，有時甚至會降低表現。總體而言，認知模型擬合提供了對 LLM 輔導政策的洞察，並顯示提示遵從並不保證更好的教學決策。

A Role-Based LLM Framework for Structured Information Extraction from Healthy Food Policies

2604.01529v1 by Congjing Zhang, Ruoxuan Bao, Jingyu Li, Yoav Ackerman, Shuai Huang, Yanfang Su

Current Large Language Model (LLM) approaches for information extraction (IE) in the healthy food policy domain are often hindered by various factors, including misinformation, specifically hallucinations, misclassifications, and omissions that result from the structural diversity and inconsistency of policy documents. To address these limitations, this study proposes a role-based LLM framework that automates the IE from unstructured policy data by assigning specialized roles: an LLM policy analyst for metadata and mechanism classification, an LLM legal strategy specialist for identifying complex legal approaches, and an LLM food system expert for categorizing food system stages. This framework mimics expert analysis workflows by incorporating structured domain knowledge, including explicit definitions of legal mechanisms and classification criteria, into role-specific prompts. We evaluate the framework using 608 healthy food policies from the Healthy Food Policy Project (HFPP) database, comparing its performance against zero-shot, few-shot, and chain-of-thought (CoT) baselines using Llama-3.3-70B. Our proposed framework demonstrates superior performance in complex reasoning tasks, offering a reliable and transparent methodology for automating IE from health policies.

摘要：目前在健康食品政策領域中，針對信息提取（IE）的大型語言模型（LLM）方法常常受到各種因素的阻礙，包括錯誤信息，特別是幻覺、錯誤分類以及由於政策文件的結構多樣性和不一致性而導致的遺漏。為了解決這些限制，本研究提出了一個基於角色的LLM框架，通過分配專門角色來自動化從非結構化政策數據中提取信息：一個LLM政策分析師負責元數據和機制分類，一個LLM法律策略專家負責識別複雜的法律方法，以及一個LLM食品系統專家負責對食品系統階段進行分類。該框架通過將結構化的領域知識納入角色特定的提示，模仿專家分析工作流程，包括法律機制和分類標準的明確定義。我們使用來自健康食品政策項目（HFPP）數據庫的608個健康食品政策來評估該框架，並將其性能與使用Llama-3.3-70B的零樣本、少樣本和思維鏈（CoT）基準進行比較。我們提出的框架在複雜推理任務中顯示出卓越的性能，提供了一種可靠且透明的方法來自動化從健康政策中提取信息。

A Self-Evolving Agentic Framework for Metasurface Inverse Design

2604.01480v1 by Yi Huang, Bowen Zheng, Yunxi Dong, Hong Tang, Huan Zhao, S. M. Rakibul Hasan Shawon, Hualiang Zhang

Metasurface inverse design has become central to realizing complex optical functionality, yet translating target responses into executable, solver-compatible workflows still demands specialized expertise in computational electromagnetics and solver-specific software engineering. Recent large language models (LLMs) offer a complementary route to reducing this workflow-construction burden, but existing language-driven systems remain largely session-bounded and do not preserve reusable workflow knowledge across inverse-design tasks. We present an agentic framework for metasurface inverse design that addresses this limitation through context-level skill evolution. The framework couples a coding agent, evolving skill artifacts, and a deterministic evaluator grounded in physical simulation so that solver-specific strategies can be iteratively refined across tasks without modifying model weights or the underlying physics solver. We evaluate the framework on a benchmark spanning multiple metasurface inverse-design task types, with separate training-aligned and held-out task families. Evolved skills raise in-distribution task success from 38% to 74%, increase criteria pass fraction from 0.510 to 0.870, and reduce average attempts from 4.10 to 2.30. On held-out task families, binary success changes only marginally, but improvements in best margin together with shifts in error composition and agent behavior indicate partial transfer of workflow knowledge. These results suggest that the main value of skill evolution lies in accumulating reusable solver-specific expertise around reliable computational engines, thereby offering a practical path toward more autonomous and accessible metasurface inverse-design workflows.

摘要：超表面反向設計已成為實現複雜光學功能的核心，但將目標響應轉換為可執行的、與求解器兼容的工作流程仍然需要在計算電磁學和求解器特定軟體工程方面的專業知識。最近的大型語言模型（LLMs）提供了一種補充途徑，以減少這一工作流程構建的負擔，但現有的語言驅動系統仍然在很大程度上受到會話的限制，並且無法在反向設計任務之間保留可重用的工作流程知識。我們提出了一個針對超表面反向設計的代理框架，通過上下文級技能演變來解決這一限制。該框架結合了一個編碼代理、演變的技能工件和基於物理模擬的確定性評估器，以便在不修改模型權重或基礎物理求解器的情況下，跨任務迭代地完善求解器特定策略。我們在涵蓋多種超表面反向設計任務類型的基準上評估該框架，並設有單獨的訓練對齊和保留的任務家族。演變的技能使得分佈內任務的成功率從38%提高到74%，標準通過比例從0.510增加到0.870，平均嘗試次數從4.10減少到2.30。在保留的任務家族中，二元成功僅有輕微變化，但最佳邊際的改善以及錯誤組成和代理行為的變化表明工作流程知識的部分轉移。這些結果表明，技能演變的主要價值在於積累圍繞可靠計算引擎的可重用求解器特定專業知識，從而為更自主和更易於訪問的超表面反向設計工作流程提供了一條實用的途徑。

Can LLMs Predict Academic Collaboration? Topology Heuristics vs. LLM-Based Link Prediction on Real Co-authorship Networks

2604.01379v1 by Fan Huang, Munjung Kim

Can large language models (LLMs) predict which researchers will collaborate? We study this question through link prediction on real-world co-authorship networks from OpenAlex (9.96M authors, 108.7M edges), evaluating whether LLMs can predict future scientific collaborations using only author profiles, without access to graph structure. Using Qwen2.5-72B-Instruct across three historical eras of AI research, we find that LLMs and topology heuristics capture distinct signals and are strongest in complementary settings. On new-edge prediction under natural class imbalance, the LLM achieves AUROC 0.714--0.789, outperforming Common Neighbors, Jaccard, and Preferential Attachment, with recall up to 92.9\%; under balanced evaluation, the LLM outperforms \emph{all} topology heuristics in every era (AUROC 0.601--0.658 vs.\ best-heuristic 0.525--0.538); on continued edges, the LLM (0.687) is competitive with Adamic-Adar (0.684). Critically, 78.6--82.7\% of new collaborations occur between authors with no common neighbor -- a blind spot where all topology heuristics score zero but the LLM still achieves AUROC 0.652 by reasoning from author metadata alone. A temporal metadata ablation reveals that research concepts are the dominant signal (removing concepts drops AUROC by 0.047--0.084). Providing pre-computed graph features to the LLM \emph{degrades} performance due to anchoring effects, confirming that LLMs and topology methods should operate as separate, complementary channels. A socio-cultural ablation finds that name-inferred ethnicity and institutional country do not predict collaboration beyond topology, reflecting the demographic homogeneity of AI research. A node2vec baseline achieves AUROC comparable to Adamic-Adar, establishing that LLMs access a fundamentally different information channel -- author metadata -- rather than encoding the same structural signal differently.

摘要：大型語言模型（LLMs）能預測哪些研究者將會合作嗎？我們通過對來自OpenAlex的真實共著網絡（996萬作者，1.087億邊）的鏈接預測來研究這個問題，評估LLMs是否能僅通過作者檔案預測未來的科學合作，而不需要訪問圖結構。在三個AI研究的歷史時期中使用Qwen2.5-72B-Instruct，我們發現LLMs和拓撲啟發式方法捕捉到不同的信號，並在互補的環境中最為強大。在自然類別不平衡下的新邊預測中，LLM達到AUROC 0.714--0.789，超越了Common Neighbors、Jaccard和Preferential Attachment，召回率高達92.9%；在平衡評估下，LLM在每個時期中超越了\emph{所有}拓撲啟發式方法（AUROC 0.601--0.658對比最佳啟發式0.525--0.538）；在持續邊上，LLM（0.687）與Adamic-Adar（0.684）競爭激烈。關鍵是，78.6--82.7\%的新合作發生在沒有共同鄰居的作者之間——這是一個盲點，所有拓撲啟發式方法得分為零，但LLM仍然通過僅依賴作者元數據達到AUROC 0.652。一項時間元數據消融實驗顯示，研究概念是主要信號（移除概念使AUROC下降0.047--0.084）。向LLM提供預計算的圖特徵會\emph{降低}性能，因為錨定效應，確認LLMs和拓撲方法應作為獨立的互補通道運作。一項社會文化消融實驗發現，根據姓名推斷的族裔和機構國家無法在拓撲之外預測合作，反映了AI研究的人口同質性。一個node2vec基準達到與Adamic-Adar相當的AUROC，確立了LLMs訪問一個根本不同的信息通道——作者元數據——而不是以不同方式編碼相同的結構信號。

Ambig-IaC: Multi-level Disambiguation for Interactive Cloud Infrastructure-as-Code Synthesis

2604.02382v1 by Zhenning Yang, Kaden Gruizenga, Tongyuan Miao, Patrick Tser Jern Kon, Hui Guan, Ang Chen

The scale and complexity of modern cloud infrastructure have made Infrastructure-as-Code (IaC) essential for managing deployments. While large Language models (LLMs) are increasingly being used to generate IaC configurations from natural language, user requests are often underspecified. Unlike traditional code generation, IaC configurations cannot be executed cheaply or iteratively repaired, forcing the LLMs into an almost one-shot regime. We observe that ambiguity in IaC exhibits a tractable compositional structure: configurations decompose into three hierarchical axes (resources, topology, attributes) where higher-level decisions constrain lower-level ones. We propose a training-free, disagreement-driven framework that generates diverse candidate specifications, identifies structural disagreements across these axes, ranks them by informativeness, and produces targeted clarification questions that progressively narrow the configuration space. We introduce \textsc{Ambig-IaC}, a benchmark of 300 validated IaC tasks with ambiguous prompts, and an evaluation framework based on graph edit distance and embedding similarity. Our method outperforms the strongest baseline, achieving relative improvements of +18.4\% and +25.4\% on structure and attribute evaluations, respectively.

摘要：現代雲基礎設施的規模和複雜性使得基礎設施即代碼（IaC）對於管理部署變得至關重要。雖然大型語言模型（LLMs）越來越多地被用來從自然語言生成IaC配置，但用戶請求往往不夠具體。與傳統代碼生成不同，IaC配置無法便宜地執行或進行迭代修復，這迫使LLMs進入幾乎一次性生成的模式。我們觀察到IaC中的模糊性展現出可處理的組合結構：配置分解為三個層次的軸（資源、拓撲、屬性），其中高層次的決策限制了低層次的決策。我們提出了一個無需訓練、以分歧為驅動的框架，該框架生成多樣的候選規範，識別這些軸之間的結構性分歧，根據信息量對它們進行排名，並產生針對性的澄清問題，逐步縮小配置空間。我們引入了\textsc{Ambig-IaC}，這是一個包含300個經過驗證的IaC任務的基準，這些任務具有模糊的提示，並基於圖編輯距離和嵌入相似性的評估框架。我們的方法超越了最強基準，在結構和屬性評估上分別實現了+18.4\%和+25.4\%的相對改進。

No Attacker Needed: Unintentional Cross-User Contamination in Shared-State LLM Agents

2604.01350v1 by Tiankai Yang, Jiate Li, Yi Nian, Shen Dong, Ruiyao Xu, Ryan Rossi, Kaize Ding, Yue Zhao

LLM-based agents increasingly operate across repeated sessions, maintaining task states to ensure continuity. In many deployments, a single agent serves multiple users within a team or organization, reusing a shared knowledge layer across user identities. This shared persistence expands the failure surface: information that is locally valid for one user can silently degrade another user's outcome when the agent reapplies it without regard for scope. We refer to this failure mode as unintentional cross-user contamination (UCC). Unlike adversarial memory poisoning, UCC requires no attacker; it arises from benign interactions whose scope-bound artifacts persist and are later misapplied. We formalize UCC through a controlled evaluation protocol, introduce a taxonomy of three contamination types, and evaluate the problem in two shared-state mechanisms. Under raw shared state, benign interactions alone produce contamination rates of 57--71%. A write-time sanitization is effective when shared state is conversational, but leaves substantial residual risk when shared state includes executable artifacts, with contamination often manifesting as silent wrong answers. These results indicate that shared-state agents need artifact-level defenses beyond text-level sanitization to prevent silent cross-user failures.

摘要：LLM-based agents 越來越多地在重複的會話中運作，維持任務狀態以確保連續性。在許多部署中，單一代理為團隊或組織中的多個用戶服務，重用跨用戶身份的共享知識層。這種共享的持久性擴大了失敗的範圍：對於一個用戶來說在本地有效的信息，當代理在不考慮範圍的情況下重新應用時，可能會無聲地降低另一個用戶的結果。我們將這種失敗模式稱為無意的跨用戶污染（UCC）。與對抗性記憶中毒不同，UCC 不需要攻擊者；它源於範圍受限的善意互動，其產物持續存在並在後續被錯誤應用。我們通過一個受控評估協議對 UCC 進行形式化，介紹三種類型的污染分類法，並在兩種共享狀態機制中評估該問題。在原始共享狀態下，僅善意互動就會產生 57--71% 的污染率。當共享狀態是對話式時，寫入時的清理是有效的，但當共享狀態包括可執行的產物時，則留下了相當大的殘餘風險，污染往往表現為無聲的錯誤答案。這些結果表明，共享狀態代理需要超越文本級清理的產物級防禦，以防止無聲的跨用戶失敗。

Procedural Knowledge at Scale Improves Reasoning

2604.01348v1 by Di Wu, Devendra Singh Sachan, Wen-tau Yih, Mingda Chen

Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.

摘要：測試時間擴展已成為改善語言模型在挑戰性推理任務中的有效方法。然而，大多數現有的方法將每個問題孤立處理，並未系統性地重用先前推理軌跡中的知識。特別是，它們未充分利用程序性知識：如何重新框定問題、選擇方法，以及在需要時進行驗證或回溯。我們介紹了推理記憶，一種增強檢索生成（RAG）框架，專為推理模型設計，能夠在大規模上明確檢索和重用程序性知識。從現有的逐步推理軌跡語料庫開始，我們將每個軌跡分解為自包含的子問題-子程序對，產生了3200萬個緊湊的程序性知識條目數據庫。在推理時，一個輕量級的思考提示讓模型能夠口頭表達核心子問題，檢索其推理痕跡中的相關子程序，並在多樣的檢索子程序下進行推理，作為隱含的程序性先驗。在六個數學、科學和編程基準測試中，推理記憶始終優於帶有文檔、軌跡和模板知識的RAG，以及計算匹配的測試時間擴展基準。在更高的推理預算下，它在沒有檢索的情況下提高了最多19.2%，並在最強的計算匹配基準上提高了7.9%，涵蓋了各種任務類型。消融研究顯示，這些增益來自兩個關鍵因素：源軌跡的廣泛程序性覆蓋以及我們的分解和檢索設計，這兩者共同實現了程序性知識的有效提取和重用。

IDEA2: Expert-in-the-loop competency question elicitation for collaborative ontology engineering

2604.01344v1 by Elliott Watkiss-Leek, Reham Alharbi, Harry Rostron, Andrew Ng, Ewan Johnson, Andrew Mitchell, Terry R. Payne, Valentina Tamma, Jacopo de Berardinis

Competency question (CQ) elicitation represents a critical but resource-intensive bottleneck in ontology engineering. This foundational phase is often hampered by the communication gap between domain experts, who possess the necessary knowledge, and ontology engineers, who formalise it. This paper introduces IDEA2, a novel, semi-automated workflow that integrates Large Language Models (LLMs) within a collaborative, expert-in-the-loop process to address this challenge. The methodology is characterised by a core iterative loop: an initial LLM-based extraction of CQs from requirement documents, a co-creational review and feedback phase by domain experts on an accessible collaborative platform, and an iterative, feedback-driven reformulation of rejected CQs by an LLM until consensus is achieved. To ensure transparency and reproducibility, the entire lifecycle of each CQ is tracked using a provenance model that captures the full lineage of edits, anonymised feedback, and generation parameters. The workflow was validated in 2 real-world scenarios (scientific data, cultural heritage), demonstrating that IDEA2 can accelerate the requirements engineering process, improve the acceptance and relevance of the resulting CQs, and exhibit high usability and effectiveness among domain experts. We release all code and experiments at https://github.com/KE-UniLiv/IDEA2

摘要：能力問題（CQ）的引出代表了本體工程中一個關鍵但資源密集的瓶頸。這一基礎階段常常受到領域專家（擁有必要知識）與本體工程師（將其形式化）之間的溝通障礙的阻礙。本文介紹了IDEA2，一種新穎的半自動工作流程，將大型語言模型（LLMs）整合進一個協作的專家參與過程中，以應對這一挑戰。該方法的特點是核心的迭代循環：從需求文件中基於LLM的初步提取CQ，領域專家在一個可訪問的協作平台上進行共同創作的審查和反饋階段，以及LLM對被拒絕CQ的迭代、基於反饋的重新表述，直到達成共識。為了確保透明度和可重複性，每個CQ的整個生命周期都使用一個來源模型進行跟踪，該模型捕捉了編輯的完整來源、匿名反饋和生成參數。該工作流程在兩個現實世界場景中（科學數據、文化遺產）進行了驗證，證明IDEA2可以加速需求工程過程，提高結果CQ的接受度和相關性，並在領域專家中展現出高可用性和有效性。我們在 https://github.com/KE-UniLiv/IDEA2 上發布了所有代碼和實驗。

Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

2604.01280v1 by Marco Morini, Sara Sarto, Marcella Cornia, Lorenzo Baraldi

Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.

摘要：回答有關圖像的問題通常需要將視覺理解與外部知識相結合。多模態大型語言模型（MLLMs）為這種情境提供了一個自然的框架，但它們在回答知識密集型查詢時，經常難以識別最相關的視覺和文本證據。在這種情況下，模型必須將視覺線索與檢索到的文本證據結合起來，而這些文本證據通常是嘈雜的或僅部分相關，同時還要在圖像中定位細緻的視覺信息。在本研究中，我們介紹了 Look Twice（LoT），這是一個無需訓練的推理時框架，改善了預訓練 MLLMs 如何利用多模態證據。具體而言，我們利用模型的注意力模式來估計哪些視覺區域和檢索到的文本元素與查詢相關，然後根據這些突出的證據生成答案。所選的線索通過輕量級的提示級標記突出，鼓勵模型在生成過程中重新關注相關證據。在多個基於知識的 VQA 基準上的實驗顯示，與零樣本 MLLMs 相比，性能有了一致的提升。對以視覺為中心和以幻覺為導向的基準的額外評估進一步證明，僅僅突出視覺證據就能在沒有文本上下文的情況下改善模型性能，所有這些都不需要額外的訓練或架構修改。源代碼將公開發布。

Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning

2604.01152v1 by Mohammad R. Abu Ayyash

We present Brainstacks, a modular architecture for continual multi-domain fine-tuning of large language models that packages domain expertise as frozen adapter stacks composing additively on a shared frozen base at inference. Five interlocking components: (1) MoE-LoRA with Shazeer-style noisy top-2 routing across all seven transformer projections under QLoRA 4-bit quantization with rsLoRA scaling; (2) an inner loop performing residual boosting by freezing trained stacks and adding new ones; (3) an outer loop training sequential domain-specific stacks with curriculum-ordered dependencies; (4) null-space projection via randomized SVD constraining new stacks to subspaces orthogonal to prior directions, achieving zero forgetting in isolation; (5) an outcome-based sigmoid meta-router trained on empirically discovered domain-combination targets that selectively weights stacks, enabling cross-domain composition. Two boundary experiments: (6) PSN pretraining on a randomly initialized model; (7) per-domain RL (DPO/GRPO) validating compatibility with post-SFT alignment. Validated on TinyLlama-1.1B (4 domains, 9 stacks) and Gemma 3 12B IT (5 domains, 10 stacks), MoE-LoRA achieves 2.5x faster convergence than parameter-matched single LoRA, residual boosting breaks through the single-stack ceiling, and the routed system recovers generation quality destroyed by ungated stack accumulation. The central finding: the outcome-based router discovers that domain stacks encode transferable cognitive primitives (instruction-following clarity, numerical reasoning, procedural logic, chain-of-thought structure) rather than domain-specific knowledge, with medical prompts routing to chat+math stacks in 97% of cases despite zero medical data in those stacks.

摘要：我們提出了 Brainstacks，一種模組化架構，用於大型語言模型的持續多領域微調，將領域專業知識打包為凍結的適配器堆疊，這些堆疊在推理時在共享的凍結基礎上進行加法組合。五個相互交織的組件：(1) MoE-LoRA，使用 Shazeer 風格的噪聲 top-2 路由，跨越所有七個Transformer投影，在 QLoRA 4 位量化下，並使用 rsLoRA 縮放；(2) 內部循環通過凍結訓練堆疊並添加新的堆疊來執行殘差增強；(3) 外部循環訓練具有課程排序依賴關係的序列領域特定堆疊；(4) 通過隨機 SVD 的零空間投影，將新的堆疊約束到與先前方向正交的子空間，實現零遺忘；(5) 基於結果的 sigmoid 元路由器，根據經驗發現的領域組合目標進行訓練，選擇性地加權堆疊，使跨領域組合成為可能。兩個邊界實驗：(6) 在隨機初始化的模型上進行 PSN 預訓練；(7) 每個領域的強化學習（DPO/GRPO）驗證與後 SFT 對齊的兼容性。在 TinyLlama-1.1B（4 個領域，9 個堆疊）和 Gemma 3 12B IT（5 個領域，10 個堆疊）上進行驗證，MoE-LoRA 的收斂速度比參數匹配的單一 LoRA 快 2.5 倍，殘差增強突破了單堆疊的天花板，路由系統恢復了因無閘堆疊累積而損失的生成質量。核心發現：基於結果的路由器發現領域堆疊編碼了可轉移的認知原語（遵循指令的清晰度、數字推理、程序邏輯、思維鏈結構），而不是特定於領域的知識，儘管這些堆疊中沒有醫療數據，但醫療提示在 97% 的情況下路由到 chat+math 堆疊。

Looking into a Pixel by Nonlinear Unmixing -- A Generative Approach

2604.01141v1 by Maofeng Tang, Hairong Qi

Due to the large footprint of pixels in remote sensing imagery, hyperspectral unmixing (HU) has become an important and necessary procedure in hyperspectral image analysis. Traditional HU methods rely on a prior spectral mixing model, especially for nonlinear mixtures, which has largely limited the performance and generalization capacity of the unmixing approach. In this paper, we address the challenging problem of hyperspectral nonlinear unmixing (HNU) without explicit knowledge of the mixing model. Inspired by the principle of generative models, where images of the same distribution can be generated as that of the training images without knowing the exact probability distribution function of the image, we develop an invertible mixing-unmixing process via a bi-directional GAN framework, constrained by both the cycle consistency and the linkage between linear and nonlinear mixtures. The combination of cycle consistency and linear linkage provides powerful constraints without requiring an explicit mixing model. We refer to the proposed approach as the linearly-constrained CycleGAN unmixing net, or LCGU net. Experimental results indicate that the proposed LCGU net exhibits stable and competitive performance across different datasets compared with other state-of-the-art model-based HNU methods.

摘要：由於遙感影像中像素的佔地面積較大，超光譜解混（HU）已成為超光譜影像分析中一個重要且必要的程序。傳統的HU方法依賴於先前的光譜混合模型，特別是對於非線性混合，這在很大程度上限制了解混方法的性能和泛化能力。在本文中，我們解決了在沒有明確混合模型知識的情況下進行超光譜非線性解混（HNU）的挑戰性問題。受到生成模型原則的啟發，該原則允許在不知道影像的確切概率分佈函數的情況下生成與訓練影像具有相同分佈的影像，我們通過雙向GAN框架開發了一個可逆的混合-解混過程，並受到循環一致性和線性與非線性混合之間的聯繫的約束。循環一致性和線性聯繫的結合提供了強有力的約束，而不需要明確的混合模型。我們將所提出的方法稱為線性約束循環生成對抗網絡解混網，或稱LCGU網。實驗結果表明，所提出的LCGU網在不同數據集上表現出穩定且具競爭力的性能，與其他最先進的基於模型的HNU方法相比。

Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation

2604.01118v1 by Reyhaneh Ahani Manghotay, Jie Liang

Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark, MoA-DepthCLIP achieves competitive results, significantly outperforming the DepthCLIP baseline by improving the $δ_1$ accuracy from 0.390 to 0.745 and reducing the RMSE from 1.176 to 0.520. These results are achieved while requiring substantially few trainable parameters, demonstrating that lightweight, prompt-guided MoA is a highly effective strategy for transferring VLM knowledge to fine-grained monocular depth estimation tasks.

摘要：利用像 CLIP 這樣的視覺-語言模型 (VLMs) 的豐富語義特徵來進行單眼深度估計任務是一個有前景的方向，但通常需要大量的微調或缺乏幾何精確性。我們提出了一個名為 MoA-DepthCLIP 的參數高效框架，該框架在最小監督下調整預訓練的 CLIP 表示以進行單眼深度估計。我們的方法將輕量級的混合適配器 (Mixture-of-Adapters, MoA) 模塊集成到預訓練的視覺Transformer (Vision Transformer, ViT-B/32) 主幹中，並結合對最後幾層的選擇性微調。這一設計使得能夠進行空間感知的適配，並由全局語義上下文向量和一種混合預測架構引導，該架構協同深度區間分類和直接回歸。為了增強結構準確性，我們採用了強制幾何約束的復合損失函數。在 NYU Depth V2 基準上，MoA-DepthCLIP 實現了具有競爭力的結果，顯著超越了 DepthCLIP 基線，將 $δ_1$ 準確度從 0.390 提高到 0.745，並將 RMSE 從 1.176 降低到 0.520。這些結果是在需要的可訓練參數顯著較少的情況下實現的，顯示出輕量級、提示引導的 MoA 是將 VLM 知識轉移到細粒度單眼深度估計任務中的一種非常有效的策略。

Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines

2604.01029v1 by Jingjie Ning, Xueqi Li, Chengyu Yu

Multi-LLM revision pipelines, in which a second model reviews and improves a draft produced by a first, are widely assumed to derive their gains from genuine error correction. We question this assumption with a controlled decomposition experiment that uses four matched conditions to separate second-pass gains into three additive components: re-solving, scaffold, and content. We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming. Our results show that the gains of multi-LLM revision are not monolithic, but depend on task structure, draft quality, and the type of draft information. On MCQ tasks, where the answer space is constrained and drafts provide little structural guidance, most gains are consistent with stronger-model re-solving, and directly routing queries to the stronger model can be more effective than revising a weak draft. On code generation tasks, however, two-stage prompting remains useful because even semantically null drafts can provide substantial structural scaffolding, while weak draft content can be harmful. Finally, role-reversed experiments show that strong drafts clearly benefit weak reviewers. Ultimately, our findings demonstrate that the utility of multi-LLM revision is dynamically bottlenecked by task structure and draft quality, necessitating more targeted pipeline designs rather than blanket revision strategies.

摘要：多LLM修訂流程中，第二個模型審查並改善第一個模型產生的草稿，普遍被認為其增益來自真正的錯誤修正。我們通過一個受控的分解實驗質疑這一假設，該實驗使用四個匹配條件將第二次通過的增益分解為三個可加的組件：重新解決、支架和內容。我們在兩對模型上對這一設計進行評估，並在三個基準上進行測試，這些基準涵蓋了知識密集型的選擇題和競爭性編程。我們的結果顯示，多LLM修訂的增益並不是單一的，而是依賴於任務結構、草稿質量和草稿信息的類型。在選擇題任務中，答案空間受到限制，草稿提供的結構指導有限，因此大多數增益與更強模型的重新解決一致，並且直接將查詢路由到更強的模型可能比修訂一個弱草稿更有效。然而，在代碼生成任務中，兩階段提示仍然有用，因為即使是語義上無效的草稿也能提供相當大的結構支架，而弱草稿內容則可能有害。最後，角色反轉的實驗顯示，強草稿明顯有利於弱審稿者。最終，我們的研究結果表明，多LLM修訂的效用受到任務結構和草稿質量的動態瓶頸，這需要更具針對性的流程設計，而不是一刀切的修訂策略。

Bridging Structured Knowledge and Data: A Unified Framework with Finance Applications

2604.00987v1 by Yi Cao, Zexun Chen, Lin William Cong, Heqing Shi

We develop Structured-Knowledge-Informed Neural Networks (SKINNs), a unified estimation framework that embeds theoretical, simulated, previously learned, or cross-domain insights as differentiable constraints within flexible neural function approximation. SKINNs jointly estimate neural network parameters and economically meaningful structural parameters in a single optimization problem, enforcing theoretical consistency not only on observed data but over a broader input domain through collocation, and therefore nesting approaches such as functional GMM, Bayesian updating, transfer learning, PINNs, and surrogate modeling. SKINNs define a class of M-estimators that are consistent and asymptotically normal with root-N convergence, sandwich covariance, and recovery of pseudo-true parameters under misspecification. We establish identification of structural parameters under joint flexibility, derive generalization and target-risk bounds under distributional shift in a convex proxy, and provide a restricted-optimal characterization of the weighting parameter that governs the bias-variance tradeoff. In an illustrative financial application to option pricing, SKINNs improve out-of-sample valuation and hedging performance, particularly at longer horizons and during high-volatility regimes, while recovering economically interpretable structural parameters with improved stability relative to conventional calibration. More broadly, SKINNs provide a general econometric framework for combining model-based reasoning with high-dimensional, data-driven estimation.

摘要：我們開發了結構知識引導的神經網絡（SKINNs），這是一個統一的估計框架，將理論、模擬、先前學習的或跨領域的見解作為可微約束嵌入到靈活的神經函數近似中。SKINNs在單一優化問題中共同估計神經網絡參數和經濟上有意義的結構參數，通過協同位置強制理論一致性，不僅在觀察數據上，還在更廣泛的輸入域中，因此嵌套了如功能GMM、貝葉斯更新、轉移學習、PINNs和代理建模等方法。SKINNs定義了一類M估計量，這些估計量在根-N收斂下是一致的和漸近正態的，具有夾心協方差，並在錯誤規範下恢復偽真參數。我們在聯合靈活性下建立了結構參數的識別，推導了在凸代理下分佈轉移的泛化和目標風險界限，並提供了控制偏差-方差權衡的加權參數的限制最優特徵。在一個說明性的金融應用中，針對期權定價，SKINNs改善了樣本外估值和對沖表現，特別是在較長的時間範圍和高波動性時期，同時相對於傳統的校準恢復了經濟上可解釋的結構參數，並提高了穩定性。更廣泛地說，SKINNs提供了一個將基於模型的推理與高維數據驅動估計相結合的一般計量經濟學框架。

Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts

2604.00901v2 by Sha Li, Naren Ramakrishnan

Multi-agent Retrieval-Augmented Generation (RAG), wherein each agent takes on a specific role, supports hard queries that require multiple steps and sources, or complex reasoning. Existing approaches, however, rely on static agent behaviors and fixed orchestration strategies, leading to brittle performance on diverse, multi-hop tasks. We identify two key limitations: the lack of continuously adaptive orchestration mechanisms and the absence of behavior-level learning for individual agents. To this end, we propose HERA, a hierarchical framework that jointly evolves multi-agent orchestration and role-specific agent prompts. At the global level, HERA optimizes query-specific agent topologies through reward-guided sampling and experience accumulation. At the local level, Role-Aware Prompt Evolution refines agent behaviors via credit assignment and dual-axes adaptation along operational and behavioral principles, enabling targeted, role-conditioned improvements. On six knowledge-intensive benchmarks, HERA achieves an average improvement of 38.69\% over recent baselines while maintaining robust generalization and token efficiency. Topological analyses reveal emergent self-organization, where sparse exploration yields compact, high-utility multi-agent networks, demonstrating both efficient coordination and robust reasoning.

摘要：多智能體檢索增強生成（RAG），其中每個智能體擔任特定角色，支持需要多個步驟和來源或複雜推理的困難查詢。然而，現有的方法依賴於靜態的智能體行為和固定的協調策略，導致在多樣化的多跳任務上表現脆弱。我們確定了兩個主要限制：缺乏持續自適應的協調機制以及缺乏對個別智能體的行為層級學習。為此，我們提出了HERA，一個層次化框架，聯合演化多智能體協調和角色特定的智能體提示。在全局層面，HERA通過獎勵引導的採樣和經驗積累來優化查詢特定的智能體拓撲。在局部層面，角色感知提示演化通過信用分配和沿操作及行為原則的雙軸適應來細化智能體行為，使得有針對性、角色條件的改進成為可能。在六個知識密集型基準上，HERA在保持穩健的泛化和標記效率的同時，實現了相對於近期基準的平均改進38.69\%。拓撲分析顯示出自我組織的出現，其中稀疏探索產生了緊湊的高效多智能體網絡，展示了高效的協調和穩健的推理能力。

Transforming OPACs into Intelligent Discovery Systems: An AI-Powered, Knowledge Graph-Driven Smart OPAC for Digital Libraries

2604.01262v1 by M. S. Rajeevan, B. Mini Devi

Traditional Online Public Access Catalogues (OPACs) are becoming less effective due to the rapid growth of scholarly literature. Conventional search methods, such as keyword indexing and Boolean queries, often fail to support efficient knowledge discovery. This paper proposes a Smart OPAC framework that transforms traditional OPACs into intelligent discovery systems using artificial intelligence and knowledge graph techniques. The framework enables semantic search, thematic filtering, and knowledge graph-based visualization to enhance user interaction and exploration. It integrates multiple open scholarly data sources and applies semantic embeddings to improve relevance and contextual understanding. The system supports exploratory search, semantic navigation, and refined result filtering based on user-defined themes. Quantitative evaluation demonstrates improvements in retrieval efficiency, relevance, and reduction of information overload. The proposed approach offers practical implications for modernizing digital library services and supports next-generation research workflows. Future work includes user-centric evaluation, personalization, and dynamic knowledge graph updates.

摘要：傳統的線上公共存取目錄（OPACs）因學術文獻的快速增長而變得不那麼有效。傳統的搜索方法，如關鍵字索引和布爾查詢，往往無法支持高效的知識發現。本文提出了一個智能 OPAC 框架，將傳統 OPAC 轉變為使用人工智慧和知識圖譜技術的智能發現系統。該框架支持語義搜索、主題過濾和基於知識圖譜的可視化，以增強用戶互動和探索。它整合了多個開放的學術數據來源，並應用語義嵌入來改善相關性和上下文理解。該系統支持探索性搜索、語義導航和基於用戶定義主題的精煉結果過濾。定量評估顯示檢索效率、相關性改善以及信息過載的減少。所提出的方法為現代化數字圖書館服務提供了實際意義，並支持下一代研究工作流程。未來的工作包括以用戶為中心的評估、個性化和動態知識圖譜更新。

2604.00829v1 by Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva

Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers $\sim$10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.

摘要：調整預訓練的語言模型 (LMs) 成為視覺-語言模型 (VLMs) 可能會因為在多模態適應過程中引入的表示轉移和跨模態干擾而降低其原生語言能力。這種損失很難恢復，即使使用標準目標進行針對特定任務的微調。先前的恢復方法通常會引入額外的模組，作為中間對齊層，以維持或隔離模態特定的子空間，這增加了架構的複雜性，在推理時增加了參數，並限制了模型和設置之間的靈活性。我們提出了 LinguDistill，一種無需適配器的蒸餾方法，通過利用原始的凍結 LM 作為教師來恢復語言能力。我們通過引入層級 KV-cache 共享來克服啟用視覺條件教師監督的關鍵挑戰，這使得教師能夠接觸到學生的多模態表示，而不修改任何模型的架構。我們然後選擇性地蒸餾教師在語言密集數據上的強語言信號，以恢復語言能力，同時保持學生在多模態任務上的視覺基礎。因此，LinguDistill 恢復了在語言和知識基準上損失的約 10% 的性能，同時在視覺密集任務上保持了可比的性能。我們的研究結果表明，語言能力可以在不增加額外模組的情況下恢復，為多模態模型中模態特定的退化提供了一個高效且實用的解決方案。

From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks

2604.00778v1 by Ayan Datta, Mounika Marreddy, Alexander Mehler, Zhixue Zhao, Radhika Mamidi

Large language models (LLMs) exhibit failures on elementary symbolic tasks such as character counting in a word, despite excelling on complex benchmarks. Although this limitation has been noted, the internal reasons remain unclear. We use character counting (e.g., "How many p's are in apple?") as a minimal, controlled probe that isolates token-level reasoning from higher-level confounds. Using this setting, we uncover a consistent phenomenon across modern architectures, including LLaMA, Qwen, and Gemma: models often compute the correct answer internally yet fail to express it at the output layer. Through mechanistic analysis combining probing classifiers, activation patching, logit lens analysis, and attention head tracing, we show that character-level information is encoded in early and mid-layer representations. However, this information is attenuated by a small set of components in later layers, especially the penultimate and final layer MLP. We identify these components as negative circuits: subnetworks that downweight correct signals in favor of higher-probability but incorrect outputs. Our results lead to two contributions. First, we show that symbolic reasoning failures in LLMs are not due to missing representations or insufficient scale, but arise from structured interference within the model's computation graph. This explains why such errors persist and can worsen under scaling and instruction tuning. Second, we provide evidence that LLM forward passes implement a form of competitive decoding, in which correct and incorrect hypotheses coexist and are dynamically reweighted, with final outputs determined by suppression as much as by amplification. These findings carry implications for interpretability and robustness: simple symbolic reasoning exposes weaknesses in modern LLMs, underscoring need for design strategies that ensure information is encoded and reliably used.

摘要：大型語言模型（LLMs）在基本的符號任務上，例如計算單詞中的字元數，表現不佳，儘管在複雜的基準測試中表現出色。雖然這一限制已被注意到，但內部原因仍不清楚。我們使用字元計數（例如，「apple 中有多少個 p？」）作為一個最小的、受控的探測，以將標記級推理與更高級的干擾隔離開來。在這個設定中，我們發現了一個在現代架構中一致的現象，包括 LLaMA、Qwen 和 Gemma：模型通常在內部計算出正確的答案，但在輸出層卻未能表達出來。通過結合探測分類器、激活修補、邏輯透鏡分析和注意力頭追踪的機械分析，我們顯示字元級信息在早期和中層表示中被編碼。然而，這些信息在後層，特別是倒數第二層和最後一層的 MLP 中，受到一小組組件的削弱。我們將這些組件識別為負電路：在網絡中下調正確信號的子網絡，偏向於更高概率但不正確的輸出。我們的結果導致了兩項貢獻。首先，我們表明 LLMs 中的符號推理失敗並不是由於缺失的表示或不充分的規模，而是源於模型計算圖中的結構性干擾。這解釋了為什麼這類錯誤持續存在，並且在擴展和指令調整下可能惡化。其次，我們提供證據表明 LLM 的前向傳遞實現了一種競爭解碼的形式，其中正確和不正確的假設共存並且動態重新加權，最終輸出由抑制和放大共同決定。這些發現對可解釋性和穩健性具有重要意義：簡單的符號推理暴露了現代 LLM 的弱點，強調了需要設計策略來確保信息被編碼並可靠地使用。

BioCOMPASS: Integrating Biomarkers into Transformer-Based Immunotherapy Response Prediction

2604.00739v1 by Sayed Hashim, Frank Soboczenski, Paul Cairns

Datasets used in immunotherapy response prediction are typically small in size, as well as diverse in cancer type, drug administered, and sequencer used. Models often drop in performance when tested on patient cohorts that are not included in the training process. Recent work has shown that transformer-based models along with self-supervised learning show better generalisation performance than threshold-based biomarkers, but is still suboptimal. We present BioCOMPASS, an extension of a transformer-based model called COMPASS, that integrates biomarkers and treatment information to further improve its generalisability. Instead of feeding biomarker data as input, we built loss components to align them with the model's intermediate representations. We found that components such as treatment gating and pathway consistency loss improved generalisability when evaluated with Leave-one-cohort-out, Leave-one-cancer-type-out and Leave-one-treatment-out strategies. Results show that building components that exploit biomarker and treatment information can help in generalisability of immunotherapy response prediction. Careful curation of additional components that leverage complementary clinical information and domain knowledge represents a promising direction for future research.

摘要：使用於免疫療法反應預測的數據集通常規模較小，且在癌症類型、所施用的藥物和使用的測序儀器上具有多樣性。當在未包含於訓練過程中的患者群體上進行測試時，模型的性能往往會下降。最近的研究顯示，基於Transformer的模型結合自我監督學習在泛化性能上優於基於閾值的生物標記，但仍然不夠理想。我們提出了BioCOMPASS，這是一個基於Transformer的模型COMPASS的擴展，整合了生物標記和治療信息以進一步提高其泛化能力。我們並不是將生物標記數據作為輸入，而是構建了損失組件以使其與模型的中間表示對齊。我們發現，治療閘控和通路一致性損失等組件在使用Leave-one-cohort-out、Leave-one-cancer-type-out和Leave-one-treatment-out策略進行評估時，提高了泛化能力。結果顯示，構建利用生物標記和治療信息的組件可以幫助提高免疫療法反應預測的泛化能力。仔細策劃利用互補臨床信息和領域知識的額外組件，代表了未來研究的一個有前景的方向。

To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

2604.00715v1 by Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, Steven Y. Feng

Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.

摘要：檢索增強生成（RAG）透過在測試時為知識密集型情況提供相關上下文來改善語言模型（LM）的性能。然而，在預訓練期間獲得的參數知識與通過檢索訪問的非參數知識之間的關係仍然不甚了解，特別是在固定數據預算下。在這項工作中，我們系統性地研究了預訓練語料庫大小與檢索庫大小之間的權衡，涵蓋了廣泛的模型和數據規模。我們訓練了基於OLMo-2的LM，參數範圍從30M到3B，使用多達100B的DCLM數據，同時變化預訓練數據規模（參數數量的1-150倍）和檢索庫大小（1-20倍），並在涵蓋推理、科學問答和開放域問答的多樣基準上評估性能。我們發現檢索在各模型規模上始終能改善性能，超越僅使用參數的基準，並引入一個三維縮放框架，將性能建模為模型大小、預訓練標記和檢索語料庫大小的函數。這個縮放流形使我們能夠估算在預訓練和檢索之間固定數據預算的最佳分配，顯示檢索的邊際效用在很大程度上取決於模型規模、任務類型和預訓練飽和度的程度。我們的結果為理解何時以及如何讓檢索補充預訓練提供了定量基礎，並為在可擴展語言建模系統設計中分配數據資源提供了實用指導。

Internal APIs Are All You Need: Shadow APIs, Shared Discovery, and the Case Against Browser-First Agent Architectures

2604.00694v1 by Lewis Tham, Nicholas Mac Gregor Garcia, Jungpil Hahn

Autonomous agents increasingly interact with the web, yet most websites remain designed for human browsers -- a fundamental mismatch that the emerging ``Agentic Web'' must resolve. Agents must repeatedly browse pages, inspect DOMs, and reverse-engineer callable routes -- a process that is slow, brittle, and redundantly repeated across agents. We observe that every modern website already exposes internal APIs (sometimes called \emph{shadow APIs}) behind its user interface -- first-party endpoints that power the site's own functionality. We present Unbrowse, a shared route graph that transforms browser-based route discovery into a collectively maintained index of these callable first-party interfaces. The system passively learns routes from real browsing traffic and serves cached routes via direct API calls. In a single-host live-web benchmark of equivalent information-retrieval tasks across 94 domains, fully warmed cached execution averaged 950\,ms versus 3{,}404\,ms for Playwright browser automation (3.6$\times$ mean speedup, 5.4$\times$ median), with well-cached routes completing in under 100\,ms. A three-path execution model -- local cache, shared graph, or browser fallback -- ensures the system is voluntary and self-correcting. A three-tier micropayment model via the x402 protocol charges per-query search fees for graph lookups (Tier~3), a one-time install fee for discovery documentation (Tier~1), and optional per-execution fees for site owners who opt in (Tier~2). All tiers are grounded in a necessary condition for rational adoption: an agent uses the shared graph only when the total fee is lower than the expected cost of browser rediscovery.

摘要：自主代理人越來越多地與網絡互動，但大多數網站仍然是為人類瀏覽器設計的——這是一個基本的不匹配，正在興起的「代理網絡」必須解決這個問題。代理人必須反覆瀏覽頁面、檢查DOM，並逆向工程可調用路由——這個過程既緩慢又脆弱，且在代理人之間冗餘重複。我們觀察到，每個現代網站已經在其用戶界面背後暴露了內部API（有時稱為\emph{影子API}）——這些第一方端點為網站自身的功能提供支持。我們提出了Unbrowse，一個共享路由圖，將基於瀏覽器的路由發現轉變為這些可調用第一方接口的集體維護索引。該系統被動地從實際瀏覽流量中學習路由，並通過直接API調用提供緩存路由。在94個域名的單主機實時網絡基準測試中，相同信息檢索任務的完全預熱緩存執行平均為950毫秒，而Playwright瀏覽器自動化則為3,404毫秒（平均加速3.6倍，中位數加速5.4倍），而緩存良好的路由在100毫秒內完成。三路徑執行模型——本地緩存、共享圖或瀏覽器回退——確保系統是自願和自我修正的。通過x402協議的三級微支付模型對圖查詢收取每次查詢的搜索費用（三級），對發現文檔收取一次性安裝費用（一级），以及對選擇參與的網站擁有者收取可選的每次執行費用（二級）。所有層級都基於理性採用的必要條件：代理人僅在總費用低於瀏覽器重新發現的預期成本時，才使用共享圖。

A Survey of On-Policy Distillation for Large Language Models

2604.00626v1 by Mingyang Song, Mao Zheng

Knowledge distillation has become a primary mechanism for transferring reasoning and domain expertise from frontier Large Language Models (LLMs) to smaller, deployable students. However, the dominant paradigm remains \textit{off-policy}: students train on static teacher-generated data and never encounter their own errors during learning. This train--test mismatch, an instance of \textit{exposure bias}, causes prediction errors to compound autoregressively at inference time. On-Policy Distillation (OPD) addresses this by letting the student generate its own trajectories and receive teacher feedback on these self-generated outputs, grounding distillation in the theory of interactive imitation learning. Despite rapid growth spanning divergence minimization, reward-guided learning, and self-play, the OPD literature remains fragmented with no unified treatment. This survey provides the first comprehensive overview of OPD for LLMs. We introduce a unified $f$-divergence framework over on-policy samples and organize the landscape along three orthogonal dimensions: \emph{feedback signal} (logit-based, outcome-based, or self-play), \emph{teacher access} (white-box, black-box, or teacher-free), and \emph{loss granularity} (token-level, sequence-level, or hybrid). We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.

摘要：知識蒸餾已成為將推理和領域專業知識從前沿大型語言模型（LLMs）轉移到較小、可部署學生的主要機制。然而，主導的範式仍然是\textit{off-policy}：學生在靜態教師生成的數據上訓練，並且在學習過程中從未遇到自己的錯誤。這種訓練--測試不匹配，作為\textit{暴露偏差}的一個例子，導致在推斷時預測錯誤自動回饋地累積。On-Policy Distillation (OPD) 通過讓學生生成自己的軌跡並對這些自生成的輸出接收教師反饋來解決這個問題，將蒸餾基於互動模仿學習的理論。儘管在發散最小化、獎勵引導學習和自我對弈方面迅速增長，但OPD文獻仍然支離破碎，缺乏統一的處理。這項調查提供了LLMs的OPD的首個綜合概述。我們介紹了一個統一的$f$-發散框架，針對在政策樣本上進行組織，並沿著三個正交維度進行佈局：\emph{反饋信號}（基於logit、基於結果或自我對弈）、\emph{教師訪問}（白盒、黑盒或無教師）以及\emph{損失粒度}（標記級、序列級或混合）。我們系統地分析了代表性方法，檢查了工業部署，並確定了包括蒸餾縮放法則、不確定性感知反饋和代理級蒸餾在內的開放問題。

Speech LLMs are Contextual Reasoning Transcribers

2604.00610v1 by Keqi Deng, Ruchao Fan, Bo Ren, Yiming Wang, Jinyu Li

Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).

摘要：儘管擴展了語音輸入，但在自動語音識別（ASR）中有效利用大型語言模型（LLMs）的豐富知識和上下文理解仍然不是一件簡單的事，因為這項任務主要涉及直接的語音轉文本映射。為了解決這個問題，本文提出了思維鏈 ASR（CoT-ASR），它構建了一個推理鏈，使 LLMs 能夠首先分析輸入的語音並生成上下文分析，從而充分發揮其生成能力。通過這種上下文推理，CoT-ASR 然後進行更有根據的語音識別，並在一次處理中完成推理和轉錄。此外，CoT-ASR 自然支持用戶引導的轉錄：雖然設計為自我生成推理，但它也可以無縫地納入用戶提供的上下文來指導轉錄，進一步擴展 ASR 的功能。為了減少模態差距，本文引入了一種 CTC 引導模態適配器，該適配器使用 CTC 非空白標記概率來加權 LLM 嵌入，從而有效地將語音編碼器輸出與 LLM 的文本潛在空間對齊。實驗表明，與標準基於 LLM 的 ASR 相比，CoT-ASR 在字錯誤率（WER）上實現了 8.7% 的相對減少，在實體錯誤率（EER）上實現了 16.9% 的相對減少。

Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

2604.00555v1 by Thanh Luong Tuan

Enterprise adoption of Large Language Models (LLMs) is constrained by hallucination, domain drift, and the inability to enforce regulatory compliance at the reasoning level. We present a neurosymbolic architecture implemented within the Foundation AgenticOS (FAOS) platform that addresses these limitations through ontology-constrained neural reasoning. Our approach introduces a three-layer ontological framework--Role, Domain, and Interaction ontologies--that provides formal semantic grounding for LLM-based enterprise agents. We formalize the concept of asymmetric neurosymbolic coupling, wherein symbolic ontological knowledge constrains agent inputs (context assembly, tool discovery, governance thresholds) while proposing mechanisms for extending this coupling to constrain agent outputs (response validation, reasoning verification, compliance checking). We evaluate the architecture through a controlled experiment (600 runs across five industries: FinTech, Insurance, Healthcare, Vietnamese Banking, and Vietnamese Insurance), finding that ontology-coupled agents significantly outperform ungrounded agents on Metric Accuracy (p < .001, W = .460), Regulatory Compliance (p = .003, W = .318), and Role Consistency (p < .001, W = .614), with improvements greatest where LLM parametric knowledge is weakest--particularly in Vietnam-localized domains. Our contributions include: (1) a formal three-layer enterprise ontology model, (2) a taxonomy of neurosymbolic coupling patterns, (3) ontology-constrained tool discovery via SQL-pushdown scoring, (4) a proposed framework for output-side ontological validation, (5) empirical evidence for the inverse parametric knowledge effect that ontological grounding value is inversely proportional to LLM training data coverage of the domain, and (6) a production system serving 21 industry verticals with 650+ agents.

摘要：企業採用大型語言模型（LLMs）受到幻覺、領域漂移以及無法在推理層面強制執行法規遵循的限制。我們提出了一種在基礎代理操作系統（Foundation AgenticOS, FAOS）平台上實現的神經符號架構，通過本體約束的神經推理來解決這些限制。我們的方法引入了一個三層的本體框架——角色、本體和互動本體——為基於LLM的企業代理提供正式的語義基礎。我們形式化了不對稱神經符號耦合的概念，其中符號本體知識約束代理輸入（上下文組合、工具發現、治理閾值），同時提出了擴展這種耦合以約束代理輸出（回應驗證、推理驗證、合規檢查）的機制。我們通過一項受控實驗（在五個行業中進行600次運行：金融科技、保險、醫療保健、越南銀行和越南保險）來評估該架構，發現本體耦合的代理在指標準確性（p < .001, W = .460）、法規遵循（p = .003, W = .318）和角色一致性（p < .001, W = .614）上顯著優於未經基礎的代理，且在LLM參數知識最弱的地方（特別是在越南本地化領域）改善最為明顯。我們的貢獻包括：（1）一個正式的三層企業本體模型，（2）神經符號耦合模式的分類，（3）通過SQL下推評分的本體約束工具發現，（4）一個提出的輸出端本體驗證框架，（5）對逆參數知識效應的實證證據，即本體基礎價值與LLM訓練數據對該領域的覆蓋率成反比，以及（6）一個為21個行業垂直領域提供650多個代理的生產系統。

Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation

2604.00536v1 by Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Ru Peng, Zenan Huang, Haokai Xu, Yixin Chen, Jian Wu, Junbo Zhao, Zuozhu Liu

Large language models (LLMs) achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data. However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to ensure. Recent work uses synthetic data, typically by prompting a generator over domain documents and filtering outputs with handcrafted rubrics. Yet rubric design is expert-dependent, transfers poorly across domains, and is often optimized through a brittle heuristic loop of writing rubrics, synthesizing data, training, inspecting results, and manually guessing revisions. This process lacks reliable quantitative feedback about how a rubric affects downstream performance. We propose evaluating synthetic data by its training utility on the target model and using this signal to guide data generation. Inspired by influence estimation, we adopt an optimizer-aware estimator that uses gradient information to quantify each synthetic sample's contribution to a target model's objective on specific tasks. Our analysis shows that even when synthetic and real samples are close in embedding space, their influence on learning can differ substantially. Based on this insight, we propose an optimization-based framework that adapts rubrics using target-model feedback. We provide lightweight guiding text and use a rubric-specialized model to generate task-conditioned rubrics. Influence score is used as the reward to optimize the rubric generator with reinforcement learning. Experiments across domains, target models, and data generators show consistent improvements and strong generalization without task-specific tuning.

摘要：大型語言模型（LLMs）之所以能在下游任務中表現出色，主要是因為擁有豐富的監督微調（SFT）數據。然而，在人文學科、社會科學、醫學、法律和金融等知識密集型領域，高品質的SFT數據卻相對稀缺，因為專家策劃成本高昂、隱私限制嚴格，且標籤一致性難以確保。近期的研究使用合成數據，通常是通過對領域文檔進行提示生成器並用手工標準過濾輸出。然而，標準設計依賴於專家，跨領域轉移效果不佳，且通常是通過脆弱的啟發式循環來優化，包括撰寫標準、合成數據、訓練、檢查結果和手動猜測修訂。這一過程缺乏關於標準如何影響下游性能的可靠定量反饋。我們建議通過其對目標模型的訓練效用來評估合成數據，並利用這一信號來指導數據生成。受到影響估計的啟發，我們採用了一種優化器感知的估計器，利用梯度信息量化每個合成樣本對特定任務上目標模型目標的貢獻。我們的分析顯示，即使合成樣本和真實樣本在嵌入空間中接近，它們對學習的影響也可能有顯著差異。基於這一見解，我們提出了一個基於優化的框架，利用目標模型反饋來調整標準。我們提供輕量級的指導文本，並使用專門的模型生成任務條件的標準。影響分數被用作獎勵，以強化學習優化標準生成器。跨領域、目標模型和數據生成器的實驗顯示出一致的改進和強大的泛化能力，而無需特定任務的調整。

Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics

2604.00443v1 by Iyad Ait Hou, Rebecca Hwa

If the same neuron activates for both "lender" and "riverside," standard metrics attribute the overlap to superposition--the neuron must be compressing two unrelated concepts. This work explores how much of the overlap is due a lexical confound: neurons fire for a shared word form (such as "bank") rather than for two compressed concepts. A 2x2 factorial decomposition reveals that the lexical-only condition (same word, different meaning) consistently exceeds the semantic-only condition (different word, same meaning) across models spanning 110M-70B parameters. The confound carries into sparse autoencoders (18-36% of features blend senses), sits in <=1% of activation dimensions, and hurts downstream tasks: filtering it out improves word sense disambiguation and makes knowledge edits more selective (p = 0.002).

摘要：如果同一個神經元對「貸方」和「河岸」都激活，標準指標將重疊歸因於重疊性——該神經元必須在壓縮兩個無關的概念。這項工作探討重疊有多少是由於詞彙混淆：神經元是因為共享的詞形（例如「銀行」）而激發，而不是因為兩個壓縮的概念。2x2因子分解顯示，僅詞彙條件（相同的單詞，不同的意義）在跨越110M-70B參數的模型中始終超過語義條件（不同的單詞，相同的意義）。這種混淆在稀疏自編碼器中延續（18-36%的特徵混合意義），位於<=1%的激活維度中，並且對下游任務造成損害：過濾掉它可以改善詞義消歧並使知識編輯更具選擇性（p = 0.002）。

TR-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning

2604.00438v1 by Wenxuan Jiang, Yuxin Zuo, Zijian Zhang, Xuecheng Wu, Zining Fan, Wenxuan Liu, Li Chen, Xiaoyu Li, Xuezhi Cao, Xiaolong Jin, Ninghao Liu

In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voting. This label then serves as a proxy to give reward messages and generate formative feedbacks, guiding LLM through iterative refinement. In the end, this synthesized contextual information is integrated with the original query to form a comprehensive prompt, with the answer determining through a final round of majority voting. TR-ICRL is evaluated on mainstream reasoning and knowledge-intensive tasks, where it demonstrates significant performance gains. Remarkably, TR-ICRL improves Qwen2.5-7B by 21.23% on average on MedQA and even 137.59% on AIME2024. Extensive ablation studies and analyses further validate the effectiveness and robustness of our approach. Our code is available at https://github.com/pangpang-xuan/TR_ICRL.

摘要：在情境強化學習（ICRL）中，允許大型語言模型（LLMs）直接從外部獎勵中在線學習，這些獎勵是在情境窗口內獲得的。然而，ICRL中的一個主要挑戰是獎勵估計，因為模型在推理過程中通常無法訪問真實數據。為了解決這一限制，我們提出了情境強化學習的測試時重新思考（TR-ICRL），這是一個針對推理和知識密集型任務設計的新型ICRL框架。TR-ICRL的運作方式是首先從未標記的評估集檢索與給定查詢最相關的實例。在每次ICRL迭代中，LLM會為每個檢索到的實例生成一組候選答案。接下來，通過多數投票從這組中推導出一個伪標籤。這個標籤隨後作為代理來提供獎勵信息並生成形成性反饋，指導LLM進行迭代改進。最後，這些合成的上下文信息與原始查詢整合，形成一個綜合提示，答案通過最終一輪的多數投票來確定。TR-ICRL在主流推理和知識密集型任務上進行評估，顯示出顯著的性能提升。值得注意的是，TR-ICRL在MedQA上平均提高了Qwen2.5-7B的性能21.23%，在AIME2024上甚至提高了137.59%。廣泛的消融研究和分析進一步驗證了我們方法的有效性和穩健性。我們的代碼可在 https://github.com/pangpang-xuan/TR_ICRL 獲得。

COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving

2604.00402v1 by Seohyoung Park, Jaeyeol Lim, Seoyoung Ju, Kyeonghun Kim, Nam-Joon Kim, Hyuk-Jae Lee

Developing robust models to accurately predict the trajectories of surrounding agents is fundamental to autonomous driving safety. However, most public datasets, such as the Waymo Open Motion Dataset and Argoverse, are collected in Western road environments and do not reflect the unique traffic patterns, infrastructure, and driving behaviors of other regions, including South Korea. This domain discrepancy leads to performance degradation when state-of-the-art models trained on Western data are deployed in different geographic contexts. In this work, we investigate the adaptability of Query-Centric Trajectory Prediction (QCNet) when transferred from U.S.-based data to Korean road environments. Using a Korean autonomous driving dataset, we compare four training strategies: zero-shot transfer, training from scratch, full fine-tuning, and encoder freezing. Experimental results demonstrate that leveraging pretrained knowledge significantly improves prediction performance. Specifically, selectively fine-tuning the decoder while freezing the encoder yields the best trade-off between accuracy and training efficiency, reducing prediction error by over 66% compared to training from scratch. This study provides practical insights into effective transfer learning strategies for deploying trajectory prediction models in new geographic domains.

摘要：開發穩健的模型以準確預測周圍代理的軌跡對於自動駕駛安全至關重要。然而，大多數公共數據集，例如 Waymo Open Motion Dataset 和 Argoverse，都是在西方道路環境中收集的，並未反映其他地區（包括南韓）獨特的交通模式、基礎設施和駕駛行為。這種領域差異導致當基於西方數據訓練的最先進模型在不同地理背景下部署時性能下降。在這項工作中，我們研究了 Query-Centric Trajectory Prediction (QCNet) 從美國數據轉移到韓國道路環境的適應性。使用韓國自動駕駛數據集，我們比較了四種訓練策略：零樣本轉移、從頭開始訓練、完全微調和編碼器凍結。實驗結果表明，利用預訓練知識顯著提高了預測性能。具體而言，選擇性地微調解碼器，同時凍結編碼器，實現了準確性和訓練效率之間的最佳權衡，預測誤差相比從頭開始訓練降低了超過 66%。這項研究為在新地理領域部署軌跡預測模型的有效轉移學習策略提供了實用的見解。

RAGShield: Provenance-Verified Defense-in-Depth Against Knowledge Base Poisoning in Government Retrieval-Augmented Generation Systems

2604.00387v1 by KrishnaSaiReddy Patil

RAG systems deployed across federal agencies for citizen-facing services are vulnerable to knowledge base poisoning attacks, where adversaries inject malicious documents to manipulate outputs. Recent work demonstrates that as few as 10 adversarial passages can achieve 98.2% retrieval success rates. We observe that RAG knowledge base poisoning is structurally analogous to software supply chain attacks, and propose RAGShield, a five-layer defense-in-depth framework applying supply chain provenance verification to the RAG knowledge pipeline. RAGShield introduces: (1) C2PA-inspired cryptographic document attestation blocking unsigned and forged documents at ingestion; (2) trust-weighted retrieval prioritizing provenance-verified sources; (3) a formal taint lattice with cross-source contradiction detection catching insider threats even when provenance is valid; (4) provenance-aware generation with auditable citations; and (5) NIST SP 800-53 compliance mapping across 15 control families. Evaluation on a 500-passage Natural Questions corpus with 63 attack documents and 200 queries against five adversary tiers achieves 0.0% attack success rate including adaptive attacks (95% CI: [0.0%, 1.9%]) with 0.0% false positive rate. We honestly report that insider in-place replacement attacks achieve 17.5% ASR, identifying the fundamental limit of ingestion-time defense. The cross-source contradiction detector catches subtle numerical manipulation attacks that bypass provenance verification entirely.

摘要：RAG 系統在聯邦機構中部署，面向公民的服務易受到知識庫毒害攻擊，對手會注入惡意文檔以操控輸出。最近的研究顯示，僅需 10 段對抗性文本即可達到 98.2% 的檢索成功率。我們觀察到 RAG 知識庫毒害在結構上類似於軟體供應鏈攻擊，並提出 RAGShield，一個五層深度防禦框架，將供應鏈來源驗證應用於 RAG 知識管道。RAGShield 引入了：(1) 受 C2PA 啟發的加密文檔驗證，阻止未簽名和偽造的文檔在進口時進入；(2) 以信任加權的檢索，優先考慮來源已驗證的資料；(3) 一個正式的污點格，具有跨來源矛盾檢測，即使來源有效也能捕捉內部威脅；(4) 具來源意識的生成，並附有可審計的引用；以及 (5) NIST SP 800-53 在 15 個控制家庭中的合規性映射。在一個包含 500 段自然問題語料庫、63 個攻擊文檔和 200 個查詢的評估中，針對五個對手層級的攻擊成功率達到 0.0%，包括自適應攻擊（95% CI: [0.0%, 1.9%]），且假陽性率為 0.0%。我們誠實報告，內部即時替換攻擊的 ASR 達到 17.5%，識別出進口時防禦的基本限制。跨來源矛盾檢測器捕捉到微妙的數值操控攻擊，完全繞過來源驗證。

Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

2604.00344v1 by Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li, Yuchen Wu, Haozheng Luo, Hengli Li, Zhi Zhang, Zhaolu Kang, Kai-Wei Chang, Ying Nian Wu

Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbf{Agent Q-Mix}, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round-wise communication graph. At its core, Agent Q-Mix combines a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity's Last Exam (HLE) using Gemini-3.1-Flash-Lite as a backbone, Agent Q-Mix achieves 20.8\% accuracy, outperforming Microsoft Agent Framework (19.2\%) and LangGraph (19.2\%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi-agent reasoning.

摘要：大型語言模型（LLMs）在完成各種任務方面展現了卓越的性能。然而，解決複雜問題通常需要多個代理的協調，這引發了一個根本性問題：如何有效地選擇和互連這些代理。在本文中，我們提出了\textbf{Agent Q-Mix}，這是一個將拓撲選擇重新表述為合作多代理強化學習（MARL）問題的強化學習框架。我們的方法使用QMIX價值分解學習去中心化的通信決策，其中每個代理從一組通信行動中選擇，這些行動共同引發一個逐輪的通信圖。在其核心，Agent Q-Mix結合了拓撲感知的GNN編碼器、GRU記憶和每個代理的Q-heads，遵循集中訓練與去中心化執行（CTDE）範式。該框架優化了一個獎勵函數，平衡任務準確性與令牌成本。在編碼、推理和數學的七個核心基準中，Agent Q-Mix相比現有方法達到了最高的平均準確率，同時展現了卓越的令牌效率和對代理失效的魯棒性。值得注意的是，在使用Gemini-3.1-Flash-Lite作為骨幹的挑戰性人類最後考試（HLE）中，Agent Q-Mix達到了20.8\%的準確率，超越了微軟代理框架（19.2\%）和LangGraph（19.2\%），其次是OpenClaw的AutoGen和Lobster。這些結果強調了學習的去中心化拓撲優化在推動多代理推理邊界方面的有效性。

2604.00284v1 by Gaurav Rajesh Parikh, Angikar Ghosal

We formally introduce a improvisational wordplay game called Connections to explore reasoning capabilities of AI agents. Playing Connections combines skills in knowledge retrieval, summarization and awareness of cognitive states of other agents. We show how the game serves as a good benchmark for social intelligence abilities of language model based agents that go beyond the agents' own memory and deductive reasoning and also involve gauging the understanding capabilities of other agents. Finally, we show how through communication with other agents in a constrained environment, AI agents must demonstrate social awareness and intelligence in games involving collaboration.

摘要：我們正式介紹一款名為 Connections 的即興文字遊戲，以探索 AI 代理的推理能力。玩 Connections 結合了知識檢索、總結能力和對其他代理的認知狀態的意識。我們展示了這款遊戲如何成為語言模型代理社交智能能力的良好基準，這超越了代理自身的記憶和推理能力，還涉及評估其他代理的理解能力。最後，我們展示了在受限環境中與其他代理的溝通中，AI 代理必須在涉及協作的遊戲中展現社會意識和智能。

A Study on the Impact of Fault localization Granularity for Repository-Scale Code Repair Tasks

2604.00167v1 by Joseph Townsend, Chandresh Pravin, Kwun Ho Ngan, Matthieu Parizy

Automatic program repair can be a challenging task, especially when resolving complex issues at a repository-level, which often involves issue reproduction, fault localization, code repair, testing and validation. Issues of this scale can be commonly found in popular GitHub repositories or datasets that are derived from them. Some repository-level approaches separate localization and repair into distinct phases. Where this is the case, the fault localization approaches vary in terms of the granularity of localization. Where the impact of granularity is explored to some degree for smaller datasets, not all isolate this issue from the separate question of localization accuracy by testing code repair under the assumption of perfect fault localization. To the best of the authors' knowledge, no repository-scale studies have explicitly investigated granularity under this assumption, nor conducted a systematic empirical comparison of granularity levels in isolation. We propose a framework for performing such tests by modifying the localization phase of the Agentless framework to retrieve ground-truth localization data and include this as context in the prompt fed to the repair phase. We show that under this configuration and as a generalization over the SWE-Bench-Mini dataset, function-level granularity yields the highest repair rate against line-level and file-level. However, a deeper dive suggests that the ideal granularity may in fact be task dependent. This study is not intended to improve on the state-of-the-art, nor do we intend for results to be compared against any complete agentic frameworks. Rather, we present a proof of concept for investigating how fault localization may impact automatic code repair in repository-scale scenarios. We present preliminary findings to this end and encourage further research into this relationship between the two phases.

摘要：自動程式修復可能是一項具有挑戰性的任務，尤其是在解決倉庫層級的複雜問題時，這通常涉及問題重現、故障定位、程式碼修復、測試和驗證。這種規模的問題通常可以在流行的 GitHub 倉庫或從中衍生的數據集中找到。
一些倉庫層級的方法將定位和修復分為不同的階段。在這種情況下，故障定位方法在定位的粒度上有所不同。雖然在較小的數據集中對粒度的影響進行了一定程度的探討，但並非所有方法都將此問題與在完美故障定位假設下測試程式碼修復的定位準確性分開。根據作者的最佳知識，尚無倉庫規模的研究明確調查在此假設下的粒度，也沒有對孤立的粒度水平進行系統的實證比較。
我們提出了一個框架來執行這些測試，通過修改無代理框架的定位階段來檢索真實的定位數據，並將其作為上下文納入提供給修復階段的提示中。我們顯示，在這種配置下，並且作為對 SWE-Bench-Mini 數據集的概括，函數級粒度相對於行級和文件級產生了最高的修復率。然而，深入探討表明，理想的粒度實際上可能依賴於任務。
本研究並不旨在改善現有技術，也不打算將結果與任何完整的代理框架進行比較。相反，我們提出了一個概念驗證，旨在調查故障定位如何影響倉庫規模場景中的自動程式碼修復。我們為此提出初步發現，並鼓勵進一步研究這兩個階段之間的關係。

Knowledge Graphs