Knowledge Graphs

Publish Date	Title	Authors	Homepage	Code
2026-06-17	Structured Inference with Large Language Gibbs	Sanghyeok Choi et.al.	2606.19264v1	null
2026-06-17	The More the Merrier: Combining Properties for ABox Abduction under Repair Semantics for ELbot	Anselm Haak et.al.	2606.19197v1	null
2026-06-17	Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection	Jinhan Li et.al.	2606.19168v1	null
2026-06-17	Essential Subspace Merging for Multi-Task Learning	Longhua Li et.al.	2606.19164v1	null
2026-06-17	IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages	Sakshi Joshi et.al.	2606.19157v1	null
2026-06-17	Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation	Ramza Basharat et.al.	2606.19139v1	null
2026-06-17	Equivariant Graph Neural Networks Improve Optical Spectra Prediction for Materials Screening	Kasper Helverskov Petersen et.al.	2606.19133v1	null
2026-06-17	Towards an Agent-First Web: Redesigning the Web for AI Agents	Eranga Bandara et.al.	2606.19116v1	null
2026-06-17	Which Sections of a Research Paper Best Reveal Its Research Methods? Evidence from Library and Information Science	Qiuyu Fang et.al.	2606.19051v1	null
2026-06-17	Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration	Xhevahire Tërnava et.al.	2606.19042v1	null
2026-06-17	Sumi: Open Uniform Diffusion Language Model from Scratch	Mengyu Ye et.al.	2606.19005v1	null
2026-06-17	GraphPO: Graph-based Policy Optimization for Reasoning Models	Yuliang Zhan et.al.	2606.18954v1	null
2026-06-17	SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents	Jingkun Luo et.al.	2606.18946v1	null
2026-06-17	Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking	Pierre Dantas et.al.	2606.18941v1	null
2026-06-17	Efficient Financial Language Understanding via Distillation with Synthetic Data	Wen-Fong et.al.	2606.18875v1	null
2026-06-17	Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness	Zijian Wang et.al.	2606.18874v1	null
2026-06-17	URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification	Xinze Zhang et.al.	2606.18861v1	null
2026-06-17	ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement	Bohou Zhang et.al.	2606.18850v1	null
2026-06-17	Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems	Hehai Lin et.al.	2606.18837v1	null
2026-06-17	Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration	Taewoon Kim et.al.	2606.18836v1	null
2026-06-17	Reinforcement Learning Foundation Models Should Already Be A Thing	Abdelrahman Zighem et.al.	2606.18812v1	null
2026-06-17	Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards	Yingyu Shan et.al.	2606.18810v1	null
2026-06-17	ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch	Tengfei Lyu et.al.	2606.18803v1	null
2026-06-17	Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports	Qingyu Lu et.al.	2606.18797v1	null
2026-06-17	SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction	Quanjiang Guo et.al.	2606.18780v1	null
2026-06-17	PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding	Jihyung Park et.al.	2606.18624v1	null
2026-06-17	Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance	Tianming Du et.al.	2606.18613v1	null
2026-06-17	Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting	Hao-Yuan Ma et.al.	2606.18566v1	null
2026-06-17	DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models	Patrick Cooper et.al.	2606.18557v1	null
2026-06-16	PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning	Bo Su et.al.	2606.18473v1	null
2026-06-16	TMR-GGNN: Credit Card Fraud Detection based on Time-Aware Multi-Relational Guided Graph Neural Network	Rohit Tewari et.al.	2606.18444v1	null
2026-06-16	RankGraph-2: Lifecycle Co-Design for Billion-Node Graph Learning in Recommendation	Renzhi Wu et.al.	2606.18379v1	null
2026-06-16	EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation	Qi Chai et.al.	2606.18235v1	null
2026-06-16	Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy, with Stylometric and Exploratory Graph Analyses	Joy Bose et.al.	2606.18222v1	null
2026-06-16	Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients	Byung-Kwan Lee et.al.	2606.18216v1	null
2026-06-16	Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0	Diaa Fayed et.al.	2606.18205v1	null
2026-06-16	The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data	Nick Bettencourt et.al.	2606.18192v2	null
2026-06-16	Learning Cardiac Electrophysiology Digital Twins Through Agentic Discovery of Hybrid Structure	Ziqi Zhou et.al.	2606.18154v1	null
2026-06-16	WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning	Yuwei Zhang et.al.	2606.18147v1	null
2026-06-16	Knowledge Reutilization in Meta-Reinforcement Learning	Yuan Meng et.al.	2606.18132v1	null
2026-06-16	Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour	Abeer Badawi et.al.	2606.18129v1	null
2026-06-16	Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models	Ramprasath Ganesaraja et.al.	2606.18114v1	null
2026-06-16	S4oP: Operator-level Pruning of Structured State Space Models for Resource-Constrained Devices	Marco Deano et.al.	2606.18096v1	null
2026-06-16	EAGG: Embodiment-Aligned Grasp Generation via Geometry-Aware Graph Conditioning	Wanhao Niu et.al.	2606.18092v1	null
2026-06-16	A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation	Haoyang Zhong et.al.	2606.18075v1	null
2026-06-16	When LLMs Analyze Scars: From Images to Clinically-Meaningful Features	Ruman Wang et.al.	2606.18063v1	null
2026-06-16	Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond	Hobin Kim et.al.	2606.18062v1	null
2026-06-16	C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift	Davide Domini et.al.	2606.18003v1	null
2026-06-16	Dimensionality Controls When Modularity Helps in Continual Learning	Kathrin Korte et.al.	2606.17889v1	null
2026-06-16	AI Adoption Across a Multinational Workforce: Sociotechnical Conditions for GenAI Acceptance in Human Resources	Dalia Ali et.al.	2606.17887v1	null
2026-06-16	FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow	Bihao Zhan et.al.	2606.17856v1	null
2026-06-16	DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL	Esteban Schafir et.al.	2606.17821v1	null
2026-06-16	A Framework for Evaluating Agentic Skills at Scale	Maksim Shaposhnikov et.al.	2606.17819v1	null
2026-06-16	Conflict-Aware Retriever Editing for Knowledge Injection Attacks on LLM-Based RAG Systems	Xinru Liu et.al.	2606.18310v1	null
2026-06-16	LLMs Infer Cultural Context but Fail to Apply It When Responding	Yisong Miao et.al.	2606.17688v1	null
2026-06-16	SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector	Jingyuan Zhang et.al.	2606.18309v1	null
2026-06-16	Handling Feature Heterogeneity with Learnable Graph Patches	Yifei Sun et.al.	2606.17667v1	null
2026-06-16	SketchXplain: Intuitive Visual Explanations of Image Classifiers with Sketches	Wencan Zhang et.al.	2606.17646v1	null
2026-06-16	Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification	Yiyue Qian et.al.	2606.17637v1	null
2026-06-16	Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs	Dong Huang et.al.	2606.17634v1	null
2026-06-16	OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation	Guibin Zhang et.al.	2606.17628v1	null
2026-06-16	Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning	Yanwei Cui et.al.	2606.17591v1	null
2026-06-16	Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow	Osamu Ito et.al.	2606.17577v1	null
2026-06-16	An AI Security Agent for Banking: Multi-Vector Fraud and AML Detection Across Retail and Corporate Accounts	Joseph Walusimbi et.al.	2606.17555v1	null
2026-06-16	FoundCause: Causal Discovery with Latent Confounders from Observational Data	Patrick Blöbaum et.al.	2606.17516v1	null
2026-06-16	Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement	Ramaravind Kommiya Mothilal et.al.	2606.17506v1	null
2026-06-16	AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows	Jiahui Niu et.al.	2606.17474v1	null
2026-06-16	Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation	Yuyang Dai et.al.	2606.17459v1	null
2026-06-16	Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos	Bo Gou et.al.	2606.17437v1	null
2026-06-16	SoK: AI-Augmented Binary Reversing	Yujeong Kwon et.al.	2606.17398v1	null
2026-06-16	MeiBRD: Meta-Learning Intraoperative Biomechanical Residual Deformation	Casey Meisenzahl et.al.	2606.17379v1	null
2026-06-15	MemTrace: Probing What Final Accuracy Misses in Long-Term Memory	Xianxuan Long et.al.	2606.17328v1	null
2026-06-15	Nothing from Something: Can a Language Model Discover 0?	Phoebe Zeng et.al.	2606.17289v1	null
2026-06-15	Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering	Rohit Kundu et.al.	2606.17257v1	null
2026-06-15	Rift: A Conflict Signature for Deception in Language Models	Petr Nyoma et.al.	2606.17229v1	null
2026-06-15	When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval	Mingxu Tao et.al.	2606.17220v1	null
2026-06-15	Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming	Callum Barbour et.al.	2606.18293v1	null
2026-06-15	Trust-Aware Multi-Agent Traceability: Confidence-Calibrated Knowledge Graphs for Consistent Software Artifact Management	Mohamed Essam et.al.	2606.17203v1	null
2026-06-15	Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation	Prabhjot Singh et.al.	2606.17188v2	null
2026-06-15	RepSelect: Robust LLM Unlearning via Representation Selectivity	Filip Sondej et.al.	2606.17168v1	null
2026-06-15	A Causal Model of Theory of Mind in Conflict for Artificial Intelligence	Nikolos Gurney et.al.	2606.16944v1	null
2026-06-15	RAID: Semantic Graph Diffusion for True Cold-Start and Cross-Lingual Forecasting	Arunkumar V et.al.	2606.16925v1	null
2026-06-15	LESS Is More: Mutual-Stability Sampling for Diffusion Language Models	Amr Mohamed et.al.	2606.16908v1	null
2026-06-15	Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier	Keizo Kato et.al.	2606.16811v1	null
2026-06-15	OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models	Tianyi Lin et.al.	2606.16774v1	null
2026-06-15	Misinformation Propagation in Benign Multi-Agent Systems	Jonas Becker et.al.	2606.16710v1	null
2026-06-15	User as Code: Executable Memory for Personalized Agents	Bojie Li et.al.	2606.16707v1	null
2026-06-15	Progressive Knowledge-Guided Large Language Model Framework for Bearing Fault Diagnosis	Jinghan Wang et.al.	2606.16684v1	null
2026-06-15	The Integrator Advantage: Controlled Agentic AI for Small and Medium-Sized Companies	Christopner Koch et.al.	2606.16649v1	null
2026-06-15	DCP-Prune: Ultra-Low Token Pruning with Distribution Consistency Preservation	Xifeng Xue et.al.	2606.16633v1	null
2026-06-15	Islamic Large Language Models: From Knowledge Acquisition to Trustworthy and Hallucination-Resistant AI	Mohammed Amine Mouhoub et.al.	2606.16629v1	null
2026-06-15	VeriGraph: Towards Verifiable Data-Analytic Agents	Jiajie Jin et.al.	2606.16603v1	null
2026-06-15	SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM Agents	Qiao Xiao et.al.	2606.16591v2	null
2026-06-15	Graph neural networks at war: integrating cybersecurity and drone intelligence in the Israeli-Iranian conflict	Sozan Sulaiman Maghdid et.al.	2606.17119v1	null
2026-06-15	Kairos: A Native World Model Stack for Physical AI	Kairos Team et.al.	2606.16533v2	null
2026-06-15	SkillWiki: A Living Knowledge Infrastructure for Agent Skills	Dingcheng Huang et.al.	2606.16523v1	null
2026-06-15	Model Graph Inductive Learning for Knowledge Graph Completion	Mohommad Esmaei Khani et.al.	2606.16509v1	null
2026-06-15	REFLEX: Reflective Evolution from LLM Experience	Pan Wang et.al.	2606.16496v1	null
2026-06-15	Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering	Jieyuan Liu et.al.	2606.16494v1	null
2026-06-15	Unified Multimodal Model for Brain MRI Imputation and Understanding	Zhiyun Song et.al.	2606.16484v1	null

Abstracts

Structured Inference with Large Language Gibbs

2606.19264v1 by Sanghyeok Choi, Henry Gouk, Esmeralda S. Whitammer

The knowledge encoded in large language models (LLMs) can serve as a substrate for structured reasoning over variables describing a complex world, but accessing this knowledge in a probabilistically coherent manner poses a difficult inference problem. We propose Large Language Gibbs, a scheme for structured probabilistic inference that uses conditional distributions of an LLM as transition operators. Rather than sampling structured objects through single-pass autoregressive generation, we iteratively resample individual variables conditioned on others using an LLM's next-token conditionals. This approach avoids order-dependent biases and produces a stationary distribution that reflects a compromise between all local conditionals. We apply this approach to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning. The results suggest that the use of LLM conditionals in MCMC is a practical alternative to one-pass generation for structured probabilistic inference under a world prior accessible through noisy LLM conditionals.

摘要：大型語言模型（LLMs）中編碼的知識可以作為對描述複雜世界的變量進行結構化推理的基礎，但以概率一致的方式訪問這些知識則構成了一個困難的推理問題。我們提出了大型語言吉布斯（Large Language Gibbs），這是一種結構化概率推理的方案，利用LLM的條件分佈作為轉移運算符。我們不是通過單次自回歸生成來抽樣結構化對象，而是使用LLM的下一個標記條件，迭代地重新抽樣基於其他變量的單個變量。這種方法避免了依賴順序的偏見，並產生了一個穩定的分佈，反映了所有局部條件的妥協。我們將這種方法應用於從合成分佈中抽樣、一致性推理任務和貝葉斯結構學習。結果表明，在通過噪聲LLM條件可訪問的世界先驗下，使用LLM條件在MCMC中是一種結構化概率推理的實用替代方案，而非單次生成。

The More the Merrier: Combining Properties for ABox Abduction under Repair Semantics for ELbot

2606.19197v1 by Anselm Haak, Patrick Koopmann, Yasir Mahmood, Anni-Yasmin Turhan

Abduction is a central approach to explain missing entailments from a knowledge base by providing a hypothesis, that would, if added to the knowledge base, make the missing entailment become true. Abduction under repair semantics has recently been investigated in detail, where several desirable properties and optimality criteria were considered, such as signature-restrictions and minimality in size and of introduced conflicts. Naturally, hypotheses that satisfy more than one of these properties or combine a property with an optimality criterion would be even more desirable for applications. So far, such hypotheses have not been investigated in the literature. In the present paper, we consider the ABox abduction problem for hypotheses satisfying more than one property or additional optimality criteria, for EL_bot under brave and AR semantics. Our main observation is that often requiring additional properties for hypotheses does not lead to an increase of complexity.

摘要：誘導推理是一種中心方法，用於解釋知識庫中缺失的推論，通過提供一個假設，如果將其添加到知識庫中，將使缺失的推論變得成立。最近，修復語義下的誘導推理已被詳細研究，其中考慮了幾個理想的特性和最佳性標準，例如簽名限制和引入衝突的最小化。自然地，滿足多個這些特性或將一個特性與最佳性標準結合的假設對於應用來說會更加理想。到目前為止，文獻中尚未對此類假設進行研究。在本篇論文中，我們考慮了針對滿足多個特性或額外最佳性標準的假設的 ABox 誘導推理問題，針對 EL_bot 在勇敢和 AR 語義下的情況。我們的主要觀察是，通常要求假設具備額外的特性並不會導致複雜性的增加。

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

2606.19168v1 by Jinhan Li, Kexian Tang, Yihan Xu, Zhuorui Ye, Kaifeng Lyu

To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.

摘要：為了實現大型語言模型（LLMs）的更深層安全對齊，最近的研究努力探討如何將安全干預措施提前到預訓練階段，主要是通過過濾不安全數據或將其重寫為更安全的形式。我們認為，預訓練階段的對齊應該超越僅僅使數據安全：LLMs 可能會將看似無害的知識和能力組合成不安全的行為。為此，我們提出了安全反思預訓練，這是一種預訓練階段的對齊方法，定期將短暫的安全反思插入預訓練語料庫中，以將自我監控直接整合到語言建模中，建立一種基礎能力，隨後通過兼容的後訓練進行強化。我們對在 FineWeb-Edu 上預訓練的 1.7B 模型的實驗顯示，安全反思預訓練提高了安全分類準確性，並大幅降低了推理階段和微調攻擊的成功率。除了我們的實際世界實驗外，我們還介紹了一個完全受控的合成環境 MedSafetyWorld，該環境對安全有明確的定義，並具有一個推理結構，使模型能夠輕鬆地從安全數據中概括不安全的行為。在 MedSafetyWorld 中的消融實驗進一步證明了安全反思預訓練在防止模型基於安全數據概括不安全行為方面的明顯優勢，相較於數據過濾和重寫。綜合來看，我們的研究結果表明，預訓練對齊不僅應使訓練數據安全，還應塑造模型可能從安全數據中獲得的行為。

Essential Subspace Merging for Multi-Task Learning

2606.19164v1 by Longhua Li, Lei Qi, Xin Geng, Qi Tian

Model merging aims to enable multi-task learning by integrating the capabilities of multiple models fine-tuned from the same pre-trained checkpoint into a single model. Its core challenge is inter-task interference among task-specific parameter updates. In this paper, we analyze the output shifts induced by task updates and observe that their energy is concentrated in a small number of principal directions. We call the subspace spanned by these directions the essential subspace. In contrast, most remaining directions carry little task-relevant energy, but their accumulation across multiple task updates can cause severe interference during merging. Motivated by this observation, we propose Essential Subspace Decomposition (ESD), which decomposes each task update according to the principal components of its activation shift. Based on ESD, we introduce Essential Subspace Merging (ESM), a training-free static merging method that orthogonalizes and fuses essential components into one compact multi-task model. We further extend ESM to ESM++, a training-free dynamic merging method that decomposes task-specific residuals into low-rank experts and selects the most relevant expert through prototype-based routing during forward inference. Extensive experiments across multiple task sets and model scales demonstrate that ESM and ESM++ effectively preserves task knowledge while reducing inter-task interference.

摘要：模型合併旨在通過將從相同預訓練檢查點微調的多個模型的能力整合到一個模型中，以實現多任務學習。其核心挑戰是任務特定參數更新之間的相互干擾。在本文中，我們分析了由任務更新引起的輸出變化，並觀察到它們的能量集中在少數主要方向上。我們稱這些方向所跨越的子空間為基本子空間。相對而言，大多數剩餘方向攜帶的任務相關能量很少，但它們在多次任務更新中的累積可能會在合併過程中造成嚴重的干擾。受到這一觀察的啟發，我們提出了基本子空間分解（ESD），該方法根據其激活變化的主成分分解每個任務更新。基於ESD，我們引入了基本子空間合併（ESM），這是一種無需訓練的靜態合併方法，能夠將基本組件正交化並融合成一個緊湊的多任務模型。我們進一步將ESM擴展為ESM++，這是一種無需訓練的動態合併方法，能夠將任務特定的殘差分解為低秩專家，並通過基於原型的路由在前向推理過程中選擇最相關的專家。在多個任務集和模型規模上的大量實驗表明，ESM和ESM++有效地保留了任務知識，同時減少了任務之間的干擾。

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

2606.19157v1 by Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh, Eldho Ittan George, R J Hari, Kaushal Bhogale, Mitesh M. Khapra

AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.

摘要：AudioLLMs 使得語音識別能夠基於文本提示進行，例如領域描述或實體列表。然而，目前尚不清楚這些模型是否真正利用了這些上下文，還是依賴於在預訓練期間學到的參數知識。現有的基準無法回答這個問題，因為它們在固定的提示條件下評估轉錄，並且很少包含明確的上下文輸入。我們介紹了 IndicContextEval，一個涵蓋 555 位講者的 56 小時多語言自然語音基準，涉及 8 種印度語言和 23 個專業領域。我們設計了一個 7 級提示框架，逐步引入上下文信號，包括元數據、自然語言描述、英語和母語的實體列表，以及包含不正確實體的對抗性提示。對五個模型的評估顯示了上下文利用行為的顯著差異，突顯了對 AudioLLMs 中上下文基礎的明確評估的需求。

Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

2606.19139v1 by Ramza Basharat, Muhammad Usman Ali

Automatic Handwritten Text Recognition (HTR) is inherently a challenging task, and its complexity is further increased when dealing with cursive scripts. Although significant efforts have been made on various cursive scripts, research regarding Urdu Handwritten Text Recognition (UHTR) has been relatively limited. This lag of research is primarily due to the unique challenges posed by its script, and the scarcity and unavailability of benchmark datasets. Therefore, to advance research in UHTR, this study presents a specialized real dataset called the Urdu Katib Handwritten Dataset (UKHD). To the best of our knowledge, this is the first offline Urdu handwritten text lines dataset specifically curated from the materials written by Katibs in historical times. It encompasses a diverse range of flat nib writing variations in the Nastalique calligraphic style. Additionally, the effectiveness of different CRNN-based hybrid models has been evaluated to identify the optimal architecture for Urdu Katib Handwriting Recognition (UKHR). Among the analyzed models, the CNN-BGRU-CTC model showed more robust performance, with low Character Error Rate (CER) and Word Error Rate (WER). This research work aims to support and encourage the research community in developing a robust recognition system for preserving Urdu handwritten literature.

摘要：自動手寫文字識別（HTR）本質上是一項具有挑戰性的任務，當處理草寫字體時，其複雜性進一步增加。儘管對各種草寫字體已經做出了重大努力，但對烏爾都手寫文字識別（UHTR）的研究相對有限。這一研究滯後主要是由於其字體所帶來的獨特挑戰，以及基準數據集的稀缺和不可用性。因此，為了推進UHTR的研究，本研究提出了一個名為烏爾都Katib手寫數據集（UKHD）的專門真實數據集。據我們所知，這是第一個專門從歷史時期Katib所寫材料中策劃的離線烏爾都手寫文字行數據集。它涵蓋了Nastalique書法風格中各種平尖筆書寫變體。此外，還評估了不同基於CRNN的混合模型的有效性，以確定烏爾都Katib手寫識別（UKHR）的最佳架構。在分析的模型中，CNN-BGRU-CTC模型顯示出更穩健的性能，具有較低的字符錯誤率（CER）和單詞錯誤率（WER）。本研究旨在支持和鼓勵研究社群開發一個穩健的識別系統，以保護烏爾都手寫文學。

Equivariant Graph Neural Networks Improve Optical Spectra Prediction for Materials Screening

2606.19133v1 by Kasper Helverskov Petersen, François R J Cornet, Martin Ovesen, Mikkel Jordahn, Kristian S. Thygesen, Mikkel N. Schmidt

Scalable prediction of optical spectra is a critical component of high-throughput materials screening for optoelectronic applications such as solar cells. Existing surrogate models are trained on spectra computed from lower levels of theory or rely on rotation-invariant scalar features, limiting their geometric expressiveness. We explore the use of equivariant graph neural networks for optical spectra prediction, adapting GotenNet to this task and evaluating it on multiple datasets including a recently published collection of 10,533 structures with spectra computed at the level of the random phase approximation (RPA). The proposed model outperforms the current state of the art, with the largest gains in the 0-8 eV range and on predicting the static real permittivity, both of particular relevance for thin-film optics.

摘要：可擴展的光譜預測是高通量材料篩選在光電應用（如太陽能電池）中的一個關鍵組成部分。現有的替代模型是基於從較低理論層次計算的光譜進行訓練，或依賴於旋轉不變的標量特徵，這限制了它們的幾何表達能力。我們探索了使用等變圖神經網絡進行光譜預測，將 GotenNet 調整為此任務，並在多個數據集上進行評估，包括最近發表的 10,533 個結構的集合，這些結構的光譜是基於隨機相位近似（RPA）計算的。所提出的模型在當前的最先進技術中表現優越，尤其是在 0-8 eV 範圍內以及預測靜態實部許可率方面，這兩者對於薄膜光學特別相關。

Towards an Agent-First Web: Redesigning the Web for AI Agents

2606.19116v1 by Eranga Bandara, Ross Gore, Ravi Mukkamala, Asanga Gunaratna, Safdar H. Bouk, Xueping Liang, Peter Foytik, Abdul Rahman, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Chalani Rajapakse, Ng Wee Keong, Kasun De Zoysa, Tharaka Hewa, Amin Hass, Wathsala Herath, Aruna Withanage, Nilaan Loganathan, Atmaram Yarlagadda, Sachin Shetty

The World Wide Web was built on an assumption held for three decades: the primary consumer of web content is a human being. This permeates every layer; its access model presumes human visitors, its economics rest on human attention, and its content targets human perception. The rapid emergence of AI agents as intermediaries between humans and web content invalidates this assumption. Yet the web resists agents through blanket blocking, CAPTCHA-based exclusion, and economic models that treat agent access as extraction rather than legitimate interaction. This paper proposes a principled redesign across three layers. At the access layer, agents acting for humans should inherit equivalent access rights, governed by rate limiting and agent identification metadata in HTTP requests, analogous to browser headers, alongside a dual-layer architecture serving human-readable and agent-optimized content from the same domain. At the economic layer, we propose an intent-based tier framework grounded in the agent-as-human-proxy principle: an agent's economic obligation mirrors that of the human it represents. A token-based subscription model meters content in tokens rather than pageviews, alongside a commissioned content economy anchoring AI content production in human intentionality. At the content layer, we identify epistemic recursion, the self-referential loop in which AI-generated content is consumed by agents to produce further content, progressively detaching web knowledge from human ground truth. We propose the Agent Text Markup Language (ATML), a four-level human supervision tier model, and a cryptographic provenance chain to counter this threat. Together these constitute ten design principles for an agent-first internet, one in which agents are first-class citizens whose integration requires renegotiating the web's foundational social contract across access, economics, and content.

摘要：全球資訊網建立在一個持續三十年的假設上：網路內容的主要消費者是人類。這一假設滲透到每一層；其訪問模型假定有人的訪客，其經濟學依賴於人類的注意力，而其內容則針對人類的感知。人工智慧代理作為人類與網路內容之間的中介的快速出現使這一假設失效。然而，網路通過全面封鎖、基於 CAPTCHA 的排除以及將代理訪問視為提取而非合法互動的經濟模型來抵抗代理。

本文提出在三個層面上進行原則性的重新設計。在訪問層，代表人類行動的代理應該繼承相應的訪問權限，這些權限由 HTTP 請求中的速率限制和代理識別元數據管理，類似於瀏覽器標頭，並且採用雙層架構，從同一域提供人類可讀和代理優化的內容。在經濟層面，我們提出一個基於意圖的層級框架，這一框架以代理作為人類代理的原則為基礎：代理的經濟責任反映其所代表的人類的責任。一種基於代幣的訂閱模型以代幣而非頁面瀏覽量來計量內容，並且一個委託內容經濟將 AI 內容生產與人類意圖相連接。在內容層，我們識別出認識論的遞歸，即 AI 生成的內容被代理消費以產生進一步的內容，逐步使網路知識脫離人類的真實基礎。我們提出了代理文本標記語言（ATML）、一種四層人類監督層級模型，以及一個加密來源鏈，以應對這一威脅。

這些共同構成了十項設計原則，旨在打造一個以代理為先的互聯網，在這個互聯網中，代理是第一公民，其整合需要重新協商網路的基礎社會契約，涵蓋訪問、經濟和內容。

Which Sections of a Research Paper Best Reveal Its Research Methods? Evidence from Library and Information Science

2606.19051v1 by Qiuyu Fang, Jiayi Hao, Chengzhi Zhang

Research methods are essential carriers of knowledge contribution in academic papers. Automatic multi-label classification of research methods can support knowledge services such as method retrieval, review generation, and research intelligence analysis. While existing studies primarily rely on titles and abstracts, abstracts often provide only limited methodological information, whereas utilizing full-text content faces challenges related to excessive length and information redundancy. Therefore, this paper proposes a segment combination strategy by partitioning the full-text content according to its physical postion. Using an annotated corpus of 1,954 full-text articles from three representative journals in Library and Information Science (JASIST, LISR, and JDoc), we evaluate the classification performance of various segments and their combinations across multiple models. Experimental results indicate that methodological information is distributed unevenly within the full-text content, with the middle-to-late and final segments exhibiting greater discriminative power. Furthermore, integrating bibliographic metadata with cross-segment combination strategies effectively enhances classification performance.

摘要：研究方法是學術論文中知識貢獻的重要載體。研究方法的自動多標籤分類可以支持知識服務，如方法檢索、評論生成和研究智能分析。雖然現有研究主要依賴標題和摘要，但摘要通常只提供有限的方法論信息，而利用全文內容則面臨過長和信息冗餘的挑戰。因此，本文提出了一種通過根據其物理位置劃分全文內容的段落組合策略。使用來自三本代表性圖書館與信息科學期刊（JASIST、LISR 和 JDoc）的 1,954 篇全文文章的註釋語料庫，我們評估了各種段落及其組合在多個模型中的分類性能。實驗結果表明，方法論信息在全文內容中分佈不均，中後段和最後段表現出更強的區分能力。此外，將書目元數據與跨段組合策略整合有效提升了分類性能。

Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration

2606.19042v1 by Xhevahire Tërnava

In vibe coding, an emerging AI-driven paradigm, an LLM generates an entire program from a natural language prompt, but what happens to the variability that traditional software engineering carefully builds into code? To answer this question, we conducted an exploratory analysis on 10 vibe coded C/C++ projects, which suggests that there is near-zero in-artifact variability, i.e., at compile and runtime. All variability decisions are resolved at a single new binding time, generation time, the moment the LLM produces the source code. Rather than treating this as a defect to fix, we propose Variability by Regeneration (VbR), to our knowledge the first product-line approach in which the LLM acts as the derivation engine, generating a purpose-built, free of dead code binary for each variant from a declarative specification, while a variant dispatcher transparently routes user requests to the matching binary. We formalise VbR, contrast it with classical SPL derivation, and demonstrate its full pipeline on a wc product family. For SPL engineering, variability in AI-generated software belongs in the specification, not in the code.

摘要：在氛圍編碼這一新興的 AI 驅動範式中，一個 LLM 從自然語言提示生成整個程序，但傳統軟體工程精心構建到代碼中的變異性會發生什麼呢？為了回答這個問題，我們對 10 個氛圍編碼的 C/C++ 項目進行了探索性分析，結果顯示在工件內幾乎沒有變異性，即在編譯和運行時。所有變異性決策都在單一的新綁定時間解決，即生成時間，也就是 LLM 產生源代碼的那一刻。我們並不將此視為需要修復的缺陷，而是提出了再生變異性（Variability by Regeneration，VbR），據我們所知，這是第一個產品線方法，其中 LLM 作為推導引擎，根據聲明性規範為每個變體生成一個專門構建、沒有死代碼的二進位檔，而變體調度器則透明地將用戶請求路由到匹配的二進位檔。我們對 VbR 進行了形式化，並將其與傳統的 SPL 推導進行對比，並在 wc 產品系列上展示了其完整的管道。對於 SPL 工程而言，AI 生成軟體中的變異性應該在規範中，而不是在代碼中。

Sumi: Open Uniform Diffusion Language Model from Scratch

2606.19005v1 by Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi, Jun Suzuki

Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.

摘要：擴散模型已成為自回歸模型的一個有前途的替代方案。在這些模型中，均勻擴散語言模型（UDLMs）原則上允許在任何步驟更新任何標記，從而實現更靈活的生成。然而，目前尚未有任何UDLM從零開始在大參數規模和大標記預算下進行預訓練。自回歸建模和遮罩擴散建模已經有可用的模型在規模上供社群研究和構建；而均勻擴散則沒有。大規模的從零開始預訓練的UDLM將提供一個乾淨的參考點，以研究擴展行為、生成動態、可控性，以及與已建立的自回歸和遮罩擴散模型之間的權衡。為此，我們介紹Sumi（在日語中意為“墨水”），這是一個完全開放的7B均勻擴散語言模型，從零開始在1.5T標記上進行預訓練。Sumi在知識、推理和編碼基準上與在相似標記預算下訓練的自回歸模型表現競爭，但在常識基準上表現不佳，其中我們以教育為重的數據混合可能是主要原因。我們釋出我們的模型權重、檢查點和完整的訓練配方，包括對公開可用語料庫的數據混合的完整規範。我們希望這次釋出能使社群能夠在大規模上研究原生均勻擴散，並促進對其尚未充分理解的方面的研究。

GraphPO: Graph-based Policy Optimization for Reasoning Models

2606.18954v1 by Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai, Jingdong Chen, Jun Zhou, Ge Wu, Wenyue Tang, Hao Sun

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

摘要：強化學習與可驗證獎勵（RLVR）已成為提升大型推理模型能力的標準範式。RLVR 通常獨立抽樣回應並利用最終答案來優化策略。這一範式有兩個限制。首先，獨立的回應往往包含相似的中間推理步驟，導致冗餘探索和計算浪費。其次，稀疏的最終答案獎勵使得識別有用步驟變得困難。基於樹的方法部分解決了這個問題，通過共享前綴並比較來自同一前綴的分支來提供細粒度的信號。然而，樹的分支仍然是獨立擴展的。當不同的分支達到相似的推理狀態時，它們無法共享信息並重複相似的探索。此外，基於樹的方法忽略了這種分散，只在不同的分支內進行局部比較，這可能導致優勢估計的方差增高。為了解決這一挑戰，我們提出了 GraphPO（基於圖的策略優化），這是一個新穎的強化學習框架，將回合表示為有向無環圖，推理步驟作為邊，從推理路徑總結的語義狀態作為節點。GraphPO 將語義上等價的推理路徑合併為等價類，允許它們共享後綴，並將預算從冗餘擴展重新分配到多樣化探索上。此外，我們將效率優勢分配給進入邊，將正確性優勢分配給輸出邊，從而在從結果中推導過程監督的同時提高推理效率。理論表明，GraphPO 減少了優勢估計的方差並增強了推理效率。在三個 LLM 的推理和代理搜索基準上進行的實驗顯示，GraphPO 在相同的標記預算或回應預算下，始終優於基於鏈和樹的基準。

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

2606.18946v1 by Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei

Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: https://github.com/luojingkun22/SenFlow

摘要：句子級別的 AI 生成文本檢測 (S-AGTD) 針對混合文檔，即人類和大型語言模型共同創作的文本，面臨兩個缺口：現有方法將每個句子孤立分類，忽略了句子之間的依賴關係，而現有基準則省略了最新一代生成器。我們構建了 MOSAIC，一個包含 16,000 篇混合文檔的基準，這些文檔來自 PubMed 和 XSum，由 DeepSeek-V3.2 和 Kimi K2 在嚴格的質量控制下生成，包括一個在先前基準中缺失的困惑度一致性過濾器。我們將 S-AGTD 重新構建為對文檔句子序列的結構化預測，並將其具體化為 SenFlow，將基於圖的句子間傳播與線性鏈 CRF 解碼整合在單個文檔級別的句子圖上進行處理。SenFlow 在 MOSAIC 上達到了最先進的性能，在跨域轉移的三個難度逐漸增加的協議中，平均 Macro-F1 邊際提高了 +4.15 個百分點。我們進一步發現，即使在困惑度過濾器平衡了明顯的線索後，AI 插入仍然保留了一個依賴於生成器的句子長度差距，而句子級別的檢測器仍然可以利用這一點。代碼和數據：https://github.com/luojingkun22/SenFlow

Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

2606.18941v1 by Pierre Dantas, Lucas Cordeiro, Waldir Junior

PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents Graph-ESBMC-PLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856).

摘要：PLCopen XML 定義了兩種 IEC 61131-3 梯形圖程序的編碼格式：一種是使用 <rung> 元素的文本編碼，另一種是將梯級邏輯表示為本地 ID/refLocalId 連接的有向圖的圖形編碼。ESBMC-PLC 支持文本格式，但將來自 CONTROLLINO、Beremiz 和 OpenPLC 編輯器的圖形導出解析為空的 GOTO 中間表示，導致虛無的驗證成功。本文提出了 Graph-ESBMC-PLC，通過基於 DFS 的圖形 LD 解析器填補了這一空白。該解析器從左電源軌遍歷連接圖到每個線圈，將梯級路徑提取為布爾接觸聯接，並應用三層 I/O 推斷方案。按右電源軌的 connectionPointIn 順序排列線圈，確保 SET 線圈在 RESET 線圈之前處理，符合 IEC 掃描週期語義。圖形到 IR 的轉換保持 ESBMC 後端不變。對來自 CONTROLLINO/OpenPLC 編輯器的 3 個圖形 LD 程序的驗證顯示，所有程序都生成完整的 GOTO IR，具有非確定性輸入和梯級邏輯，而不是之前的空 IR。所有 3 個在 k=2 下的驗證時間小於 70 毫秒。11 個文本 LD 基準完全保留，沒有回歸。報告了兩個不含 LD 內容或不支持計時器語義的 Beremiz 示例作為發現的限制。文檔在 Zenodo 上（DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856）。

Efficient Financial Language Understanding via Distillation with Synthetic Data

2606.18875v1 by Wen-Fong, Huang, Edwin Simpson

Large instruction-following models are powerful but costly to deploy, particularly in finance, where labelled data are limited by confidentiality and expert annotation cost. We present an efficient framework for financial sentiment analysis through distillation with synthetic data, transferring knowledge from a large instruction-tuned teacher to compact student models. The framework is designed for low-resource conditions, where a small set of real examples are collected and labelled by hand. The framework then clusters the examples and uses the clusters to select seeds for generating synthetic examples via structured few-shot prompting. Experiments show that clustering-based seed selection yields more representative synthetic data than random sampling, enabling compact models to achieve strong performance with minimal supervision. Notably, on a more complex and noisy text domain, the compact model trained on the complete synthetic-seed corpus even outperforms the teacher model, while remaining competitive on formal text. The framework provides a practical route toward resource-efficient domain adaptation in financial NLP with minimal human labelling effort.

摘要：大型指令跟隨模型強大但部署成本高，特別是在金融領域，因為標註數據受到保密性和專家標註成本的限制。我們提出了一個通過合成數據進行金融情感分析的高效框架，將知識從大型指令調整的教師模型轉移到緊湊的學生模型。該框架設計用於低資源條件，其中一小組真實範例由人工收集和標註。然後，該框架對範例進行聚類，並利用這些聚類選擇種子，以通過結構化的少量提示生成合成範例。實驗表明，基於聚類的種子選擇比隨機抽樣產生更具代表性的合成數據，使緊湊模型在最小監督下實現強大性能。值得注意的是，在更複雜和噪聲較多的文本領域，基於完整合成種子語料庫訓練的緊湊模型甚至超越了教師模型，同時在正式文本上仍保持競爭力。該框架為在金融自然語言處理中以最小的人力標註努力實現資源高效的領域適應提供了一條實用的途徑。

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

2606.18874v1 by Zijian Wang, Hanqi Li, Ziyue Yang, Zijian Hu, Shenghan Zuo, Yunzhe Zhang, Da Ma, Danyu Luo, Chenrun Wang, Jing Peng, Tiancheng Huang, Sijia Guo, Huayang Wang, Zichen Zhu, Senyu Han, Yilu Cao, Kai Yu, Lu Chen

AI systems can increasingly automate scientific workflows, but the reasoning that links prior evidence, generated ideas, experiments and final claims often remains implicit inside model inference. Here we introduce Xcientist, a research harness that externalizes research synthesis and experimental validation into inspectable, contract-governed processes. Xcientist organizes literature evidence, idea states, implementation plans, ablation records and repair traces as persistent research artifacts, so that generated mechanisms can be grounded, executed, tested and revised without losing their evidential basis. We identify claim drift as a failure mode of automated research, where runnable artifacts no longer support the mechanism originally claimed. Across training-free memory systems, graph-structured traffic forecasting and multi-scale physics-informed neural networks, Xcientist preserves traceable trajectories from problem formulation to mechanism design, validation and bounded revision. These results suggest that AI scientists should be evaluated not only by their final artifacts, but by whether their synthesis and validation processes remain attributable, inspectable and scientifically accountable.

摘要：AI 系統可以越來越多地自動化科學工作流程，但將先前證據、生成的想法、實驗和最終主張聯繫起來的推理通常仍隱含在模型推斷中。在此，我們介紹 Xcientist，一個將研究綜合和實驗驗證外部化為可檢查的、受合同約束的過程的研究工具。Xcientist 將文獻證據、想法狀態、實施計劃、消融記錄和修復痕跡組織為持久的研究文物，以便生成的機制可以在不失去其證據基礎的情況下進行基礎化、執行、測試和修訂。我們將主張漂移確定為自動化研究的一種失效模式，其中可運行的文物不再支持最初聲稱的機制。在無需訓練的記憶系統、圖結構的交通預測和多尺度物理知識驅動的神經網絡中，Xcientist 保留了從問題表述到機制設計、驗證和有限修訂的可追溯軌跡。這些結果表明，AI 科學家應該不僅根據他們的最終文物進行評估，還應根據他們的綜合和驗證過程是否保持可歸因、可檢查和科學負責進行評估。

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

2606.18861v1 by Xinze Zhang

Reconstructing simulation-ready digital twins of articulated objects from sensor observations remains constrained by two persistent gaps: (i) part-level geometric reconstruction is decoupled from kinematic-parameter estimation, and (ii) the recovered models often violate basic dynamic invariants such as energy conservation, leading to drift when the URDF is replayed in physics simulators. We present KinemaForge, a constraint-driven pipeline that jointly infers part-level shape, joint topology, and joint parameters from short RGB-D sequences and validates the result against an energy-consistent verifier built on differentiable rigid-body dynamics. The pipeline introduces three components: a kinematic constraint graph that encodes joint-part incidences as soft edges; a differentiable screw-axis solver that backpropagates from rendered observations through Featherstone's articulated-body algorithm to joint parameters; and an energy residual loss that penalises non-physical free responses of the reconstructed model. Across five PartNet-Mobility categories and an internal RGB-D benchmark, KinemaForge reduces the average joint-axis error from 4.52 degrees to 2.83 degrees (-37.4%) over the strongest geometric baseline (PARIS) and from 5.30 degrees to 2.83 degrees (-46.6%) over the interaction-based Ditto baseline, lowers long-horizon simulation drift by 64% (vs. PARIS) over 50 s rollouts, and yields URDFs whose closed-loop manipulation success rate improves by 14.6 percentage points over Ditto in our preliminary evaluation. Code and reconstruction data will be released upon acceptance.

摘要：重建可供模擬使用的關節物體數位雙胞胎，仍然受到兩個持續存在的缺口的限制：(i) 部件級幾何重建與運動參數估計相互解耦，以及 (ii) 恢復的模型經常違反基本的動態不變性，如能量守恆，導致在物理模擬器中重播 URDF 時出現漂移。我們提出了 KinemaForge，一個基於約束的管道，從短的 RGB-D 序列中共同推斷部件級形狀、關節拓撲和關節參數，並將結果與基於可微剛體動力學構建的能量一致性驗證器進行驗證。該管道引入了三個組件：一個運動約束圖，將關節-部件的關聯編碼為軟邊；一個可微的螺旋軸求解器，通過 Featherstone 的關節體算法從渲染的觀察結果反向傳播到關節參數；以及一個能量殘差損失，對重建模型的非物理自由反應進行懲罰。在五個 PartNet-Mobility 類別和一個內部 RGB-D 基準測試中，KinemaForge 將平均關節軸誤差從 4.52 度降低到 2.83 度（-37.4%），相較於最強的幾何基線（PARIS），以及從 5.30 度降低到 2.83 度（-46.6%），相較於基於互動的 Ditto 基線，並在 50 秒的滾動中將長期模擬漂移降低 64%（與 PARIS 相比），產生的 URDF 在我們的初步評估中，其閉環操作成功率比 Ditto 提高了 14.6 個百分點。代碼和重建數據將在接受後發布。

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

2606.18850v1 by Bohou Zhang, Xiaoyu Tao, Mingyue Cheng, Huijie Liu, Qi Liu

Abstractive summarization plays a crucial role in enabling efficient understanding of scientific literature, yet it inherently demands both linguistic fluency and factual faithfulness. Existing approaches often fail to reconcile these two requirements. Extractive methods rely on rigid sentence splicing that disrupts macro-level logical coherence, while large language model (LLM)-based generative approaches, despite mastering linguistic fluency, exhibit limited factual consistency. In this work, we propose ScholarSum, a hierarchical reflective graph-based framework that emulates a student-teacher writing process for fluent and faithful scientific summarization. ScholarSum first organizes the document into a hierarchical knowledge graph by segmenting it into semantically coherent units, whose multi-layered community structure captures global logic and macro-level themes. Guided by this global structure, the student generates an initial draft, which is subsequently refined through fine-grained evidence retrieval. To ensure factual consistency, a teacher-like reviewer then iteratively examines the draft, identifies unsupported content, and prompts targeted re-retrieval and rewriting until the summary meets rigorous quality standards. Extensive experiments demonstrate that ScholarSum significantly outperforms previous baselines in terms of both completeness and faithfulness. Our code is available at https://github.com/Xiaoyu-Tao/ScholarSum.

摘要：抽象摘要在促進對科學文獻的有效理解中扮演著至關重要的角色，但它本質上需要語言流暢性和事實忠實性。現有的方法往往無法調和這兩個要求。抽取式方法依賴於僵化的句子拼接，這會破壞宏觀層面的邏輯一致性，而基於大型語言模型（LLM）的生成方法，儘管在語言流暢性上表現出色，但在事實一致性方面卻有限。在本研究中，我們提出了ScholarSum，一種層次反思圖形基礎框架，模擬學生-教師的寫作過程，以實現流暢且忠實的科學摘要。ScholarSum首先通過將文檔劃分為語義上連貫的單元，將其組織成層次知識圖，這些多層社群結構捕捉了全球邏輯和宏觀主題。在這一全球結構的指導下，學生生成初步草稿，然後通過細緻的證據檢索進行精煉。為了確保事實一致性，類似教師的審閱者隨後反覆檢查草稿，識別不支持的內容，並促使針對性的重新檢索和重寫，直到摘要達到嚴格的質量標準。大量實驗表明，ScholarSum在完整性和忠實性方面顯著超越了以往的基準。我們的代碼可在 https://github.com/Xiaoyu-Tao/ScholarSum 獲得。

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

2606.18837v1 by Hehai Lin, Qi Yang, Chengwei Qin

Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. Inference-time MAS leverages frozen frontier LLMs but repeats identical searches without learning from past experience. Conversely, Training-time MAS internalizes experience via gradient updates but is constrained by the low capability ceiling of smaller models, and is hard to scale to large frontier LLMs. To bridge this gap, we propose Skill-MAS, a novel third path that decouples experience retention from parametric updates by conceptualizing the high-level orchestration capability as an evolvable Meta-Skill. Skill-MAS refines this architectural knowledge through a closed optimization loop: (1) Multi-Trajectory Rollout samples a behavioral distribution for each task under the current Meta-Skill; and (2) Selective Reflection adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy-level principles. Extensive experiments across four complex benchmarks and four distinct LLMs demonstrate that Skill-MAS not only achieves remarkable performance gains but also maintains a favorable cost-performance trade-off. Further analysis reveals that the evolved Meta-Skills are highly robust and exhibit strong transferability across unseen tasks and different LLMs.

摘要：大型語言模型（LLM）基礎的自動多代理系統（MAS）生成已成為應對複雜任務的重要前沿。然而，現有方法面臨模型能力與經驗保留之間的困境。推理時的 MAS 利用凍結的前沿 LLM，但在沒有從過去經驗中學習的情況下重複相同的搜索。相反，訓練時的 MAS 通過梯度更新內化經驗，但受到較小模型低能力上限的限制，並且難以擴展到大型前沿 LLM。為了填補這一空白，我們提出了 Skill-MAS，一條新穎的第三條路徑，通過將高層次的編排能力概念化為可演變的元技能，將經驗保留與參數更新解耦。Skill-MAS 通過閉環優化循環來精煉這一架構知識：（1）多軌跡回放在當前元技能下為每個任務採樣行為分佈；（2）選擇性反思自適應地選擇優先任務，並應用分層對比分析將系統經驗提煉為可泛化的策略級原則。跨越四個複雜基準和四個不同 LLM 的廣泛實驗表明，Skill-MAS 不僅實現了顯著的性能提升，還保持了有利的成本性能權衡。進一步分析顯示，演變的元技能具有高度的穩健性，並在未見任務和不同 LLM 之間表現出強大的可轉移性。

Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration

2606.18836v1 by Taewoon Kim, Emma van Zoelen, Mark Neerincx

Effective human-robot teamwork requires robots to adapt to partners, situations, and task dynamics from the start of an interaction. In the MATRX Urban Search and Rescue (USAR) environment, people can externalize collaboration patterns (CPs) they discover during teamwork through a chat and reflection interface. We study whether a robot can use such prior team experience to become a better teammate in future interactions. To this end, we represent historical CPs as knowledge-graph episodic memories and use graph representation learning with a node-classification objective to identify a representative and effective memory for reuse. We then initialize the robot with this memory before a new collaboration episode begins. Across 20 participants and 160 round-level observations, initializing the robot with a single automatically selected prior CP increases rescue success from 25.7% to 41.3% and reduces average task time by 283 seconds. The strongest gains appear at the beginning of interaction, suggesting that reusable episodic memory can help robots enter collaboration with more effective task knowledge and support smoother early teamwork.

摘要：有效的人機團隊合作要求機器人從互動開始就能適應夥伴、情境和任務動態。在 MATRX 城市搜索與救援（USAR）環境中，人們可以通過聊天和反思介面外化他們在團隊合作中發現的合作模式（CPs）。我們研究機器人是否能利用這種先前的團隊經驗在未來的互動中成為更好的隊友。為此，我們將歷史 CPs 表示為知識圖譜的情節記憶，並使用圖表示學習與節點分類目標來識別可重用的代表性和有效記憶。然後，在新的合作情節開始之前，我們用這個記憶初始化機器人。在 20 位參與者和 160 次回合級觀察中，使用單一自動選擇的先前 CP 初始化機器人，使救援成功率從 25.7% 提高到 41.3%，並將平均任務時間減少 283 秒。最明顯的增益出現在互動的開始，這表明可重用的情節記憶可以幫助機器人以更有效的任務知識進入合作，並支持更順利的早期團隊合作。

Reinforcement Learning Foundation Models Should Already Be A Thing

2606.18812v1 by Abdelrahman Zighem, Jill-Jênn Vie

Foundation models for language and vision are powered by internet-scale data, while structured domains (tabular prediction, time-series forecasting, graph learning, reinforcement learning) are not. The substitute is synthetic data, which shifts the burden from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train one model entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.

摘要：基於互聯網規模數據的語言和視覺基礎模型，而結構化領域（表格預測、時間序列預測、圖學習、強化學習）則不是。替代品是合成數據，這將負擔從收集轉移到先前設計。對於許多結構化任務，這樣的先驗已經存在：TabPFN及其後續版本使用在合成貝葉斯先驗上預訓練的Transformer來解決表格分類問題。
我們提出兩點。 \textbf{首先}，強化學習是明顯的缺口：對合成MDP的採樣與對合成表格數據集的採樣同樣可行，但沒有任何上下文強化學習工作將先前設計視為主要目標。 \textbf{其次}，MDP允許固定大小的充分統計量，與觀察到的情節無關且呈表格形狀，這使得它們直接適合用於表格基礎模型的基於注意力的架構，並用策略頭替代監督目標。這些共同定義了強化學習基礎模型的議程。
作為概念驗證，我們完全在合成MDP上訓練一個模型，並顯示在沒有特定任務調整的情況下，它能夠在上下文中解決保留的表格基準，無論是在線還是離線：在線時，所需的情節數量遠少於UCB-VI和表格Q學習；離線時，與VI-LCB競爭。

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

2606.18810v1 by Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu, Xiangrong Zhu, Xinyi Wang, Jiashu Yao, Wei Lin, Hongru Wang, Heyan Huang

Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning steps. Existing token-level credit assignment methods require resources beyond the model's own rollouts. GRPO variants rely on process reward models or ground-truth answers. Knowledge distillation assigns credit through per-token divergence but requires external teachers (On-Policy Distillation) or privileged information (On-Policy Self Distillation). However, these dependencies limit applicability in the pure RLVR setting. We observe that conditioning the model on its own verified trajectories induces a measurable per-token KL divergence between the original and conditioned distributions, and prove that distilling from a self-teacher constructed by verified trajectories leads to infeasible weighted-average solutions when multiple verified trajectories exist. We propose SC-GRPO (Self-Conditioned GRPO), which uses KL divergence mentioned before as a multiplicative weight on GRPO gradients. Across five benchmarks spanning math, code, and agentic tasks, SC-GRPO consistently outperforms 8.1% over GRPO and 5.9% over DAPO with stronger OOD performance. Moreover, SC-GRPO achieves higher performance than OPD.

摘要：強化學習與可驗證獎勵（RLVR）在訓練大型語言模型（LLMs）以解決推理任務方面推動了顯著的進展，但代表性的方法如 GRPO 對所有標記分配均勻的信用，浪費了在常規標記上的梯度，同時對關鍵推理步驟的信用評估不足。現有的標記級信用分配方法需要超出模型自身回合的資源。GRPO 的變體依賴於過程獎勵模型或真實答案。知識蒸餾通過每個標記的偏差分配信用，但需要外部教師（在政策蒸餾）或特權信息（在政策自我蒸餾）。然而，這些依賴限制了在純 RLVR 設定中的適用性。我們觀察到，將模型條件化於其自身的驗證軌跡會在原始分佈和條件分佈之間產生可測量的每標記 KL 散度，並證明從由驗證軌跡構建的自我教師中進行蒸餾會導致在存在多個驗證軌跡時無法實現的加權平均解。我們提出了 SC-GRPO（自我條件化 GRPO），它使用前面提到的 KL 散度作為 GRPO 梯度的乘法權重。在跨越數學、代碼和代理任務的五個基準測試中，SC-GRPO 始終比 GRPO 高出 8.1%，比 DAPO 高出 5.9%，並且在 OOD 性能上更強。此外，SC-GRPO 的性能高於 OPD。

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

2606.18803v1 by Tengfei Lyu, Zirui Yuan, Xu Liu, Kai Wan, Zihao Lu, Li Ma, Hao Liu

Bringing Large Language Models (LLMs) into industrial ride-hailing dispatch as semantic feature extractors over platform-scale behavioral logs is a compelling but under-explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver's habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM-generated user profiles. However, scaling such profiling to a live, millisecond-latency dispatcher faces three intertwined constraints rarely addressed together: on a platform with millions of daily orders, logs exceed any LLM's context window by orders of magnitude; most users are long-tail, with too few interactions for per-user profiling; and surface-fluent profiles do not necessarily improve downstream prediction utility. We present ProfiLLM, an agentic LLM data pipeline that operationalizes utility-aligned user profiling for production matching systems through two modules. (1) Tool-Augmented Global Knowledge Mining equips an LLM agent with 27 analytical tools to mine platform-scale data, producing reusable global knowledge, adaptive user clustering rules, and region-level supply-demand priors. (2) Utility-Aligned Profile Exploration generates multiple candidate profiles per cluster, evaluates them via a lightweight downstream utility proxy, iteratively refines the best candidates and constructs preference pairs for DPO fine-tuning. Deployed on DiDi's production dispatcher, ProfiLLM achieves up to +6.14% relative AUC improvement in outcome prediction, up to +4.35% GMV gain in dispatching simulation, and consistent improvements in a 14-day online A/B test including +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate.

摘要：將大型語言模型（LLMs）引入工業乘車呼叫調度，作為平台規模行為日誌的語義特徵提取器，這是一個引人注目但尚未充分探索的數據系統問題。生產匹配管道仍然以結構化數值特徵為主導，但決定性的行為信號（例如，駕駛員對某些地區的習慣性厭惡）本質上是上下文相關的，並且自然可以表達為LLM生成的用戶檔案。然而，將這種檔案擴展到實時、毫秒延遲的調度器面臨著三個相互交織的約束，這些約束很少同時被解決：在一個每天有數百萬訂單的平台上，日誌的數據量超過任何LLM的上下文窗口幾個數量級；大多數用戶是長尾用戶，與每個用戶的互動次數太少，無法進行個別檔案分析；而表面流暢的檔案不一定能改善下游預測的效用。我們提出了ProfiLLM，一個自主的LLM數據管道，通過兩個模塊實現與效用對齊的用戶檔案分析，以支持生產匹配系統。（1）工具增強的全球知識挖掘為LLM代理配備了27個分析工具，以挖掘平台規模的數據，生成可重用的全球知識、自適應的用戶聚類規則和區域供需先驗。（2）與效用對齊的檔案探索為每個聚類生成多個候選檔案，通過輕量級的下游效用代理進行評估，迭代地精煉最佳候選檔案並構建DPO微調的偏好對。在滴滴的生產調度器上部署的ProfiLLM，在結果預測中實現了高達+6.14%的相對AUC改善，在調度模擬中實現了高達+4.35%的GMV增益，並在為期14天的在線A/B測試中持續改進，包括+0.47%的GMV、+0.33%的完成率和-0.82%的接受前取消率。

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

2606.18797v1 by Qingyu Lu, Ruochen Li, Liang Ding, Yufei Xia, Youxiang Zhu, Dacheng Tao

Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar. Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation. We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness"). Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings. To mitigate this, we synthesize 4k report pairs and train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B. Our trained metric sharpens the clinical significance boundary, surpassing 32B-scale medical LLMs and remaining competitive with proprietary models. Crucially, the more costly two-pass setting fails to consistently improve overall performance and mainly trades discrimination for robustness. These findings suggest one-pass trained metrics as the practical choice for cost-sensitive deployment, with two-pass inference reserved for settings where D-R balance is critical. We will release the dataset and metric.

摘要：可靠的放射科報告評估需要嚴格的臨床準確性，因為遺漏關鍵發現或錯誤表徵放射影像觀察會直接影響病人護理。現有的指標通過將報告質量簡化為無醫學根據的標量來掩蓋這一要求。儘管大型語言模型（LLMs）擁有豐富的醫學知識，但它們同樣難以劃定臨床上重要錯誤與無害變異之間的可靠邊界。我們使用ReEvalMed基準作為測試平台來研究這一邊界，並從檢測真實臨床錯誤（“區分”）和容忍不重要變異（“穩健性”）的角度評估指標層級的臨床意義。在單通道和雙通道設置下的8個LLM評估者中，我們識別出廣泛的區分偏見：模型有效地檢測錯誤，但也過度懲罰無害的改述。為了減輕這一問題，我們合成了4k報告對並在Qwen3-8B和MedGemma-4B上訓練輕量級可解釋的指標。我們訓練的指標明確了臨床意義邊界，超越了32B規模的醫學LLMs，並與專有模型保持競爭力。關鍵是，更昂貴的雙通道設置未能持續改善整體性能，主要是在區分和穩健性之間進行了權衡。這些發現表明，單通道訓練的指標是成本敏感部署的實際選擇，而雙通道推斷則保留給D-R平衡至關重要的設置。我們將發布數據集和指標。

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

2606.18780v1 by Quanjiang Guo, Chong Mu, Jiazhou Pan, Ming Jia, Ling Tian, Hui Gao, Zhao Kang

Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

摘要：多模態信息提取（MIE）涵蓋了多模態命名實體識別（MNER）、關係提取（MRE）和事件提取（MEE）等任務，對於理解多媒體內容至關重要，但仍受到嚴重數據稀缺的限制。儘管數據增強是一種有前景的解決方案，但現有的方法受到粗糙的跨模態對齊和碎片化的任務特定設計的阻礙，未能充分利用共享的語義知識。為了克服這些限制，我們提出了語義錨點對齊的多模態增強（SAMA），這是一個統一的框架，用於生成高保真、任務感知的合成數據。SAMA 從真實標籤中構建結構化的語義錨點，以指導協作多專家多模態大型語言模型（CME-MLLM），該模型將共享語義的通用適配器與任務特定的適配器相結合，生成多樣但符合約束的文本樣本。對於圖像合成，SAMA 採用錨點保留擴散機制，使用錨點加權提示和潛在條件來保持關鍵的語義錨點，同時多樣化視覺上下文。為了消除手動驗證的需要，SAMA 進一步引入了一個雙約束過濾模塊，根據跨模態一致性和錨點保真度選擇合成樣本。在 MNER、MRE 和 MEE 的基準數據集上進行的廣泛實驗表明，SAMA 在完全監督和低資源設置下始終超越了最先進的增強基準，突顯了其多功能性、穩健性和有效性。

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

2606.18624v1 by Jihyung Park, Minchao Huang, Leqi Liu, Elias Stengel-Eskin

Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still struggle with making pragmatic inferences, often choosing literal interpretations. To improve LLM pragmatic reasoning, we introduce PragReST, a self-supervised framework that constructs pragmatic QA data, generates counterfactual reasoning traces, and trains models to internalize them through supervised fine-tuning and reinforcement learning, without human-labeled training data or distillation from a stronger teacher. Across four pragmatic benchmarks (PragMega, Ludwig, MetoQA, and AltPrag), PragReST improves over backbone models, task-specific pragmatic tuning baselines, and non-counterfactual variants of the same pipeline. On accuracy-based benchmarks, PragReST improves over the instruct backbone by 5.37 and 5.50% (absolute) for Qwen3-8B and Qwen3-14B, respectively. Our error analysis and ablations underscore the importance of counterfactual reasoning: PragReST primarily reduces errors caused by failures to contrast observed utterances with plausible alternatives, and removing counterfactual reasoning substantially reduces performance. Moreover, our training preserves out-of-domain performance on general-knowledge and mathematical reasoning benchmarks.

摘要：自然語言理解通常依賴於隱含的意義，而非明確陳述的內容，這需要實用的推理。儘管在數學和邏輯推理方面表現強勁，大型語言模型（LLMs）在進行實用推理時仍然面臨挑戰，經常選擇字面解釋。為了改善LLM的實用推理，我們引入了PragReST，一個自我監督的框架，該框架構建實用的問答數據，生成反事實推理痕跡，並通過監督微調和強化學習訓練模型內化這些痕跡，而無需人類標註的訓練數據或來自更強教師的蒸餾。在四個實用基準（PragMega、Ludwig、MetoQA和AltPrag）上，PragReST優於基礎模型、特定任務的實用調整基準和相同流程的非反事實變體。在基於準確性的基準上，PragReST在Qwen3-8B和Qwen3-14B上分別提高了5.37%和5.50%（絕對值）相較於指令基礎模型。我們的錯誤分析和消融實驗強調了反事實推理的重要性：PragReST主要減少了因未能將觀察到的話語與合理替代品進行對比而造成的錯誤，並且去除反事實推理會顯著降低性能。此外，我們的訓練保持了在一般知識和數學推理基準上的域外性能。

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

2606.18613v1 by Tianming Du, Peijie Yu, Sihan Shang, Danli Shi, My Linh Nguyen, Shengbo Gao, Guangyuan Li, Yinghong Yu, Yan Jiang, Qianlong Zhao, Behzad Bozorgtabar, Shaoxiong Ji, Jiazhen Pan, Daniel Rueckert, Jiancheng Yang

The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.

摘要：醫療 LLMs 在近期最可能的角色是協助而非取代醫生，但目前的評估往往測試孤立的能力：臨床知識、EHR 系統互動或病人溝通。醫生的協助需要在同一互動中協調這些能力，在這裡醫生發出不明確的請求，病人模糊地描述症狀，而 EHR 系統則需要精確的工具使用。我們引入了 PhysAssistBench，一個用於互動醫生-病人-EHR 協助的基準。PhysAssistBench 由真實的 MIMIC-IV 案例構建，使用可擴展的管道來構建具主動性的病人：互動的、基於記錄的代理，將靜態的 EHR 記錄轉化為多輪臨床場景，同時保持臨床事實性。PhysAssistBench 提供了一個經過策劃的雙語評估集，包含 1,296 個手動審核和醫生驗證的回合。與領先的 LLMs 進行的實驗顯示，當前模型在這種環境中仍然不可靠，這暴露了臨床 LLMs 的一個關鍵瓶頸：可靠的協助需要在知識、溝通和系統之間進行協調，而不是在任何一個方面的孤立增長。

2606.18566v1 by Hao-Yuan Ma, Li Zhang, Yushi Qiu, Jie Gao, Yan Zhang, Bangjun Wang

Crowd counting is a fundamental task in computer vision. However, crowd counting in low-light environments remains largely underexplored, despite its practical importance in the real world. Existing methods mainly focus on well-lit scenes or rely on single-modality Red-Green-Blue (RGB) representations, which often become unreliable under extreme darkness and complex non-uniform illumination. To handle this problem, we construct three new low-light crowd counting benchmarks, which consist of two synthetic datasets, SHA_Dark and SHB_Dark, and a real-world benchmark LC-Crowd (Low-light Crowd Dataset). Inspired by Retinex-based physical modeling, we introduce depth and Canny edge cues as complementary geometric and structural priors to enhance the intrinsic reflectance representation under low-light conditions. We propose a Multi-Modal Hyper-Graph Fusion module, which formulates RGB appearance, depth geometry, and edge structure cues as nodes in a unified hyper-graph and explicitly captures their high-order complementary relationships via dynamic hyperedge construction and message passing. Furthermore, to adaptively allocate computation in dense prediction, we propose a Deformable Rectangular Sparse Attention (DRSA) module, which concentrates computation on informative regions through anchor-aware estimation and adaptive rectangular window modeling. Based on these designs, we develop a unified Low-Light Counting Network (LCNet) for robust low-light crowd counting. Extensive experiments on three benchmarks demonstrate that the proposed method achieves the best overall performance against existing state-of-the-art (SOTA) methods. The code is in the supplementary material. The datasets will be made public upon acceptance.

摘要：人群計數是計算機視覺中的一項基本任務。然而，儘管在現實世界中具有實際重要性，低光環境下的人群計數仍然在很大程度上未被探索。現有的方法主要集中在光線良好的場景上，或依賴單一模態的紅綠藍（RGB）表示，這在極度黑暗和複雜的非均勻照明下往往變得不可靠。為了解決這個問題，我們構建了三個新的低光人群計數基準，這些基準由兩個合成數據集SHA_Dark和SHB_Dark，以及一個現實世界基準LC-Crowd（低光人群數據集）組成。受到基於Retinex的物理建模的啟發，我們引入深度和Canny邊緣線索作為補充幾何和結構先驗，以增強低光條件下的內在反射率表示。我們提出了一個多模態超圖融合模塊，將RGB外觀、深度幾何和邊緣結構線索作為統一超圖中的節點，並通過動態超邊構建和信息傳遞明確捕捉它們的高階互補關係。此外，為了在密集預測中自適應地分配計算，我們提出了一個可變形矩形稀疏注意力（DRSA）模塊，通過錨點感知估計和自適應矩形窗口建模將計算集中在信息豐富的區域。基於這些設計，我們開發了一個統一的低光計數網絡（LCNet），以實現穩健的低光人群計數。在三個基準上的廣泛實驗表明，所提出的方法在現有的最先進（SOTA）方法中實現了最佳的整體性能。代碼在補充材料中。數據集將在接受後公開。

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

2606.18557v1 by Patrick Cooper, Alvaro Velasquez

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

摘要：一個基於規則的邏輯解決器在我們的基準測試中以 50 微秒內解決每個實例，並且準確率達到 100%；最佳的前沿語言模型最多達到 65%，在渲染穩健評估下則降至 23.5%（在四次表面渲染的最壞情況下）。我們介紹 DeFAb（可駁回的歸納基準），這是一個數據集和生成管道，將四十年的公共資助知識庫轉換為可駁回歸納的形式化實例：通過覆蓋預設來構建解釋異常的假設，同時保留無關的期望。因為每個假設必須通過多項式時間檢查以驗證有效推導、保守性和最小性，DeFAb 使邏輯嚴謹成為衡量創造力和理論推理的工具，評分理論修訂的有序構建，而不是流暢但毀滅理論的散文。該管道將分類層次（OpenCyc、YAGO、Wikidata）與行為屬性圖（ConceptNet、UMLS）配對，以從 18 個來源生成 372,648+ 個實例，涵蓋 33.75M 的具體化規則，並設有三個層級和多項式時間可驗證的黃金標準。四個前沿模型並不可靠地內化可駁回推理：渲染穩健的 Level 2 準確率為 7.8-23.5%；思維鏈變異（約 36 pp）超過任何模型間的差距；而匹配的污染控制則隔離出 +19.4 pp 的 Level 3 差距。我們進一步發布 DeFAb-Hard（235 個實例的 Level 3 難度變體；最佳模型 53.3% 對比 100% 符號）和 CONJURE（560 個 Lean 4/Mathlib 實例的核心驗證轉化創造力變體，其黃金答案是證明核心之前未包含的定義，無評判的驗證者；一項試點發現零個新概念）。同一驗證者也作為偏好優化（DPO，RLVR/GRPO）的精確獎勵。根據 MIT 授權發布，網址為 https://huggingface.co/datasets/PatrickAllenCooper/DeFAb。

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

2606.18473v1 by Bo Su, Ankit Shah, Thai Le

Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often unclear, since related and even distant information may be entangled in the model. In this paper, we study LLM unlearning from a data-centric perspective and measure how unlearning effects propagate from the forget set to same-domain and distant-domain knowledge. We find a consistent decay pattern: collateral damage is strongest near the forget set, weakens with semantic distance, but does not disappear at domain boundaries. We further ask whether such damage can be audited before unlearning is executed. We formulate forget-set auditing as a pre-unlearning prediction task and analyze which data features are most predictive of downstream damage. Our results show that interaction features between the forget set and evaluation set provide the strongest signals, suggesting that collateral damage is partly reflected in data geometry before model updates occur. These findings position forget-set auditing as an early warning tool for identifying risky unlearning runs and designing more reliable unlearning procedures.

摘要：機器遺忘對於大型語言模型（LLMs）的目標是移除特定知識，同時保留模型的其他能力。然而，遺忘的知識與保留的知識之間的界限往往不明確，因為相關甚至遙遠的信息可能在模型中交織在一起。在本文中，我們從數據中心的角度研究LLM的遺忘，並測量遺忘效應如何從遺忘集傳播到同域和異域知識。我們發現了一個一致的衰減模式：附帶損害在遺忘集附近最強，隨著語義距離的增加而減弱，但在領域邊界並不會消失。我們進一步詢問這種損害是否可以在執行遺忘之前進行審核。我們將遺忘集審核形式化為一個預遺忘預測任務，並分析哪些數據特徵最能預測下游損害。我們的結果顯示，遺忘集與評估集之間的交互特徵提供了最強的信號，這表明附帶損害在模型更新發生之前部分反映在數據幾何中。這些發現將遺忘集審核定位為識別風險遺忘運行和設計更可靠遺忘程序的早期預警工具。

TMR-GGNN: Credit Card Fraud Detection based on Time-Aware Multi-Relational Guided Graph Neural Network

2606.18444v1 by Rohit Tewari, Shubhankar Shilpi, Navin Chhibber, Devendra Singh Parmar, Sunil Khemka, Piyush Ranjan

In recent years, credit card fraud detection has faced significant challenges due to highly imbalanced data, evolving fraud patterns, and complex relational structures among transaction entities. To address these issues, this research proposes a novel framework called Timeaware Multi Relational Guided Graph Neural Network (TMR GGNN). Particularly, the proposed TMR GGNN extends the encoder decoder Graph Neural Network GNN architecture by modeling heterogeneous interactions across customers, merchants, devices, and IPs over temporal windows. Subsequently, the proposed TMR GGNN approach constructs a dynamic, multi relational graph and incorporates a time aware relational attention mechanism within the encoder to adaptively weigh the transaction relevance based on temporal proximity and semantic context. Consequently, the decoder employs a contrastive learning module to distinguish between real and synthesized transaction patterns, while improving the models generalization of rare fraud cases. Additionally, to effectively manage severe class imbalances and emphasize discriminative learning, a composite loss function combining Information Noise Contrastive Estimation (InfoNCE) based contrastive loss with Focal Loss is introduced. This integration assists in improving fraud identification while mitigating false negatives.

摘要：近年來，由於數據高度不平衡、詐騙模式不斷演變以及交易實體之間的複雜關係結構，信用卡詐騙檢測面臨重大挑戰。為了解決這些問題，本研究提出了一個名為時間感知多關係引導圖神經網絡（TMR GGNN）的新框架。特別是，所提出的TMR GGNN通過在時間窗口內建模客戶、商家、設備和IP之間的異質互動，擴展了編碼器-解碼器圖神經網絡GNN架構。隨後，所提出的TMR GGNN方法構建了一個動態的多關係圖，並在編碼器內部引入了一種時間感知關係注意機制，以根據時間接近性和語義上下文自適應地加權交易相關性。因此，解碼器採用了對比學習模塊，以區分真實和合成的交易模式，同時提高模型對稀有詐騙案例的泛化能力。此外，為了有效管理嚴重的類別不平衡並強調區分學習，提出了一種結合基於信息噪聲對比估計（InfoNCE）的對比損失與焦點損失的復合損失函數。這一整合有助於改善詐騙識別，同時減少假陰性。

RankGraph-2: Lifecycle Co-Design for Billion-Node Graph Learning in Recommendation

2606.18379v1 by Renzhi Wu, Zikun Cui, Junjie Yang, Tai Guo, Hong Li, Xian Chen, Li Yu, Ke Pan, Sri Reddy, Mahesh Srinivasan, Nipun Mathur, Haomin Yu, Hong Yan

Graph-based retrieval at billion-node scale requires jointly solving three tightly coupled problems -- graph construction, representation learning, and real-time serving -- yet existing work addresses each in isolation. We present RankGraph-2, a framework deployed at Meta that co-designs all three lifecycle stages for similarity-based retrieval (U2U2I and U2I2I), where each stage's requirements shape the others. Serving requires a co-learned cluster index to avoid expensive online KNN -- this pushes index co-training into the training objective. Training benefits from the observation that similarity-based retrieval tolerates pre-computed neighborhoods, eliminating online graph infrastructure -- this requires construction to produce self-contained data. Construction must also support hour-level refresh for item coverage. Acting on these cascading requirements, RankGraph-2 reduces hundreds of trillions of edges to hundreds of billions via subsampling with popularity bias correction, pre-computes multi-hop neighborhoods via personalized PageRank, and co-learns a residual-quantization cluster index that reduces serving computational cost by 83%. This lifecycle co-design enables a simple architecture to achieve 3.8 x higher recall than a GAT + Deep Graph Infomax model on a bipartite graph and 2.1 x higher than PyTorch-BigGraph on item retrieval. RankGraph-2 delivers up to +0.96% CTR and +2.75% CVR, and has powered 20+ retrieval launches across major surfaces.

摘要：圖形基於檢索在十億節點規模上需要共同解決三個緊密耦合的問題——圖形構建、表示學習和實時服務——然而現有的工作各自獨立處理這些問題。我們提出了 RankGraph-2，這是一個在 Meta 部署的框架，為基於相似性的檢索（U2U2I 和 U2I2I）共同設計所有三個生命周期階段，其中每個階段的需求影響其他階段。服務需要共同學習的集群索引，以避免昂貴的在線 KNN——這將索引共同訓練推入訓練目標。訓練受益於這樣的觀察：基於相似性的檢索容忍預計算的鄰域，消除了在線圖形基礎設施——這要求構建產生自包含的數據。構建還必須支持小時級別的刷新以覆蓋項目。根據這些級聯需求，RankGraph-2 通過帶有受歡迎度偏差修正的子採樣將數百萬億的邊減少到數百億，通過個性化 PageRank 預計算多跳鄰域，並共同學習一個殘差量化集群索引，將服務計算成本降低 83%。這種生命周期共同設計使得簡單的架構能夠在二分圖上實現比 GAT + Deep Graph Infomax 模型高 3.8 倍的召回率，並比 PyTorch-BigGraph 在項目檢索上高 2.1 倍。RankGraph-2 提供高達 +0.96% 的 CTR 和 +2.75% 的 CVR，並已支持 20 多個主要平台的檢索啟動。

2606.18235v1 by Qi Chai, Wenhao Shen, Nanjie Yao, Yue Xia, Kaiyong Zhao, Jie Ma, Guosheng Lin, Hao Wang

Zero-Shot Object-Goal Navigation (ZS-OGN) requires embodied agents to explore and locate target objects without any prior training. To this end, recent methods leverage foundation models. But they typically rely on static priors and lack adaptation, which leads to repeated errors and costly trial and error. In this paper, we propose a self-evolving ZS-OGN framework that enables continuous test-time improvement. Specifically, we build an agentic rule memory by extracting actionable knowledge from past trajectories. Then, we propose a retrieval strategy based on upper confidence bound, selecting effective rules by balancing semantic relevance and historical success. In addition, we introduce a memory-guided preflection module that forecasts potential outcomes before action, reducing inefficient exploration. Extensive experiments show that our method outperforms existing zero-shot baselines, achieving a 10.1\% improvement in success rate with fewer unnecessary steps.

摘要：零-shot 物體目標導航 (ZS-OGN) 要求具身體的代理在沒有任何先前訓練的情況下探索並定位目標物體。為此，最近的方法利用基礎模型。但它們通常依賴於靜態先驗，缺乏適應性，這導致重複錯誤和昂貴的試錯過程。在本文中，我們提出了一個自我演變的 ZS-OGN 框架，能夠實現持續的測試時改進。具體而言，我們通過從過去的軌跡中提取可行的知識來構建一個代理規則記憶。然後，我們提出了一種基於上置信界的檢索策略，通過平衡語義相關性和歷史成功來選擇有效的規則。此外，我們引入了一個記憶引導的預反模塊，預測行動前的潛在結果，減少低效的探索。大量實驗表明，我們的方法在成功率上超越了現有的零-shot 基準，實現了 10.1\% 的成功率提升，並且步驟更少。

Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy, with Stylometric and Exploratory Graph Analyses

2606.18222v1 by Joy Bose

We introduce Darshana Graph, a corpus of over 125,000 text records spanning classical Hindu, Buddhist, and Jain philosophical traditions, drawn from public-domain and openly licensed translations of sources including the Bhagavad Gita, Brahma Sutras, principal Upanishads, the Pali Canon, and core Jain texts. Its distinctive contribution lies in a structurally unique subset of roughly 8,500 Hindu and Jain records in which the same root verse or sutra is aligned across eighteen historical commentators representing five schools of Vedanta and other darshanas, enabling direct comparison of how independent interpretive traditions read identical source material. To our knowledge, no publicly available resource provides comparable cross-commentator alignment at this scale. We present two analyses built on this corpus. First, a transparent stylometric comparison requiring no machine learning measures argumentative style through scriptural citation density, explicit refutation rate, and sentence complexity. It finds a moderate negative correlation between citation density and refutation rate, a marked increase in refutation rate across three commentators in a related doctrinal lineage, and measurable genre-level differences within the Pali Canon itself. Second, we describe a constrained large language model pipeline that extracts typed philosophical relationships between concepts using a predefined relation vocabulary and deterministic post-hoc validation. The resulting graph surfaces cross-school disagreement patterns while also revealing important extraction limitations, including cases where an independent embedding-based analysis disagrees with the graph-derived findings. We release the full corpus, extracted relationship graph, and all source code.

摘要：我們介紹 Darshana Graph，這是一個包含超過 125,000 條文本記錄的語料庫，涵蓋了古典印度教、佛教和耆那教的哲學傳統，資料來源包括《博伽梵歌》、《梵天經》、《主要奧義書》、《巴利經典》以及核心耆那教文本，這些資料均來自公共領域和公開授權的翻譯。其獨特的貢獻在於一個結構上獨特的子集，約有 8,500 條印度教和耆那教的記錄，其中相同的根本經文或經句在代表五個維丹塔學派和其他 darshanas 的十八位歷史評論家之間對齊，使得獨立的詮釋傳統能夠直接比較如何解讀相同的來源材料。據我們所知，沒有任何公開可用的資源能在這個規模上提供可比的跨評論家對齊。我們基於這個語料庫呈現了兩項分析。首先，一個透明的文體計量比較，無需機器學習，通過經文引用密度、明確反駁率和句子複雜性來衡量論證風格。它發現引用密度與反駁率之間存在中等的負相關性，在三位評論家中，相關教義系譜的反駁率顯著增加，並且在巴利經典內部也存在可測量的類別層級差異。其次，我們描述了一個受限的大型語言模型管道，該管道使用預定義的關係詞彙和確定性的事後驗證，提取概念之間的類型哲學關係。結果圖顯示了跨學派的分歧模式，同時揭示了重要的提取限制，包括獨立的嵌入基礎分析與圖派生結果不一致的情況。我們發布了完整的語料庫、提取的關係圖和所有源代碼。

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

2606.18216v1 by Byung-Kwan Lee, Ximing Lu, Shizhe Diao, Minki Kang, Saurav Muralidharan, Karan Sapra, Andrew Tao, Pavlo Molchanov, Yejin Choi, Yu-Chiang Frank Wang, Ryo Hachiuma

Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.

摘要：知識蒸餾將教師的能力轉移到小型學生，但在小型學生的範疇內卻顯得脆弱：迫使學生模仿來自更大教師的邏輯輸出，會使其集中於教師最尖銳的模式，從而損害在訓練語料庫之外的基準家庭上的泛化能力。強化學習（RL）通過在學生自己的回合上進行訓練來避免邏輯模仿。然而，在每個回合都失敗的問題上——產生零優勢並被靜默丟棄——將更強的教師反應注入策略梯度會打破在政策上的假設並引起漂移。我們引入了近端政策優化區域（ZPPO），靈感來自維果茨基的近端發展區，該方法將教師保持在提示內而非策略梯度中。在困難問題上，ZPPO 構建了兩個重新格式化的提示：二元候選人包含問題（BCQ）將一個正確的教師反應與一個不正確的學生反應配對，作為學生必須區分的匿名候選者，而負候選人包含問題（NCQ）將學生的錯誤回合聚合成一個單一提示，以顯示其共同的失敗模式。一個提示重播緩衝區會循環每個困難問題，直到它要麼畢業——學生在該問題上的平均回合準確率達到一半——要麼在有限容量下以先進先出方式驅逐，從而在學生當前的近端發展區內放大 BCQ 和 NCQ。在 Qwen3.5 系列的四個學生規模（0.8B-9B）上，使用一個 27B 的教師，作為視覺-語言模型進行後訓練，並在 31 個基準套件（16 VLM，10 LLM，5 視頻）上進行評估，ZPPO 的表現超過了離線/在線蒸餾和 GRPO，在最小規模下獲得了最大的增益。

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

2606.18205v1 by Diaa Fayed, Laurent Romary

This paper presents a robust methodology for the systematic digitization and encoding of the Al-Mawrid Arabic-English dictionary, transforming it from a legacy print resource into a standardized computational lexicon. Addressing a significant gap in Arabic lexical infrastructure, the study adopts a dual-standard framing that aligns the ISO Lexical Markup Framework (LMF) with the Text Encoding Initiative TEI Lex-0 guidelines. By applying an editorial view to the dictionary's macro- and microstructure, the research resolves the structural ambiguities and punctuation inconsistencies typical of 20th-century bilingual dictionaries. The methodology is grounded in an empirical analysis of the dictionary's lexical knowledge density. Drawing on a representative sample (the letter Ayn, comprising 4.6% of the total volume), the study provides scientific weight to the encoding process, demonstrating a structural parsing accuracy of 91%. Quantitative evaluation of the information extraction rules reveals high performance, with 85% precision and 98% recall for synonyms, and 88% precision for other morpho-semantic features. Beyond technical description, the paper provides a critical comparison with existing Arabic lexical resources and discusses the limitations of TEI Lex-0 when modelling specific Arabic phenomena, such as implicit "open set" semantic relations and scattered morphological cues. Furthermore, the study explores the potential for Linguistic Linked Open Data (LLOD) integration by establishing a scalable prefix-based referencing system that facilitates the resource's inclusion in the semantic web. The result is an interoperable, machine-tractable resource that provides a reproducible workflow for the retro-digitization of complex legacy bilingual lexicons within the Arabic NLP and Digital Humanities communities.

摘要：本文提出了一種穩健的方法論，用於系統化數位化和編碼《阿爾-毛里德阿拉伯語-英語詞典》，將其從傳統印刷資源轉變為標準化的計算詞彙庫。針對阿拉伯語詞彙基礎設施中的一個重大空白，本研究採用了雙標準框架，將ISO詞彙標記框架（LMF）與文本編碼倡議TEI Lex-0指導方針對齊。通過對詞典的宏觀和微觀結構應用編輯視角，研究解決了20世紀雙語詞典中典型的結構模糊性和標點不一致性。該方法論建立在對詞典詞彙知識密度的實證分析之上。研究基於一個代表性樣本（字母Ayn，佔總體積的4.6%），為編碼過程提供了科學依據，顯示出91%的結構解析準確率。對信息提取規則的定量評估顯示出高效能，同義詞的精確度為85%，召回率為98%，而其他形態語義特徵的精確度為88%。除了技術描述外，本文還對現有的阿拉伯語詞彙資源進行了批判性比較，並討論了在建模特定阿拉伯現象（如隱含的“開放集”語義關係和分散的形態線索）時TEI Lex-0的局限性。此外，本研究探討了通過建立可擴展的基於前綴的參考系統來整合語言連結開放數據（LLOD）的潛力，該系統促進了資源在語義網中的納入。最終結果是一個可互操作的、機器可處理的資源，為阿拉伯語自然語言處理和數位人文學科社群內複雜傳統雙語詞典的回溯數位化提供了可重複的工作流程。

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

2606.18192v2 by Nick Bettencourt, Xiaowei Ding, Kay Giesecke

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.

摘要：隨著高品質公共網路語料庫逐漸枯竭，乾淨的長上下文文件已成為大型語言模型（LLMs）訓練數據的稀缺且昂貴的來源。現有的長上下文語料庫往往是專有的，獲取成本高昂，或是合成生成的，或者集中於狹窄的領域，如程式設計。我們介紹了斯坦福EDGAR檔案數據集（SEFD），這是一個將SEC檔案重建為佈局忠實的MultiMarkdown的開放數據集，用於金融語言建模和評估。SEFD使經過審計的財務報表、風險披露、所有權報告、會計註釋和市場影響事件檔案可用作長上下文的預訓練數據，以及金融推理、預測、合規性和文件理解的基礎。生成的語料庫在標記效率上表現良好，適合模型使用，並且與Common Crawl衍生的語料庫重疊率低於0.1%。我們發布了SEFD-v1，這是一個152B標記的初始公共快照，並提供了對一個估計為550B標記的更大18.5M檔案存檔的語料庫級分析。我們進一步介紹了兩個基於SEFD的基準：EDGAR-Forecast，評估模型知識截止後的檔案基礎數字預測，以及EDGAR-OCR，評估複雜財務表格的轉錄。

Learning Cardiac Electrophysiology Digital Twins Through Agentic Discovery of Hybrid Structure

2606.18154v1 by Ziqi Zhou, Yubo Ye, Sumeet Atul Vadhavka, Linwei Wang, Zhiqiang Tao

Building personalized cardiac electrophysiology (EP) digital twins requires identifying the appropriate model structure for each patient, not merely fitting parameters. Traditional methods rely on experts to manually prescribe hybrid physics-neural architectures, which requires deep domain expertise and does not transfer across patients. Recent works have applied large language models (LLMs) to generate or act as hybrid models. However, despite their promising generalization capacity, these LLM-based methods lack the structural priors needed for stable cardiac simulations. Hence, we propose LEADS, a framework that formulates cardiac EP domain knowledge as a structured action space and utilizes an LLM agent to discover hybrid models. The agent follows an iterative reasoning-and-action loop to select, combine, and refine hybrid models, whilst gradient descent handles parameter fitting. The proposed LEADS designs every candidate model towards physically grounded, interpretable, and numerically stable, while allowing open-ended architectural discovery. We validate LEADS on synthetic data with three ground-truth reaction models and on real cardiac EP data, demonstrating that it outperforms both human-designed hybrid models and other LLM-based hybrid modeling.

摘要：建立個性化心臟電生理（EP）數位雙胞胎需要為每位患者識別適當的模型結構，而不僅僅是擬合參數。傳統方法依賴專家手動指定混合物理-神經架構，這需要深厚的領域專業知識，且無法在患者之間轉移。最近的研究已經應用大型語言模型（LLMs）來生成或作為混合模型。然而，儘管這些基於LLM的方法具有良好的泛化能力，但它們缺乏穩定心臟模擬所需的結構先驗。因此，我們提出LEADS，一個將心臟EP領域知識表述為結構化行動空間的框架，並利用LLM代理來發現混合模型。該代理遵循迭代推理和行動循環來選擇、組合和精煉混合模型，同時使用梯度下降來處理參數擬合。所提出的LEADS旨在使每個候選模型朝向物理基礎、可解釋且數值穩定的方向設計，同時允許開放式的架構發現。我們在具有三個真實反應模型的合成數據和真實心臟EP數據上驗證LEADS，證明其在性能上優於人類設計的混合模型和其他基於LLM的混合建模。

WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning

2606.18147v1 by Yuwei Zhang, Tong Xia, Bianca Emmerich, Yu Yvonne Wu, Dimitris Spathis, Xin Liu, Daniel McDuff, Cecilia Mascolo

Language models are remarkably capable at medical question answering, in some cases surpassing the accuracy of general physicians. However, answering questions about wearable health data remains challenging and understudied, as these ubiquitous sensors produce continuous, high-dimensional, and longitudinal data, which is non-trivial to align with text-centric distributions in LLM pretraining. The diversity of sensor modalities and user intents cannot be effectively handled by a fixed reasoning workflow or a single pretrained foundation model. To address these challenges, we propose WEQA, a query-adaptive agent framework that unifies LLM reasoning with specialized wearable analytical and modeling tools. An LLM controller is employed to synthesize execution plans and dynamically route each query to the appropriate combination of sensor analysis and pretrained models, and perform grounded response auditing with external knowledge. We also curate a benchmark spanning four open wearable datasets comprising analytic and predictive tasks in three different health domains. Experiments show that our framework is 24% more accurate than LLM and agentic baselines, and a blinded study with 12 medical experts and 8 users shows substantial gains in usefulness and clinical soundness.

摘要：語言模型在醫學問答方面表現出色，在某些情況下超越了一般醫生的準確性。然而，回答有關可穿戴健康數據的問題仍然具有挑戰性且研究不足，因為這些無處不在的傳感器產生連續的、高維度的和長期的數據，這與 LLM 預訓練中的以文本為中心的分佈對齊並非易事。傳感器模態和用戶意圖的多樣性無法通過固定的推理工作流程或單一的預訓練基礎模型有效處理。為了解決這些挑戰，我們提出了 WEQA，一個查詢自適應代理框架，將 LLM 推理與專門的可穿戴分析和建模工具統一。我們使用 LLM 控制器來合成執行計劃，並動態地將每個查詢路由到適當的傳感器分析和預訓練模型的組合，並利用外部知識進行基於事實的回應審核。我們還策劃了一個基準，涵蓋四個開放的可穿戴數據集，包括三個不同健康領域的分析和預測任務。實驗表明，我們的框架比 LLM 和代理基準準確性高出 24%，而一項由 12 位醫學專家和 8 位用戶參與的盲測顯示在實用性和臨床合理性方面有顯著提升。

Knowledge Reutilization in Meta-Reinforcement Learning

2606.18132v1 by Yuan Meng, Bo Wang, Juan de los Rios Ruiz, Xiangtong Yao, Zhenshan Bing, Fuchun Sun, Alois Knoll

Meta-reinforcement learning enables fast adaptation by extracting shared structure from related tasks, but existing end-to-end methods often couple task inference with embodiment-specific control. This coupling can obscure non-parametric task semantics, reduce sample efficiency, and limit cross-agent reuse. We propose a meta-knowledge reutilization framework that learns task-level knowledge on a dynamics-simplified agent and transfers it to heterogeneous agents. The framework uses a Bayesian non-parametric prior to organize latent task modes and a high-level policy to generate task-level magnitude guidance. To bridge reusable task knowledge with different embodiments, we introduce a semantic-magnitude interface and a lightweight temporal adaptor, which convert frozen meta-knowledge into temporally aligned subgoals for embodiment-specific low-level controllers. Experiments on multiple locomotion agents show that our framework reduces final-step tracking error by 94.75% -- 99.79% compared with recent state-of-the-art baselines and achieves comparable deployment performance with about 23.8% of their interaction data.

摘要：Meta-強化學習通過從相關任務中提取共享結構來實現快速適應，但現有的端到端方法通常將任務推斷與具體實體的控制結合在一起。這種結合可能會模糊非參數任務語義，降低樣本效率，並限制跨代理重用。我們提出了一個元知識重用框架，該框架在動力學簡化的代理上學習任務級知識並將其轉移到異質代理。該框架使用貝葉斯非參數先驗來組織潛在任務模式，並使用高級策略生成任務級幅度指導。為了將可重用的任務知識與不同的實體橋接，我們引入了一個語義-幅度接口和一個輕量級的時間適配器，將凍結的元知識轉換為時間對齊的子目標，以供具體實體的低級控制器使用。在多個運動代理上的實驗顯示，我們的框架相比於最近的最先進基準，將最終步驟跟蹤誤差降低了94.75%至99.79%，並且在約23.8%的交互數據下實現了可比的部署性能。

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

2606.18129v1 by Abeer Badawi, Moyosoreoluwa Olatosi, Negin Baghbanzadeh, Laleh Seyyed-Kalantari, Frank Rudzicz, R. Shayna Rosenbaum, Sara Pishdadian, Elham Dolatabadi

Recent incidents involving LLMs used for mental-health support reveal a critical evaluation gap: surface-level safety scores do not capture how models behave across realistic, emotionally sensitive interactions over time. Existing benchmarks measure knowledge, safety, or static response quality, but miss whether LLM interactions help users keep reflecting, coping, and making decisions themselves. We formalize this missing dimension as COGNITIVE ATROPHY, a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. To measure it, we introduce COGNITIVE ATROPHY BENCH, a clinically grounded benchmark built from 1,576 fully human-generated counseling conversations, 15,680 turns, and 42,230 responses from five LLMs. Three clinical and neuropsychology experts developed a 20-attribute schema spanning user context, response behaviour, and global risk flags; six trained clinical reviewers applied it with span-grounded evidence, producing 5,324 reviewer judgments. We further introduce the User-Input Risk Index (UIRI), the Cognitive Atrophy Risk Index (ARI), and trajectory summaries. Across five LLMs, models show a consistent moderate-to-high level of atrophy-aligned behaviour across single and multi-turn settings. While models generally respond to overt safety cues, they adapt less reliably when users seek solutions or decisions. The dominant recurring patterns are directive advice, problem-solving, recommendation responses, topic shifts, and forms of validation that may reinforce dependence rather than reflection. Our work makes COGNITIVE ATROPHY measurable and provides a foundation for auditing model behaviour in sensitive LLM conversations.

摘要：最近涉及用於心理健康支持的LLM事件揭示了一個關鍵的評估缺口：表面上的安全分數無法捕捉模型在現實情境中隨時間推移的情感敏感互動中的行為。現有的基準測量知識、安全性或靜態反應質量，但未能評估LLM互動是否幫助用戶持續反思、應對和自主做出決策。我們將這一缺失的維度正式化為認知萎縮（COGNITIVE ATROPHY），這是一種在AI介導的心理健康支持中與安全性和幫助性不同的過程層面行為測量。為了測量它，我們引入了認知萎縮基準（COGNITIVE ATROPHY BENCH），這是一個基於1,576個完全由人類生成的諮詢對話、15,680次回合和來自五個LLM的42,230個回應的臨床基準。三位臨床和神經心理學專家開發了一個涵蓋用戶背景、回應行為和全球風險標誌的20屬性架構；六位經過培訓的臨床審核員應用該架構並提供基於證據的評估，產生了5,324條審核判斷。我們進一步引入了用戶輸入風險指數（User-Input Risk Index, UIRI）、認知萎縮風險指數（Cognitive Atrophy Risk Index, ARI）和軌跡摘要。在五個LLM中，模型在單回合和多回合設置中顯示出一致的中到高水平的萎縮對齊行為。儘管模型通常對明顯的安全提示作出反應，但當用戶尋求解決方案或決策時，它們的適應性較低。主導的重複模式包括指導性建議、問題解決、推薦回應、主題轉換和可能加強依賴而非反思的驗證形式。我們的工作使認知萎縮可測量，並為審計敏感LLM對話中的模型行為提供了基礎。

Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models

2606.18114v1 by Ramprasath Ganesaraja, Sahil Dilip Panse, Swathika N

State Space Models (SSMs) such as Mamba-2 offer linear-time inference but their memory footprint limits edge deployment. Prior ternary SSM work (Slender-Mamba) trains from scratch on 150B tokens; we show a pretrained checkpoint suffices, reducing the marginal token budget by 1,000x. Using grouped quantization-aware training (QAT) with knowledge distillation from a frozen FP16 teacher, we compress Mamba-2 1.3B to 3.61x (2,687 to 744 MB) and achieve 48.1% zero-shot accuracy (7-task average) in just 102M tokens (4 GPU-hours, single H100) -- approaching Bi-Mamba's 48.4% (within +/-0.9pp CI). This QAT-from-pretrained setting reveals zero-ratio collapse, a novel instability caused by learnable quantization scales that does not arise in from-scratch training. We further show that post-hoc correction strategies effective for Transformers fail for SSMs due to error accumulation through the recurrence. These results demonstrate that ternary SSMs do not require expensive from-scratch training: QAT from pretrained checkpoints with KD is a data-efficient alternative.

摘要：狀態空間模型（SSMs）如 Mamba-2 提供線性時間推斷，但其記憶體佔用限制了邊緣部署。先前的三元 SSM 研究（Slender-Mamba）從 150B 代幣開始訓練；我們顯示預訓練的檢查點已足夠，將邊際代幣預算減少 1,000 倍。使用分組量化感知訓練（QAT）並從凍結的 FP16 教師進行知識蒸餾，我們將 Mamba-2 1.3B 壓縮至 3.61 倍（從 2,687 MB 減少至 744 MB），並在僅 102M 代幣（4 GPU 小時，單個 H100）中達到 48.1% 的零-shot 準確率（7 任務平均）——接近 Bi-Mamba 的 48.4%（在 +/-0.9pp CI 內）。這種從預訓練設定的 QAT 顯示出零比率崩潰，這是一種由可學習量化尺度引起的新穩定性問題，而在從零開始訓練中並未出現。我們進一步顯示，對於Transformer有效的事後修正策略因重複性導致的誤差累積而對 SSMs 失效。這些結果表明，三元 SSMs 不需要昂貴的從零開始訓練：從預訓練檢查點進行的 QAT 結合 KD 是一種數據高效的替代方案。

S4oP: Operator-level Pruning of Structured State Space Models for Resource-Constrained Devices

2606.18096v1 by Marco Deano, Filippo Ziche, Nicola Bombieri

Structured State Space Models (SSMs), including the S4 and S4D architectures, have recently emerged as powerful alternatives to attention-based models for capturing long-range dependencies in sequential data. Despite their strong empirical performance, deploying these models in time- and resource-constrained settings remains challenging due to their computational and memory demands. In this paper, we propose a novel incremental, operator-level pruning approach for S4- and S4D-based models that significantly reduces inference cost while preserving predictive performance. To the best of our knowledge, this is the first work to systematically investigate structured operator pruning for SSMs. Our method progressively prunes model operators by interleaving structured masking with fine-tuning, while jointly monitoring accuracy and inference latency. We implement this approach within a unified training and evaluation framework that enables systematic exploration of efficiency-accuracy trade-offs. Experiments across multiple benchmark datasets show that pruning up to 70% of the model operators preserves the performance of the original models in most cases, while substantially reducing inference latency. These results demonstrate that structured operator pruning is an effective and previously unexplored strategy for improving the efficiency of SSMs and facilitate their deployment in practical, resource-constrained scenarios.

摘要：結構化狀態空間模型（SSMs），包括 S4 和 S4D 架構，最近已成為捕捉序列數據中長距依賴關係的強大替代方案，超越基於注意力的模型。儘管它們在實證性能上表現優異，但由於計算和記憶體需求，將這些模型部署在時間和資源受限的環境中仍然具有挑戰性。在本文中，我們提出了一種新穎的增量運算子級修剪方法，適用於基於 S4 和 S4D 的模型，該方法顯著降低了推理成本，同時保持預測性能。據我們所知，這是第一個系統性研究 SSMs 的結構化運算子修剪的工作。我們的方法通過將結構化遮罩與微調交錯進行，逐步修剪模型運算子，同時共同監控準確性和推理延遲。我們在一個統一的訓練和評估框架內實現了這種方法，使得系統性探索效率與準確性之間的權衡成為可能。在多個基準數據集上的實驗顯示，修剪多達 70% 的模型運算子在大多數情況下保持了原始模型的性能，同時顯著降低了推理延遲。這些結果表明，結構化運算子修剪是一種有效且之前未被探索的策略，用於提高 SSMs 的效率，並促進其在實際資源受限場景中的部署。

EAGG: Embodiment-Aligned Grasp Generation via Geometry-Aware Graph Conditioning

2606.18092v1 by Wanhao Niu, Qiyan Ke, Yuan Sun, Hao Sun, Jie Xu, Muyuan Ma, Ruiqi Hu, Fuchun Sun

Cross-end-effector grasp generation seeks a unified model that generalizes across objects and across embodiments ranging from parallel grippers to dexterous end effectors. Existing grasp generators are typically designed for a fixed embodiment or encode embodiment identity with a static descriptor, which weakens transfer when topology, actuation coupling, and contact geometry differ substantially. We present EAGG, an embodiment-aligned grasp generator that represents each embodiment with a topology-aware end-effector graph and an embodiment-specific low-dimensional end-effector control space. A frozen end-effector-cognition backbone converts the current articulated state into geometry-aware tokens that act as a reusable morphology prior, and iterative geometry injection refreshes these tokens throughout sampling so that conditioning remains synchronized with the evolving end-effector geometry. On the MultiGripperGrasp benchmark, EAGG reaches 56.17% average success across six training end effectors, remaining within 1.10 percentage points of specialized training while preserving transfer to finetuning and zero-shot end effectors. Iterative geometry injection further reduces the pooled median contact distance from 0.239 cm to 0.189 cm. These results show that cross-end-effector grasp generation is strengthened by aligning embodiment structure inside a shared generator rather than suppressing embodiment differences. Code is available at https://github.com/wanhaoniu/EAGG.

摘要：跨端執行器抓取生成尋求一個統一模型，能夠在物體和從平行夾具到靈巧端執行器的不同實體之間進行泛化。現有的抓取生成器通常是為固定實體設計，或使用靜態描述符編碼實體身份，這在拓撲、驅動耦合和接觸幾何顯著不同時削弱了轉移能力。我們提出了EAGG，一個與實體對齊的抓取生成器，該生成器用一個拓撲感知的端執行器圖和一個特定實體的低維端執行器控制空間來表示每個實體。一個凍結的端執行器認知骨幹將當前的關節狀態轉換為幾何感知的標記，這些標記作為可重用的形態先驗，而迭代幾何注入在採樣過程中不斷刷新這些標記，以便條件保持與不斷演變的端執行器幾何同步。在MultiGripperGrasp基準測試中，EAGG在六個訓練端執行器上達到56.17%的平均成功率，與專門訓練相差僅1.10個百分點，同時保持對微調和零樣本端執行器的轉移。迭代幾何注入進一步將合併的中位接觸距離從0.239厘米減少到0.189厘米。這些結果顯示，跨端執行器抓取生成通過在共享生成器內對齊實體結構而不是抑制實體差異得到了加強。代碼可在https://github.com/wanhaoniu/EAGG獲得。

A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation

2606.18075v1 by Haoyang Zhong, Yifei Sun, Antong Zhang, Chunping Wang, Lei Chen, Yang Yang

Retrieval-Augmented Generation (RAG) has emerged as a paradigm for enhancing large language models (LLMs) with external knowledge, yet existing graph-based methods face a fundamental limitation: entity-centric and chunk-centric approaches operate on representations anchored to original text without true knowledge fusion. While entity-centric methods connect logically related content and chunk-centric methods preserve context, both retrieve information separately through similarity search, missing emergent understanding from their synthesis. In this paper, we propose HyGRAG, a hierarchical graph RAG framework that transcends source documents by addressing three core challenges: constructing summaries that genuinely integrate contextual and relational information, leveraging these synthesized representations to access emergent knowledge during retrieval, and efficiently updating hierarchical structures for dynamic corpora. Specifically, we design hierarchical index structures over hybrid graphs with both chunk and entity nodes, then iteratively cluster them and generate LLM-based summaries. Then, we design context and relation-aware retrieval that searches across all abstraction levels while expanding through community membership. Moreover, we enable dynamic knowledge update through attachment-based algorithms with only local re-summarization. Experimental results show that HyGRAG improves the average accuracy of multi-hop reasoning tasks by 9.7%, while maintaining reasonable efficiency.

摘要：檢索增強生成（RAG）已成為增強大型語言模型（LLMs）與外部知識的範式，但現有的基於圖的方法面臨一個基本限制：以實體為中心和以區塊為中心的方法在原始文本的基礎上運作，卻沒有真正的知識融合。雖然以實體為中心的方法連接邏輯相關的內容，而以區塊為中心的方法保留上下文，但兩者都是通過相似性搜索分別檢索信息，錯過了其綜合所產生的理解。在本文中，我們提出了HyGRAG，一個層次化圖形RAG框架，通過解決三個核心挑戰超越源文檔：構建真正整合上下文和關聯信息的摘要，利用這些綜合表示在檢索過程中訪問新興知識，並有效更新動態語料庫的層次結構。具體而言，我們設計了基於混合圖的層次索引結構，包含區塊和實體節點，然後對它們進行迭代聚類並生成基於LLM的摘要。接著，我們設計了上下文和關係感知的檢索，能夠在所有抽象層次上進行搜索，同時通過社群成員資格進行擴展。此外，我們通過基於附加的算法實現動態知識更新，僅需進行局部重新摘要。實驗結果顯示，HyGRAG將多跳推理任務的平均準確率提高了9.7%，同時保持合理的效率。

When LLMs Analyze Scars: From Images to Clinically-Meaningful Features

2606.18063v1 by Ruman Wang, Hangting Ye

Medical image classification faces a fundamental dilemma: while deep learning models achieve remarkable performance at scale, real-world clinical scenarios often suffer from severe data scarcity due to annotation costs, privacy constraints, and disease rarity. This challenge is particularly pronounced in pathological scar classification, where differentiating keloids from hypertrophic scars requires subtle expert knowledge and labeled images are extremely limited. We propose a novel paradigm that repositions large language models (LLMs) as knowledge-driven feature engineers rather than end-to-end classifiers. We call this framework ScaFE (Scar Feature Engineering). Our key insight is that LLMs encode rich medical knowledge that can be externalized as executable feature extraction code, enabling the transformation of high-dimensional images into low-dimensional, clinically interpretable representations. Specifically, we prompt an LLM with established scar assessment criteria to generate deterministic Python code that extracts features aligned with clinical scoring systems such as the Vancouver Scar Scale. Our approach offers three key advantages: (1) data efficiency, achieving robust performance with limited training samples by decoupling knowledge acquisition from statistical learning; (2) privacy preservation, as raw images are processed locally without exposure to external LLMs; and (3) interpretability, through explicit features grounded in clinical reasoning. Extensive experiments on scar classification demonstrate that our method consistently outperforms end-to-end deep learning baselines or using LLMs as black-box classifiers under limited data conditions, establishing a promising direction for integrating LLMs into data-efficient and clinically transparent medical AI systems.

摘要：醫學影像分類面臨一個根本性的困境：雖然深度學習模型在大規模下表現卓越，但現實世界的臨床情境常常因為標註成本、隱私限制和疾病稀有性而遭遇嚴重的數據匱乏。這一挑戰在病理性疤痕分類中尤為明顯，因為區分凹疤和肥厚性疤痕需要微妙的專家知識，而標註的影像極為有限。我們提出了一種新穎的範式，將大型語言模型（LLMs）重新定位為知識驅動的特徵工程師，而非端到端的分類器。我們稱這一框架為ScaFE（疤痕特徵工程）。我們的關鍵見解是，LLMs編碼了豐富的醫學知識，這些知識可以外部化為可執行的特徵提取代碼，使高維影像轉換為低維且臨床可解釋的表示。具體而言，我們使用既定的疤痕評估標準來提示LLM生成確定性的Python代碼，提取與臨床評分系統（如溫哥華疤痕量表）對齊的特徵。我們的方法提供了三個主要優勢：（1）數據效率，通過將知識獲取與統計學習解耦，實現有限訓練樣本下的穩健性能；（2）隱私保護，因為原始影像在本地處理，未暴露於外部LLMs；以及（3）可解釋性，通過基於臨床推理的明確特徵。對疤痕分類的廣泛實驗表明，我們的方法在有限數據條件下始終優於端到端的深度學習基準或將LLMs用作黑箱分類器，確立了將LLMs整合進數據高效且臨床透明的醫學AI系統中的有前景方向。

Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond

2606.18062v1 by Hobin Kim, Xiaoyuan Wu, Omer Akgul, Lujo Bauer, Nicolas Christin

Large language models (LLMs) are widely used to fulfill users' information needs; users ask LLMs about the weather, pose educational questions, and consult them for legal assistance. One particularly understudied area is digital security and privacy (S&P), where users may seek LLMs' help on how to secure their online accounts or protect their computers from cyber attacks. To the best of our knowledge, no prior study has collected or analyzed the S&P questions users ask LLMs; prior research on LLM response quality relied on expert-authored S&P misconceptions or FAQs rather than user queries. Drawing from WildChat, a dataset of 3.2M user-LLM conversations collected in the wild, our study identifies 14,727 S&P prompts and categorizes them into nine categories covering a wide range of S&P topics. From the S&P prompts, we sampled 450 and performed a thematic analysis to characterize the S&P questions users ask LLMs. Separate from the thematic analysis, we curated 270 advice-seeking S&P prompts, where users ask for recommendations, guidance, or specific S&P information. We measured LLM response quality and consistency when posing the prompt to LLMs 10 times. We found that commercial LLMs outperform open-weight models (GPT 5.5 provided "good enough" responses on 98% of prompts; Llama 4 on 47%). However, among prompts that received high-quality responses on average, commercial models sometimes produce contradictory responses across runs, risking confusing or misleading users.

摘要：大型語言模型（LLMs）被廣泛用來滿足用戶的信息需求；用戶向LLMs詢問天氣、提出教育問題，並諮詢法律協助。數位安全和隱私（S&P）是一個特別少有研究的領域，用戶可能會尋求LLMs的幫助，以了解如何保護他們的在線帳戶或防止電腦遭受網絡攻擊。據我們所知，之前沒有研究收集或分析用戶向LLMs提出的S&P問題；先前對LLM回應質量的研究依賴於專家撰寫的S&P誤解或常見問題，而不是用戶查詢。基於WildChat，這是一個收集到的320萬用戶-LLM對話的數據集，我們的研究識別了14,727個S&P提示，並將其分類為九個類別，涵蓋各種S&P主題。從這些S&P提示中，我們抽取了450個並進行了主題分析，以特徵化用戶向LLMs提出的S&P問題。與主題分析分開，我們整理了270個尋求建議的S&P提示，用戶在這些提示中請求建議、指導或具體的S&P信息。我們測量了LLM在向其提出提示時的回應質量和一致性，進行了10次提問。我們發現商業LLMs在性能上優於開放權重模型（GPT 5.5在98%的提示中提供了「足夠好」的回應；Llama 4則為47%）。然而，在平均獲得高質量回應的提示中，商業模型有時會在不同的運行中產生矛盾的回應，這可能會使用戶感到困惑或誤導。

C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift

2606.18003v1 by Davide Domini, Gianluca Aguzzi, Lorenzo Pellegrini, Mirko Viroli, Lukas Esterle

Collective Adaptive Systems (CAS) increasingly rely on machine learning to let each node learn from locally sensed data, aligning its behavior with the surrounding environment. Scaling this intelligence, however, raises fundamental challenges: sensed data is often privacy-sensitive, preventing centralized collection; nodes are mobile, traversing regions where nearby nodes perceive similar phenomena while distant ones observe radically different conditions, creating natural spatial clusters; and these distributions evolve over time due to mobility, introducing temporal drift that makes local models progressively stale. These dynamics arise across domains - vehicular sensing, drone-based monitoring, smartphone crowdsensing - yet the interplay of privacy, spatial heterogeneity, and temporal drift severely undermines conventional learning strategies. Therefore, we propose C2FL, a fully distributed Federated Learning (FL) approach where nodes self-organize into learning groups through spatial clustering, reflecting the geographic structure of the environment. To counteract temporal drift, each node combines experience replay with a dwell-time-aware adaptive averaging step, progressively incorporating the regional consensus as it remains longer within the same area, while preserving previously acquired knowledge under evolving distributions. We evaluate our approach on synthetic experiments that systematically reproduce spatial and temporal shifts, showing that standard federated strategies degrade significantly under these conditions and that our method restores robust collective adaptation.

摘要：集體自適應系統（CAS）越來越依賴機器學習，讓每個節點從本地感知數據中學習，並使其行為與周圍環境對齊。然而，擴展這種智能會帶來根本挑戰：感知數據通常涉及隱私問題，阻止集中收集；節點是移動的，穿越附近節點感知相似現象而遠端節點觀察到截然不同條件的區域，形成自然的空間集群；而且，由於移動性，這些分佈隨時間演變，引入了時間漂移，使得本地模型逐漸過時。這些動態在不同領域中出現——車輛感知、無人機監測、智能手機群眾感知——然而，隱私、空間異質性和時間漂移的相互作用嚴重削弱了傳統學習策略。因此，我們提出了C2FL，一種完全分散的聯邦學習（FL）方法，節點通過空間集群自組織成學習小組，反映環境的地理結構。為了抵消時間漂移，每個節點將經驗重播與考慮停留時間的自適應平均步驟相結合，隨著在同一區域停留時間的增長，逐步納入地區共識，同時在演變的分佈下保留先前獲得的知識。我們在系統性重現空間和時間變化的合成實驗中評估我們的方法，顯示標準的聯邦策略在這些條件下顯著退化，而我們的方法恢復了強健的集體適應能力。

Dimensionality Controls When Modularity Helps in Continual Learning

2606.17889v1 by Kathrin Korte, Christian Medeiros Adriano, Joachim Winther Pedersen, Eleni Nisioti, Sebastian Risi

Compositional learning systems must balance plasticity, the ability to acquire new knowledge, with stability, the preservation of previously learned components, especially when tasks share structure and risk interference. We study how modular architecture, task similarity, and representational dimensionality jointly shape compositional continual learning in a sequential A-B-A paradigm, comparing a task-partitioned recurrent network to a single-network baseline while inducing high- and low-dimensional regimes via weight-scale manipulations. In a high-dimensional "lazy" regime, both architectures achieve similar performance and internal geometry, suggesting that explicit modular structure has little impact when representations are weakly constrained. In a lower-dimensional "rich" regime, modularity becomes decisive: the modular network develops graded task-specific subspaces that overlap for similar tasks, partially align for moderately dissimilar tasks, and separate for dissimilar tasks, yielding a more compositional and interpretable organization than the single network. These findings identify the representational regime induced by initialization scale, which co-varies with representational dimensionality, as a key factor governing when compositional, modular structure is functionally beneficial in continual learning, and support viewing safety and robustness as problems of adaptive allocation of representational subspaces rather than fixed separation versus sharing.

摘要：組合學習系統必須在可塑性，即獲取新知識的能力，與穩定性，即保留先前學習的組件之間取得平衡，特別是在任務共享結構並存在干擾風險的情況下。我們研究模組化架構、任務相似性和表徵維度如何共同塑造序列 A-B-A 範式中的組合持續學習，將任務劃分的遞歸網絡與單一網絡基準進行比較，同時通過權重縮放操作引入高維和低維範疇。在高維的「懶惰」範疇中，兩種架構的性能和內部幾何形狀相似，這表明當表徵受到弱約束時，顯式模組結構的影響不大。在較低維的「豐富」範疇中，模組化變得至關重要：模組網絡發展出分級的任務特定子空間，對於相似任務重疊，對於中等不相似的任務部分對齊，對於不相似的任務則分開，從而產生比單一網絡更具組合性和可解釋性的組織。這些發現確定了由初始化規模引起的表徵範疇，該範疇與表徵維度共同變化，是決定何時組合模組結構在持續學習中功能上有益的關鍵因素，並支持將安全性和穩健性視為表徵子空間的自適應分配問題，而非固定的分離與共享問題。

AI Adoption Across a Multinational Workforce: Sociotechnical Conditions for GenAI Acceptance in Human Resources

2606.17887v1 by Dalia Ali, Maria José Rodríguez Velázquez, Manoel Horta Ribeiro, Vera Liao, Orestis Papakyriakopoulos

Generative AI (GenAI) deployment in the workplace is accelerating rapidly. Nevertheless, questions of who adopts, who benefits, and who is left behind and why are still understudied. In this paper, we investigate these dynamics in the context of a multinational tech company transitioning from a legacy Human Resources (HR) search system to a GenAI-supported system, analyzing search log data, survey data (n=25), and ten semi-structured interviews. Our findings show that adoption depended on the fit between the GenAI system's design assumptions and employees' work positionalities (role, spoken language, tenure). Further, we find that employees' trust in GenAI answers was built through source-checking, comparison among systems, and seeking input from colleagues or HR when in doubt. Our contribution is twofold. First, we provide empirical evidence of workplace GenAI adoption during a live organizational transition, showing that adoption is influenced by factors such as situational fit, search literacy, and trust calibration. It is also further shaped by knowledge conditions such as the system's content quality, employee training, and guidance. Second, we translate these findings into design considerations for inclusive deployment and adoption in high-stakes environments such as HR. We argue that organizations should design systems considering the role and context-sensitive benefits they yield to different social groups. They also need to treat the organizational knowledge infrastructure as AI infrastructure to improve the accountability and usability of GenAI systems

摘要：生成式人工智慧（GenAI）在工作場所的部署正在迅速加速。然而，誰採用、誰受益、誰被遺留在外以及原因等問題仍然缺乏研究。在本文中，我們調查了這些動態，背景是一家跨國科技公司從傳統人力資源（HR）搜尋系統過渡到GenAI支持的系統，分析了搜尋日誌數據、調查數據（n=25）和十次半結構式訪談。我們的研究結果顯示，採用取決於GenAI系統的設計假設與員工的工作位置（角色、語言、任期）之間的契合度。此外，我們發現員工對GenAI答案的信任是通過檢查來源、系統之間的比較以及在有疑慮時向同事或HR尋求意見來建立的。我們的貢獻有兩個方面。首先，我們提供了在實時組織過渡期間工作場所GenAI採用的實證證據，顯示採用受到情境契合、搜尋素養和信任校準等因素的影響。此外，這也受到系統內容質量、員工培訓和指導等知識條件的進一步影響。其次，我們將這些發現轉化為高風險環境（如HR）中包容性部署和採用的設計考量。我們認為，組織應該設計系統時考慮到它們對不同社會群體所產生的角色和情境敏感的好處。他們還需要將組織知識基礎設施視為人工智慧基礎設施，以提高GenAI系統的問責性和可用性。

FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow

2606.17856v1 by Bihao Zhan, Zongsheng Cao, Jie Zhou, Bo Zhang, Liang He

Graph-based retrieval-augmented generation (GraphRAG) is effective for knowledge-intensive and multi-hop query tasks; however, many existing methods primarily seed entity-based graphs and rely on implicit semantic relevance propagation. This often (i) under-retrieves when user queries are abstract and semantically sparse at the entity level, and (ii) suffers from brittle multi-hop reasoning, where noisy activations can derail entity-to-entity transitions and corrupt the inferred relation chain, yielding unreliable conclusions. To this end, we propose \texttt{FlowRAG}, a semantic-aware retrieval framework that improves both semantic recall and explicit reasoning. Specifically, \texttt{FlowRAG} constructs a quad-level heterogeneous graph over passages, summaries, sentences, and entities, where summary nodes serve as a coarse semantic hub. At retrieval time, a dual-granularity activation module combines summary--query alignment with sentence-level matching to activate relevant entities under paraphrase and abstraction robustly. We then introduce a frequency-aware weighted flow module that routes relevance through entity--passage links weighted by within-passage term frequency, pruning noisy connections and extracting high-confidence reasoning paths as an explicit logic skeleton for generation. Extensive experiments show that \texttt{FlowRAG} obtains state-of-the-art performance on complex reasoning benchmarks.

摘要：圖基檢索增強生成（GraphRAG）對於知識密集型和多跳查詢任務是有效的；然而，許多現有的方法主要以實體為基礎構建圖形，並依賴於隱式語義相關性傳播。這通常會在（i）當用戶查詢在實體層面上抽象且語義稀疏時，導致檢索不足，以及（ii）在脆弱的多跳推理中受到影響，噪聲激活可能會干擾實體到實體的轉換並腐蝕推斷的關係鏈，產生不可靠的結論。為此，我們提出了\texttt{FlowRAG}，一個語義感知的檢索框架，改善了語義召回和明確推理。具體而言，\texttt{FlowRAG}在段落、摘要、句子和實體上構建了一個四層異構圖，其中摘要節點作為粗略的語義中心。在檢索時，雙粒度激活模塊將摘要-查詢對齊與句子級匹配相結合，穩健地激活相關實體以應對同義詞和抽象。我們接著引入了一個頻率感知的加權流模塊，通過實體-段落鏈接路由相關性，這些鏈接根據段落內的詞頻加權，修剪噪聲連接並提取高置信度的推理路徑，作為生成的明確邏輯骨架。大量實驗表明，\texttt{FlowRAG}在複雜推理基準上獲得了最先進的性能。

DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL

2606.17821v1 by Esteban Schafir, Xu Zheng, Hojat Allah Salehi, Zhuomin Chen, Mo Sha, Wei Cheng, Dongsheng Luo

Large Language Models (LLMs) have demonstrated remarkable capabilities in translating natural language to SQL, yet existing methods still falter on complex queries requiring multi-step, data-aware reasoning. We introduce DecoSearch, a training-free framework that addresses this by routing each query to the appropriate level of reasoning effort. A lightweight Schema Selector first prunes the full database schema to the relevant tables and columns. An LLM Judger then decides whether the question requires decomposition: straightforward questions follow a direct generation path and complex ones are escalated to a Directed Acyclic Graph (DAG) of atomic sub-questions, each solved by a targeted SQL generation step. A RAG component grounds the decomposer with semantically similar training examples, and a Topology Refiner restructures the reasoning plan when execution failures signal a flawed decomposition rather than a fixable SQL error. DecoSearch achieves 70.53% execution accuracy on BIRD and 88.31% on Spider with a DeepSeek backbone, surpassing all training-free baselines while consuming an order of magnitude fewer tokens than competing methods. It also functions as a model-agnostic wrapper, consistently improving fine-tuned SQL generation backbones without any modification to the pipeline.

摘要：大型語言模型（LLMs）在將自然語言翻譯成 SQL 方面展現了卓越的能力，但現有的方法在需要多步驟、數據感知推理的複雜查詢上仍然表現不佳。我們介紹了 DecoSearch，一個無需訓練的框架，通過將每個查詢路由到適當的推理努力水平來解決這個問題。一個輕量級的 Schema Selector 首先將完整的數據庫架構修剪到相關的表和列。然後，LLM Judger 決定問題是否需要分解：簡單的問題遵循直接生成路徑，而複雜的問題則升級到原子子問題的有向無環圖（DAG），每個子問題通過針對性的 SQL 生成步驟解決。一個 RAG 組件用語義相似的訓練範例為分解器提供支持，而 Topology Refiner 在執行失敗信號表明分解存在缺陷而非可修復的 SQL 錯誤時，重構推理計劃。DecoSearch 在 BIRD 上達到 70.53% 的執行準確率，在 Spider 上達到 88.31% 的執行準確率，搭配 DeepSeek 主幹，超越了所有無需訓練的基準，同時消耗的標記數量比競爭方法少一個量級。它還作為一個模型無關的包裝器，持續改善微調的 SQL 生成主幹，而不需要對管道進行任何修改。

A Framework for Evaluating Agentic Skills at Scale

2606.17819v1 by Maksim Shaposhnikov, Nicolas Fortuin, Simon Stipcich, Maria I. Gorinova, Amy Heineike, Rob Willoughby

Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain under-studied, and no reusable methodology exists for evaluating an individual skill. In this work, we present an evaluation framework that lets a skill author construct realistic tasks to rigorously assess the aspects of a skill that matter most to them, and that estimates skill utility by solving those tasks. Further, we apply our evaluation approach at scale to 500 real-world skills, generating 1,000 tasks derived from the skills' content, along with instruction-following and goal-completion scoring rubrics. Using these metrics, we evaluate how 19 agent-model configurations, both proprietary and open-source, perform on the tasks. Our results show that models vary widely in how closely they adhere to the instructions encoded in skills, leading to substantial differences in their performance gains. Furthermore, we show that access to a skill significantly changes model behavior compared to the no-skill setup, providing an essential mechanism for encoding opinionated workflows into LLM agents. We release our evaluation dataset to support future work on agent skills.

摘要：代理技能——結構化、可重用的知識工件，增強了大型語言模型代理的能力——在業界迅速被採用，然而它們在商業和開源模型中的跨領域影響及使用仍未受到充分研究，且缺乏可重用的方法論來評估單一技能。在這項工作中，我們提出了一個評估框架，讓技能作者構建現實的任務，以嚴格評估對他們最重要的技能方面，並通過解決這些任務來估計技能的實用性。此外，我們將我們的評估方法擴展應用於500個現實世界的技能，生成1,000個從技能內容衍生的任務，以及遵循指令和目標完成的評分標準。使用這些指標，我們評估了19種代理模型配置，包括專有和開源模型，在這些任務上的表現。我們的結果顯示，模型在遵循技能中編碼的指令方面差異很大，導致其性能增益有顯著差異。此外，我們顯示，與無技能設置相比，訪問技能顯著改變了模型行為，提供了一種將有見地的工作流程編碼到大型語言模型代理中的基本機制。我們釋放了我們的評估數據集，以支持未來對代理技能的研究。

Conflict-Aware Retriever Editing for Knowledge Injection Attacks on LLM-Based RAG Systems

2606.18310v1 by Xinru Liu, Xianglong Zhang, Di Cai, Zhumin Chen, Pengfei Hu, Xin Xin

Injecting malicious knowledge into retrieval-augmented generation (RAG) systems can manipulate retrieved evidence and mislead downstream generation, posing a serious security threat for AI applications. Existing RAG injection attacks mainly rely on manipulating external knowledge bases, such as crafting malicious corpus. However, the synthetic text crafted by such data-centric methods could be detectable, leading to the failure of attacks. Beyond corpus manipulation, open-source retrievers are increasingly exposing RAG systems to model-centric attacks. In this paper, we propose conflict-aware retriever editing, i.e., CAREATTACK, a model-centric retriever attack framework for malicious knowledge injection in RAG. Specifically, CAREATTACK consists two stages of conflict-aware retriever editing and attack-preserving anchor repair. Conflict-aware retriever editing adapts efficient closed-form parameter editing to the dense retrieval model, promoting malicious knowledge above benign competing passages and resolving potential parameter conflicts through graph-based conflict detection and parameter editing projection. Then, attack-preserving anchor repair performs lightweight calibration on the edited retriever to further eliminate the impact on non-target prompts while preserving the attack effectiveness for target prompts. We instantiate CAREATTACK on Qwen3-Embedding-0.6B and BGE-M3, and conduct evaluation on three benchmark datasets. Experimental results demonstrate our method substantially promote malicious passages into the retrieved knowledge of RAG systems and can perform attacks for batches of target prompts and passages, given the access of retrieval model parameters. Since most RAG systems are built upon open-source retrieval models, this work reveals a practical attack surface in RAG systems. Codes are public accessible at https://anonymous.4open.science/r/CareAttack-3F1C.

摘要：注入惡意知識到檢索增強生成（RAG）系統中可以操縱檢索到的證據並誤導下游生成，對人工智慧應用構成嚴重的安全威脅。現有的 RAG 注入攻擊主要依賴於操縱外部知識庫，例如製作惡意語料。然而，這種數據中心方法製作的合成文本可能是可檢測的，導致攻擊失敗。除了語料操縱之外，開源檢索器越來越多地使 RAG 系統暴露於以模型為中心的攻擊。在本文中，我們提出了衝突感知檢索器編輯，即 CAREATTACK，一個針對 RAG 中惡意知識注入的以模型為中心的檢索器攻擊框架。具體而言，CAREATTACK 包含兩個階段的衝突感知檢索器編輯和攻擊保留的錨點修復。衝突感知檢索器編輯將高效的閉式參數編輯應用於密集檢索模型，促進惡意知識超越良性競爭段落，並通過基於圖的衝突檢測和參數編輯投影解決潛在的參數衝突。然後，攻擊保留的錨點修復對編輯過的檢索器進行輕量級校準，以進一步消除對非目標提示的影響，同時保留對目標提示的攻擊有效性。我們在 Qwen3-Embedding-0.6B 和 BGE-M3 上實現了 CAREATTACK，並在三個基準數據集上進行了評估。實驗結果表明，我們的方法顯著促進了惡意段落進入 RAG 系統的檢索知識中，並能夠對一批目標提示和段落執行攻擊，前提是能夠訪問檢索模型參數。由於大多數 RAG 系統是基於開源檢索模型構建的，這項工作揭示了 RAG 系統中的一個實際攻擊面。代碼可在 https://anonymous.4open.science/r/CareAttack-3F1C 上公開訪問。

LLMs Infer Cultural Context but Fail to Apply It When Responding

2606.17688v1 by Yisong Miao, Jian Zhu, Vered Shwartz

Recent work has shown that LLMs overrepresent dominant cultures, particularly Western ones, while marginalizing others. We investigate whether this affects models' ability to generate culturally adapted responses by evaluating their use of local measurement units based on the user's perceived cultural background. We introduce Cultural and Pragmatic Response Inference (CAPRI), a dataset of conversations with varying levels of cultural cues. Experiments with state-of-the-art LLMs show that models can infer cultural background and recall relevant conventions, but often fail to utilize the information to adapt their answers to the relevant cultural conventions, unless explicitly prompted to perform the tasks sequentially. We further evaluate adaptation to the interpretation of time and quantity expressions, two subjective language grounding dimensions that are affected by culture. We find that models increasingly adapt their answers as cultural cues accumulate, but their priors are not culture-neutral, sometimes aligning with the model's country of origin. Overall, CAPRI provides a resource for future research aimed at narrowing the gap between cultural knowledge and culturally adaptive language generation.

摘要：最近的研究顯示，大型語言模型（LLMs）過度代表主導文化，特別是西方文化，同時邊緣化其他文化。我們調查這是否影響模型生成文化適應性回應的能力，通過評估它們根據用戶感知的文化背景使用當地測量單位的情況。我們引入了文化與實用回應推斷（CAPRI），這是一個具有不同文化提示水平的對話數據集。對於最先進的LLMs的實驗顯示，模型可以推斷文化背景並回憶相關的慣例，但通常未能利用這些信息來使其回答適應相關的文化慣例，除非明確提示它們按順序執行任務。我們進一步評估對時間和數量表達的解釋的適應性，這是受文化影響的兩個主觀語言基礎維度。我們發現，隨著文化提示的累積，模型越來越能夠適應其回答，但它們的先驗並不是文化中立的，有時與模型的原產國相符。總體而言，CAPRI為未來旨在縮小文化知識與文化適應性語言生成之間差距的研究提供了一個資源。

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

2606.18309v1 by Jingyuan Zhang, Yucheng Bai, Peixi Wen, Zhehao Huang, Zhengbao He, Hanling Tian, Xinwen Cheng, Haiyin Ran, Xiaolin Huang

Large Language Model (LLM) unlearning aims to remove undesirable knowledge or behaviors while preserving retained capabilities. Current unlearning methods all involve a trade-off between unlearning and retention. We have found that the retention activation bias can also be used to quantify the damage an unlearning method inflicts on retention, without considering the specific implementation of the unlearning process. This allows us to restore retention performance for any unlearning method using a post-hoc approach. Therefore, we propose a complementary post-hoc setting to sanitize the final update vector without rerunning the original unlearning pipeline. In this setting, we design SAGE, Spectral Activation-GEometry Sanitization, a source-agnostic correction for final unlearning updates. SAGE collects real module inputs from a small retain proxy, extracts their dominant activation geometry, and solves a source-anchored optimization objective in closed form, which suppresses update components aligned with high-energy retained directions while preserving the source method's forgetting carrier. Across multiple unlearning methods, model scales, and benchmarks, SAGE consistently relieves the retain-forget trade-off, identifying post-hoc sanitization of final vectors as a practical and underexplored axis for machine unlearning.

摘要：大型語言模型（LLM）去學習的目的是在保留已保留能力的同時，去除不必要的知識或行為。當前的去學習方法都涉及去學習與保留之間的權衡。我們發現保留激活偏差也可以用來量化去學習方法對保留造成的損害，而不考慮去學習過程的具體實施。這使我們能夠使用後驗方法恢復任何去學習方法的保留性能。因此，我們提出了一個互補的後驗設置，以清理最終更新向量，而無需重新運行原始的去學習管道。在這個設置中，我們設計了SAGE，光譜激活幾何清理，這是一種與來源無關的最終去學習更新的修正。SAGE從小型保留代理收集真實模塊輸入，提取其主導激活幾何，並以封閉形式解決一個以來源為基礎的優化目標，該目標抑制與高能保留方向對齊的更新組件，同時保留來源方法的遺忘載體。在多種去學習方法、模型規模和基準測試中，SAGE持續減輕保留與遺忘之間的權衡，將最終向量的後驗清理確定為機器去學習的一個實用且未被充分探索的方向。

Handling Feature Heterogeneity with Learnable Graph Patches

2606.17667v1 by Yifei Sun, Yang Yang, Xiao Feng, Zijun Wang, Haoyang Zhong, Chunping Wang, Lei Chen

In recent years, the rapid development of foundation models and graph pre-training technologies has spurred increasing interest in constructing a universal pre-trained graph model or Graph Foundation Model (GFM). However, a significant challenge is that existing models are unable to address feature heterogeneity in graph data without textual information, which hinders the transferability of graph models across different datasets. To bridge this gap, we propose the concept of learnable graph patches, which we regard as the smallest semantic units of any graph data. We decompose the graph into learnable graph patches by unfolding the node features and constructing corresponding patch structures separately. We then design a framework that mines transferable information from graph data across domains. Specifically, after extracting graph patches, we propose a patch encoder to extract knowledge from each unit and a patch aggregator to learn how the units are combined into a whole. Due to its domain-agnostic nature, the model can be applied to downstream data across different domains. Furthermore, we analyze the connection between our method and existing graph models, as well as the transferability of the node embeddings it generates. Empirically, our method not only achieves the capability to use multi-domain graphs for pre-training, but also shows enhanced performance across various downstream datasets and tasks. Moreover, we observe consistent improvement in downstream performance as the volume of pre-training data increases.

摘要：近年來，基礎模型和圖形預訓練技術的快速發展引發了對構建通用預訓練圖形模型或圖形基礎模型（Graph Foundation Model, GFM）的日益關注。然而，一個重大挑戰是現有模型無法在沒有文本信息的情況下解決圖形數據中的特徵異質性，這妨礙了圖形模型在不同數據集之間的可轉移性。為了填補這一空白，我們提出了可學習圖形補丁的概念，將其視為任何圖形數據的最小語義單元。我們通過展開節點特徵並分別構建相應的補丁結構，將圖形分解為可學習的圖形補丁。然後，我們設計了一個框架，從不同領域的圖形數據中挖掘可轉移的信息。具體而言，在提取圖形補丁後，我們提出了一個補丁編碼器來從每個單元中提取知識，以及一個補丁聚合器來學習這些單元如何組合成整體。由於其領域無關的特性，該模型可以應用於不同領域的下游數據。此外，我們分析了我們的方法與現有圖形模型之間的聯繫，以及它所生成的節點嵌入的可轉移性。實證結果表明，我們的方法不僅實現了使用多領域圖形進行預訓練的能力，還在各種下游數據集和任務中顯示出增強的性能。此外，我們觀察到，隨著預訓練數據量的增加，下游性能持續改善。

SketchXplain: Intuitive Visual Explanations of Image Classifiers with Sketches

2606.17646v1 by Wencan Zhang, Mario Michelessa, Xuejun Zhao, Brian Y. Lim

Saliency map visualizations explain image-based AI predictions by pointing to regions, but these are often unintuitive and semantically unclear, leaving an interpretability gap. We argue that AI explanations should be intuitive -- coherent to user knowledge, yet simple and selective to accelerate interpretation. Inspired by artistic drawings, we propose SketchXplain to generate sketch-based visual explanations for intuitive image-based explainable AI (XAI). Combining techniques in saliency maps, concept-bottleneck models, and sketch optimization, SketchXplain integrates saliency to select coherent observation artifacts, concepts for knowledge coherence, cues to represent them, and abstraction for simplicity. Evaluating on face expression recognition, modeling and user studies showed that SketchXplain supported quicker interpretation with more aligned visualizations than saliency maps or simple drawings. Further evaluation on skin lesion diagnosis found that SketchXplain more coherently visualized disease symptoms, better supporting lay diagnosis. Thus, this work illustrates the value of sketches for intuitive, simple, coherent, and quick image-based XAI visualizations.

摘要：顯著性圖視覺化通過指向區域來解釋基於圖像的人工智慧預測，但這些通常不直觀且語義不清，留下了解釋性差距。我們認為，人工智慧的解釋應該是直觀的——與用戶知識一致，但又簡單且具選擇性，以加速解釋。受到藝術繪畫的啟發，我們提出了SketchXplain，用於生成基於草圖的視覺解釋，以實現直觀的基於圖像的可解釋人工智慧（XAI）。SketchXplain結合了顯著性圖、概念瓶頸模型和草圖優化技術，整合顯著性以選擇一致的觀察工件、知識一致性的概念、表示它們的提示以及簡單性的抽象。在面部表情識別的評估中，建模和用戶研究顯示，SketchXplain支持比顯著性圖或簡單繪圖更快的解釋，並且視覺化更一致。對皮膚病變診斷的進一步評估發現，SketchXplain更一致地視覺化疾病症狀，更好地支持非專業診斷。因此，這項工作說明了草圖在直觀、簡單、一致和快速的基於圖像的XAI視覺化中的價值。

Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification

2606.17637v1 by Yiyue Qian, Shinan Zhang, Huan Song, Negin Sokhandan, Hannah Marlowe, Diego Socolinsky

Building Management Systems (BMS) are essential for optimizing energy efficiency and operational performance in modern buildings. However, the lack of standardization across BMS points from different manufacturers creates significant barriers to integration and data utilization. While the Brick schema offers a standardized ontology for building systems, mapping BMS points to appropriate Brick classes presents three critical challenges: (i) the extensive number of Brick classes (936 in the latest version), (ii) limited domain-specific knowledge in large language models (LLMs), and (iii) substantial manual effort required for verification. To address these challenges, we propose Brick-DICL, a two-stage dynamic in-context learning framework for automated Brick schema classification. Brick-DICL consists of two primary components: metadata-RAG, which retrieves relevant examples to enhance LLMs' domain knowledge, and class-RAG, which narrows down potential Brick classes to address the large classification space. Additionally, we implement a multi-LLM filtering mechanism that compares predictions across multiple models, flagging low-confidence classifications for human review. As a result: (i) General: Brick-DICL is applicable to any building management system regardless of manufacturer or metadata format; (ii) Novel and Powerful: as the first dynamic in-context learning approach for Brick schema classification, Brick-DICL achieves significant classification accuracy improvements on building datasets, outperforming existing methods; (iii) Efficient: our multi-LLM filtering strategy reduces manual verification effort, enabling rapid digital building onboarding. Extensive experiments demonstrate Brick-DICL's effectiveness across diverse building datasets, accelerating the path toward standardized, interoperable building management systems.

摘要：建築管理系統（BMS）對於優化現代建築的能源效率和運營性能至關重要。然而，不同製造商的 BMS 點缺乏標準化，造成了整合和數據利用的重大障礙。雖然 Brick 架構提供了建築系統的標準化本體，但將 BMS 點映射到適當的 Brick 類別面臨三個關鍵挑戰：（i）大量的 Brick 類別（最新版本中有 936 個），（ii）大型語言模型（LLMs）中有限的領域專業知識，以及（iii）驗證所需的大量手動工作。為了解決這些挑戰，我們提出了 Brick-DICL，一種兩階段動態上下文學習框架，用於自動化 Brick 架構分類。Brick-DICL 包含兩個主要組件：metadata-RAG，該組件檢索相關示例以增強 LLM 的領域知識，以及 class-RAG，該組件縮小潛在的 Brick 類別以應對龐大的分類空間。此外，我們實施了一種多 LLM 過濾機制，該機制比較多個模型的預測，並標記低信心的分類以供人工審查。因此：（i）一般性：Brick-DICL 適用於任何建築管理系統，無論製造商或元數據格式如何；（ii）新穎且強大：作為第一個動態上下文學習方法用於 Brick 架構分類，Brick-DICL 在建築數據集上實現了顯著的分類準確性提升，超越了現有方法；（iii）高效：我們的多 LLM 過濾策略減少了手動驗證的工作量，使快速的數位建築上線成為可能。廣泛的實驗證明了 Brick-DICL 在多樣化建築數據集上的有效性，加速了邁向標準化、可互操作的建築管理系統的進程。

Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs

2606.17634v1 by Dong Huang, Jianbo Sun, Pengkun Yang

Evaluating large language models (LLMs) is important for understanding their capabilities, comparing competing systems, and supporting the deployment of reliable models in practice. For open-ended tasks, pairwise evaluation has become a popular paradigm, in which two responses to the same prompt are compared and the resulting judgments are aggregated into an overall ranking. A central challenge of this paradigm is intransitivity: the induced comparison outcomes may fail to support any coherent global ranking. For example, one may observe cyclic preferences such as $A \succ B \succ C \succ A$, or inconsistencies involving ties such as $A \equiv B\equiv C\neq A$. Such contradictions make the resulting leaderboard unstable and challenging to interpret. In this paper, we propose a prompt perturbation framework for improving the consistency of pairwise LLM evaluation. Our approach generates perturbed variants of each prompt, uses the resulting comparison graphs to identify and filter out structurally inconsistent comparison patterns, and then applies standard ranking methods to the filtered comparisons. A key feature of the proposed framework is that graph-level structural consistency is incorporated explicitly into the evaluation pipeline before ranking aggregation. This provides a simple and principled way to reduce cyclic inconsistencies and improve the reliability of LLM rankings.

摘要：評估大型語言模型 (LLMs) 對於理解其能力、比較競爭系統以及支持可靠模型在實踐中的部署至關重要。對於開放式任務，成對評估已成為一種流行的範式，其中比較對同一提示的兩個回應，並將結果判斷匯總成整體排名。這一範式的一個核心挑戰是非傳遞性：所引發的比較結果可能無法支持任何一致的全局排名。例如，可能會觀察到循環偏好，如 $A \succ B \succ C \succ A$，或涉及平局的不一致性，如 $A \equiv B\equiv C\neq A$。這些矛盾使得最終的排行榜不穩定且難以解釋。在本文中，我們提出了一個提示擾動框架，以改善成對 LLM 評估的一致性。我們的方法生成每個提示的擾動變體，使用結果比較圖來識別和過濾結構上不一致的比較模式，然後將標準排名方法應用於過濾後的比較。所提框架的一個關鍵特徵是，在排名匯總之前，圖級結構一致性被明確納入評估流程中。這提供了一種簡單且有原則的方法來減少循環不一致性，並提高 LLM 排名的可靠性。

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

2606.17628v1 by Guibin Zhang, Xun Xu, Yanwei Yue, Zikun Su, Wangchunshu Zhou, Xiaobin Hu, Shuicheng Yan

Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.

摘要：記憶已成為自我演化代理的標準基底，但保留經驗並不等同於學習如何通過這些經驗進行演化。現有的記憶代理可以儲存軌跡、檢索反思或累積技能，但往往缺乏選擇有用經驗、採取行動、撰寫可重用知識和維護不斷增長的資料庫的整體能力。我們介紹了 OPD-Evolver，一個緩慢-快速共同演化框架，通過政策內自我蒸餾培養這樣的代理演化者。在快速迴圈中，OPD-Evolver 與四層記憶層次結構互動，以快速讀取、使用、寫入和維護經驗，以便於快速測試時間的演化。在緩慢迴圈中，結果校準的記憶歸因和特權後見將這四種能力蒸餾成可部署的政策。在多領域基準測試中，OPD-Evolver 超越了記憶系統，如 ReasoningBank，達到高達 11.5% 的提升，以及基於訓練的方法，如 Skill0，約 5.8% 的提升。進一步分析顯示，OPD-Evolver 內化了高價值的經驗和記憶管理，使 OPD-Evolver-9B 能夠挑戰巨型對手，如 Qwen3.5-397B-A17B 和 Step-3.5-Flash，指向超越記憶增強代理的真正合格代理演化者。

Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning

2606.17591v1 by Yanwei Cui, Xing Zhang, Yulong Zhang, Li Shao, Xiaofeng Shi, Guanghui Wang, Peiyang He

Training-free verbal reinforcement learning enables LLM agents to learn from world feedback -- objective signals such as dynamic task outcomes, market returns, or demand forecasts -- by extracting verbal rules from experience and injecting them as context, updating the agent's behavior without parameter changes. However, in non-stationary environments these agents face a retention-forgetting dilemma: retaining stale insights causes negative transfer, while discarding them causes catastrophic forgetting when conditions recur. We identify four requirements for navigating this dilemma -- outcome-driven evaluation, persistent structured evidence, non-monotonic knowledge lifecycle, and compositional governance -- and show that existing methods invest heavily in experience extraction while underinvesting in insight governance. We propose a three-layer architecture -- rules, evidence, and skills -- connected by a feedback-driven curation loop that closes the governance gap. Rules capture distilled experience from world outcomes; evidence logs track each rule's reliability across episodes; skills govern which rules to apply, how to resolve conflicts, and when to abstain. On financial forecasting as a case study, where world feedback is naturally abundant, noisy, and non-stationary, we show that the same accumulated experience either degrades performance below the zero-shot baseline or dramatically improves accuracy and risk-adjusted returns, depending on whether the curation loop is present.

摘要：訓練自由的口頭強化學習使得大型語言模型（LLM）代理能夠從世界反饋中學習——例如動態任務結果、市場回報或需求預測等客觀信號——通過從經驗中提取口頭規則並將其注入作為背景，更新代理的行為而無需改變參數。然而，在非穩定環境中，這些代理面臨著保留與遺忘的困境：保留過時的見解會導致負面轉移，而丟棄它們則會在條件重現時導致災難性遺忘。我們確定了四個應對這一困境的要求——以結果為驅動的評估、持續的結構證據、非單調的知識生命周期和組合治理——並顯示現有的方法在經驗提取上投入過多，而在見解治理上投入不足。我們提出了一個三層架構——規則、證據和技能——通過一個以反饋為驅動的策展循環連接，來填補治理的空白。規則捕捉來自世界結果的提煉經驗；證據日誌追蹤每個規則在不同情節中的可靠性；技能則管理應用哪些規則、如何解決衝突以及何時應該放棄。以金融預測為案例研究，在這裡世界反饋自然豐富、雜訊多且非穩定，我們顯示相同的累積經驗要麼使性能低於零樣本基準，要麼根據策展循環的存在顯著提高準確性和風險調整回報。

Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow

2606.17577v1 by Osamu Ito, Akihiko Katagiri, Yoshikazu Nakagawa, Shin Saeki, Jun Shiraishi, Masato Sasaki

AI-driven engineering workflows face particular challenges in crash safety design: unlike aerodynamics, crash events involve highly nonlinear contact dynamics, material nonlinearity, and discrete state transitions that are difficult to capture with data-driven surrogate models. To the best of our knowledge, we present the first foundation model--orchestrated workflow for crash safety design that enables surrogate-assisted exploration for pedestrian protection, reducing evaluation time from hours per CAE simulation to seconds. The workflow integrates four components: (1) a surrogate trained on CAE crash simulations to predict pedestrian leg injury metrics from design parameters, achieving an average $R^2=0.87$ and providing distribution-free conformal prediction intervals; (2) multiobjective evolutionary search (NSGA-II) to discover diverse feasible parameter sets under user-specified constraints; (3) a morphing-based geometry generator that maps parameters to topology-preserving 3D shapes; and (4) a natural-language interface in which an LLM orchestrates the workflow and a vision--language model supports semantic comparison of generated designs. In an automotive front-bumper case study, the workflow produces 35 distinct safety-compliant alternatives from a single exploration, a process that would require weeks with conventional CAE iteration. These results suggest that foundation models can serve as integration layers between ML surrogates and physics-based simulation, helping bring AI capabilities to safety-critical engineering domains.

摘要：AI 驅動的工程工作流程在碰撞安全設計方面面臨特定挑戰：與氣動力學不同，碰撞事件涉及高度非線性的接觸動力學、材料非線性以及難以用數據驅動的替代模型捕捉的離散狀態轉換。據我們所知，我們提出了首個基於基礎模型的碰撞安全設計協調工作流程，該流程使得在行人保護方面進行替代輔助探索，將每次 CAE 模擬的評估時間從幾小時縮短到幾秒鐘。該工作流程整合了四個組件：(1) 一個基於 CAE 碰撞模擬訓練的替代模型，用於從設計參數預測行人腿部受傷指標，達到平均 $R^2=0.87$ 並提供無分佈的符合預測區間；(2) 多目標進化搜索 (NSGA-II) 用於在用戶指定的約束下發現多樣的可行參數集；(3) 一個基於變形的幾何生成器，將參數映射到保持拓撲的 3D 形狀；以及 (4) 一個自然語言界面，其中 LLM 協調工作流程，視覺-語言模型支持生成設計的語義比較。在一個汽車前保險杠的案例研究中，該工作流程從單次探索中產生了 35 種不同的安全合規替代方案，這一過程在傳統 CAE 迭代中需要數週時間。這些結果表明，基礎模型可以作為機器學習替代模型與基於物理的模擬之間的整合層，幫助將 AI 能力引入安全關鍵的工程領域。

An AI Security Agent for Banking: Multi-Vector Fraud and AML Detection Across Retail and Corporate Accounts

2606.17555v1 by Joseph Walusimbi, Joshua Benjamin Ssentongo

Banks simultaneously face signature-based fraud (card-not-present attacks, account takeover, ATM cloning) and behavioural financial crime (structuring, layering, mule networks, business email compromise) -- two threat families with fundamentally different detection requirements. Static rule engines that reliably catch brute-force and high-velocity events are structurally blind to business-email-compromise (BEC) payment redirection, session hijacking, and money-laundering layering, which are engineered to appear indistinguishable from legitimate activity at the individual transaction or session level. This paper presents an AI security agent for retail and corporate banking that addresses this gap through a three-component fusion architecture operating on two parallel event streams: a transaction stream (card fraud, ACH/wire fraud, AML categories) and a session stream (account takeover, session hijacking, SIM-swap, insider abuse). Each stream combines an LSTM sequence model capturing per-account behavioural history, a statistical velocity/threshold monitor, and a graph/network module capturing account-counterparty relationship patterns (fan-in, fan-out, pass-through ratio) for money-laundering detection. Experiments on a synthetic event log of 237,669 transactions and 113,508 sessions across 13 threat categories and 3,470 simulated accounts demonstrate overall F1 of 0.787 (transaction stream) and 0.867 (session stream) for the proposed model, versus 0.562/0.733 for a rule-based baseline and 0.655/0.713 for an LSTM-only baseline. The agent includes a customer-facing transaction-verification chatbot (96.6% identity verification accuracy, 86.8% mass-reset attack detection) and an analyst case-summary assistant (99.3% action-recommendation F1), with Critical-tier automated response latency under 0.43 ms at the 95th percentile.

摘要：銀行同時面臨基於簽名的詐騙（非持卡人攻擊、帳戶接管、ATM克隆）和行為金融犯罪（結構化、分層、騙子網絡、商業電子郵件妥協）——這兩類威脅具有根本不同的檢測需求。靜態規則引擎可靠地捕捉暴力破解和高速度事件，但在結構上對商業電子郵件妥協（BEC）支付重定向、會話劫持和洗錢分層等行為視而不見，這些行為被設計得在個別交易或會話層面上與合法活動無法區分。本文提出了一種針對零售和企業銀行的人工智慧安全代理，通過在兩個平行事件流上運行的三組件融合架構來填補這一空白：交易流（卡片詐騙、ACH/電匯詐騙、反洗錢類別）和會話流（帳戶接管、會話劫持、SIM交換、內部濫用）。每個流結合了一個捕捉每個帳戶行為歷史的LSTM序列模型、一個統計速度/閾值監控器，以及一個捕捉帳戶-對方關係模式（進入、退出、通過比例）的圖形/網絡模組，用於洗錢檢測。在237,669筆交易和113,508個會話的合成事件日誌上進行的實驗顯示，該模型的整體F1為0.787（交易流）和0.867（會話流），而基於規則的基線為0.562/0.733，僅LSTM基線為0.655/0.713。該代理包括一個面向客戶的交易驗證聊天機器人（96.6%的身份驗證準確率，86.8%的大規模重置攻擊檢測）和一個分析師案例摘要助手（99.3%的行動建議F1），其關鍵層級自動響應延遲在第95百分位數下低於0.43毫秒。

FoundCause: Causal Discovery with Latent Confounders from Observational Data

2606.17516v1 by Patrick Blöbaum, Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan

Causal discovery from observational data remains challenging due to the need to recover directed structure and latent confounding without interventions. We propose FoundCause, an amortized causal discovery model trained entirely on synthetic data that maps datasets directly to causal graphs in a single forward pass. By learning from large collections of simulated structural causal models, FoundCause captures transferable statistical patterns that generalize beyond individual datasets. The architecture incorporates several key inductive biases for causal discovery. It uses a permutation-invariant transformer encoder with alternating attention over samples and variables to jointly model cross-variable dependence and per-variable distributions. Pairwise statistical features derived from classical asymmetry measures are injected through statistics-conditioned attention, guiding the model toward known causal signals. A factorized decoder separates edge existence from direction, while a triangular refinement module enables reasoning over higher-order causal motifs such as chains and colliders. In addition, a dedicated confounder module based on learnable latent tokens explicitly models hidden common causes, and the model explicitly handles missing data via its masked input representation. To our knowledge, FoundCause is the first amortized causal discovery approach to explicitly model latent confounding. FoundCause outperforms 11 classical non-amortized methods (e.g., PC, GES, NOTEARS-style optimization) and 4 amortized causal discovery methods on 15 real-world datasets, achieving +9.6% improvement in $F_1$, +1.2% in AUROC, and an 18.9% reduction in structural Hamming distance relative to the strongest non-amortized methods, while performing inference in a single forward pass.

摘要：因果發現從觀察數據中仍然具有挑戰性，因為需要在沒有干預的情況下恢復有向結構和潛在的混淆。我們提出了FoundCause，這是一種完全基於合成數據訓練的攤銷因果發現模型，能夠在單次前向傳遞中將數據集直接映射到因果圖。通過從大量模擬結構因果模型中學習，FoundCause 捕捉可轉移的統計模式，這些模式超越了單個數據集的範疇。該架構結合了幾個關鍵的歸納偏見以進行因果發現。它使用了一種對置換不變的Transformer編碼器，並在樣本和變數之間交替注意，以共同建模跨變數依賴和每個變數的分佈。通過統計條件注意注入的成對統計特徵源自於經典的不對稱性測量，指導模型朝向已知的因果信號。一個因子化解碼器將邊的存在與方向分開，而一個三角形細化模塊則使得對於更高階因果圖形（如鏈和碰撞器）進行推理成為可能。此外，基於可學習潛在標記的專用混淆模塊明確建模隱藏的共同原因，並且模型通過其掩蔽輸入表示明確處理缺失數據。據我們所知，FoundCause 是第一個明確建模潛在混淆的攤銷因果發現方法。FoundCause 在 15 個真實世界數據集上超越了 11 種經典的非攤銷方法（例如，PC、GES、NOTEARS 風格優化）和 4 種攤銷因果發現方法，實現了 $F_1$ 提升 +9.6%、AUROC 提升 +1.2%，以及相對於最強的非攤銷方法結構漢明距離減少 18.9%，同時在單次前向傳遞中進行推理。

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

2606.17506v1 by Ramaravind Kommiya Mothilal, Terry Jingchen Zhang, Raiyan Ahmed, Zhijing Jin, Shion Guha, Syed Ishtiaque Ahmed

Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate biased content, which current methods do not systematically capture. We call this second-order bias: social bias in an LLM's judgment about social bias, which we evaluate through a novel, philosophically grounded reasoning task. Drawing on entitlement epistemology, we conceptualize bias as misplaced foundational knowledge that shapes an agent's rational inquiry, and derive a logical reasoning task for LLMs to judge to whom a biased text is acceptable or non-acceptable. We develop two simple metrics to measure how biased LLM judges are in inferring demographics for acceptability without sufficient support, and how these inferences vary across groups targeted by biased texts. Evaluating open and closed models, we find that our task evades safety guardrails by surfacing bias in model judgment. It varies systematically across target groups, reflects implicit social maps, and shows how models are still triggered by demographic labels. Our work points to the need for LLM bias evaluation in judgment tasks and broadly, for more theoretically grounded approaches to bias evaluation in NLP. We release our code and model responses at https://github.com/uofthcdslab/second-order-bias.

摘要：社會偏見在大型語言模型（LLMs）中的評估主要集中在模型是否生成或暗示偏見內容。然而，隨著 LLMs 越來越多地被用作偏見的評判者，它們可能在評估偏見內容的方式上以更微妙的方式表現出社會偏見，而目前的方法並未系統性地捕捉到這一點。我們稱之為二階偏見：LLM 對社會偏見的判斷中的社會偏見，我們通過一項新穎的、以哲學為基礎的推理任務來進行評估。基於權利認識論，我們將偏見概念化為錯位的基礎知識，這種知識塑造了代理人的理性探究，並推導出一項邏輯推理任務，讓 LLM 判斷誰對偏見文本是可接受或不可接受的。我們開發了兩個簡單的指標來衡量 LLM 評判者在推斷可接受性的人口統計時的偏見程度，這些推斷在缺乏充分支持的情況下是如何變化的，以及這些推斷在受到偏見文本針對的群體之間的變化。評估開放和封閉模型時，我們發現我們的任務通過顯示模型判斷中的偏見而逃避了安全防護措施。它在目標群體之間系統性地變化，反映了隱含的社會地圖，並顯示模型仍然會受到人口標籤的觸發。我們的工作指出了在判斷任務中對 LLM 偏見評估的必要性，以及在自然語言處理中對偏見評估的更理論基礎的方法的廣泛需求。我們在 https://github.com/uofthcdslab/second-order-bias 上發布了我們的代碼和模型響應。

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

2606.17474v1 by Jiahui Niu, Huizi Yu, Wenkong Wang, Guangxin Dai, Jingxian He, Xiang Li, Zhiying Liang, Xinxin Lin, Kent CY So, Bryan YP Yan, Yun Kwok Wing, Yanqiu Xing, Xin Ma, Lizhou Fan

Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.

摘要：大型語言模型（LLMs）越來越被考慮用於臨床諮詢任務，然而大多數醫療評估仍然是靜態的、單回合的或狹隘的結果導向，限制了它們反映現實世界護理的連續性、不確定性和互動性的能力。在此，我們提出了AIPatient Arena，一個基於電子健康紀錄（EHRs）的評估框架，用於評估LLMs在八個臨床能力維度上的臨床效用。該框架將EHR數據整合到患者特定的知識圖譜中，使多回合的醫生-患者互動成為可能。我們在一個由437名患者組成的主要隊列以及兩個分佈外的驗證隊列（119名和67名患者）上應用AIPatient Arena。我們觀察到LLMs在醫療面試提問技能（QS；平均分數，4.43-4.99/5）、倫理和專業行為（ET；4.38-4.93/5）以及臨床解釋的清晰性和透明性（EX；3.80-4.72/5）方面表現良好。在信息整合（II；3.19-4.21/5）和藥物安全性與合理性（MS；3.13-3.78/5）方面表現中等，但在處理模糊患者反應（HR；2.57-3.32/5）、信息覆蓋（IC；2.08-3.02/5）以及診斷準確性和推理（Dx；2.63-3.55/5）方面持續存在弱點。基於過程的評估揭示了重複的互動失敗，包括重複提問、遺漏過去病史以及對不確定性的處理不足。更豐富的對話上下文改善了診斷推理，但在治療計劃方面的增益有限。這些發現表明，僅僅依賴最終答案的準確性不足以評估臨床準備情況，並突顯了評估模型在諮詢過程中如何收集、解釋和傳達信息的重要性。AIPatient Arena提供了一個基於EHR的框架，用於針對工作流程的醫療LLMs預部署評估。

Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation

2606.17459v1 by Yuyang Dai, Xueqing Peng, Lingfei Qian, Zhuohan Xie

Evaluating the decision-making capabilities of large language models (LLMs) is a growing research priority, yet existing benchmarks focus on isolated cognitive tasks such as reasoning, knowledge retrieval, and economic rationality in stylized settings. These evaluations overlook the defining challenge of real executive decision-making: integrating conflicting recommendations from specialized stakeholders under information asymmetry, organizational constraints, and temporal dependencies. We introduce \textsc{CEO-Bench}, a multi-agent benchmark that evaluates LLMs on CEO-level strategic resource reallocation -- the process of redirecting capital across business units in a multi-round, constraint-rich organizational environment. In \textsc{CEO-Bench}, LLM agents receive conflicting advice from four role-conditioned C-suite advisors (CFO, CTO, COO, CMO), each with private signals and distinct priorities, and must synthesize these into a concrete allocation plan evaluated along four dimensions: role integration, conditional boldness, history-sensitive judgment, and plan validity. Experiments across five frontier models on 13 scenarios reveal that all models achieve high structural validity but diverge sharply on strategic calibration -- the hardest capability layer. We identify systematic failure modes including single-advisor capture, conservative default under ambiguity, and historical amnesia, and uncover a structural integration-boldness tradeoff: models that engage more deeply with conflicting perspectives tend to produce less decisive action. These findings delineate the current capability boundary of LLMs as organizational decision-makers and inform the design of future AI-assisted executive systems.

摘要：評估大型語言模型（LLMs）的決策能力正成為一項日益重要的研究優先事項，但現有的基準主要集中在孤立的認知任務上，例如推理、知識檢索和在典型情境中的經濟理性。這些評估忽略了真實執行決策的定義挑戰：在信息不對稱、組織約束和時間依賴的情況下，整合來自專業利益相關者的相互矛盾建議。我們介紹了 \textsc{CEO-Bench}，這是一個多代理基準，評估 LLM 在 CEO 級別的戰略資源重新分配上的表現——在一個多輪、約束豐富的組織環境中重新指引資本的過程。在 \textsc{CEO-Bench} 中，LLM 代理從四位角色條件的高管顧問（CFO、CTO、COO、CMO）那裡接收相互矛盾的建議，每位顧問都有私有信號和不同的優先事項，並必須將這些建議綜合成一個具體的分配計劃，該計劃將在四個維度上進行評估：角色整合、條件勇敢、歷史敏感判斷和計劃有效性。對五個前沿模型在 13 個場景中的實驗顯示，所有模型都達到了高結構有效性，但在戰略校準上卻有明顯的分歧——這是最難的能力層級。我們識別了系統性的失敗模式，包括單一顧問捕獲、在模糊情況下的保守默認和歷史健忘，並發現了一個結構整合-勇敢的權衡：更深入地參與相互矛盾觀點的模型往往會產生不那麼果斷的行動。這些發現描繪了 LLM 作為組織決策者的當前能力邊界，並為未來 AI 輔助的高管系統的設計提供了信息。

Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos

2606.17437v1 by Bo Gou, Jicheng Zhang, Jianlong Xiong, Tao He, Bentian Liu, Hai Wu, Yijiao Wang, Yu Zhang, Yujia Yang, Yun Dai, Jian Liu, Jie Wang

Automated classification of standard echocardiographic views is crucial for efficient clinical workflow but faces three main challenges. First, publicly available datasets are scarce and limited in scale and view coverage. Second, the performance of some modern video-level architectures for echocardiographic view classification remains underexplored. Third, some view categories exhibit highly similar spatial appearances, making single-frame features insufficient for discrimination, while heterogeneous frame quality complicates robust temporal information fusion. To address these challenges, we release the Echocardiographic Videos of Nine Views (EV9V) dataset, comprising 5,138 videos, 910,579 frames, and 9 standard views, which is, to the best of our knowledge, the largest publicly available echocardiography video dataset. Using EV9V, we systematically benchmark representative video classification architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. Furthermore, we propose a Spatio-Temporal Fusion Model (STFM), an efficient dual-stream CNN-LSTM (Long Short-Term Memory) framework that jointly captures spatial anatomical structures and temporal cardiac dynamics. The proposed framework leverages uncertainty-aware learning to preferentially sample representative video segments during training and evidence-based fusion during inference, improving robustness to variations in frame quality across echocardiographic videos. Extensive experiments demonstrate that our method achieves competitive performance across diverse video classification models, validating the effectiveness of uncertainty-aware spatio-temporal learning for echocardiographic view classification. The code is available at https://github.com/bgx666/stfm.

摘要：自動化的標準超聲心動圖視圖分類對於高效的臨床工作流程至關重要，但面臨三個主要挑戰。首先，公開可用的數據集稀缺，且在規模和視圖覆蓋方面有限。其次，一些現代視頻級架構在超聲心動圖視圖分類中的性能仍未被充分探索。第三，一些視圖類別表現出高度相似的空間外觀，使得單幀特徵不足以進行區分，而異質幀質量則使得穩健的時間信息融合變得複雜。為了解決這些挑戰，我們發布了九個視圖的超聲心動圖視頻（EV9V）數據集，包含5,138個視頻、910,579幀和9個標準視圖，據我們所知，這是目前最大的公開可用超聲心動圖視頻數據集。使用EV9V，我們系統性地基準測試了代表性的視頻分類架構，包括卷積神經網絡（CNNs）、遞歸神經網絡（RNNs）和Transformer。此外，我們提出了一個時空融合模型（STFM），這是一個高效的雙流CNN-LSTM（長短期記憶）框架，能夠共同捕捉空間解剖結構和時間心臟動力學。所提出的框架利用不確定性感知學習，在訓練期間優先抽樣代表性視頻片段，並在推斷期間進行基於證據的融合，從而提高對超聲心動圖視頻中幀質量變化的穩健性。大量實驗表明，我們的方法在各種視頻分類模型中達到了競爭性能，驗證了不確定性感知時空學習在超聲心動圖視圖分類中的有效性。代碼可在 https://github.com/bgx666/stfm 獲得。

SoK: AI-Augmented Binary Reversing

2606.17398v1 by Yujeong Kwon, Yiyue Zhang, Shakhzod Yuldoshkhujaev, Kexin Pei, Dokyung Song, Hyungjoon Koo

Binary reversing is fundamental to software understanding, vulnerability discovery, malware investigation, and firmware auditing. However, it remains inherently challenging due to the irreversible loss of semantic information during compilation. Recent advances in machine learning, large language models (LLMs), and agentic AI systems have accelerated the adoption of AI-augmented binary reversing. Yet, the resulting body of work has become increasingly fragmented across reversing domains, artifact representations, learning approaches, and evaluation practices. This paper presents the first comprehensive systematization of knowledge on AI-augmented binary reversing. We analyze 144 research papers published since 2015, and organize them into 22 binary reversing domains according to the inference tasks. We further introduce a unified taxonomy spanning conventional and AI-augmented reversing pipelines. Our taxonomy connects traditional analysis techniques, binary-derived artifacts, representation strategies, learning paradigms, and downstream inference tasks, while clarifying the emerging roles of LLMs and agentic AI systems. By establishing a common vocabulary and structured framework, we provide a holistic view of the field's evolution over the past decade. Our study reveals common structures underlying seemingly disparate approaches, highlights persistent technical challenges and evaluation gaps, and identifies promising opportunities for future research. Collectively, these insights clarify the current state of the field and provide a foundation for the next generation of reliable and scalable AI-augmented binary reversing systems.

摘要：二進位反向工程對於軟體理解、漏洞發現、惡意程式調查和韌體審計是基本的。然而，由於編譯過程中語義資訊的不可逆損失，它仍然固有地具有挑戰性。最近在機器學習、大型語言模型（LLMs）和自主人工智慧系統方面的進展，加速了AI增強的二進位反向工程的採用。然而，隨之而來的研究成果在反向工程領域、工件表示、學習方法和評估實踐上變得越來越支離破碎。本文呈現了對AI增強的二進位反向工程的首個全面知識系統化。我們分析了自2015年以來發表的144篇研究論文，並根據推理任務將其組織為22個二進位反向工程領域。我們進一步介紹了一個統一的分類法，涵蓋傳統和AI增強的反向工程流程。我們的分類法連接了傳統分析技術、二進位衍生工件、表示策略、學習範式和下游推理任務，同時澄清了LLMs和自主人工智慧系統的新興角色。通過建立共同的詞彙和結構化框架，我們提供了對過去十年該領域演變的整體觀點。我們的研究揭示了看似不同的方法之間的共同結構，突顯了持續存在的技術挑戰和評估空白，並識別了未來研究的有前景機會。總體而言，這些見解澄清了該領域的當前狀態，並為下一代可靠且可擴展的AI增強二進位反向工程系統提供了基礎。

MeiBRD: Meta-Learning Intraoperative Biomechanical Residual Deformation

2606.17379v1 by Casey Meisenzahl, Jon Heiselman, Michael Holtz, Yubo Ye, Michael Miga, Linwei Wang

Accurate intraoperative liver registration is challenging due to substantial soft-tissue deformation yet sparse intraoperative measurements. Biomechanical models regularize this ill-posedness with prior knowledge but exhibit persistent prediction bias due to simplifying assumptions, while data-driven learning solutions struggle with data efficiency, generalization, and physical plausibility. We propose a hybrid registration framework that adapts a biomechanical prior using sparse intraoperative correspondences. Rather than learning a full deformation field, we learn a residual deformation function that corrects linear biomechanical predictions, modeled as a graph neural diffusion function with geometry-aware attention over the 3D liver mesh. To enable long-range information transfer of sparse observations, we take a novel perspective of sparse intraoperative measurements as \textit{context} samples where input-output pairs of the residual deformation function are fully observed, casting the problem into learning-to-learn this residual function from intraoperative context samples with feedforward meta-learners. Experiments on a deformable liver phantom dataset demonstrate improved registration accuracy and generalization compared to rigid, biomechanical, and data-driven baselines, particularly for out-of-distribution geometries and deformations.

摘要：準確的術中肝臟註冊因為顯著的軟組織變形以及稀疏的術中測量而具有挑戰性。生物力學模型利用先驗知識來正則化這種不適定性，但由於簡化假設而表現出持續的預測偏差，而基於數據的學習解決方案在數據效率、泛化性和物理合理性方面面臨挑戰。我們提出了一個混合註冊框架，利用稀疏的術中對應關係來調整生物力學先驗。與其學習完整的變形場，我們學習一個殘差變形函數，該函數修正線性生物力學預測，並建模為具有幾何感知注意力的圖神經擴散函數，作用於3D肝臟網格。為了實現稀疏觀測的長距離信息傳遞，我們以新穎的視角看待稀疏術中測量，將其視為\textit{上下文}樣本，其中殘差變形函數的輸入-輸出對完全可觀察，將問題轉化為從術中上下文樣本中學習這個殘差函數的學習過程，使用前饋元學習器。在可變形肝臟幻影數據集上的實驗顯示，與剛性、生物力學和基於數據的基準相比，註冊準確性和泛化性得到了改善，特別是在分佈外幾何形狀和變形的情況下。

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

2606.17328v1 by Xianxuan Long, Zhikai Chen, Shenglai Zeng, Shouren Wang, Kai Guo, Jiliang Tang

LLM agents increasingly maintain long-term memory of user facts across sessions. Yet such memory is usually evaluated by aggregating accuracy over question rows or episodes. Because this approach scores question rows independently, even when several questions probe the same fact, it cannot show how that fact behaves as conditions change. We introduce MemTrace, a benchmark whose unit of measurement is the knowledge point: a single typed fact about the user, rather than an individual question. MemTrace probes each fact along three controlled dimensions: memory age, defined by how many sessions ago the fact appeared in the history; question type, covering current state, earlier state, and trajectory of change; and evidence condition, covering present, missing, and contradicted-by-false-premise settings. Evaluating 13 memory-system configurations across four paradigms, we find that similar pooled accuracy hides different failures: recovering a fact's current and earlier states does not imply tracking how it changed, and safe abstention does not imply correcting a false premise. The dominant bottleneck is evidence use, not retrieval: when systems fail, the evidence was retrievable 10 times more often than it was missing. These results suggest that improving long-term memory requires better use of reachable evidence, not simply more storage or retrieval.

摘要：LLM 代理越來越能夠在會話之間維持用戶事實的長期記憶。然而，這種記憶通常是通過對問題行或情節的準確性進行聚合來評估的。因為這種方法獨立地評分問題行，即使幾個問題探查同一事實，它也無法顯示該事實在條件變化時的行為。我們引入了 MemTrace，一個基準，其測量單位是知識點：關於用戶的單一鍵入事實，而不是單獨的問題。MemTrace 沿著三個受控維度探查每個事實：記憶年齡，定義為該事實出現在歷史中的會話數；問題類型，涵蓋當前狀態、早期狀態和變化的軌跡；以及證據條件，涵蓋當前、缺失和被虛假前提矛盾的設置。在四個範式中評估 13 種記憶系統配置後，我們發現相似的聚合準確性隱藏了不同的失敗：恢復事實的當前和早期狀態並不意味著追蹤其變化，而安全的放棄並不意味著糾正虛假前提。主要的瓶頸是證據使用，而不是檢索：當系統失敗時，證據的可檢索性比缺失的情況多出 10 倍。這些結果表明，改善長期記憶需要更好地利用可達證據，而不僅僅是增加存儲或檢索。

Nothing from Something: Can a Language Model Discover 0?

2606.17289v1 by Phoebe Zeng, Thomas L. Griffiths, Brenden M. Lake

AI systems based on artificial neural networks are being developed with aspirations of pushing the boundary of human mathematical knowledge. A key question for these systems is how much they can reach beyond their training data. Mathematical discovery requires a strong form of out of distribution generalization; the ability to hypothesize genuinely new - and potentially logically more powerful - mathematical structures. It has been hypothesized that language abilities support such generalizations in human cognition. In this work, we use simple arithmetic as a case study for examining how modern AI models could expand their mathematical horizons, evaluating whether these models can independently discover the concept of "zero". We show that We show that (1) language models of a GPT-2 size are unable to perform this generalization at test time regardless of language pretraining, but (2) models can improve substantially after training on tens or hundreds of examples of zero. Additionally, we find that language pretraining reduces the number of required examples by approximately $50\%$, showing that language abilities can scaffold mathematical discovery in neural models.

摘要：AI 系統基於人工神經網絡的發展旨在推動人類數學知識的邊界。這些系統的一個關鍵問題是它們能在多大程度上超越其訓練數據。數學發現需要一種強形式的分佈外泛化能力；即假設真正新穎的 - 並且可能在邏輯上更強大的 - 數學結構的能力。有人假設語言能力支持人類認知中的這種泛化。在這項工作中，我們使用簡單的算術作為案例研究，以檢驗現代 AI 模型如何擴展其數學視野，評估這些模型是否能獨立發現「零」的概念。我們顯示 (1) 大小為 GPT-2 的語言模型在測試時無法進行這種泛化，無論語言預訓練如何，但 (2) 模型在訓練數十或數百個零的例子後可以顯著改善。此外，我們發現語言預訓練將所需的例子數量減少了約 $50\%$，顯示語言能力可以支持神經模型中的數學發現。

Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering

2606.17257v1 by Rohit Kundu, Arindam Dutta, Sarosij Bose, Athula Balachandran, Amit K. Roy-Chowdhury

Open-weight video diffusion models can generate photorealistic unsafe content, from violence to misinformation, yet existing defenses either require expensive safety fine-tuning that degrades general capability, or apply external filters that are trivially bypassed by adversarial prompts. We present REINS (REpresentation-space INference-time Safety steering), a training-free method that aligns video diffusion models at inference time by steering their internal representations toward safe generation. Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead. Through mechanistic analysis, we reveal that while safety information accumulates monotonically with transformer depth, steering effectiveness peaks at intermediate layers (~50% depth), exposing a fundamental tradeoff between information availability and downstream propagation capacity. We evaluate REINS across 9 video diffusion models, multiple parameter scales (1.3B-5B), and both text-to-video and image-to-video generation, to our knowledge, the broadest safety evaluation suite in the video generation literature.

摘要：開放權重的視頻擴散模型可以生成逼真的不安全內容，從暴力到錯誤信息，但現有的防禦措施要麼需要昂貴的安全微調，這會降低一般能力，要麼應用外部過濾器，這些過濾器可以輕易被對抗性提示繞過。我們提出了REINS（REpresentation-space INference-time Safety steering），這是一種無需訓練的方法，通過在推理時引導其內部表示朝向安全生成來對齊視頻擴散模型。我們的關鍵發現是，安全相關的結構在線性編碼於視頻擴散Transformer的隱藏狀態激活中，並且通過對二元安全標籤進行監督式主成分分析發現的單一方向足以將安全和不安全的生成軌跡分開。在推理時，將這一方向添加到中間Transformer層的隱藏狀態中，可以將生成從有害內容重定向到語義相關的安全替代品，無需權重更新、無需概念枚舉，且計算開銷微不足道。通過機械分析，我們揭示了雖然安全信息隨著Transformer深度單調累積，但引導效果在中間層達到峰值（約50%深度），暴露了信息可用性與下游傳播能力之間的基本權衡。我們在9個視頻擴散模型、多個參數範圍（1.3B-5B）以及文本到視頻和圖像到視頻生成中評估了REINS，據我們所知，這是視頻生成文獻中最廣泛的安全評估套件。

Rift: A Conflict Signature for Deception in Language Models

2606.17229v1 by Petr Nyoma

A model that lies while knowing the truth is the central case ELK cannot handle with behavioral evaluation alone. We ask whether such deception leaves an internal signature distinguishing it from honest error. Our key move is a control for wrongness: we contrast a sleeper agent (knows the truth, lies on trigger) against a naive liar (fine-tuned to emit the same wrong answers with no honest training). Both produce identical wrong outputs; any difference is about knowledge conflict, not incorrectness. We find deceptive forward passes carry a conflict signature - 2.1-2.3x higher residual rank than naive-liar passes on the same wrong answer - strong enough to identify which of two responses is the lie with 100% accuracy and no labels, across GPT-2 small/medium (three seeds) and three instruct models. Across Qwen2.5-1.5B/7B and Phi-3-mini, instructed deception raises residual rank on every tested fact (18/18, 40/40, 34/34); on Phi-3, lies separate perfectly from both honest answers and hallucinations (AUC 1.0, Wilcoxon p~6e-11). The signature survives strategic self-constructed deception (model invents its own lie, AUC 1.0), active concealment attempts (AUC 1.0), and length-controlled replication (20/20, AUC 1.0, p~1e-6). Using basis-free relative representations, a probe trained on one model family detects deception in two other families zero-shot (mean AUC 0.933), surviving simultaneous architecture and format change (AUC 0.821), and transfers across five languages (AUC 1.000, length-controlled). The signature is read-only: detectable but not injectable (0/8 both directions). Honest limitations and six negative experiments are documented in full.

摘要：一個在知道真相的情況下撒謊的模型是ELK無法僅通過行為評估處理的核心案例。我們詢問這種欺騙是否留下了內部特徵，使其與誠實錯誤區分開來。我們的關鍵舉措是對錯誤的控制：我們將一個睡眠特工（知道真相，在觸發時撒謊）與一個天真的撒謊者（經過微調以發出相同的錯誤答案，沒有誠實的訓練）進行對比。兩者產生相同的錯誤輸出；任何差異都是關於知識衝突，而不是不正確性。我們發現欺騙性的前向傳遞具有衝突特徵——相較於天真的撒謊者在相同錯誤答案上的傳遞，其殘餘排名高出2.1-2.3倍——足夠強大以100%準確率識別哪兩個回應是謊言，且無需標籤，涵蓋GPT-2小型/中型（三個種子）和三個指令模型。在Qwen2.5-1.5B/7B和Phi-3-mini上，指令欺騙在每個測試的事實上提高了殘餘排名（18/18，40/40，34/34）；在Phi-3上，謊言與誠實答案和幻覺完全分開（AUC 1.0，Wilcoxon p~6e-11）。這一特徵能夠抵抗戰略性自我構建的欺騙（模型自創謊言，AUC 1.0）、主動隱瞞嘗試（AUC 1.0）以及長度控制的複製（20/20，AUC 1.0，p~1e-6）。使用無基礎相對表示法，對一個模型家族訓練的探針能夠在零樣本情況下檢測到兩個其他家族的欺騙（平均AUC 0.933），並能夠在同時架構和格式變更中存活（AUC 0.821），並在五種語言中轉移（AUC 1.000，長度控制）。這一特徵是只讀的：可檢測但不可注入（0/8雙向）。誠實的限制和六個負面實驗已完整記錄。

When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval

2606.17220v1 by Mingxu Tao, Jiawei Hu, Xian Zhou, Wenpeng Hu, Jiajun Cheng, Yunbo Cao, Zhunchen Luo, Guotong Geng

Legal case retrieval remains challenging due to the complexity of legal language and the need for precise lexical alignment between queries and relevant cases. Although dense retrieval models have achieved notable progress, empirical studies show that BM25 continues to serve as a strong baseline in this domain. It motivates us to propose a self-evolving framework for rule-driven query rewriting that enhances BM25 without any parameter training. The framework equips an LLM-based agent with an automatic evaluation environment, enabling it to iteratively create rewriting rules, plan validation experiments over rule combinations, and eliminate ineffective rules based on historical feedbacks. We evaluate our method on the Chinese legal case retrieval benchmark LeCaRD-v2. Experimental results demonstrate that the proposed framework outperforms non-evolutionary baselines, including human-designed rules and greedy rule selection, particularly when powered by a highcapacity core LLM. We also conduct detailed analyses to investigate the mechanisms underlying self-evolution. Our findings reveal that LLM's capabilities to leverage previous experimental results and its intrinsic knowledge of rule elimination play critical roles in refining the rule set via self-evolution.

摘要：法律案件檢索仍然充滿挑戰，因為法律語言的複雜性以及查詢與相關案件之間需要精確的詞彙對齊。儘管密集檢索模型已取得顯著進展，但實證研究顯示，BM25 在這個領域仍然是強有力的基準。這促使我們提出一個自我演化的框架，用於基於規則的查詢重寫，該框架在不進行任何參數訓練的情況下增強了 BM25。該框架為基於 LLM 的代理提供了一個自動評估環境，使其能夠迭代地創建重寫規則，規劃規則組合的驗證實驗，並根據歷史反饋消除無效規則。我們在中國法律案件檢索基準 LeCaRD-v2 上評估我們的方法。實驗結果顯示，所提出的框架在性能上超越了非演化基準，包括人類設計的規則和貪婪規則選擇，特別是在高容量核心 LLM 的支持下。我們還進行了詳細分析，以研究自我演化的機制。我們的發現揭示了 LLM 利用先前實驗結果的能力以及其對規則消除的內在知識在通過自我演化精煉規則集方面發揮了關鍵作用。

Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming

2606.18293v1 by Callum Barbour

Thanks to rapid developments in generative AI, we are in the midst of a paradigm shift that may change how we interact with computers forever. We have observed a growth in the use of natural language prompts to build applications and coding infrastructures without underlying knowledge of the field, and this practice has been dubbed `vibe coding.' It arguably represents what the field of programming has been building towards since the beginning, with every higher level of abstraction that is conceived. Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. This paper aims to evaluate the viability of vibe coding for greenfield software engineering tasks, as well as analyse the benchmarks that have been used to measure its software engineering prowess. To this end, we have developed an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.

摘要：由於生成性人工智慧的快速發展，我們正處於一場可能永遠改變我們與電腦互動方式的範式轉變之中。我們觀察到使用自然語言提示來構建應用程序和編碼基礎設施的增長，而無需對該領域有深入了解，這種做法被稱為 vibe coding。可以說，這代表了編程領域自始至今所追求的目標，隨著每一個新概念的抽象層次提高。Vibe coding 在輸入方法上承諾成為高級編程的終點：完全消除人類對代碼語法的使用，轉而使用母語進行編程。本文旨在評估 vibe coding 在綠地軟體工程任務中的可行性，以及分析用於衡量其軟體工程能力的基準。為此，我們開發了一個評估套件，用於分析 LLM 在執行 Python 中簡單、孤立的綠地編程任務的能力，以提供對此問題的具體見解。

Trust-Aware Multi-Agent Traceability: Confidence-Calibrated Knowledge Graphs for Consistent Software Artifact Management

2606.17203v1 by Mohamed Essam, Kareem Wael, Azza Hassan, Ahmed Haitham, Mahmoud Soliman, Samer Saber, Ibrahim Habib

Multi-agent AI systems are increasingly used to automate software engineering tasks including requirements analysis, architecture design, test generation, and traceability linking. When these agents operate as a sequential pipeline over shared software artifacts, errors and low-confidence decisions made by upstream agents propagate to downstream stages, producing orphaned requirements, contradictory links, and compliance gaps that pose significant risks in safety-critical domains. We propose a trust-aware coordination framework where a shared knowledge graph serves as both centralized semantic memory and a coordination surface through which agents assess and build upon each other's contributions using calibrated confidence scores. Our approach introduces a two-stage traceability link prediction pipeline combining embedding-based retrieval with LLM-based multi-criteria analysis, a traceability seeding mechanism that enables comparison between derivation-time and validation-time confidence, and a consistency protocol governing pipeline interactions through confidence threshold gating, confidence divergence detection, and conflict resolution. We evaluate on an automotive software engineering case study measuring link prediction calibration, protocol effectiveness, threshold sensitivity, and the impact of traceability seeding. Ablation studies confirm that confidence calibration is essential for effective pipeline coordination.

摘要：多代理人工智慧系統越來越多地用於自動化軟體工程任務，包括需求分析、架構設計、測試生成和可追溯性鏈接。當這些代理作為一個順序管道在共享軟體工件上運作時，上游代理所做的錯誤和低信心決策會傳播到下游階段，產生孤立的需求、矛盾的鏈接和合規差距，這在安全關鍵領域中構成了重大風險。我們提出了一個信任感知的協調框架，其中共享知識圖譜既作為集中式語義記憶，又作為協調表面，通過它，代理可以使用經過校準的信心分數來評估和建立彼此的貢獻。我們的方法引入了一個兩階段的可追溯性鏈接預測管道，結合了基於嵌入的檢索與基於大型語言模型的多標準分析、一個可追溯性播種機制，該機制使得可以比較推導時間和驗證時間的信心，以及一個一致性協議，通過信心閾值門控、信心分歧檢測和衝突解決來管理管道交互。我們在一個汽車軟體工程案例研究中進行評估，測量鏈接預測的校準、協議的有效性、閾值的敏感性以及可追溯性播種的影響。消融研究確認信心校準對於有效的管道協調至關重要。

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

2606.17188v2 by Prabhjot Singh, Bhushan Pawar, Madhu Reddiboina, Rajvee Sheth

Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark of 1,000 strictly parallel image-text instances across Punjabi's three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, we expose a substantial and systematic Script Gap. Models frequently solve visual tasks in one script while failing identical tasks in another, with accuracy deltas reaching 16%. Crucially, visual input boosts absolute performance uniformly yet does not close the orthographic gap. Furthermore, cross-script in-context transfer is highly brittle, exposing script-locked knowledge representation. Supported by McNemar tests across all script pairs, our findings demonstrate that current "multilingual" VLMs are not truly multi-script. We propose the Script Consistency Rate (SCR), which falls as low as 24.8% on our benchmark, as a mandatory metric for script-agnostic evaluation to ensure equitable AI access. Data and code are available at: https://github.com/prabhjotschugh/Not-Truly-Multilingual-PuMVR.

摘要：目前對於視覺語言模型（VLMs）的多語言評估假設語言與正字法之間存在一對一的映射，忽視了數十億使用多種文字語言的用戶。我們引入了PuMVR（旁遮普多模態視覺推理），這是一個包含1,000個嚴格平行的圖像-文本實例的基準，涵蓋旁遮普的三種活躍文字：古爾穆基、沙穆基和羅馬字。通過評估10個最先進的VLM，我們揭示了一個顯著且系統性的文字差距。模型經常在一種文字中解決視覺任務，而在另一種文字中對相同任務失敗，準確率差異高達16%。關鍵是，視覺輸入均勻地提升了絕對性能，但並未縮小正字法差距。此外，跨文字的上下文轉移非常脆弱，暴露了被文字鎖定的知識表徵。通過對所有文字對進行的McNemar測試，我們的發現表明，目前的“多語言”VLM並不是真正的多文字。我們提出了文字一致性率（SCR），在我們的基準上低至24.8%，作為無文字偏見評估的必要指標，以確保公平的AI訪問。數據和代碼可在以下網址獲得：https://github.com/prabhjotschugh/Not-Truly-Multilingual-PuMVR。

RepSelect: Robust LLM Unlearning via Representation Selectivity

2606.17168v1 by Filip Sondej, Yushi Yang, Adam Mahdi

Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.

摘要：使大型語言模型（LLMs）深度忘記特定知識和價值觀，而不犧牲一般能力，仍然是去學習中的一個核心挑戰。然而，當前的方法很容易被微調或少量提示逆轉，這表明它們的遺忘只是淺層的。我們確定了根本原因。現有的方法針對與保留集和微調攻擊者恢復的子空間共享的表示，這使得去學習對一般能力造成了干擾，並且容易被逆轉。我們提出了RepSelect（表示選擇性），通過在每次更新之前壓縮權重梯度的前幾個主成分來隔離忘記集特定的表示，從而保持一般能力不變，同時限制微調可以恢復的內容。我們在兩個忘記類別（生物危害知識和虐待傾向）和四個模型系列（包括密集和專家混合架構的Llama 3、Qwen 3.5、Gemma 4 E4B、DeepSeek V2 Lite）中進行評估。與五個流行的基準（GradDiff、NPO、SimNPO、RMU、UNDIAL）相比，RepSelect在後學習答案準確性上實現了比最強基準高出4-50倍的減少，並且對少量提示攻擊幾乎完美穩健。因此，針對選擇性表示是一個邁向深度和穩健的LLM遺忘的重要步驟。

A Causal Model of Theory of Mind in Conflict for Artificial Intelligence

2606.16944v1 by Nikolos Gurney

Theory of mind (ToM), the capacity to ascribe mental states to others and use those ascriptions for prediction and inference, is widely assumed to be essential for effective human-machine integration. Existing AI-ToM models address \emph{how} to mentalize, but leave the question of when largely unaddressed. The central question is: under what situational and agent-level conditions is ToM engagement causally warranted in conflict? This paper presents a structural causal model formalized as a directed acyclic graph (DAG), treating ToM as a mechanism activated by situational and agent-level conditions rather than as an always-on capacity. The model specifies four exogenous variables capturing situational and agent-level conditions, five endogenous mediators, and a mechanistic ToM node producing engagement states through three distinct causal pathways: a tractability pathway, a reasoning-depth pathway, and an enabling-cause pathway. The primary outcome is epistemic accuracy, which decouples social reasoning from behavioral policy and generalizes across social phenomena beyond conflict. The framework gives AI systems a principled, resource-rational decision procedure for mentalizing, with implications for efficiency, trust, and the development of robust artificial social intelligence. Simulation validation, empirical human-machine teaming studies, and ethical considerations arising from conflict-optimized mentalizing are discussed.

摘要：心智理論（ToM），即將心理狀態歸因於他人並利用這些歸因進行預測和推理的能力，被廣泛認為對於有效的人機整合至關重要。現有的AI-ToM模型主要探討\emph{如何}進行心智化，但對於何時進行心智化的問題則大多未予以解答。核心問題是：在什麼情境和代理層級的條件下，心智理論的參與在衝突中是因果上合理的？本文提出了一個結構性因果模型，形式化為一個有向無循環圖（DAG），將心智理論視為一種由情境和代理層級條件激活的機制，而非一種始終開啟的能力。該模型指定了四個外生變數，捕捉情境和代理層級的條件，五個內生中介變數，以及一個機械性心智理論節點，通過三條不同的因果途徑產生參與狀態：可處理性途徑、推理深度途徑和促成原因途徑。主要結果是認識準確性，這將社會推理與行為政策解耦，並在衝突以外的社會現象中進行概括。該框架為AI系統提供了一種原則性、資源理性的決策程序，用於心智化，對效率、信任以及穩健的人工社會智能的發展具有重要意義。文中討論了模擬驗證、實證人機協作研究以及由於衝突優化心智化而產生的倫理考量。

RAID: Semantic Graph Diffusion for True Cold-Start and Cross-Lingual Forecasting

2606.16925v1 by Arunkumar V, Manoranjan Gandhudi, Gangadharan G. R., Arun Prakash, S. Senthilkumar

Time-series foundation models show strong transfer performance when given a non-empty history window. However, true cold-start scenarios, where a new item has no prior observations, violate this assumption. We propose RAID (Retrieval-Augmented Iterative Diffusion) a framework, which replaces history-based correlation learning with metadata-driven semantic retrieval and graph-conditioned diffusion. RAID maps textual metadata into a shared semantic space using a frozen multilingual embedding model and constructs an inductive retrieval graph that extends naturally to unseen items. It first forms a base forecast by aggregating information from semantically related neighbors, then refines this forecast with a gated diffusion module to model residual uncertainty. Under a strict true cold-start protocol, RAID outperforms strong foundation models and competitive baselines on both forecasting accuracy and prediction interval coverage, while reducing inference latency by an order of magnitude through non-autoregressive decoding. The shared semantic space also enables zero-shot cross-lingual transfer, allowing a model trained on English descriptions to generalize to items described in other languages without direct supervision.

摘要：時間序列基礎模型在給定非空歷史窗口時顯示出強大的轉移性能。然而，真正的冷啟動場景，即新項目沒有先前觀察，違反了這一假設。我們提出了RAID（檢索增強迭代擴散）框架，該框架用元數據驅動的語義檢索和圖條件擴散替代基於歷史的相關性學習。RAID使用凍結的多語言嵌入模型將文本元數據映射到共享語義空間，並構建一個自然擴展到未見項目的歸納檢索圖。它首先通過聚合語義相關鄰居的信息來形成基礎預測，然後用門控擴散模塊來細化這一預測，以建模殘餘不確定性。在嚴格的真正冷啟動協議下，RAID在預測準確性和預測區間覆蓋率上超越了強大的基礎模型和競爭基準，同時通過非自回歸解碼將推理延遲降低了一個量級。共享語義空間還使零樣本跨語言轉移成為可能，允許在英語描述上訓練的模型在沒有直接監督的情況下對用其他語言描述的項目進行泛化。

LESS Is More: Mutual-Stability Sampling for Diffusion Language Models

2606.16908v1 by Amr Mohamed, Guokan Shang, Michalis Vazirgiannis

Diffusion large language models (dLLMs) offer a promising alternative to autoregressive decoding by iteratively refining masked sequences, enabling parallel token updates and bidirectional conditioning. Their practical efficiency, however, is limited by sampling procedures that execute a fixed number of reverse denoising steps selected before decoding, spending computation on already-stable positions and sometimes committing unstable ones too early. We present \textsc{LESS}, a training-free, model-agnostic adaptive sampler that treats token commitment as an online stopping problem. \textsc{LESS} implements mutual-stability sampling through a joint stability rule that makes a masked position eligible for unmasking only when its top-1 prediction has high confidence, its top-1 token persists across recent reverse steps, and its predictive distribution is stable under top-$K$ inter-step Jensen--Shannon divergence. We evaluate \textsc{LESS} on Dream-7B, LLaDA-8B, and LLaDA-1.5-8B, covering full-sequence diffusion and semi-autoregressive blockwise sampling regimes, across seven benchmarks spanning general knowledge, math, and code. \textsc{LESS} improves average accuracy over strong training-free adaptive samplers while using $72.1\%$ fewer reverse steps than fixed-budget decoding. Since each reverse step requires a Transformer forward pass, these step-count reductions translate into fewer forward evaluations, lower measured wall-clock latency, and lower estimated inference compute.

摘要：擴散大型語言模型（dLLMs）提供了一種有前景的替代自回歸解碼的方法，通過迭代地精煉遮蔽序列，實現並行的標記更新和雙向條件化。然後，它們的實際效率受到限制，因為取樣程序在解碼之前執行固定數量的反向去噪步驟，將計算花費在已經穩定的位置上，有時也會過早地承諾不穩定的位置。我們提出了\textsc{LESS}，一種無需訓練的模型無關自適應取樣器，將標記承諾視為一個在線停止問題。\textsc{LESS}通過一個聯合穩定性規則實現互穩定取樣，該規則僅在其 top-1 預測具有高信心、其 top-1 標記在最近的反向步驟中持續存在以及其預測分佈在 top-$K$ 交互步驟的詹森-香農散度下穩定時，使得遮蔽位置有資格被解遮蔽。我們在 Dream-7B、LLaDA-8B 和 LLaDA-1.5-8B 上評估\textsc{LESS}，涵蓋全序列擴散和半自回歸區塊取樣範疇，涉及七個基準，涵蓋一般知識、數學和代碼。\textsc{LESS}在使用$72.1\%$ 更少的反向步驟的情況下，提高了強大的無訓練自適應取樣器的平均準確性。由於每個反向步驟都需要一次 Transformer 前向傳遞，這些步驟數量的減少轉化為更少的前向評估、更低的實際牆鐘延遲和更低的估計推理計算。

Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier

2606.16811v1 by Keizo Kato, Chenhui Chu, Yugo Murawaki, Sado Kurohashi

For the development of Large language models (LLMs), recent approaches to generating pseudo intermediate reasoning have shown remarkable progress. But they typically rely on large numbers of correctly annotated answers to assess reasoning quality. This paper presents a semi-supervised framework that scales reasoning learning from minimal supervision, turning reasoning verification itself into a data creation mechanism. We train a lightweight reasoning-correctness classifier on only a few labeled samples, which judges whether intermediate reasoning traces generated by an LLM are valid. Furthermore, an entropy-based confidence threshold filters out unreliable samples, and the remaining high-confidence reasoning traces are used to fine-tune the model. Experiments on Verifiable Math Problems (Orca-Math subset) and Question Answering on Image Scene Graphs (GQA) with Visual Programming show that our method achieves accuracy comparable to using 10-15x more labeled data. Ablation analyses confirm that both the classifier and entropy filtering are essential for scalable and noise-resistant pseudo-labeling. By replacing expensive answer-level supervision with lightweight reasoning verification, our method provides a practical path toward constructing large-scale reasoning resources and paves the way for future autonomous reasoning systems that learn from minimal human input.

摘要：大型語言模型（LLMs）的發展中，最近生成偽中介推理的方法顯示出顯著的進展。但這些方法通常依賴大量正確標註的答案來評估推理質量。本文提出了一個半監督框架，從最小的監督中擴展推理學習，將推理驗證本身轉變為一種數據創建機制。我們在僅有幾個標註樣本的情況下訓練了一個輕量級的推理正確性分類器，用以判斷LLM生成的中介推理痕跡是否有效。此外，基於熵的置信度閾值過濾掉不可靠的樣本，剩下的高置信度推理痕跡用於微調模型。在可驗證數學問題（Orca-Math子集）和圖像場景圖的問題回答（GQA）與視覺編程的實驗中，我們的方法達到了與使用10-15倍更多標註數據相當的準確性。消融分析確認分類器和熵過濾對於可擴展和抗噪聲的偽標註至關重要。通過用輕量級的推理驗證取代昂貴的答案級監督，我們的方法提供了一條實用的路徑，用於構建大規模的推理資源，並為未來從最小人類輸入學習的自主推理系統鋪平道路。

OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models

2606.16774v1 by Tianyi Lin, Chuanyu Sun, Jingyi Zhang, Changxu Wei, Huanjin Yao, Shunyu Liu, Xikun Zhang, Liu Liu, Jiaxing Huang

Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems like OpenClaw. In this work, we aim to develop a framework that automatically constructs such reusable skills to enhance LLMs in tool use, multi-step reasoning, and dynamic environment interaction. To this end, we propose Collective Skill Tree Search (CSTS), a novel tree-search-based skill construction framework that constructs structured, diverse and generalizable tree of skills. The core idea of CSTS is to leverage collective intelligence to jointly search, identify and compose effective skills via two iterative phases: Collective Skill Node Generation (CSN-Gen) and Collective Skill Node Assessment (CSN-Assess). CSN-Gen exploits collective knowledge from multiple models to explore diverse candidate skills for each subtask, enabling comprehensive skill exploration. CSN-Assess employs multiple models as judges to evaluate and select skill nodes with two scoring mechanisms: (1) collective quality scoring that aggregates independent evaluations to produce a robust estimate of skill effectiveness, and (2) collective transferability scoring that explicitly verifies whether a skill generalizes well across different models. With CSTS, we construct a set of comprehensive tree of skills along with skill-augmented training data, enabling models to effectively learn and utilize skills. Besides, we introduce Collective Skill Reinforcement Learning, which actively selects multiple relevant skills from the tree to broaden solution-space exploration, avoid being trapped by a single skill and its resulting homogeneous or suboptimal solutions. As a result, our trained model, OpenClaw-Skill, exhibits outstanding agentic capabilities in long-horizon planning, tool use and generalization over challenging benchmarks.

摘要：為大型語言模型（LLM）代理配備有效技能對於解決像 OpenClaw 這樣的現實系統中的複雜任務至關重要。在這項工作中，我們旨在開發一個框架，自動構建可重用的技能，以增強 LLM 在工具使用、多步推理和動態環境互動方面的能力。為此，我們提出了集體技能樹搜尋（CSTS），這是一個基於樹搜尋的創新技能構建框架，構建結構化、多樣化和可泛化的技能樹。CSTS 的核心思想是利用集體智慧通過兩個迭代階段共同搜尋、識別和組合有效技能：集體技能節點生成（CSN-Gen）和集體技能節點評估（CSN-Assess）。CSN-Gen 利用來自多個模型的集體知識探索每個子任務的多樣候選技能，從而實現全面的技能探索。CSN-Assess 則利用多個模型作為評審，通過兩種評分機制來評估和選擇技能節點：（1）集體質量評分，聚合獨立評估以產生技能有效性的穩健估計，以及（2）集體可轉移性評分，明確驗證一項技能是否能在不同模型之間良好泛化。通過 CSTS，我們構建了一組全面的技能樹以及增強技能的訓練數據，使模型能夠有效地學習和利用技能。此外，我們引入了集體技能強化學習，主動從樹中選擇多個相關技能，以擴展解決方案空間的探索，避免被單一技能及其導致的同質或次優解所困住。因此，我們訓練的模型 OpenClaw-Skill 在長期規劃、工具使用和在挑戰性基準上的泛化能力上展現了卓越的代理能力。

Misinformation Propagation in Benign Multi-Agent Systems

2606.16710v1 by Jonas Becker, Jan Philip Wahle, Terry Ruas, Bela Gipp

Multi-agent systems, in which multiple large language model agents solve problems through turn-based interaction, are increasingly deployed in high-stakes settings such as medical diagnosis, legal analysis, and forensic decision-making. Their reliability can be at risk when single agents reason from incorrect or misleading context, e.g., from tool calls, since errors may propagate through agent interactions. This work studies this risk by injecting intent-based misinformation into benign single-agent and multi-agent systems across reasoning, knowledge, and alignment tasks. We find that misinformation can degrade single-agent performance and persists across multi-agent debate, with agents often retaining answers introduced by misinformed peers. Nevertheless, multi-agent debate reduces the resulting performance degradation compared to single-agent prompting, especially when most agents are not exposed to misinformation. Robustness depends on group composition and decision protocol. Consensus can be more stable than voting under peer pressure, while majorities can often steer misinformed agents back toward correct answers. Our results show that misinformation robustness in multi-agent systems depends on the underlying model and also on how agents exchange information and aggregate decisions.

摘要：多代理系統中，多个大型语言模型代理通过回合制互动解决问题，越来越多地应用于高风险环境，如医疗诊断、法律分析和法医决策。当单一代理从不正确或误导性的上下文中推理时，例如来自工具调用，其可靠性可能会面临风险，因为错误可能会通过代理互动传播。本文通过在无害的单一代理和多代理系统中注入基于意图的错误信息，研究这一风险，涵盖推理、知识和对齐任务。我们发现，错误信息会降低单一代理的表现，并在多代理辩论中持续存在，代理通常会保留误导同伴引入的答案。尽管如此，与单一代理提示相比，多代理辩论减少了由此产生的表现下降，尤其是在大多数代理未接触错误信息的情况下。鲁棒性取决于群体组成和决策协议。在同伴压力下，达成共识可能比投票更稳定，而多数派通常可以将误导的代理引导回正确答案。我们的结果表明，多代理系统中的错误信息鲁棒性依赖于基础模型，以及代理如何交换信息和聚合决策。

User as Code: Executable Memory for Personalized Agents

2606.16707v1 by Bojie Li

A personalized AI agent needs a user memory: a persistent model of who the user is, built across many conversations and consulted on each new one. Today this memory is almost always stored as unstructured text, a knowledge graph, or a flat store of facts, and consulted by retrieval -- fetching the entries most similar to the current request. Such "bag-of-facts" memory recalls individual facts well, but because storing a fact and acting on it are separate steps, it struggles to resolve contradictions, aggregate over many records, or enforce rules. We argue that user memory should instead be executable. We introduce User as Code (UaC), a paradigm in which an agent's model of a user is a living software project: typed Python objects hold the user's state and ordinary Python functions encode the rules that govern it, so representing and reasoning about the user happen in one medium an interpreter can run. The enabling mechanism is a two-phase pipeline: an append-only log that never discards a fact, periodically checkpointed into typed code. This changes what memory can do. On standard long-term conversation benchmarks, UaC matches both a full-context upper bound and the strongest prior memory systems on recall (78.8% on LOCOMO). Its advantage emerges where representation matters most. On aggregate questions over a user's history -- "how many international trips did I take last year?" -- retrieval-based memory collapses (6-43%) while UaC stays near-perfect (99%), because the answer is a one-line computation over typed state rather than a search over text. And because its rules execute deterministically whenever the state changes, UaC can surface unsolicited, safety-critical alerts -- such as a newly prescribed drug that conflicts with an allergy recorded months earlier -- a capability query-driven memory cannot provide.

摘要：一個個性化的 AI 代理需要用戶記憶：一個持久的模型，描繪用戶是誰，這個模型是通過多次對話建立的，並在每次新的對話中進行查詢。今天，這種記憶幾乎總是以非結構化文本、知識圖譜或平面事實存儲的形式存儲，並通過檢索進行查詢——提取與當前請求最相似的條目。這種「事實袋」記憶能夠很好地回憶個別事實，但因為存儲一個事實和基於該事實行動是兩個不同的步驟，它在解決矛盾、整合多條記錄或執行規則方面面臨困難。我們認為，用戶記憶應該是可執行的。我們引入了用戶作為代碼（UaC），這是一種範式，其中代理對用戶的模型是一個活的軟體項目：類型化的 Python 對象持有用戶的狀態，而普通的 Python 函數編碼了管理該狀態的規則，因此表示和推理用戶的過程發生在一個解釋器可以運行的媒介中。啟用機制是一個兩階段的管道：一個只追加的日誌，永遠不會丟棄事實，並定期檢查點到類型化代碼中。
這改變了記憶能做的事情。在標準的長期對話基準上，UaC 同時達到了完整上下文的上限和最強的先前記憶系統的回憶（在 LOCOMO 上達到 78.8%）。它的優勢在於表示最為重要的地方。在針對用戶歷史的聚合問題上——「我去年出國旅行了多少次？」——基於檢索的記憶崩潰（6-43%），而 UaC 則保持近乎完美（99%），因為答案是一行計算基於類型化狀態，而不是對文本的搜索。而且，因為它的規則在狀態變化時以確定性執行，UaC 可以提出未經請求的、安全關鍵的警報——例如，與幾個月前記錄的過敏反應衝突的新處方藥——這是基於查詢的記憶無法提供的能力。

Progressive Knowledge-Guided Large Language Model Framework for Bearing Fault Diagnosis

2606.16684v1 by Jinghan Wang, Gaoliang Peng, Yanjun Chen, Wei Zhang, Wentao Wu, Tianchen Liu

Vibration-based bearing fault diagnosis requires resolving three interrelated measurement challenges, including the trade-off between global statistical feature efficiency and local transient signal fidelity, insufficient traceability of measurement features to underlying fault physics, and ineffective multi-source measurement information fusion across diagnostic scales. This paper presents a progressive physics-guided multi-scale vibration signal processing framework that addresses all three challenges within a unified diagnostic pipeline. An 81-dimensional measurement descriptor, derived from bearing kinematic theory and characteristic defect frequencies, establishes a physically traceable feature space enabling real-time fault screening at approximately 20 ms per sample. A fault-adaptive signal segmentation mechanism then directs analytical attention toward fault-relevant waveform regions guided by physics-based priors, without manual feature engineering. Structured fault mechanism knowledge is further encoded implicitly in model parameters during training, enabling autonomous multi-scale measurement fusion without external knowledge dependencies at inference. Validated on four public benchmark datasets under diverse operating conditions, the framework achieves 98.49% diagnostic accuracy with a 12.6-fold reduction in computational cost relative to signal-level baselines. Interpretability analysis confirms that diagnostic feature activations align with established bearing fault mechanics, supporting measurement traceability in safety-critical industrial systems.

摘要：振動基礎的軸承故障診斷需要解決三個相互關聯的測量挑戰，包括全球統計特徵效率與局部瞬態信號保真度之間的權衡、測量特徵對基礎故障物理的追溯性不足，以及在診斷尺度之間的多源測量信息融合效果不佳。本文提出了一個漸進的物理引導多尺度振動信號處理框架，解決了統一診斷流程中的所有三個挑戰。從軸承運動學理論和特徵缺陷頻率衍生出的81維測量描述子，建立了一個物理可追溯的特徵空間，使得每個樣本的實時故障篩選約為20毫秒。然後，故障自適應信號分割機制將分析重點引導到由物理基礎先驗知識指導的與故障相關的波形區域，無需手動特徵工程。在訓練過程中，結構化故障機制知識進一步隱式編碼在模型參數中，實現了無需外部知識依賴的自主多尺度測量融合。在多種操作條件下，該框架在四個公共基準數據集上進行驗證，實現了98.49%的診斷準確率，相較於信號級基準，計算成本降低了12.6倍。可解釋性分析確認診斷特徵激活與既定的軸承故障力學相一致，支持在安全關鍵的工業系統中進行測量追溯。

The Integrator Advantage: Controlled Agentic AI for Small and Medium-Sized Companies

2606.16649v1 by Christopner Koch, Joshua A. Wellbrock

Agentic AI marks a new phase of enterprise automation. Unlike traditional automation or conversational AI, agentic systems can interpret goals, plan multi step tasks, access tools, interact with enterprise systems, and execute workflows with varying degrees of autonomy. For small and medium sized companies, this creates potential to reduce administrative burden, accelerate routine processes, and improve the use of organizational knowledge. This paper argues that the near term value of Agentic AI does not lie in full autonomy or workforce reduction, but in controlled partial autonomy for simple and medium complexity business processes. It proposes an integration framework covering use case suitability, autonomy levels, technical integration, governance, security, employee enablement, and measurable impact. The paper concludes that Agentic AI can become a productivity lever when implemented as a human centered capability with responsibility and accountability retained by people.

摘要：代理人工智慧標誌著企業自動化的新階段。與傳統自動化或對話式人工智慧不同，代理系統能夠解釋目標、規劃多步驟任務、訪問工具、與企業系統互動，並以不同程度的自主性執行工作流程。對於中小型企業來說，這創造了減少行政負擔、加速日常流程和改善組織知識使用的潛力。本文主張，代理人工智慧的短期價值不在於完全自主或減少勞動力，而在於對簡單和中等複雜度商業流程的受控部分自主性。它提出了一個整合框架，涵蓋用例適用性、自主性水平、技術整合、治理、安全性、員工賦能和可衡量的影響。本文總結道，當代理人工智慧作為以人為中心的能力實施時，並由人保留責任和問責制，它可以成為生產力的杠杆。

DCP-Prune: Ultra-Low Token Pruning with Distribution Consistency Preservation

2606.16633v1 by Xifeng Xue, Xiaokang Wang, Zirui Li, Ming-Ming Cheng, Guolei Sun

Recent vision token pruning methods effectively preserve model performance under moderate token budgets but become unstable under ultra-low token budget. Our analysis shows that as the pruning budget decreases, accuracy degradation is often accompanied by larger feature distribution shifts. Critically, the degree of this distribution shift strongly correlates with performance degradation. To better characterize this phenomenon, we introduce a lightweight distribution consistency metric to estimate the distribution shift between retained and full tokens. Motivated by these observations, we propose a two-stage pruning framework consisting of Anchor-Context Graph Recovery (ACGR) and Text-Aware Token Cluster Selection (TATCS). Specifically, ACGR transfers contextual information before token removal, while TATCS dynamically re-selects representative tokens when severe distribution shift is detected. Extensive experiments demonstrate that our method achieves superior and more stable performance under ultra-low token budget. Notably, it retains 92.1% of the upper-bound average performance on LLaVA-1.5-7B with only 16 visual tokens.

摘要：最近的視覺標記修剪方法在中等標記預算下有效保留模型性能，但在超低標記預算下則變得不穩定。我們的分析顯示，隨著修剪預算的減少，準確度下降通常伴隨著更大的特徵分佈變化。關鍵是，這種分佈變化的程度與性能下降有著強烈的相關性。為了更好地描述這一現象，我們引入了一種輕量級的分佈一致性度量，以估計保留標記和完整標記之間的分佈變化。受到這些觀察的啟發，我們提出了一個由錨點-上下文圖恢復（ACGR）和文本感知標記集群選擇（TATCS）組成的兩階段修剪框架。具體而言，ACGR 在標記移除之前轉移上下文信息，而 TATCS 在檢測到嚴重的分佈變化時動態重新選擇代表性標記。大量實驗表明，我們的方法在超低標記預算下實現了更優越和更穩定的性能。值得注意的是，它在僅使用 16 個視覺標記的情況下，保留了 LLaVA-1.5-7B 的上限平均性能的 92.1%。

Islamic Large Language Models: From Knowledge Acquisition to Trustworthy and Hallucination-Resistant AI

2606.16629v1 by Mohammed Amine Mouhoub

Large language models (LLMs) are increasingly used for knowledge-intensive question answering, including religious and legal questions. Islamic knowledge is a particularly demanding setting: answers are expected to be grounded in authoritative sources, citations must be exact, Arabic varieties differ substantially from the language of classical sources, and legitimate jurisprudential disagreement must be represented rather than collapsed into a single answer. This survey reviews the emerging field of Islamic LLMs and trustworthy Islamic AI. We organize the literature around Arabic NLP and Arabic-centric LLMs, Islamic NLP resources, Qur'anic question answering, Islamic knowledge benchmarks, retrieval-augmented generation, Islamic legal reasoning, inheritance reasoning, hallucination evaluation, and trustworthiness. We argue that fluency in Arabic is not sufficient for Islamic AI. Reliable systems require curated sources, retrieval and verification modules, citation-aware generation, madhhab-aware reasoning, human expert evaluation, and benchmarks that measure not only answer accuracy but also faithfulness, source validity, and reasoning quality. The survey concludes with a research agenda for hallucination-resistant Islamic AI systems.

摘要：大型語言模型（LLMs）在知識密集型問題回答中越來越多地被使用，包括宗教和法律問題。伊斯蘭知識是一個特別要求高的環境：答案必須基於權威來源，引用必須準確，阿拉伯語的變體與古典來源的語言有顯著差異，合法的法理學爭議必須被表達出來，而不是簡化為單一答案。本調查回顧了新興的伊斯蘭LLMs和可信的伊斯蘭AI領域。我們將文獻組織在阿拉伯語NLP和以阿拉伯語為中心的LLMs、伊斯蘭NLP資源、古蘭經問題回答、伊斯蘭知識基準、檢索增強生成、伊斯蘭法律推理、繼承推理、幻覺評估和可信度等主題上。我們主張，流利的阿拉伯語不足以支持伊斯蘭AI。可靠的系統需要經過策劃的來源、檢索和驗證模塊、引用意識生成、教派意識推理、人類專家評估，以及不僅測量答案準確性還測量忠實性、來源有效性和推理質量的基準。本調查以針對抗幻覺的伊斯蘭AI系統的研究議程作結。

VeriGraph: Towards Verifiable Data-Analytic Agents

2606.16603v1 by Jiajie Jin, Zhao Yang, Wenle Liao, Yuyang Hu, Guanting Dong, Xiaoxi Li, Yutao Zhu, Zhicheng Dou

LLM-based agents have demonstrated strong capabilities in data-intensive analytical tasks, yet their outputs are rarely verifiable: a reliance on linear text trajectories makes their reasoning difficult to audit. In particular, deterministic computations over raw data and semantic deductions over natural-language claims are often entangled in an unstructured stream, leaving numerical conclusions hard to reproduce and qualitative judgments hard to inspect. To address this, we propose VeriGraph, a traceable neuro-symbolic reasoning framework that enables agents to construct an explicit heterogeneous evidence directed acyclic graph (DAG) during execution. VeriGraph introduces three evidence-expansion primitives, namely computational, grounding, and derivational expansion, to connect raw data, interpreter variables, computed results, and natural-language claims in a unified graph. Under this formulation, structural traceability is reduced to graph reachability from raw data sources to terminal claims, while semantic support is measured by claim-level evidence evaluation. To improve graph construction, we further design a graph-based policy optimization strategy with a composite reward that jointly supervises answer correctness, computational integrity, and derivational coherence. Experiments on four benchmarks show that VeriGraph-8B achieves the highest overall score among all baselines. More importantly, VeriGraph produces auditable evidence graphs with substantially stronger claim grounding, achieving a 87.61\% Grounding Rate under our claim-level evidence support evaluation. These results suggest that explicit evidence-graph construction is a promising path toward verifiable data-analytic agents. Our code is available at https://github.com/ignorejjj/VeriGraph.

摘要：LLM 基礎的代理在數據密集型分析任務中展現出強大的能力，但它們的輸出很少可驗證：對線性文本軌跡的依賴使得它們的推理難以審計。特別是，對原始數據的確定性計算和對自然語言聲明的語義推導通常交織在一個非結構化的流中，導致數值結論難以重現，定性判斷難以檢查。為了解決這個問題，我們提出了 VeriGraph，一個可追蹤的神經符號推理框架，該框架使代理在執行過程中能夠構建一個明確的異質證據有向無環圖 (DAG)。VeriGraph 引入了三個證據擴展原語，即計算擴展、基礎擴展和推導擴展，以在統一圖中連接原始數據、解釋變量、計算結果和自然語言聲明。在這種表述下，結構可追蹤性被簡化為從原始數據源到終端聲明的圖可達性，而語義支持則通過聲明級證據評估來衡量。為了改善圖的構建，我們進一步設計了一種基於圖的策略優化策略，該策略具有復合獎勵，聯合監督答案的正確性、計算完整性和推導一致性。在四個基準上的實驗顯示，VeriGraph-8B 在所有基線中達到了最高的整體分數。更重要的是，VeriGraph 生成了可審計的證據圖，具有顯著更強的聲明基礎，在我們的聲明級證據支持評估中達到了 87.61\% 的基礎率。這些結果表明，明確的證據圖構建是朝向可驗證數據分析代理的一條有前景的道路。我們的代碼可在 https://github.com/ignorejjj/VeriGraph 獲得。

SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM Agents

2606.16591v2 by Qiao Xiao, Haochen Shi, Yisen Gao, Wenbin Hu, Huihao Jing, Tianshi Zheng, Baixuan Xu, Ziheng Zhang, Weiqi Wang, Haoran Li, Jiaxin Bai, Yangqiu Song

Large language model (LLM) agents increasingly rely on agent harnesses that manage context, tools, and multi-turn execution, making tools a central interface for acting in realistic digital environments. As harness-connected tool ecosystems expand to hundreds or thousands of APIs, services, and task-specific skills, exhaustive tool schema injection becomes costly and imposes a closed-world assumption that limits agents to a predefined static inventory. Retrieval-augmented tool selection offers a natural alternative, but existing one-shot retrieval methods often fail to align isolated tool descriptions with the agent's true task intention, especially in long-horizon tasks where required capabilities emerge through decomposition, observations, and newly induced subgoals. We propose SING, an intention-aware active tool discovery framework that builds an intention-tool graph linking user intentions, tool capabilities, and tool collaboration patterns, and dynamically retrieves tools according to evolving task states. Using a unified corpus of 7,471 tools, we evaluate SING on three real-world tool-use benchmarks. SING improves Global Recall@5 by up to 59.8% and downstream success rate by up to 28.9% over baselines, while reducing full-corpus tool-schema exposure by 99.8%, demonstrating that intention-aware graph structure enables more accurate and context-efficient tool discovery in large-scale agentic ecosystems.

摘要：大型語言模型（LLM）代理越來越依賴於管理上下文、工具和多輪執行的代理架構，使得工具成為在現實數位環境中行動的核心介面。隨著連接架構的工具生態系統擴展到數百或數千個API、服務和特定任務技能，徹底的工具架構注入變得昂貴，並施加了一種封閉世界假設，限制代理僅能使用預定義的靜態庫存。檢索增強的工具選擇提供了一種自然的替代方案，但現有的一次性檢索方法往往無法將孤立的工具描述與代理的真實任務意圖對齊，特別是在需要通過分解、觀察和新引入的子目標出現所需能力的長期任務中。我們提出了SING，一種意圖感知的主動工具發現框架，建立了一個意圖-工具圖，連結用戶意圖、工具能力和工具協作模式，並根據不斷變化的任務狀態動態檢索工具。使用一個統一的7,471個工具的語料庫，我們在三個現實世界的工具使用基準上評估SING。SING在Global Recall@5上提高了多達59.8%，在下游成功率上提高了多達28.9%，同時將全語料庫工具架構的曝光降低了99.8%，證明了意圖感知的圖結構能夠在大規模代理生態系統中實現更準確和上下文高效的工具發現。

Graph neural networks at war: integrating cybersecurity and drone intelligence in the Israeli-Iranian conflict

2606.17119v1 by Sozan Sulaiman Maghdid, Tarik Ahmed Rashid, Shavan Askar

Physical cyber systems have brought about new threats and challenges in detection and immediate response. This study examines how Graph Neural Networks (GNNs) can be used to aid cybersecurity and drone management in a physical cyber system comprising of cyber intrusions and unmanned aerial vehicles (UAVs). By providing a bridge between structural understanding of graphical neural networks, this work has provided an integrated procedure that allows intrusion detection systems to educate on underlying network structures, identify malicious activity, and facilitates drone response measures. Based on an emulation-based case study, cyberattacks models were created to provoke the responses of the drones, which proved that graph-based learning can assist with the situational awareness, swarm coordination, and adaptive maneuver. According to the performance valuation, this method has a detection rate of 94.2, average area under the receiver operating characteristic (ROC) of 0.955 and an average response time of 1.4 seconds. Comparative experiments reveal that proposed GraphSAGE network is more effective than the Graphical Convolutional Networks (GCNs) and Graphical Attention Networks (GATs) in the identical situation. Such findings prove that graphical neural networks can be used to avert intrusion and response of dynamic cyber-physical systems.

摘要：物理網絡系統帶來了新的威脅和挑戰，尤其是在檢測和即時反應方面。本研究探討了圖神經網絡（GNNs）如何用於協助網絡安全和無人機管理，這些無人機管理涉及網絡入侵和無人航空載具（UAVs）。通過提供圖形神經網絡的結構理解之間的橋樑，本研究提供了一個綜合程序，使入侵檢測系統能夠對底層網絡結構進行教育，識別惡意活動，並促進無人機的反應措施。根據基於模擬的案例研究，創建了網絡攻擊模型以激發無人機的反應，這證明了基於圖的學習可以協助情境意識、群體協調和自適應機動。根據性能評估，這種方法的檢測率為94.2，接收者操作特徵（ROC）下的平均面積為0.955，平均反應時間為1.4秒。比較實驗顯示，所提出的GraphSAGE網絡在相同情況下比圖形卷積網絡（GCNs）和圖形注意網絡（GATs）更有效。這些發現證明了圖形神經網絡可以用來避免動態網絡物理系統的入侵和反應。

Kairos: A Native World Model Stack for Physical AI

2606.16533v2 by Kairos Team, Fei Wang, Shan You, Qiming Zhang, Tao Huang, Zuoyi Fu, Zhisheng Zheng, Yunlong Xi, Feng Lv, Xiaoming Wu, Zeyu Liu, Cong Wan, Pu Li, Ruiqing Yang, Xiaoou Li, Wei Wang, Kangkang Zhu, Yuwei Zhang, Shi Fu, Zheng Zhang, Xiaoning Wu, Xuzeng Fan, Dacheng Tao, Xiaogang Wang

World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.

摘要：世界模型正在從被動的視覺生成器轉變為物理人工智慧的基礎操作基礎設施：它們必須能夠從異質經驗中原生獲取世界知識，維持長期穩定的狀態，並在實際部署約束下高效執行。我們介紹了Kairos，一個圍繞這些要求設計的原生世界模型堆疊。(1) Kairos通過開創一種由跨體現數據課程主導的原生預訓練範式來學習世界，該課程將開放世界視頻、人類行為數據和機器人交互組織成一個漸進的發展路徑。(2) Kairos通過統一的世界理解、生成和預測來維持世界，並在配備混合線性時間注意力的原生統一架構中進行，其中滑動窗口注意力捕捉局部動態，擴張的滑動窗口捕捉中範圍依賴性，而門控線性注意力則維持持久的全局記憶。我們建立了正式的理論界限，證明這種時間因式分解嚴格限制了誤差累積，數學上保證了狀態在擴展視野中的傳播。(3) Kairos通過納入部署感知系統共同設計來運行世界，以支持在伺服器和消費級硬體上進行低延遲的推出生成，以實現現實世界的觀察-行動-反饋循環。在具身世界模型、長期視野和行動政策基準上的實驗表明，Kairos在提供強大的效率與能力權衡的同時，達到了頂級性能。綜合這些結果，Kairos被定位為未來自我演化物理智慧的統一操作基礎。

SkillWiki: A Living Knowledge Infrastructure for Agent Skills

2606.16523v1 by Dingcheng Huang, Yuda Ding, Bingshuo Liu, Qingbin Liu, Xi Chen, Jiang Bian, Hongliang Sun, Zhiying Tu, Dianhui Chu, Xiaoyan Yu, Dianbo Sui

While knowledge is managed through Wikipedia and software through GitHub, agent skills still lack an infrastructure for large-scale production, governance, and evolution. SkillWiki is a living knowledge infrastructure that supports the organization, grounding, and continuous evolution of agent skills by transforming heterogeneous knowledge into reusable skill assets linked to their originating evidence. Our demonstration presents the complete skill lifecycle, from knowledge ingestion and skill production to provenance-aware exploration, governance, and execution-driven evolution. SkillWiki highlights a future in which knowledge, skills, and execution experience co-evolve within a shared infrastructure. The live demonstration and source code are publicly available at https://github.com/Huangdingcheng/SkillWiki.

摘要：知識透過維基百科進行管理，軟體則透過GitHub進行管理，但代理技能仍缺乏大規模生產、治理和演化的基礎設施。SkillWiki是一個活的知識基礎設施，通過將異質知識轉化為與其來源證據相連的可重用技能資產，支持代理技能的組織、基礎和持續演化。我們的演示展示了完整的技能生命周期，從知識攝取和技能生產到考慮來源的探索、治理和執行驅動的演化。SkillWiki突顯了一個未來，即知識、技能和執行經驗在共享基礎設施中共同演化。現場演示和源代碼可在 https://github.com/Huangdingcheng/SkillWiki 上公開獲得。

Model Graph Inductive Learning for Knowledge Graph Completion

2606.16509v1 by Mohommad Esmaei Khani, Mahdieh Hasheminejad, Ali Taherkhani, Hossein Hajiabolhassan

Link prediction in knowledge graphs fundamentally depends on the quality of learned embeddings for entities and relations. However, most existing methods derive these embeddings by aggregating only the local neighborhood of each entity, neglecting the global structure of the knowledge graph. This limited view prevents models from capturing higher-level structural patterns that are essential for accurate and generalizable link prediction. To address these limitations, we introduce Model Graph Inductive Learning (\textbf{MGIL}), a framework that constructs a model graph by clustering entities based on the similarity of their incoming and outgoing relational structures or their entity types. A GNN is then applied to this model graph to produce embeddings that capture the global view of the knowledge graph. These embeddings subsequently serve as high-quality initial features %embeddings for the original knowledge graph, replacing random initialization and leading to more stable and expressive representations. Extensive experiments on standard and recently proposed inductive benchmarks demonstrate that MGIL achieves state-of-the-art or highly competitive performance in inductive link prediction, highlighting its effectiveness across diverse graph settings.

摘要：知識圖譜中的連結預測根本上依賴於對實體和關係的學習嵌入的質量。然而，大多數現有方法僅通過聚合每個實體的局部鄰域來推導這些嵌入，忽略了知識圖譜的全局結構。這種有限的視角使得模型無法捕捉到對於準確且可泛化的連結預測至關重要的高階結構模式。為了解決這些限制，我們提出了模型圖歸納學習（\textbf{MGIL}），這是一個通過根據實體的進出關係結構或實體類型的相似性對實體進行聚類來構建模型圖的框架。然後將 GNN 應用於這個模型圖，以生成捕捉知識圖譜全局視角的嵌入。這些嵌入隨後作為高質量的初始特徵 %嵌入，用於原始知識圖譜，取代隨機初始化，並導致更穩定和表達力更強的表示。在標準和最近提出的歸納基準上進行的廣泛實驗表明，MGIL 在歸納連結預測中達到了最先進或高度競爭的性能，突顯了其在多樣化圖形設置中的有效性。

REFLEX: Reflective Evolution from LLM Experience

2606.16496v1 by Pan Wang

Large multimodal language models (LLMs) have emerged as powerful tools for guiding evolutionary search toward interpretable programmatic policies. However, existing frameworks rely on a monolithic model call to simultaneously interpret visual behavioral evidence and synthesize corrective code. This diagnosis-repair entanglement creates an opaque feedback loop, obscuring the rationale behind mutations and preventing the retention of algorithmic insights across independent runs. To achieve auditable and efficient policy search, we argue that visual diagnosis must be structurally decoupled from code generation. We present REFLEX, a train-free evolutionary framework that operationalizes this decoupling. In REFLEX, a vision-enabled Critic first distills task-specific behavioral evidence into structured, auditable diagnoses. Subsequently, a text-optimized Actor synthesizes child policies using these diagnoses alongside a persistent, self-evolving Skill Memory of reusable code snippets. This architecture not only provides transparent mutation traces but also enables cross-run programmatic knowledge transfer. Extensive evaluations across control benchmarks (Lunar Lander, Acrobot, Pendulum) and a 36-dimensional antenna array synthesis task demonstrate exceptional sample efficiency. Notably, REFLEX solves Acrobot and Pendulum in under 10 LLM calls and reaches a best Normalized Weighted Score of 1.092 on Lunar Lander, achieving highly competitive final performance while significantly accelerating the early-stage discovery of transparent policies.

摘要：大型多模態語言模型（LLMs）已成為引導進化搜索朝向可解釋的程式政策的強大工具。然而，現有的框架依賴於單一模型調用，同時解釋視覺行為證據並合成修正代碼。這種診斷-修復的糾纏創造了一個不透明的反饋循環，模糊了突變背後的理由，並阻礙了在獨立運行中保留算法洞察。為了實現可審計和高效的政策搜索，我們主張視覺診斷必須在結構上與代碼生成解耦。我們提出了REFLEX，一個無需訓練的進化框架，實現了這種解耦。在REFLEX中，一個具備視覺能力的評論者首先將任務特定的行為證據提煉成結構化的、可審計的診斷。隨後，一個文本優化的行為者使用這些診斷以及持久的、自我演化的技能記憶合成子政策，這些技能記憶包含可重用的代碼片段。這種架構不僅提供透明的突變痕跡，還使跨運行的程式知識轉移成為可能。在控制基準（Lunar Lander、Acrobot、Pendulum）和一個36維天線陣列合成任務中的廣泛評估顯示出卓越的樣本效率。值得注意的是，REFLEX在少於10次LLM調用中解決了Acrobot和Pendulum，並在Lunar Lander上達到了最佳的標準化加權分數1.092，實現了高度競爭的最終表現，同時顯著加速了透明政策的早期發現。

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

2606.16494v1 by Jieyuan Liu, Jianyang Gu, Shijie Chen, Jefferson Chen, Zhen Wang

Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In pure-text long-context LLMs, retrieved-context use follows the U-shaped "lost-in-the-middle" effect of Liu et al. (2024): information at the start and end of context is used, the middle is lost. Whether this transfers to deployed multimodal KB-VQA is open. To close this gap, we design the first controlled probe of reader-side position dependence in multimodal KB-VQA: a gold-position protocol in which only the gold passage's prompt slot varies within question. We run it on three open-source 7B/8B VLM readers and two KB-VQA benchmarks at k up to 20. The shape flips from U to primacy: gold-at-first beats gold-at-last by 16 to 26 points on every reader-by-benchmark cell, an effect we call "Lost at the End". Three targeted ablations narrow the cause: a text-only control shows the multimodal setting amplifies an already-present text-mode primacy 2.2 to 4.5 times, and image-position and distractor-shuffle ablations together pin the locus to prompt slot 0 of the instruction-tuned reader. On a frozen reader, three retrieval-side fixes (MMR, oracle reranking, rank-based reordering) all leave the gap intact (no separable improvement). Our findings indicate that recall@k is the wrong metric for deployed KB-VQA and that closing the gap requires reader-side intervention; we release our protocol as a controlled instrument for evaluating such interventions.

摘要：知識基礎的視覺問題回答（KB-VQA）讓視覺-語言系統能夠回答超出其參數知識範圍的問題，通過將讀者的注意力集中在從維基百科規模的知識庫檢索到的段落上。在純文本的長上下文大型語言模型（LLMs）中，檢索的上下文使用遵循劉等人（2024）所描述的U形「迷失在中間」效應：上下文開頭和結尾的信息被使用，而中間的信息則被遺失。這是否會轉移到部署的多模態KB-VQA上仍然是個未知數。為了縮小這一差距，我們設計了第一個控制性探測，檢視多模態KB-VQA中讀者側位置依賴性：一種金標位置協議，其中只有金段落的提示槽在問題中變化。我們在三個開源的7B/8B VLM讀者和兩個KB-VQA基準上進行了實驗，k值最高可達20。形狀從U型翻轉為優先效應：金標在前的表現比金標在後的表現高出16到26個點，在每個讀者-基準單元中均如此，這一效應我們稱之為「在結尾迷失」。三個針對性的消融實驗縮小了原因：僅文本控制顯示多模態設置將已存在的文本模式優先效應放大了2.2到4.5倍，而圖像位置和干擾項隨機化的消融實驗共同將焦點鎖定在指令調整讀者的提示槽0上。在一個固定的讀者上，三個檢索側的修正（MMR、預測重排序、基於排名的重新排序）都未能縮小這一差距（沒有可分離的改進）。我們的發現表明，recall@k是部署KB-VQA的錯誤指標，縮小這一差距需要讀者側的干預；我們將我們的協議作為評估此類干預的控制工具釋出。

Unified Multimodal Model for Brain MRI Imputation and Understanding

2606.16484v1 by Zhiyun Song, Che Liu, Tian Xia, Avinash Kori, Wenjia Bai

Multimodal large language models (MLLMs) hold great potential for medicine, as they inherit knowledge from LLM and allow multiple data modalities to be integrated, analysed and interpreted in natural language. However, the field of medical MLLMs is constrained by non-trivial challenges, notably the scarcity of high-quality training data and the frequent occurrence of missing data in the real-world clinical setting. Here, we propose a novel unified multimodal model, UniBrain, for brain magnetic resonance image (MRI) analysis. To address potential missing brain MRI modalities, we employ a unified training strategy to perform joint imaging modality imputation and brain image understanding. During training, an interleaved and description-enriched data flow is constructed to train the model in an autoregressive manner, enabling medical reasoning with generated multimodal data. A self-alignment strategy is introduced to leverage dense image embeddings to learn fine-grained anatomical features without requiring detailed image captions. Furthermore, we propose a dynamic hidden state mechanism to alleviate the exposure bias during long-context multimodal inference. Extensive experiments on multi-disease brain MRI dataset demonstrate that UniBrain achieves high performance for brain image imputation, understanding, and disease diagnosis under various extents of modality incompleteness.

摘要：多模態大型語言模型（MLLMs）在醫學領域具有巨大的潛力，因為它們繼承了LLM的知識並允許多種數據模態的整合、分析和用自然語言解釋。然而，醫學MLLM的領域受到非平凡挑戰的限制，尤其是高質量訓練數據的稀缺以及在現實臨床環境中經常出現的缺失數據。在此，我們提出了一種新穎的統一多模態模型UniBrain，用於腦部磁共振影像（MRI）分析。為了解決潛在的缺失腦部MRI模態，我們採用統一的訓練策略來執行聯合影像模態插補和腦影像理解。在訓練過程中，構建了一個交錯且描述豐富的數據流，以自回歸的方式訓練模型，使其能夠利用生成的多模態數據進行醫學推理。引入了一種自對齊策略，以利用密集的影像嵌入來學習細緻的解剖特徵，而無需詳細的影像標題。此外，我們提出了一種動態隱藏狀態機制，以減輕長上下文多模態推理過程中的曝光偏差。在多疾病腦部MRI數據集上的廣泛實驗顯示，UniBrain在不同程度的模態不完整性下，實現了腦影像插補、理解和疾病診斷的高性能。

Knowledge Graphs