arxiv-daily

Automated deployment @ 2026-06-18 22:58:32 Asia/Taipei

Welcome to contribute! Add your topics and keywords in topic.yml. You can also view historical data through the storage.

AI

Knowledge Graphs

Publish Date	Title	Authors	Homepage	Code
2026-06-17	Structured Inference with Large Language Gibbs	Sanghyeok Choi et.al.	2606.19264v1	null
2026-06-17	The More the Merrier: Combining Properties for ABox Abduction under Repair Semantics for ELbot	Anselm Haak et.al.	2606.19197v1	null
2026-06-17	Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection	Jinhan Li et.al.	2606.19168v1	null
2026-06-17	Essential Subspace Merging for Multi-Task Learning	Longhua Li et.al.	2606.19164v1	null
2026-06-17	IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages	Sakshi Joshi et.al.	2606.19157v1	null
2026-06-17	Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation	Ramza Basharat et.al.	2606.19139v1	null
2026-06-17	Equivariant Graph Neural Networks Improve Optical Spectra Prediction for Materials Screening	Kasper Helverskov Petersen et.al.	2606.19133v1	null
2026-06-17	Towards an Agent-First Web: Redesigning the Web for AI Agents	Eranga Bandara et.al.	2606.19116v1	null
2026-06-17	Which Sections of a Research Paper Best Reveal Its Research Methods? Evidence from Library and Information Science	Qiuyu Fang et.al.	2606.19051v1	null
2026-06-17	Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration	Xhevahire Tërnava et.al.	2606.19042v1	null
2026-06-17	Sumi: Open Uniform Diffusion Language Model from Scratch	Mengyu Ye et.al.	2606.19005v1	null
2026-06-17	GraphPO: Graph-based Policy Optimization for Reasoning Models	Yuliang Zhan et.al.	2606.18954v1	null
2026-06-17	SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents	Jingkun Luo et.al.	2606.18946v1	null
2026-06-17	Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking	Pierre Dantas et.al.	2606.18941v1	null
2026-06-17	Efficient Financial Language Understanding via Distillation with Synthetic Data	Wen-Fong et.al.	2606.18875v1	null
2026-06-17	Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness	Zijian Wang et.al.	2606.18874v1	null
2026-06-17	URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification	Xinze Zhang et.al.	2606.18861v1	null
2026-06-17	ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement	Bohou Zhang et.al.	2606.18850v1	null
2026-06-17	Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems	Hehai Lin et.al.	2606.18837v1	null
2026-06-17	Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration	Taewoon Kim et.al.	2606.18836v1	null
2026-06-17	Reinforcement Learning Foundation Models Should Already Be A Thing	Abdelrahman Zighem et.al.	2606.18812v1	null
2026-06-17	Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards	Yingyu Shan et.al.	2606.18810v1	null
2026-06-17	ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch	Tengfei Lyu et.al.	2606.18803v1	null
2026-06-17	Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports	Qingyu Lu et.al.	2606.18797v1	null
2026-06-17	SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction	Quanjiang Guo et.al.	2606.18780v1	null
2026-06-17	PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding	Jihyung Park et.al.	2606.18624v1	null
2026-06-17	Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance	Tianming Du et.al.	2606.18613v1	null
2026-06-17	Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting	Hao-Yuan Ma et.al.	2606.18566v1	null
2026-06-17	DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models	Patrick Cooper et.al.	2606.18557v1	null
2026-06-16	PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning	Bo Su et.al.	2606.18473v1	null
2026-06-16	TMR-GGNN: Credit Card Fraud Detection based on Time-Aware Multi-Relational Guided Graph Neural Network	Rohit Tewari et.al.	2606.18444v1	null
2026-06-16	RankGraph-2: Lifecycle Co-Design for Billion-Node Graph Learning in Recommendation	Renzhi Wu et.al.	2606.18379v1	null
2026-06-16	EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation	Qi Chai et.al.	2606.18235v1	null
2026-06-16	Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy, with Stylometric and Exploratory Graph Analyses	Joy Bose et.al.	2606.18222v1	null
2026-06-16	Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients	Byung-Kwan Lee et.al.	2606.18216v1	null
2026-06-16	Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0	Diaa Fayed et.al.	2606.18205v1	null
2026-06-16	The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data	Nick Bettencourt et.al.	2606.18192v2	null
2026-06-16	Learning Cardiac Electrophysiology Digital Twins Through Agentic Discovery of Hybrid Structure	Ziqi Zhou et.al.	2606.18154v1	null
2026-06-16	WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning	Yuwei Zhang et.al.	2606.18147v1	null
2026-06-16	Knowledge Reutilization in Meta-Reinforcement Learning	Yuan Meng et.al.	2606.18132v1	null
2026-06-16	Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour	Abeer Badawi et.al.	2606.18129v1	null
2026-06-16	Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models	Ramprasath Ganesaraja et.al.	2606.18114v1	null
2026-06-16	S4oP: Operator-level Pruning of Structured State Space Models for Resource-Constrained Devices	Marco Deano et.al.	2606.18096v1	null
2026-06-16	EAGG: Embodiment-Aligned Grasp Generation via Geometry-Aware Graph Conditioning	Wanhao Niu et.al.	2606.18092v1	null
2026-06-16	A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation	Haoyang Zhong et.al.	2606.18075v1	null
2026-06-16	When LLMs Analyze Scars: From Images to Clinically-Meaningful Features	Ruman Wang et.al.	2606.18063v1	null
2026-06-16	Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond	Hobin Kim et.al.	2606.18062v1	null
2026-06-16	C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift	Davide Domini et.al.	2606.18003v1	null
2026-06-16	Dimensionality Controls When Modularity Helps in Continual Learning	Kathrin Korte et.al.	2606.17889v1	null
2026-06-16	AI Adoption Across a Multinational Workforce: Sociotechnical Conditions for GenAI Acceptance in Human Resources	Dalia Ali et.al.	2606.17887v1	null
2026-06-16	FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow	Bihao Zhan et.al.	2606.17856v1	null
2026-06-16	DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL	Esteban Schafir et.al.	2606.17821v1	null
2026-06-16	A Framework for Evaluating Agentic Skills at Scale	Maksim Shaposhnikov et.al.	2606.17819v1	null
2026-06-16	Conflict-Aware Retriever Editing for Knowledge Injection Attacks on LLM-Based RAG Systems	Xinru Liu et.al.	2606.18310v1	null
2026-06-16	LLMs Infer Cultural Context but Fail to Apply It When Responding	Yisong Miao et.al.	2606.17688v1	null
2026-06-16	SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector	Jingyuan Zhang et.al.	2606.18309v1	null
2026-06-16	Handling Feature Heterogeneity with Learnable Graph Patches	Yifei Sun et.al.	2606.17667v1	null
2026-06-16	SketchXplain: Intuitive Visual Explanations of Image Classifiers with Sketches	Wencan Zhang et.al.	2606.17646v1	null
2026-06-16	Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification	Yiyue Qian et.al.	2606.17637v1	null
2026-06-16	Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs	Dong Huang et.al.	2606.17634v1	null
2026-06-16	OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation	Guibin Zhang et.al.	2606.17628v1	null
2026-06-16	Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning	Yanwei Cui et.al.	2606.17591v1	null
2026-06-16	Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow	Osamu Ito et.al.	2606.17577v1	null
2026-06-16	An AI Security Agent for Banking: Multi-Vector Fraud and AML Detection Across Retail and Corporate Accounts	Joseph Walusimbi et.al.	2606.17555v1	null
2026-06-16	FoundCause: Causal Discovery with Latent Confounders from Observational Data	Patrick Blöbaum et.al.	2606.17516v1	null
2026-06-16	Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement	Ramaravind Kommiya Mothilal et.al.	2606.17506v1	null
2026-06-16	AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows	Jiahui Niu et.al.	2606.17474v1	null
2026-06-16	Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation	Yuyang Dai et.al.	2606.17459v1	null
2026-06-16	Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos	Bo Gou et.al.	2606.17437v1	null
2026-06-16	SoK: AI-Augmented Binary Reversing	Yujeong Kwon et.al.	2606.17398v1	null
2026-06-16	MeiBRD: Meta-Learning Intraoperative Biomechanical Residual Deformation	Casey Meisenzahl et.al.	2606.17379v1	null
2026-06-15	MemTrace: Probing What Final Accuracy Misses in Long-Term Memory	Xianxuan Long et.al.	2606.17328v1	null
2026-06-15	Nothing from Something: Can a Language Model Discover 0?	Phoebe Zeng et.al.	2606.17289v1	null
2026-06-15	Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering	Rohit Kundu et.al.	2606.17257v1	null
2026-06-15	Rift: A Conflict Signature for Deception in Language Models	Petr Nyoma et.al.	2606.17229v1	null
2026-06-15	When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval	Mingxu Tao et.al.	2606.17220v1	null
2026-06-15	Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming	Callum Barbour et.al.	2606.18293v1	null
2026-06-15	Trust-Aware Multi-Agent Traceability: Confidence-Calibrated Knowledge Graphs for Consistent Software Artifact Management	Mohamed Essam et.al.	2606.17203v1	null
2026-06-15	Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation	Prabhjot Singh et.al.	2606.17188v2	null
2026-06-15	RepSelect: Robust LLM Unlearning via Representation Selectivity	Filip Sondej et.al.	2606.17168v1	null
2026-06-15	A Causal Model of Theory of Mind in Conflict for Artificial Intelligence	Nikolos Gurney et.al.	2606.16944v1	null
2026-06-15	RAID: Semantic Graph Diffusion for True Cold-Start and Cross-Lingual Forecasting	Arunkumar V et.al.	2606.16925v1	null
2026-06-15	LESS Is More: Mutual-Stability Sampling for Diffusion Language Models	Amr Mohamed et.al.	2606.16908v1	null
2026-06-15	Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier	Keizo Kato et.al.	2606.16811v1	null
2026-06-15	OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models	Tianyi Lin et.al.	2606.16774v1	null
2026-06-15	Misinformation Propagation in Benign Multi-Agent Systems	Jonas Becker et.al.	2606.16710v1	null
2026-06-15	User as Code: Executable Memory for Personalized Agents	Bojie Li et.al.	2606.16707v1	null
2026-06-15	Progressive Knowledge-Guided Large Language Model Framework for Bearing Fault Diagnosis	Jinghan Wang et.al.	2606.16684v1	null
2026-06-15	The Integrator Advantage: Controlled Agentic AI for Small and Medium-Sized Companies	Christopner Koch et.al.	2606.16649v1	null
2026-06-15	DCP-Prune: Ultra-Low Token Pruning with Distribution Consistency Preservation	Xifeng Xue et.al.	2606.16633v1	null
2026-06-15	Islamic Large Language Models: From Knowledge Acquisition to Trustworthy and Hallucination-Resistant AI	Mohammed Amine Mouhoub et.al.	2606.16629v1	null
2026-06-15	VeriGraph: Towards Verifiable Data-Analytic Agents	Jiajie Jin et.al.	2606.16603v1	null
2026-06-15	SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM Agents	Qiao Xiao et.al.	2606.16591v2	null
2026-06-15	Graph neural networks at war: integrating cybersecurity and drone intelligence in the Israeli-Iranian conflict	Sozan Sulaiman Maghdid et.al.	2606.17119v1	null
2026-06-15	Kairos: A Native World Model Stack for Physical AI	Kairos Team et.al.	2606.16533v2	null
2026-06-15	SkillWiki: A Living Knowledge Infrastructure for Agent Skills	Dingcheng Huang et.al.	2606.16523v1	null
2026-06-15	Model Graph Inductive Learning for Knowledge Graph Completion	Mohommad Esmaei Khani et.al.	2606.16509v1	null
2026-06-15	REFLEX: Reflective Evolution from LLM Experience	Pan Wang et.al.	2606.16496v1	null
2026-06-15	Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering	Jieyuan Liu et.al.	2606.16494v1	null
2026-06-15	Unified Multimodal Model for Brain MRI Imputation and Understanding	Zhiyun Song et.al.	2606.16484v1	null

Abstracts

Structured Inference with Large Language Gibbs

2606.19264v1 by Sanghyeok Choi, Henry Gouk, Esmeralda S. Whitammer

The knowledge encoded in large language models (LLMs) can serve as a substrate for structured reasoning over variables describing a complex world, but accessing this knowledge in a probabilistically coherent manner poses a difficult inference problem. We propose Large Language Gibbs, a scheme for structured probabilistic inference that uses conditional distributions of an LLM as transition operators. Rather than sampling structured objects through single-pass autoregressive generation, we iteratively resample individual variables conditioned on others using an LLM's next-token conditionals. This approach avoids order-dependent biases and produces a stationary distribution that reflects a compromise between all local conditionals. We apply this approach to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning. The results suggest that the use of LLM conditionals in MCMC is a practical alternative to one-pass generation for structured probabilistic inference under a world prior accessible through noisy LLM conditionals.

摘要：大型語言模型（LLMs）中編碼的知識可以作為對描述複雜世界的變量進行結構化推理的基礎，但以概率一致的方式訪問這些知識則構成了一個困難的推理問題。我們提出了大型語言吉布斯（Large Language Gibbs），這是一種結構化概率推理的方案，利用LLM的條件分佈作為轉移運算符。我們不是通過單次自回歸生成來抽樣結構化對象，而是使用LLM的下一個標記條件，迭代地重新抽樣基於其他變量的單個變量。這種方法避免了依賴順序的偏見，並產生了一個穩定的分佈，反映了所有局部條件的妥協。我們將這種方法應用於從合成分佈中抽樣、一致性推理任務和貝葉斯結構學習。結果表明，在通過噪聲LLM條件可訪問的世界先驗下，使用LLM條件在MCMC中是一種結構化概率推理的實用替代方案，而非單次生成。

The More the Merrier: Combining Properties for ABox Abduction under Repair Semantics for ELbot

2606.19197v1 by Anselm Haak, Patrick Koopmann, Yasir Mahmood, Anni-Yasmin Turhan

Abduction is a central approach to explain missing entailments from a knowledge base by providing a hypothesis, that would, if added to the knowledge base, make the missing entailment become true. Abduction under repair semantics has recently been investigated in detail, where several desirable properties and optimality criteria were considered, such as signature-restrictions and minimality in size and of introduced conflicts. Naturally, hypotheses that satisfy more than one of these properties or combine a property with an optimality criterion would be even more desirable for applications. So far, such hypotheses have not been investigated in the literature. In the present paper, we consider the ABox abduction problem for hypotheses satisfying more than one property or additional optimality criteria, for EL_bot under brave and AR semantics. Our main observation is that often requiring additional properties for hypotheses does not lead to an increase of complexity.

摘要：誘導推理是一種中心方法，用於解釋知識庫中缺失的推論，通過提供一個假設，如果將其添加到知識庫中，將使缺失的推論變得成立。最近，修復語義下的誘導推理已被詳細研究，其中考慮了幾個理想的特性和最佳性標準，例如簽名限制和引入衝突的最小化。自然地，滿足多個這些特性或將一個特性與最佳性標準結合的假設對於應用來說會更加理想。到目前為止，文獻中尚未對此類假設進行研究。在本篇論文中，我們考慮了針對滿足多個特性或額外最佳性標準的假設的 ABox 誘導推理問題，針對 EL_bot 在勇敢和 AR 語義下的情況。我們的主要觀察是，通常要求假設具備額外的特性並不會導致複雜性的增加。

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

2606.19168v1 by Jinhan Li, Kexian Tang, Yihan Xu, Zhuorui Ye, Kaifeng Lyu

To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.

摘要：為了實現大型語言模型（LLMs）的更深層安全對齊，最近的研究努力探討如何將安全干預措施提前到預訓練階段，主要是通過過濾不安全數據或將其重寫為更安全的形式。我們認為，預訓練階段的對齊應該超越僅僅使數據安全：LLMs 可能會將看似無害的知識和能力組合成不安全的行為。為此，我們提出了安全反思預訓練，這是一種預訓練階段的對齊方法，定期將短暫的安全反思插入預訓練語料庫中，以將自我監控直接整合到語言建模中，建立一種基礎能力，隨後通過兼容的後訓練進行強化。我們對在 FineWeb-Edu 上預訓練的 1.7B 模型的實驗顯示，安全反思預訓練提高了安全分類準確性，並大幅降低了推理階段和微調攻擊的成功率。除了我們的實際世界實驗外，我們還介紹了一個完全受控的合成環境 MedSafetyWorld，該環境對安全有明確的定義，並具有一個推理結構，使模型能夠輕鬆地從安全數據中概括不安全的行為。在 MedSafetyWorld 中的消融實驗進一步證明了安全反思預訓練在防止模型基於安全數據概括不安全行為方面的明顯優勢，相較於數據過濾和重寫。綜合來看，我們的研究結果表明，預訓練對齊不僅應使訓練數據安全，還應塑造模型可能從安全數據中獲得的行為。

Essential Subspace Merging for Multi-Task Learning

2606.19164v1 by Longhua Li, Lei Qi, Xin Geng, Qi Tian

Model merging aims to enable multi-task learning by integrating the capabilities of multiple models fine-tuned from the same pre-trained checkpoint into a single model. Its core challenge is inter-task interference among task-specific parameter updates. In this paper, we analyze the output shifts induced by task updates and observe that their energy is concentrated in a small number of principal directions. We call the subspace spanned by these directions the essential subspace. In contrast, most remaining directions carry little task-relevant energy, but their accumulation across multiple task updates can cause severe interference during merging. Motivated by this observation, we propose Essential Subspace Decomposition (ESD), which decomposes each task update according to the principal components of its activation shift. Based on ESD, we introduce Essential Subspace Merging (ESM), a training-free static merging method that orthogonalizes and fuses essential components into one compact multi-task model. We further extend ESM to ESM++, a training-free dynamic merging method that decomposes task-specific residuals into low-rank experts and selects the most relevant expert through prototype-based routing during forward inference. Extensive experiments across multiple task sets and model scales demonstrate that ESM and ESM++ effectively preserves task knowledge while reducing inter-task interference.

摘要：模型合併旨在通過將從相同預訓練檢查點微調的多個模型的能力整合到一個模型中，以實現多任務學習。其核心挑戰是任務特定參數更新之間的相互干擾。在本文中，我們分析了由任務更新引起的輸出變化，並觀察到它們的能量集中在少數主要方向上。我們稱這些方向所跨越的子空間為基本子空間。相對而言，大多數剩餘方向攜帶的任務相關能量很少，但它們在多次任務更新中的累積可能會在合併過程中造成嚴重的干擾。受到這一觀察的啟發，我們提出了基本子空間分解（ESD），該方法根據其激活變化的主成分分解每個任務更新。基於ESD，我們引入了基本子空間合併（ESM），這是一種無需訓練的靜態合併方法，能夠將基本組件正交化並融合成一個緊湊的多任務模型。我們進一步將ESM擴展為ESM++，這是一種無需訓練的動態合併方法，能夠將任務特定的殘差分解為低秩專家，並通過基於原型的路由在前向推理過程中選擇最相關的專家。在多個任務集和模型規模上的大量實驗表明，ESM和ESM++有效地保留了任務知識，同時減少了任務之間的干擾。

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

2606.19157v1 by Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh, Eldho Ittan George, R J Hari, Kaushal Bhogale, Mitesh M. Khapra

AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.

摘要：AudioLLMs 使得語音識別能夠基於文本提示進行，例如領域描述或實體列表。然而，目前尚不清楚這些模型是否真正利用了這些上下文，還是依賴於在預訓練期間學到的參數知識。現有的基準無法回答這個問題，因為它們在固定的提示條件下評估轉錄，並且很少包含明確的上下文輸入。我們介紹了 IndicContextEval，一個涵蓋 555 位講者的 56 小時多語言自然語音基準，涉及 8 種印度語言和 23 個專業領域。我們設計了一個 7 級提示框架，逐步引入上下文信號，包括元數據、自然語言描述、英語和母語的實體列表，以及包含不正確實體的對抗性提示。對五個模型的評估顯示了上下文利用行為的顯著差異，突顯了對 AudioLLMs 中上下文基礎的明確評估的需求。

Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

2606.19139v1 by Ramza Basharat, Muhammad Usman Ali

Automatic Handwritten Text Recognition (HTR) is inherently a challenging task, and its complexity is further increased when dealing with cursive scripts. Although significant efforts have been made on various cursive scripts, research regarding Urdu Handwritten Text Recognition (UHTR) has been relatively limited. This lag of research is primarily due to the unique challenges posed by its script, and the scarcity and unavailability of benchmark datasets. Therefore, to advance research in UHTR, this study presents a specialized real dataset called the Urdu Katib Handwritten Dataset (UKHD). To the best of our knowledge, this is the first offline Urdu handwritten text lines dataset specifically curated from the materials written by Katibs in historical times. It encompasses a diverse range of flat nib writing variations in the Nastalique calligraphic style. Additionally, the effectiveness of different CRNN-based hybrid models has been evaluated to identify the optimal architecture for Urdu Katib Handwriting Recognition (UKHR). Among the analyzed models, the CNN-BGRU-CTC model showed more robust performance, with low Character Error Rate (CER) and Word Error Rate (WER). This research work aims to support and encourage the research community in developing a robust recognition system for preserving Urdu handwritten literature.

摘要：自動手寫文字識別（HTR）本質上是一項具有挑戰性的任務，當處理草寫字體時，其複雜性進一步增加。儘管對各種草寫字體已經做出了重大努力，但對烏爾都手寫文字識別（UHTR）的研究相對有限。這一研究滯後主要是由於其字體所帶來的獨特挑戰，以及基準數據集的稀缺和不可用性。因此，為了推進UHTR的研究，本研究提出了一個名為烏爾都Katib手寫數據集（UKHD）的專門真實數據集。據我們所知，這是第一個專門從歷史時期Katib所寫材料中策劃的離線烏爾都手寫文字行數據集。它涵蓋了Nastalique書法風格中各種平尖筆書寫變體。此外，還評估了不同基於CRNN的混合模型的有效性，以確定烏爾都Katib手寫識別（UKHR）的最佳架構。在分析的模型中，CNN-BGRU-CTC模型顯示出更穩健的性能，具有較低的字符錯誤率（CER）和單詞錯誤率（WER）。本研究旨在支持和鼓勵研究社群開發一個穩健的識別系統，以保護烏爾都手寫文學。

Equivariant Graph Neural Networks Improve Optical Spectra Prediction for Materials Screening

2606.19133v1 by Kasper Helverskov Petersen, François R J Cornet, Martin Ovesen, Mikkel Jordahn, Kristian S. Thygesen, Mikkel N. Schmidt

Scalable prediction of optical spectra is a critical component of high-throughput materials screening for optoelectronic applications such as solar cells. Existing surrogate models are trained on spectra computed from lower levels of theory or rely on rotation-invariant scalar features, limiting their geometric expressiveness. We explore the use of equivariant graph neural networks for optical spectra prediction, adapting GotenNet to this task and evaluating it on multiple datasets including a recently published collection of 10,533 structures with spectra computed at the level of the random phase approximation (RPA). The proposed model outperforms the current state of the art, with the largest gains in the 0-8 eV range and on predicting the static real permittivity, both of particular relevance for thin-film optics.

摘要：可擴展的光譜預測是高通量材料篩選在光電應用（如太陽能電池）中的一個關鍵組成部分。現有的替代模型是基於從較低理論層次計算的光譜進行訓練，或依賴於旋轉不變的標量特徵，這限制了它們的幾何表達能力。我們探索了使用等變圖神經網絡進行光譜預測，將 GotenNet 調整為此任務，並在多個數據集上進行評估，包括最近發表的 10,533 個結構的集合，這些結構的光譜是基於隨機相位近似（RPA）計算的。所提出的模型在當前的最先進技術中表現優越，尤其是在 0-8 eV 範圍內以及預測靜態實部許可率方面，這兩者對於薄膜光學特別相關。

Towards an Agent-First Web: Redesigning the Web for AI Agents

2606.19116v1 by Eranga Bandara, Ross Gore, Ravi Mukkamala, Asanga Gunaratna, Safdar H. Bouk, Xueping Liang, Peter Foytik, Abdul Rahman, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Chalani Rajapakse, Ng Wee Keong, Kasun De Zoysa, Tharaka Hewa, Amin Hass, Wathsala Herath, Aruna Withanage, Nilaan Loganathan, Atmaram Yarlagadda, Sachin Shetty

The World Wide Web was built on an assumption held for three decades: the primary consumer of web content is a human being. This permeates every layer; its access model presumes human visitors, its economics rest on human attention, and its content targets human perception. The rapid emergence of AI agents as intermediaries between humans and web content invalidates this assumption. Yet the web resists agents through blanket blocking, CAPTCHA-based exclusion, and economic models that treat agent access as extraction rather than legitimate interaction. This paper proposes a principled redesign across three layers. At the access layer, agents acting for humans should inherit equivalent access rights, governed by rate limiting and agent identification metadata in HTTP requests, analogous to browser headers, alongside a dual-layer architecture serving human-readable and agent-optimized content from the same domain. At the economic layer, we propose an intent-based tier framework grounded in the agent-as-human-proxy principle: an agent's economic obligation mirrors that of the human it represents. A token-based subscription model meters content in tokens rather than pageviews, alongside a commissioned content economy anchoring AI content production in human intentionality. At the content layer, we identify epistemic recursion, the self-referential loop in which AI-generated content is consumed by agents to produce further content, progressively detaching web knowledge from human ground truth. We propose the Agent Text Markup Language (ATML), a four-level human supervision tier model, and a cryptographic provenance chain to counter this threat. Together these constitute ten design principles for an agent-first internet, one in which agents are first-class citizens whose integration requires renegotiating the web's foundational social contract across access, economics, and content.

摘要：全球資訊網建立在一個持續三十年的假設上：網路內容的主要消費者是人類。這一假設滲透到每一層；其訪問模型假定有人的訪客，其經濟學依賴於人類的注意力，而其內容則針對人類的感知。人工智慧代理作為人類與網路內容之間的中介的快速出現使這一假設失效。然而，網路通過全面封鎖、基於 CAPTCHA 的排除以及將代理訪問視為提取而非合法互動的經濟模型來抵抗代理。

本文提出在三個層面上進行原則性的重新設計。在訪問層，代表人類行動的代理應該繼承相應的訪問權限，這些權限由 HTTP 請求中的速率限制和代理識別元數據管理，類似於瀏覽器標頭，並且採用雙層架構，從同一域提供人類可讀和代理優化的內容。在經濟層面，我們提出一個基於意圖的層級框架，這一框架以代理作為人類代理的原則為基礎：代理的經濟責任反映其所代表的人類的責任。一種基於代幣的訂閱模型以代幣而非頁面瀏覽量來計量內容，並且一個委託內容經濟將 AI 內容生產與人類意圖相連接。在內容層，我們識別出認識論的遞歸，即 AI 生成的內容被代理消費以產生進一步的內容，逐步使網路知識脫離人類的真實基礎。我們提出了代理文本標記語言（ATML）、一種四層人類監督層級模型，以及一個加密來源鏈，以應對這一威脅。

這些共同構成了十項設計原則，旨在打造一個以代理為先的互聯網，在這個互聯網中，代理是第一公民，其整合需要重新協商網路的基礎社會契約，涵蓋訪問、經濟和內容。

Which Sections of a Research Paper Best Reveal Its Research Methods? Evidence from Library and Information Science

2606.19051v1 by Qiuyu Fang, Jiayi Hao, Chengzhi Zhang

Research methods are essential carriers of knowledge contribution in academic papers. Automatic multi-label classification of research methods can support knowledge services such as method retrieval, review generation, and research intelligence analysis. While existing studies primarily rely on titles and abstracts, abstracts often provide only limited methodological information, whereas utilizing full-text content faces challenges related to excessive length and information redundancy. Therefore, this paper proposes a segment combination strategy by partitioning the full-text content according to its physical postion. Using an annotated corpus of 1,954 full-text articles from three representative journals in Library and Information Science (JASIST, LISR, and JDoc), we evaluate the classification performance of various segments and their combinations across multiple models. Experimental results indicate that methodological information is distributed unevenly within the full-text content, with the middle-to-late and final segments exhibiting greater discriminative power. Furthermore, integrating bibliographic metadata with cross-segment combination strategies effectively enhances classification performance.

摘要：研究方法是學術論文中知識貢獻的重要載體。研究方法的自動多標籤分類可以支持知識服務，如方法檢索、評論生成和研究智能分析。雖然現有研究主要依賴標題和摘要，但摘要通常只提供有限的方法論信息，而利用全文內容則面臨過長和信息冗餘的挑戰。因此，本文提出了一種通過根據其物理位置劃分全文內容的段落組合策略。使用來自三本代表性圖書館與信息科學期刊（JASIST、LISR 和 JDoc）的 1,954 篇全文文章的註釋語料庫，我們評估了各種段落及其組合在多個模型中的分類性能。實驗結果表明，方法論信息在全文內容中分佈不均，中後段和最後段表現出更強的區分能力。此外，將書目元數據與跨段組合策略整合有效提升了分類性能。

Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration

2606.19042v1 by Xhevahire Tërnava

In vibe coding, an emerging AI-driven paradigm, an LLM generates an entire program from a natural language prompt, but what happens to the variability that traditional software engineering carefully builds into code? To answer this question, we conducted an exploratory analysis on 10 vibe coded C/C++ projects, which suggests that there is near-zero in-artifact variability, i.e., at compile and runtime. All variability decisions are resolved at a single new binding time, generation time, the moment the LLM produces the source code. Rather than treating this as a defect to fix, we propose Variability by Regeneration (VbR), to our knowledge the first product-line approach in which the LLM acts as the derivation engine, generating a purpose-built, free of dead code binary for each variant from a declarative specification, while a variant dispatcher transparently routes user requests to the matching binary. We formalise VbR, contrast it with classical SPL derivation, and demonstrate its full pipeline on a wc product family. For SPL engineering, variability in AI-generated software belongs in the specification, not in the code.

摘要：在氛圍編碼這一新興的 AI 驅動範式中，一個 LLM 從自然語言提示生成整個程序，但傳統軟體工程精心構建到代碼中的變異性會發生什麼呢？為了回答這個問題，我們對 10 個氛圍編碼的 C/C++ 項目進行了探索性分析，結果顯示在工件內幾乎沒有變異性，即在編譯和運行時。所有變異性決策都在單一的新綁定時間解決，即生成時間，也就是 LLM 產生源代碼的那一刻。我們並不將此視為需要修復的缺陷，而是提出了再生變異性（Variability by Regeneration，VbR），據我們所知，這是第一個產品線方法，其中 LLM 作為推導引擎，根據聲明性規範為每個變體生成一個專門構建、沒有死代碼的二進位檔，而變體調度器則透明地將用戶請求路由到匹配的二進位檔。我們對 VbR 進行了形式化，並將其與傳統的 SPL 推導進行對比，並在 wc 產品系列上展示了其完整的管道。對於 SPL 工程而言，AI 生成軟體中的變異性應該在規範中，而不是在代碼中。

Sumi: Open Uniform Diffusion Language Model from Scratch

2606.19005v1 by Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi, Jun Suzuki

Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.

摘要：擴散模型已成為自回歸模型的一個有前途的替代方案。在這些模型中，均勻擴散語言模型（UDLMs）原則上允許在任何步驟更新任何標記，從而實現更靈活的生成。然而，目前尚未有任何UDLM從零開始在大參數規模和大標記預算下進行預訓練。自回歸建模和遮罩擴散建模已經有可用的模型在規模上供社群研究和構建；而均勻擴散則沒有。大規模的從零開始預訓練的UDLM將提供一個乾淨的參考點，以研究擴展行為、生成動態、可控性，以及與已建立的自回歸和遮罩擴散模型之間的權衡。為此，我們介紹Sumi（在日語中意為“墨水”），這是一個完全開放的7B均勻擴散語言模型，從零開始在1.5T標記上進行預訓練。Sumi在知識、推理和編碼基準上與在相似標記預算下訓練的自回歸模型表現競爭，但在常識基準上表現不佳，其中我們以教育為重的數據混合可能是主要原因。我們釋出我們的模型權重、檢查點和完整的訓練配方，包括對公開可用語料庫的數據混合的完整規範。我們希望這次釋出能使社群能夠在大規模上研究原生均勻擴散，並促進對其尚未充分理解的方面的研究。

GraphPO: Graph-based Policy Optimization for Reasoning Models

2606.18954v1 by Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai, Jingdong Chen, Jun Zhou, Ge Wu, Wenyue Tang, Hao Sun

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

摘要：強化學習與可驗證獎勵（RLVR）已成為提升大型推理模型能力的標準範式。RLVR 通常獨立抽樣回應並利用最終答案來優化策略。這一範式有兩個限制。首先，獨立的回應往往包含相似的中間推理步驟，導致冗餘探索和計算浪費。其次，稀疏的最終答案獎勵使得識別有用步驟變得困難。基於樹的方法部分解決了這個問題，通過共享前綴並比較來自同一前綴的分支來提供細粒度的信號。然而，樹的分支仍然是獨立擴展的。當不同的分支達到相似的推理狀態時，它們無法共享信息並重複相似的探索。此外，基於樹的方法忽略了這種分散，只在不同的分支內進行局部比較，這可能導致優勢估計的方差增高。為了解決這一挑戰，我們提出了 GraphPO（基於圖的策略優化），這是一個新穎的強化學習框架，將回合表示為有向無環圖，推理步驟作為邊，從推理路徑總結的語義狀態作為節點。GraphPO 將語義上等價的推理路徑合併為等價類，允許它們共享後綴，並將預算從冗餘擴展重新分配到多樣化探索上。此外，我們將效率優勢分配給進入邊，將正確性優勢分配給輸出邊，從而在從結果中推導過程監督的同時提高推理效率。理論表明，GraphPO 減少了優勢估計的方差並增強了推理效率。在三個 LLM 的推理和代理搜索基準上進行的實驗顯示，GraphPO 在相同的標記預算或回應預算下，始終優於基於鏈和樹的基準。

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

2606.18946v1 by Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei

Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: https://github.com/luojingkun22/SenFlow

摘要：句子級別的 AI 生成文本檢測 (S-AGTD) 針對混合文檔，即人類和大型語言模型共同創作的文本，面臨兩個缺口：現有方法將每個句子孤立分類，忽略了句子之間的依賴關係，而現有基準則省略了最新一代生成器。我們構建了 MOSAIC，一個包含 16,000 篇混合文檔的基準，這些文檔來自 PubMed 和 XSum，由 DeepSeek-V3.2 和 Kimi K2 在嚴格的質量控制下生成，包括一個在先前基準中缺失的困惑度一致性過濾器。我們將 S-AGTD 重新構建為對文檔句子序列的結構化預測，並將其具體化為 SenFlow，將基於圖的句子間傳播與線性鏈 CRF 解碼整合在單個文檔級別的句子圖上進行處理。SenFlow 在 MOSAIC 上達到了最先進的性能，在跨域轉移的三個難度逐漸增加的協議中，平均 Macro-F1 邊際提高了 +4.15 個百分點。我們進一步發現，即使在困惑度過濾器平衡了明顯的線索後，AI 插入仍然保留了一個依賴於生成器的句子長度差距，而句子級別的檢測器仍然可以利用這一點。代碼和數據：https://github.com/luojingkun22/SenFlow

Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

2606.18941v1 by Pierre Dantas, Lucas Cordeiro, Waldir Junior

PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents Graph-ESBMC-PLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856).

摘要：PLCopen XML 定義了兩種 IEC 61131-3 梯形圖程序的編碼格式：一種是使用 <rung> 元素的文本編碼，另一種是將梯級邏輯表示為本地 ID/refLocalId 連接的有向圖的圖形編碼。ESBMC-PLC 支持文本格式，但將來自 CONTROLLINO、Beremiz 和 OpenPLC 編輯器的圖形導出解析為空的 GOTO 中間表示，導致虛無的驗證成功。本文提出了 Graph-ESBMC-PLC，通過基於 DFS 的圖形 LD 解析器填補了這一空白。該解析器從左電源軌遍歷連接圖到每個線圈，將梯級路徑提取為布爾接觸聯接，並應用三層 I/O 推斷方案。按右電源軌的 connectionPointIn 順序排列線圈，確保 SET 線圈在 RESET 線圈之前處理，符合 IEC 掃描週期語義。圖形到 IR 的轉換保持 ESBMC 後端不變。對來自 CONTROLLINO/OpenPLC 編輯器的 3 個圖形 LD 程序的驗證顯示，所有程序都生成完整的 GOTO IR，具有非確定性輸入和梯級邏輯，而不是之前的空 IR。所有 3 個在 k=2 下的驗證時間小於 70 毫秒。11 個文本 LD 基準完全保留，沒有回歸。報告了兩個不含 LD 內容或不支持計時器語義的 Beremiz 示例作為發現的限制。文檔在 Zenodo 上（DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856）。

Efficient Financial Language Understanding via Distillation with Synthetic Data

2606.18875v1 by Wen-Fong, Huang, Edwin Simpson

Large instruction-following models are powerful but costly to deploy, particularly in finance, where labelled data are limited by confidentiality and expert annotation cost. We present an efficient framework for financial sentiment analysis through distillation with synthetic data, transferring knowledge from a large instruction-tuned teacher to compact student models. The framework is designed for low-resource conditions, where a small set of real examples are collected and labelled by hand. The framework then clusters the examples and uses the clusters to select seeds for generating synthetic examples via structured few-shot prompting. Experiments show that clustering-based seed selection yields more representative synthetic data than random sampling, enabling compact models to achieve strong performance with minimal supervision. Notably, on a more complex and noisy text domain, the compact model trained on the complete synthetic-seed corpus even outperforms the teacher model, while remaining competitive on formal text. The framework provides a practical route toward resource-efficient domain adaptation in financial NLP with minimal human labelling effort.

摘要：大型指令跟隨模型強大但部署成本高，特別是在金融領域，因為標註數據受到保密性和專家標註成本的限制。我們提出了一個通過合成數據進行金融情感分析的高效框架，將知識從大型指令調整的教師模型轉移到緊湊的學生模型。該框架設計用於低資源條件，其中一小組真實範例由人工收集和標註。然後，該框架對範例進行聚類，並利用這些聚類選擇種子，以通過結構化的少量提示生成合成範例。實驗表明，基於聚類的種子選擇比隨機抽樣產生更具代表性的合成數據，使緊湊模型在最小監督下實現強大性能。值得注意的是，在更複雜和噪聲較多的文本領域，基於完整合成種子語料庫訓練的緊湊模型甚至超越了教師模型，同時在正式文本上仍保持競爭力。該框架為在金融自然語言處理中以最小的人力標註努力實現資源高效的領域適應提供了一條實用的途徑。

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

2606.18874v1 by Zijian Wang, Hanqi Li, Ziyue Yang, Zijian Hu, Shenghan Zuo, Yunzhe Zhang, Da Ma, Danyu Luo, Chenrun Wang, Jing Peng, Tiancheng Huang, Sijia Guo, Huayang Wang, Zichen Zhu, Senyu Han, Yilu Cao, Kai Yu, Lu Chen

AI systems can increasingly automate scientific workflows, but the reasoning that links prior evidence, generated ideas, experiments and final claims often remains implicit inside model inference. Here we introduce Xcientist, a research harness that externalizes research synthesis and experimental validation into inspectable, contract-governed processes. Xcientist organizes literature evidence, idea states, implementation plans, ablation records and repair traces as persistent research artifacts, so that generated mechanisms can be grounded, executed, tested and revised without losing their evidential basis. We identify claim drift as a failure mode of automated research, where runnable artifacts no longer support the mechanism originally claimed. Across training-free memory systems, graph-structured traffic forecasting and multi-scale physics-informed neural networks, Xcientist preserves traceable trajectories from problem formulation to mechanism design, validation and bounded revision. These results suggest that AI scientists should be evaluated not only by their final artifacts, but by whether their synthesis and validation processes remain attributable, inspectable and scientifically accountable.

摘要：AI 系統可以越來越多地自動化科學工作流程，但將先前證據、生成的想法、實驗和最終主張聯繫起來的推理通常仍隱含在模型推斷中。在此，我們介紹 Xcientist，一個將研究綜合和實驗驗證外部化為可檢查的、受合同約束的過程的研究工具。Xcientist 將文獻證據、想法狀態、實施計劃、消融記錄和修復痕跡組織為持久的研究文物，以便生成的機制可以在不失去其證據基礎的情況下進行基礎化、執行、測試和修訂。我們將主張漂移確定為自動化研究的一種失效模式，其中可運行的文物不再支持最初聲稱的機制。在無需訓練的記憶系統、圖結構的交通預測和多尺度物理知識驅動的神經網絡中，Xcientist 保留了從問題表述到機制設計、驗證和有限修訂的可追溯軌跡。這些結果表明，AI 科學家應該不僅根據他們的最終文物進行評估，還應根據他們的綜合和驗證過程是否保持可歸因、可檢查和科學負責進行評估。

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

2606.18861v1 by Xinze Zhang

Reconstructing simulation-ready digital twins of articulated objects from sensor observations remains constrained by two persistent gaps: (i) part-level geometric reconstruction is decoupled from kinematic-parameter estimation, and (ii) the recovered models often violate basic dynamic invariants such as energy conservation, leading to drift when the URDF is replayed in physics simulators. We present KinemaForge, a constraint-driven pipeline that jointly infers part-level shape, joint topology, and joint parameters from short RGB-D sequences and validates the result against an energy-consistent verifier built on differentiable rigid-body dynamics. The pipeline introduces three components: a kinematic constraint graph that encodes joint-part incidences as soft edges; a differentiable screw-axis solver that backpropagates from rendered observations through Featherstone's articulated-body algorithm to joint parameters; and an energy residual loss that penalises non-physical free responses of the reconstructed model. Across five PartNet-Mobility categories and an internal RGB-D benchmark, KinemaForge reduces the average joint-axis error from 4.52 degrees to 2.83 degrees (-37.4%) over the strongest geometric baseline (PARIS) and from 5.30 degrees to 2.83 degrees (-46.6%) over the interaction-based Ditto baseline, lowers long-horizon simulation drift by 64% (vs. PARIS) over 50 s rollouts, and yields URDFs whose closed-loop manipulation success rate improves by 14.6 percentage points over Ditto in our preliminary evaluation. Code and reconstruction data will be released upon acceptance.

摘要：重建可供模擬使用的關節物體數位雙胞胎，仍然受到兩個持續存在的缺口的限制：(i) 部件級幾何重建與運動參數估計相互解耦，以及 (ii) 恢復的模型經常違反基本的動態不變性，如能量守恆，導致在物理模擬器中重播 URDF 時出現漂移。我們提出了 KinemaForge，一個基於約束的管道，從短的 RGB-D 序列中共同推斷部件級形狀、關節拓撲和關節參數，並將結果與基於可微剛體動力學構建的能量一致性驗證器進行驗證。該管道引入了三個組件：一個運動約束圖，將關節-部件的關聯編碼為軟邊；一個可微的螺旋軸求解器，通過 Featherstone 的關節體算法從渲染的觀察結果反向傳播到關節參數；以及一個能量殘差損失，對重建模型的非物理自由反應進行懲罰。在五個 PartNet-Mobility 類別和一個內部 RGB-D 基準測試中，KinemaForge 將平均關節軸誤差從 4.52 度降低到 2.83 度（-37.4%），相較於最強的幾何基線（PARIS），以及從 5.30 度降低到 2.83 度（-46.6%），相較於基於互動的 Ditto 基線，並在 50 秒的滾動中將長期模擬漂移降低 64%（與 PARIS 相比），產生的 URDF 在我們的初步評估中，其閉環操作成功率比 Ditto 提高了 14.6 個百分點。代碼和重建數據將在接受後發布。

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

2606.18850v1 by Bohou Zhang, Xiaoyu Tao, Mingyue Cheng, Huijie Liu, Qi Liu

Abstractive summarization plays a crucial role in enabling efficient understanding of scientific literature, yet it inherently demands both linguistic fluency and factual faithfulness. Existing approaches often fail to reconcile these two requirements. Extractive methods rely on rigid sentence splicing that disrupts macro-level logical coherence, while large language model (LLM)-based generative approaches, despite mastering linguistic fluency, exhibit limited factual consistency. In this work, we propose ScholarSum, a hierarchical reflective graph-based framework that emulates a student-teacher writing process for fluent and faithful scientific summarization. ScholarSum first organizes the document into a hierarchical knowledge graph by segmenting it into semantically coherent units, whose multi-layered community structure captures global logic and macro-level themes. Guided by this global structure, the student generates an initial draft, which is subsequently refined through fine-grained evidence retrieval. To ensure factual consistency, a teacher-like reviewer then iteratively examines the draft, identifies unsupported content, and prompts targeted re-retrieval and rewriting until the summary meets rigorous quality standards. Extensive experiments demonstrate that ScholarSum significantly outperforms previous baselines in terms of both completeness and faithfulness. Our code is available at https://github.com/Xiaoyu-Tao/ScholarSum.

摘要：抽象摘要在促進對科學文獻的有效理解中扮演著至關重要的角色，但它本質上需要語言流暢性和事實忠實性。現有的方法往往無法調和這兩個要求。抽取式方法依賴於僵化的句子拼接，這會破壞宏觀層面的邏輯一致性，而基於大型語言模型（LLM）的生成方法，儘管在語言流暢性上表現出色，但在事實一致性方面卻有限。在本研究中，我們提出了ScholarSum，一種層次反思圖形基礎框架，模擬學生-教師的寫作過程，以實現流暢且忠實的科學摘要。ScholarSum首先通過將文檔劃分為語義上連貫的單元，將其組織成層次知識圖，這些多層社群結構捕捉了全球邏輯和宏觀主題。在這一全球結構的指導下，學生生成初步草稿，然後通過細緻的證據檢索進行精煉。為了確保事實一致性，類似教師的審閱者隨後反覆檢查草稿，識別不支持的內容，並促使針對性的重新檢索和重寫，直到摘要達到嚴格的質量標準。大量實驗表明，ScholarSum在完整性和忠實性方面顯著超越了以往的基準。我們的代碼可在 https://github.com/Xiaoyu-Tao/ScholarSum 獲得。

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

2606.18837v1 by Hehai Lin, Qi Yang, Chengwei Qin

Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. Inference-time MAS leverages frozen frontier LLMs but repeats identical searches without learning from past experience. Conversely, Training-time MAS internalizes experience via gradient updates but is constrained by the low capability ceiling of smaller models, and is hard to scale to large frontier LLMs. To bridge this gap, we propose Skill-MAS, a novel third path that decouples experience retention from parametric updates by conceptualizing the high-level orchestration capability as an evolvable Meta-Skill. Skill-MAS refines this architectural knowledge through a closed optimization loop: (1) Multi-Trajectory Rollout samples a behavioral distribution for each task under the current Meta-Skill; and (2) Selective Reflection adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy-level principles. Extensive experiments across four complex benchmarks and four distinct LLMs demonstrate that Skill-MAS not only achieves remarkable performance gains but also maintains a favorable cost-performance trade-off. Further analysis reveals that the evolved Meta-Skills are highly robust and exhibit strong transferability across unseen tasks and different LLMs.

摘要：大型語言模型（LLM）基礎的自動多代理系統（MAS）生成已成為應對複雜任務的重要前沿。然而，現有方法面臨模型能力與經驗保留之間的困境。推理時的 MAS 利用凍結的前沿 LLM，但在沒有從過去經驗中學習的情況下重複相同的搜索。相反，訓練時的 MAS 通過梯度更新內化經驗，但受到較小模型低能力上限的限制，並且難以擴展到大型前沿 LLM。為了填補這一空白，我們提出了 Skill-MAS，一條新穎的第三條路徑，通過將高層次的編排能力概念化為可演變的元技能，將經驗保留與參數更新解耦。Skill-MAS 通過閉環優化循環來精煉這一架構知識：（1）多軌跡回放在當前元技能下為每個任務採樣行為分佈；（2）選擇性反思自適應地選擇優先任務，並應用分層對比分析將系統經驗提煉為可泛化的策略級原則。跨越四個複雜基準和四個不同 LLM 的廣泛實驗表明，Skill-MAS 不僅實現了顯著的性能提升，還保持了有利的成本性能權衡。進一步分析顯示，演變的元技能具有高度的穩健性，並在未見任務和不同 LLM 之間表現出強大的可轉移性。

Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration

2606.18836v1 by Taewoon Kim, Emma van Zoelen, Mark Neerincx

Effective human-robot teamwork requires robots to adapt to partners, situations, and task dynamics from the start of an interaction. In the MATRX Urban Search and Rescue (USAR) environment, people can externalize collaboration patterns (CPs) they discover during teamwork through a chat and reflection interface. We study whether a robot can use such prior team experience to become a better teammate in future interactions. To this end, we represent historical CPs as knowledge-graph episodic memories and use graph representation learning with a node-classification objective to identify a representative and effective memory for reuse. We then initialize the robot with this memory before a new collaboration episode begins. Across 20 participants and 160 round-level observations, initializing the robot with a single automatically selected prior CP increases rescue success from 25.7% to 41.3% and reduces average task time by 283 seconds. The strongest gains appear at the beginning of interaction, suggesting that reusable episodic memory can help robots enter collaboration with more effective task knowledge and support smoother early teamwork.

摘要：有效的人機團隊合作要求機器人從互動開始就能適應夥伴、情境和任務動態。在 MATRX 城市搜索與救援（USAR）環境中，人們可以通過聊天和反思介面外化他們在團隊合作中發現的合作模式（CPs）。我們研究機器人是否能利用這種先前的團隊經驗在未來的互動中成為更好的隊友。為此，我們將歷史 CPs 表示為知識圖譜的情節記憶，並使用圖表示學習與節點分類目標來識別可重用的代表性和有效記憶。然後，在新的合作情節開始之前，我們用這個記憶初始化機器人。在 20 位參與者和 160 次回合級觀察中，使用單一自動選擇的先前 CP 初始化機器人，使救援成功率從 25.7% 提高到 41.3%，並將平均任務時間減少 283 秒。最明顯的增益出現在互動的開始，這表明可重用的情節記憶可以幫助機器人以更有效的任務知識進入合作，並支持更順利的早期團隊合作。

Reinforcement Learning Foundation Models Should Already Be A Thing

2606.18812v1 by Abdelrahman Zighem, Jill-Jênn Vie

Foundation models for language and vision are powered by internet-scale data, while structured domains (tabular prediction, time-series forecasting, graph learning, reinforcement learning) are not. The substitute is synthetic data, which shifts the burden from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train one model entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.

摘要：基於互聯網規模數據的語言和視覺基礎模型，而結構化領域（表格預測、時間序列預測、圖學習、強化學習）則不是。替代品是合成數據，這將負擔從收集轉移到先前設計。對於許多結構化任務，這樣的先驗已經存在：TabPFN及其後續版本使用在合成貝葉斯先驗上預訓練的Transformer來解決表格分類問題。
我們提出兩點。 \textbf{首先}，強化學習是明顯的缺口：對合成MDP的採樣與對合成表格數據集的採樣同樣可行，但沒有任何上下文強化學習工作將先前設計視為主要目標。 \textbf{其次}，MDP允許固定大小的充分統計量，與觀察到的情節無關且呈表格形狀，這使得它們直接適合用於表格基礎模型的基於注意力的架構，並用策略頭替代監督目標。這些共同定義了強化學習基礎模型的議程。
作為概念驗證，我們完全在合成MDP上訓練一個模型，並顯示在沒有特定任務調整的情況下，它能夠在上下文中解決保留的表格基準，無論是在線還是離線：在線時，所需的情節數量遠少於UCB-VI和表格Q學習；離線時，與VI-LCB競爭。

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

2606.18810v1 by Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu, Xiangrong Zhu, Xinyi Wang, Jiashu Yao, Wei Lin, Hongru Wang, Heyan Huang

Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning steps. Existing token-level credit assignment methods require resources beyond the model's own rollouts. GRPO variants rely on process reward models or ground-truth answers. Knowledge distillation assigns credit through per-token divergence but requires external teachers (On-Policy Distillation) or privileged information (On-Policy Self Distillation). However, these dependencies limit applicability in the pure RLVR setting. We observe that conditioning the model on its own verified trajectories induces a measurable per-token KL divergence between the original and conditioned distributions, and prove that distilling from a self-teacher constructed by verified trajectories leads to infeasible weighted-average solutions when multiple verified trajectories exist. We propose SC-GRPO (Self-Conditioned GRPO), which uses KL divergence mentioned before as a multiplicative weight on GRPO gradients. Across five benchmarks spanning math, code, and agentic tasks, SC-GRPO consistently outperforms 8.1% over GRPO and 5.9% over DAPO with stronger OOD performance. Moreover, SC-GRPO achieves higher performance than OPD.

摘要：強化學習與可驗證獎勵（RLVR）在訓練大型語言模型（LLMs）以解決推理任務方面推動了顯著的進展，但代表性的方法如 GRPO 對所有標記分配均勻的信用，浪費了在常規標記上的梯度，同時對關鍵推理步驟的信用評估不足。現有的標記級信用分配方法需要超出模型自身回合的資源。GRPO 的變體依賴於過程獎勵模型或真實答案。知識蒸餾通過每個標記的偏差分配信用，但需要外部教師（在政策蒸餾）或特權信息（在政策自我蒸餾）。然而，這些依賴限制了在純 RLVR 設定中的適用性。我們觀察到，將模型條件化於其自身的驗證軌跡會在原始分佈和條件分佈之間產生可測量的每標記 KL 散度，並證明從由驗證軌跡構建的自我教師中進行蒸餾會導致在存在多個驗證軌跡時無法實現的加權平均解。我們提出了 SC-GRPO（自我條件化 GRPO），它使用前面提到的 KL 散度作為 GRPO 梯度的乘法權重。在跨越數學、代碼和代理任務的五個基準測試中，SC-GRPO 始終比 GRPO 高出 8.1%，比 DAPO 高出 5.9%，並且在 OOD 性能上更強。此外，SC-GRPO 的性能高於 OPD。

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

2606.18803v1 by Tengfei Lyu, Zirui Yuan, Xu Liu, Kai Wan, Zihao Lu, Li Ma, Hao Liu

Bringing Large Language Models (LLMs) into industrial ride-hailing dispatch as semantic feature extractors over platform-scale behavioral logs is a compelling but under-explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver's habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM-generated user profiles. However, scaling such profiling to a live, millisecond-latency dispatcher faces three intertwined constraints rarely addressed together: on a platform with millions of daily orders, logs exceed any LLM's context window by orders of magnitude; most users are long-tail, with too few interactions for per-user profiling; and surface-fluent profiles do not necessarily improve downstream prediction utility. We present ProfiLLM, an agentic LLM data pipeline that operationalizes utility-aligned user profiling for production matching systems through two modules. (1) Tool-Augmented Global Knowledge Mining equips an LLM agent with 27 analytical tools to mine platform-scale data, producing reusable global knowledge, adaptive user clustering rules, and region-level supply-demand priors. (2) Utility-Aligned Profile Exploration generates multiple candidate profiles per cluster, evaluates them via a lightweight downstream utility proxy, iteratively refines the best candidates and constructs preference pairs for DPO fine-tuning. Deployed on DiDi's production dispatcher, ProfiLLM achieves up to +6.14% relative AUC improvement in outcome prediction, up to +4.35% GMV gain in dispatching simulation, and consistent improvements in a 14-day online A/B test including +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate.

摘要：將大型語言模型（LLMs）引入工業乘車呼叫調度，作為平台規模行為日誌的語義特徵提取器，這是一個引人注目但尚未充分探索的數據系統問題。生產匹配管道仍然以結構化數值特徵為主導，但決定性的行為信號（例如，駕駛員對某些地區的習慣性厭惡）本質上是上下文相關的，並且自然可以表達為LLM生成的用戶檔案。然而，將這種檔案擴展到實時、毫秒延遲的調度器面臨著三個相互交織的約束，這些約束很少同時被解決：在一個每天有數百萬訂單的平台上，日誌的數據量超過任何LLM的上下文窗口幾個數量級；大多數用戶是長尾用戶，與每個用戶的互動次數太少，無法進行個別檔案分析；而表面流暢的檔案不一定能改善下游預測的效用。我們提出了ProfiLLM，一個自主的LLM數據管道，通過兩個模塊實現與效用對齊的用戶檔案分析，以支持生產匹配系統。（1）工具增強的全球知識挖掘為LLM代理配備了27個分析工具，以挖掘平台規模的數據，生成可重用的全球知識、自適應的用戶聚類規則和區域供需先驗。（2）與效用對齊的檔案探索為每個聚類生成多個候選檔案，通過輕量級的下游效用代理進行評估，迭代地精煉最佳候選檔案並構建DPO微調的偏好對。在滴滴的生產調度器上部署的ProfiLLM，在結果預測中實現了高達+6.14%的相對AUC改善，在調度模擬中實現了高達+4.35%的GMV增益，並在為期14天的在線A/B測試中持續改進，包括+0.47%的GMV、+0.33%的完成率和-0.82%的接受前取消率。

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

2606.18797v1 by Qingyu Lu, Ruochen Li, Liang Ding, Yufei Xia, Youxiang Zhu, Dacheng Tao

Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar. Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation. We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness"). Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings. To mitigate this, we synthesize 4k report pairs and train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B. Our trained metric sharpens the clinical significance boundary, surpassing 32B-scale medical LLMs and remaining competitive with proprietary models. Crucially, the more costly two-pass setting fails to consistently improve overall performance and mainly trades discrimination for robustness. These findings suggest one-pass trained metrics as the practical choice for cost-sensitive deployment, with two-pass inference reserved for settings where D-R balance is critical. We will release the dataset and metric.

摘要：可靠的放射科報告評估需要嚴格的臨床準確性，因為遺漏關鍵發現或錯誤表徵放射影像觀察會直接影響病人護理。現有的指標通過將報告質量簡化為無醫學根據的標量來掩蓋這一要求。儘管大型語言模型（LLMs）擁有豐富的醫學知識，但它們同樣難以劃定臨床上重要錯誤與無害變異之間的可靠邊界。我們使用ReEvalMed基準作為測試平台來研究這一邊界，並從檢測真實臨床錯誤（“區分”）和容忍不重要變異（“穩健性”）的角度評估指標層級的臨床意義。在單通道和雙通道設置下的8個LLM評估者中，我們識別出廣泛的區分偏見：模型有效地檢測錯誤，但也過度懲罰無害的改述。為了減輕這一問題，我們合成了4k報告對並在Qwen3-8B和MedGemma-4B上訓練輕量級可解釋的指標。我們訓練的指標明確了臨床意義邊界，超越了32B規模的醫學LLMs，並與專有模型保持競爭力。關鍵是，更昂貴的雙通道設置未能持續改善整體性能，主要是在區分和穩健性之間進行了權衡。這些發現表明，單通道訓練的指標是成本敏感部署的實際選擇，而雙通道推斷則保留給D-R平衡至關重要的設置。我們將發布數據集和指標。

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

2606.18780v1 by Quanjiang Guo, Chong Mu, Jiazhou Pan, Ming Jia, Ling Tian, Hui Gao, Zhao Kang

Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

摘要：多模態信息提取（MIE）涵蓋了多模態命名實體識別（MNER）、關係提取（MRE）和事件提取（MEE）等任務，對於理解多媒體內容至關重要，但仍受到嚴重數據稀缺的限制。儘管數據增強是一種有前景的解決方案，但現有的方法受到粗糙的跨模態對齊和碎片化的任務特定設計的阻礙，未能充分利用共享的語義知識。為了克服這些限制，我們提出了語義錨點對齊的多模態增強（SAMA），這是一個統一的框架，用於生成高保真、任務感知的合成數據。SAMA 從真實標籤中構建結構化的語義錨點，以指導協作多專家多模態大型語言模型（CME-MLLM），該模型將共享語義的通用適配器與任務特定的適配器相結合，生成多樣但符合約束的文本樣本。對於圖像合成，SAMA 採用錨點保留擴散機制，使用錨點加權提示和潛在條件來保持關鍵的語義錨點，同時多樣化視覺上下文。為了消除手動驗證的需要，SAMA 進一步引入了一個雙約束過濾模塊，根據跨模態一致性和錨點保真度選擇合成樣本。在 MNER、MRE 和 MEE 的基準數據集上進行的廣泛實驗表明，SAMA 在完全監督和低資源設置下始終超越了最先進的增強基準，突顯了其多功能性、穩健性和有效性。

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

2606.18624v1 by Jihyung Park, Minchao Huang, Leqi Liu, Elias Stengel-Eskin

Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still struggle with making pragmatic inferences, often choosing literal interpretations. To improve LLM pragmatic reasoning, we introduce PragReST, a self-supervised framework that constructs pragmatic QA data, generates counterfactual reasoning traces, and trains models to internalize them through supervised fine-tuning and reinforcement learning, without human-labeled training data or distillation from a stronger teacher. Across four pragmatic benchmarks (PragMega, Ludwig, MetoQA, and AltPrag), PragReST improves over backbone models, task-specific pragmatic tuning baselines, and non-counterfactual variants of the same pipeline. On accuracy-based benchmarks, PragReST improves over the instruct backbone by 5.37 and 5.50% (absolute) for Qwen3-8B and Qwen3-14B, respectively. Our error analysis and ablations underscore the importance of counterfactual reasoning: PragReST primarily reduces errors caused by failures to contrast observed utterances with plausible alternatives, and removing counterfactual reasoning substantially reduces performance. Moreover, our training preserves out-of-domain performance on general-knowledge and mathematical reasoning benchmarks.

摘要：自然語言理解通常依賴於隱含的意義，而非明確陳述的內容，這需要實用的推理。儘管在數學和邏輯推理方面表現強勁，大型語言模型（LLMs）在進行實用推理時仍然面臨挑戰，經常選擇字面解釋。為了改善LLM的實用推理，我們引入了PragReST，一個自我監督的框架，該框架構建實用的問答數據，生成反事實推理痕跡，並通過監督微調和強化學習訓練模型內化這些痕跡，而無需人類標註的訓練數據或來自更強教師的蒸餾。在四個實用基準（PragMega、Ludwig、MetoQA和AltPrag）上，PragReST優於基礎模型、特定任務的實用調整基準和相同流程的非反事實變體。在基於準確性的基準上，PragReST在Qwen3-8B和Qwen3-14B上分別提高了5.37%和5.50%（絕對值）相較於指令基礎模型。我們的錯誤分析和消融實驗強調了反事實推理的重要性：PragReST主要減少了因未能將觀察到的話語與合理替代品進行對比而造成的錯誤，並且去除反事實推理會顯著降低性能。此外，我們的訓練保持了在一般知識和數學推理基準上的域外性能。

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

2606.18613v1 by Tianming Du, Peijie Yu, Sihan Shang, Danli Shi, My Linh Nguyen, Shengbo Gao, Guangyuan Li, Yinghong Yu, Yan Jiang, Qianlong Zhao, Behzad Bozorgtabar, Shaoxiong Ji, Jiazhen Pan, Daniel Rueckert, Jiancheng Yang

The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.

摘要：醫療 LLMs 在近期最可能的角色是協助而非取代醫生，但目前的評估往往測試孤立的能力：臨床知識、EHR 系統互動或病人溝通。醫生的協助需要在同一互動中協調這些能力，在這裡醫生發出不明確的請求，病人模糊地描述症狀，而 EHR 系統則需要精確的工具使用。我們引入了 PhysAssistBench，一個用於互動醫生-病人-EHR 協助的基準。PhysAssistBench 由真實的 MIMIC-IV 案例構建，使用可擴展的管道來構建具主動性的病人：互動的、基於記錄的代理，將靜態的 EHR 記錄轉化為多輪臨床場景，同時保持臨床事實性。PhysAssistBench 提供了一個經過策劃的雙語評估集，包含 1,296 個手動審核和醫生驗證的回合。與領先的 LLMs 進行的實驗顯示，當前模型在這種環境中仍然不可靠，這暴露了臨床 LLMs 的一個關鍵瓶頸：可靠的協助需要在知識、溝通和系統之間進行協調，而不是在任何一個方面的孤立增長。

2606.18566v1 by Hao-Yuan Ma, Li Zhang, Yushi Qiu, Jie Gao, Yan Zhang, Bangjun Wang

Crowd counting is a fundamental task in computer vision. However, crowd counting in low-light environments remains largely underexplored, despite its practical importance in the real world. Existing methods mainly focus on well-lit scenes or rely on single-modality Red-Green-Blue (RGB) representations, which often become unreliable under extreme darkness and complex non-uniform illumination. To handle this problem, we construct three new low-light crowd counting benchmarks, which consist of two synthetic datasets, SHA_Dark and SHB_Dark, and a real-world benchmark LC-Crowd (Low-light Crowd Dataset). Inspired by Retinex-based physical modeling, we introduce depth and Canny edge cues as complementary geometric and structural priors to enhance the intrinsic reflectance representation under low-light conditions. We propose a Multi-Modal Hyper-Graph Fusion module, which formulates RGB appearance, depth geometry, and edge structure cues as nodes in a unified hyper-graph and explicitly captures their high-order complementary relationships via dynamic hyperedge construction and message passing. Furthermore, to adaptively allocate computation in dense prediction, we propose a Deformable Rectangular Sparse Attention (DRSA) module, which concentrates computation on informative regions through anchor-aware estimation and adaptive rectangular window modeling. Based on these designs, we develop a unified Low-Light Counting Network (LCNet) for robust low-light crowd counting. Extensive experiments on three benchmarks demonstrate that the proposed method achieves the best overall performance against existing state-of-the-art (SOTA) methods. The code is in the supplementary material. The datasets will be made public upon acceptance.

摘要：人群計數是計算機視覺中的一項基本任務。然而，儘管在現實世界中具有實際重要性，低光環境下的人群計數仍然在很大程度上未被探索。現有的方法主要集中在光線良好的場景上，或依賴單一模態的紅綠藍（RGB）表示，這在極度黑暗和複雜的非均勻照明下往往變得不可靠。為了解決這個問題，我們構建了三個新的低光人群計數基準，這些基準由兩個合成數據集SHA_Dark和SHB_Dark，以及一個現實世界基準LC-Crowd（低光人群數據集）組成。受到基於Retinex的物理建模的啟發，我們引入深度和Canny邊緣線索作為補充幾何和結構先驗，以增強低光條件下的內在反射率表示。我們提出了一個多模態超圖融合模塊，將RGB外觀、深度幾何和邊緣結構線索作為統一超圖中的節點，並通過動態超邊構建和信息傳遞明確捕捉它們的高階互補關係。此外，為了在密集預測中自適應地分配計算，我們提出了一個可變形矩形稀疏注意力（DRSA）模塊，通過錨點感知估計和自適應矩形窗口建模將計算集中在信息豐富的區域。基於這些設計，我們開發了一個統一的低光計數網絡（LCNet），以實現穩健的低光人群計數。在三個基準上的廣泛實驗表明，所提出的方法在現有的最先進（SOTA）方法中實現了最佳的整體性能。代碼在補充材料中。數據集將在接受後公開。

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

2606.18557v1 by Patrick Cooper, Alvaro Velasquez

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

摘要：一個基於規則的邏輯解決器在我們的基準測試中以 50 微秒內解決每個實例，並且準確率達到 100%；最佳的前沿語言模型最多達到 65%，在渲染穩健評估下則降至 23.5%（在四次表面渲染的最壞情況下）。我們介紹 DeFAb（可駁回的歸納基準），這是一個數據集和生成管道，將四十年的公共資助知識庫轉換為可駁回歸納的形式化實例：通過覆蓋預設來構建解釋異常的假設，同時保留無關的期望。因為每個假設必須通過多項式時間檢查以驗證有效推導、保守性和最小性，DeFAb 使邏輯嚴謹成為衡量創造力和理論推理的工具，評分理論修訂的有序構建，而不是流暢但毀滅理論的散文。該管道將分類層次（OpenCyc、YAGO、Wikidata）與行為屬性圖（ConceptNet、UMLS）配對，以從 18 個來源生成 372,648+ 個實例，涵蓋 33.75M 的具體化規則，並設有三個層級和多項式時間可驗證的黃金標準。四個前沿模型並不可靠地內化可駁回推理：渲染穩健的 Level 2 準確率為 7.8-23.5%；思維鏈變異（約 36 pp）超過任何模型間的差距；而匹配的污染控制則隔離出 +19.4 pp 的 Level 3 差距。我們進一步發布 DeFAb-Hard（235 個實例的 Level 3 難度變體；最佳模型 53.3% 對比 100% 符號）和 CONJURE（560 個 Lean 4/Mathlib 實例的核心驗證轉化創造力變體，其黃金答案是證明核心之前未包含的定義，無評判的驗證者；一項試點發現零個新概念）。同一驗證者也作為偏好優化（DPO，RLVR/GRPO）的精確獎勵。根據 MIT 授權發布，網址為 https://huggingface.co/datasets/PatrickAllenCooper/DeFAb。

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

2606.18473v1 by Bo Su, Ankit Shah, Thai Le

Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often unclear, since related and even distant information may be entangled in the model. In this paper, we study LLM unlearning from a data-centric perspective and measure how unlearning effects propagate from the forget set to same-domain and distant-domain knowledge. We find a consistent decay pattern: collateral damage is strongest near the forget set, weakens with semantic distance, but does not disappear at domain boundaries. We further ask whether such damage can be audited before unlearning is executed. We formulate forget-set auditing as a pre-unlearning prediction task and analyze which data features are most predictive of downstream damage. Our results show that interaction features between the forget set and evaluation set provide the strongest signals, suggesting that collateral damage is partly reflected in data geometry before model updates occur. These findings position forget-set auditing as an early warning tool for identifying risky unlearning runs and designing more reliable unlearning procedures.

摘要：機器遺忘對於大型語言模型（LLMs）的目標是移除特定知識，同時保留模型的其他能力。然而，遺忘的知識與保留的知識之間的界限往往不明確，因為相關甚至遙遠的信息可能在模型中交織在一起。在本文中，我們從數據中心的角度研究LLM的遺忘，並測量遺忘效應如何從遺忘集傳播到同域和異域知識。我們發現了一個一致的衰減模式：附帶損害在遺忘集附近最強，隨著語義距離的增加而減弱，但在領域邊界並不會消失。我們進一步詢問這種損害是否可以在執行遺忘之前進行審核。我們將遺忘集審核形式化為一個預遺忘預測任務，並分析哪些數據特徵最能預測下游損害。我們的結果顯示，遺忘集與評估集之間的交互特徵提供了最強的信號，這表明附帶損害在模型更新發生之前部分反映在數據幾何中。這些發現將遺忘集審核定位為識別風險遺忘運行和設計更可靠遺忘程序的早期預警工具。

TMR-GGNN: Credit Card Fraud Detection based on Time-Aware Multi-Relational Guided Graph Neural Network

2606.18444v1 by Rohit Tewari, Shubhankar Shilpi, Navin Chhibber, Devendra Singh Parmar, Sunil Khemka, Piyush Ranjan

In recent years, credit card fraud detection has faced significant challenges due to highly imbalanced data, evolving fraud patterns, and complex relational structures among transaction entities. To address these issues, this research proposes a novel framework called Timeaware Multi Relational Guided Graph Neural Network (TMR GGNN). Particularly, the proposed TMR GGNN extends the encoder decoder Graph Neural Network GNN architecture by modeling heterogeneous interactions across customers, merchants, devices, and IPs over temporal windows. Subsequently, the proposed TMR GGNN approach constructs a dynamic, multi relational graph and incorporates a time aware relational attention mechanism within the encoder to adaptively weigh the transaction relevance based on temporal proximity and semantic context. Consequently, the decoder employs a contrastive learning module to distinguish between real and synthesized transaction patterns, while improving the models generalization of rare fraud cases. Additionally, to effectively manage severe class imbalances and emphasize discriminative learning, a composite loss function combining Information Noise Contrastive Estimation (InfoNCE) based contrastive loss with Focal Loss is introduced. This integration assists in improving fraud identification while mitigating false negatives.

摘要：近年來，由於數據高度不平衡、詐騙模式不斷演變以及交易實體之間的複雜關係結構，信用卡詐騙檢測面臨重大挑戰。為了解決這些問題，本研究提出了一個名為時間感知多關係引導圖神經網絡（TMR GGNN）的新框架。特別是，所提出的TMR GGNN通過在時間窗口內建模客戶、商家、設備和IP之間的異質互動，擴展了編碼器-解碼器圖神經網絡GNN架構。隨後，所提出的TMR GGNN方法構建了一個動態的多關係圖，並在編碼器內部引入了一種時間感知關係注意機制，以根據時間接近性和語義上下文自適應地加權交易相關性。因此，解碼器採用了對比學習模塊，以區分真實和合成的交易模式，同時提高模型對稀有詐騙案例的泛化能力。此外，為了有效管理嚴重的類別不平衡並強調區分學習，提出了一種結合基於信息噪聲對比估計（InfoNCE）的對比損失與焦點損失的復合損失函數。這一整合有助於改善詐騙識別，同時減少假陰性。

RankGraph-2: Lifecycle Co-Design for Billion-Node Graph Learning in Recommendation

2606.18379v1 by Renzhi Wu, Zikun Cui, Junjie Yang, Tai Guo, Hong Li, Xian Chen, Li Yu, Ke Pan, Sri Reddy, Mahesh Srinivasan, Nipun Mathur, Haomin Yu, Hong Yan

Graph-based retrieval at billion-node scale requires jointly solving three tightly coupled problems -- graph construction, representation learning, and real-time serving -- yet existing work addresses each in isolation. We present RankGraph-2, a framework deployed at Meta that co-designs all three lifecycle stages for similarity-based retrieval (U2U2I and U2I2I), where each stage's requirements shape the others. Serving requires a co-learned cluster index to avoid expensive online KNN -- this pushes index co-training into the training objective. Training benefits from the observation that similarity-based retrieval tolerates pre-computed neighborhoods, eliminating online graph infrastructure -- this requires construction to produce self-contained data. Construction must also support hour-level refresh for item coverage. Acting on these cascading requirements, RankGraph-2 reduces hundreds of trillions of edges to hundreds of billions via subsampling with popularity bias correction, pre-computes multi-hop neighborhoods via personalized PageRank, and co-learns a residual-quantization cluster index that reduces serving computational cost by 83%. This lifecycle co-design enables a simple architecture to achieve 3.8 x higher recall than a GAT + Deep Graph Infomax model on a bipartite graph and 2.1 x higher than PyTorch-BigGraph on item retrieval. RankGraph-2 delivers up to +0.96% CTR and +2.75% CVR, and has powered 20+ retrieval launches across major surfaces.

摘要：圖形基於檢索在十億節點規模上需要共同解決三個緊密耦合的問題——圖形構建、表示學習和實時服務——然而現有的工作各自獨立處理這些問題。我們提出了 RankGraph-2，這是一個在 Meta 部署的框架，為基於相似性的檢索（U2U2I 和 U2I2I）共同設計所有三個生命周期階段，其中每個階段的需求影響其他階段。服務需要共同學習的集群索引，以避免昂貴的在線 KNN——這將索引共同訓練推入訓練目標。訓練受益於這樣的觀察：基於相似性的檢索容忍預計算的鄰域，消除了在線圖形基礎設施——這要求構建產生自包含的數據。構建還必須支持小時級別的刷新以覆蓋項目。根據這些級聯需求，RankGraph-2 通過帶有受歡迎度偏差修正的子採樣將數百萬億的邊減少到數百億，通過個性化 PageRank 預計算多跳鄰域，並共同學習一個殘差量化集群索引，將服務計算成本降低 83%。這種生命周期共同設計使得簡單的架構能夠在二分圖上實現比 GAT + Deep Graph Infomax 模型高 3.8 倍的召回率，並比 PyTorch-BigGraph 在項目檢索上高 2.1 倍。RankGraph-2 提供高達 +0.96% 的 CTR 和 +2.75% 的 CVR，並已支持 20 多個主要平台的檢索啟動。

2606.18235v1 by Qi Chai, Wenhao Shen, Nanjie Yao, Yue Xia, Kaiyong Zhao, Jie Ma, Guosheng Lin, Hao Wang

Zero-Shot Object-Goal Navigation (ZS-OGN) requires embodied agents to explore and locate target objects without any prior training. To this end, recent methods leverage foundation models. But they typically rely on static priors and lack adaptation, which leads to repeated errors and costly trial and error. In this paper, we propose a self-evolving ZS-OGN framework that enables continuous test-time improvement. Specifically, we build an agentic rule memory by extracting actionable knowledge from past trajectories. Then, we propose a retrieval strategy based on upper confidence bound, selecting effective rules by balancing semantic relevance and historical success. In addition, we introduce a memory-guided preflection module that forecasts potential outcomes before action, reducing inefficient exploration. Extensive experiments show that our method outperforms existing zero-shot baselines, achieving a 10.1\% improvement in success rate with fewer unnecessary steps.

摘要：零-shot 物體目標導航 (ZS-OGN) 要求具身體的代理在沒有任何先前訓練的情況下探索並定位目標物體。為此，最近的方法利用基礎模型。但它們通常依賴於靜態先驗，缺乏適應性，這導致重複錯誤和昂貴的試錯過程。在本文中，我們提出了一個自我演變的 ZS-OGN 框架，能夠實現持續的測試時改進。具體而言，我們通過從過去的軌跡中提取可行的知識來構建一個代理規則記憶。然後，我們提出了一種基於上置信界的檢索策略，通過平衡語義相關性和歷史成功來選擇有效的規則。此外，我們引入了一個記憶引導的預反模塊，預測行動前的潛在結果，減少低效的探索。大量實驗表明，我們的方法在成功率上超越了現有的零-shot 基準，實現了 10.1\% 的成功率提升，並且步驟更少。

Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy, with Stylometric and Exploratory Graph Analyses

2606.18222v1 by Joy Bose

We introduce Darshana Graph, a corpus of over 125,000 text records spanning classical Hindu, Buddhist, and Jain philosophical traditions, drawn from public-domain and openly licensed translations of sources including the Bhagavad Gita, Brahma Sutras, principal Upanishads, the Pali Canon, and core Jain texts. Its distinctive contribution lies in a structurally unique subset of roughly 8,500 Hindu and Jain records in which the same root verse or sutra is aligned across eighteen historical commentators representing five schools of Vedanta and other darshanas, enabling direct comparison of how independent interpretive traditions read identical source material. To our knowledge, no publicly available resource provides comparable cross-commentator alignment at this scale. We present two analyses built on this corpus. First, a transparent stylometric comparison requiring no machine learning measures argumentative style through scriptural citation density, explicit refutation rate, and sentence complexity. It finds a moderate negative correlation between citation density and refutation rate, a marked increase in refutation rate across three commentators in a related doctrinal lineage, and measurable genre-level differences within the Pali Canon itself. Second, we describe a constrained large language model pipeline that extracts typed philosophical relationships between concepts using a predefined relation vocabulary and deterministic post-hoc validation. The resulting graph surfaces cross-school disagreement patterns while also revealing important extraction limitations, including cases where an independent embedding-based analysis disagrees with the graph-derived findings. We release the full corpus, extracted relationship graph, and all source code.

摘要：我們介紹 Darshana Graph，這是一個包含超過 125,000 條文本記錄的語料庫，涵蓋了古典印度教、佛教和耆那教的哲學傳統，資料來源包括《博伽梵歌》、《梵天經》、《主要奧義書》、《巴利經典》以及核心耆那教文本，這些資料均來自公共領域和公開授權的翻譯。其獨特的貢獻在於一個結構上獨特的子集，約有 8,500 條印度教和耆那教的記錄，其中相同的根本經文或經句在代表五個維丹塔學派和其他 darshanas 的十八位歷史評論家之間對齊，使得獨立的詮釋傳統能夠直接比較如何解讀相同的來源材料。據我們所知，沒有任何公開可用的資源能在這個規模上提供可比的跨評論家對齊。我們基於這個語料庫呈現了兩項分析。首先，一個透明的文體計量比較，無需機器學習，通過經文引用密度、明確反駁率和句子複雜性來衡量論證風格。它發現引用密度與反駁率之間存在中等的負相關性，在三位評論家中，相關教義系譜的反駁率顯著增加，並且在巴利經典內部也存在可測量的類別層級差異。其次，我們描述了一個受限的大型語言模型管道，該管道使用預定義的關係詞彙和確定性的事後驗證，提取概念之間的類型哲學關係。結果圖顯示了跨學派的分歧模式，同時揭示了重要的提取限制，包括獨立的嵌入基礎分析與圖派生結果不一致的情況。我們發布了完整的語料庫、提取的關係圖和所有源代碼。

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

2606.18216v1 by Byung-Kwan Lee, Ximing Lu, Shizhe Diao, Minki Kang, Saurav Muralidharan, Karan Sapra, Andrew Tao, Pavlo Molchanov, Yejin Choi, Yu-Chiang Frank Wang, Ryo Hachiuma

Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.

摘要：知識蒸餾將教師的能力轉移到小型學生，但在小型學生的範疇內卻顯得脆弱：迫使學生模仿來自更大教師的邏輯輸出，會使其集中於教師最尖銳的模式，從而損害在訓練語料庫之外的基準家庭上的泛化能力。強化學習（RL）通過在學生自己的回合上進行訓練來避免邏輯模仿。然而，在每個回合都失敗的問題上——產生零優勢並被靜默丟棄——將更強的教師反應注入策略梯度會打破在政策上的假設並引起漂移。我們引入了近端政策優化區域（ZPPO），靈感來自維果茨基的近端發展區，該方法將教師保持在提示內而非策略梯度中。在困難問題上，ZPPO 構建了兩個重新格式化的提示：二元候選人包含問題（BCQ）將一個正確的教師反應與一個不正確的學生反應配對，作為學生必須區分的匿名候選者，而負候選人包含問題（NCQ）將學生的錯誤回合聚合成一個單一提示，以顯示其共同的失敗模式。一個提示重播緩衝區會循環每個困難問題，直到它要麼畢業——學生在該問題上的平均回合準確率達到一半——要麼在有限容量下以先進先出方式驅逐，從而在學生當前的近端發展區內放大 BCQ 和 NCQ。在 Qwen3.5 系列的四個學生規模（0.8B-9B）上，使用一個 27B 的教師，作為視覺-語言模型進行後訓練，並在 31 個基準套件（16 VLM，10 LLM，5 視頻）上進行評估，ZPPO 的表現超過了離線/在線蒸餾和 GRPO，在最小規模下獲得了最大的增益。

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

2606.18205v1 by Diaa Fayed, Laurent Romary

This paper presents a robust methodology for the systematic digitization and encoding of the Al-Mawrid Arabic-English dictionary, transforming it from a legacy print resource into a standardized computational lexicon. Addressing a significant gap in Arabic lexical infrastructure, the study adopts a dual-standard framing that aligns the ISO Lexical Markup Framework (LMF) with the Text Encoding Initiative TEI Lex-0 guidelines. By applying an editorial view to the dictionary's macro- and microstructure, the research resolves the structural ambiguities and punctuation inconsistencies typical of 20th-century bilingual dictionaries. The methodology is grounded in an empirical analysis of the dictionary's lexical knowledge density. Drawing on a representative sample (the letter Ayn, comprising 4.6% of the total volume), the study provides scientific weight to the encoding process, demonstrating a structural parsing accuracy of 91%. Quantitative evaluation of the information extraction rules reveals high performance, with 85% precision and 98% recall for synonyms, and 88% precision for other morpho-semantic features. Beyond technical description, the paper provides a critical comparison with existing Arabic lexical resources and discusses the limitations of TEI Lex-0 when modelling specific Arabic phenomena, such as implicit "open set" semantic relations and scattered morphological cues. Furthermore, the study explores the potential for Linguistic Linked Open Data (LLOD) integration by establishing a scalable prefix-based referencing system that facilitates the resource's inclusion in the semantic web. The result is an interoperable, machine-tractable resource that provides a reproducible workflow for the retro-digitization of complex legacy bilingual lexicons within the Arabic NLP and Digital Humanities communities.

摘要：本文提出了一種穩健的方法論，用於系統化數位化和編碼《阿爾-毛里德阿拉伯語-英語詞典》，將其從傳統印刷資源轉變為標準化的計算詞彙庫。針對阿拉伯語詞彙基礎設施中的一個重大空白，本研究採用了雙標準框架，將ISO詞彙標記框架（LMF）與文本編碼倡議TEI Lex-0指導方針對齊。通過對詞典的宏觀和微觀結構應用編輯視角，研究解決了20世紀雙語詞典中典型的結構模糊性和標點不一致性。該方法論建立在對詞典詞彙知識密度的實證分析之上。研究基於一個代表性樣本（字母Ayn，佔總體積的4.6%），為編碼過程提供了科學依據，顯示出91%的結構解析準確率。對信息提取規則的定量評估顯示出高效能，同義詞的精確度為85%，召回率為98%，而其他形態語義特徵的精確度為88%。除了技術描述外，本文還對現有的阿拉伯語詞彙資源進行了批判性比較，並討論了在建模特定阿拉伯現象（如隱含的“開放集”語義關係和分散的形態線索）時TEI Lex-0的局限性。此外，本研究探討了通過建立可擴展的基於前綴的參考系統來整合語言連結開放數據（LLOD）的潛力，該系統促進了資源在語義網中的納入。最終結果是一個可互操作的、機器可處理的資源，為阿拉伯語自然語言處理和數位人文學科社群內複雜傳統雙語詞典的回溯數位化提供了可重複的工作流程。

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

2606.18192v2 by Nick Bettencourt, Xiaowei Ding, Kay Giesecke

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.

摘要：隨著高品質公共網路語料庫逐漸枯竭，乾淨的長上下文文件已成為大型語言模型（LLMs）訓練數據的稀缺且昂貴的來源。現有的長上下文語料庫往往是專有的，獲取成本高昂，或是合成生成的，或者集中於狹窄的領域，如程式設計。我們介紹了斯坦福EDGAR檔案數據集（SEFD），這是一個將SEC檔案重建為佈局忠實的MultiMarkdown的開放數據集，用於金融語言建模和評估。SEFD使經過審計的財務報表、風險披露、所有權報告、會計註釋和市場影響事件檔案可用作長上下文的預訓練數據，以及金融推理、預測、合規性和文件理解的基礎。生成的語料庫在標記效率上表現良好，適合模型使用，並且與Common Crawl衍生的語料庫重疊率低於0.1%。我們發布了SEFD-v1，這是一個152B標記的初始公共快照，並提供了對一個估計為550B標記的更大18.5M檔案存檔的語料庫級分析。我們進一步介紹了兩個基於SEFD的基準：EDGAR-Forecast，評估模型知識截止後的檔案基礎數字預測，以及EDGAR-OCR，評估複雜財務表格的轉錄。

Learning Cardiac Electrophysiology Digital Twins Through Agentic Discovery of Hybrid Structure

2606.18154v1 by Ziqi Zhou, Yubo Ye, Sumeet Atul Vadhavka, Linwei Wang, Zhiqiang Tao

Building personalized cardiac electrophysiology (EP) digital twins requires identifying the appropriate model structure for each patient, not merely fitting parameters. Traditional methods rely on experts to manually prescribe hybrid physics-neural architectures, which requires deep domain expertise and does not transfer across patients. Recent works have applied large language models (LLMs) to generate or act as hybrid models. However, despite their promising generalization capacity, these LLM-based methods lack the structural priors needed for stable cardiac simulations. Hence, we propose LEADS, a framework that formulates cardiac EP domain knowledge as a structured action space and utilizes an LLM agent to discover hybrid models. The agent follows an iterative reasoning-and-action loop to select, combine, and refine hybrid models, whilst gradient descent handles parameter fitting. The proposed LEADS designs every candidate model towards physically grounded, interpretable, and numerically stable, while allowing open-ended architectural discovery. We validate LEADS on synthetic data with three ground-truth reaction models and on real cardiac EP data, demonstrating that it outperforms both human-designed hybrid models and other LLM-based hybrid modeling.

摘要：建立個性化心臟電生理（EP）數位雙胞胎需要為每位患者識別適當的模型結構，而不僅僅是擬合參數。傳統方法依賴專家手動指定混合物理-神經架構，這需要深厚的領域專業知識，且無法在患者之間轉移。最近的研究已經應用大型語言模型（LLMs）來生成或作為混合模型。然而，儘管這些基於LLM的方法具有良好的泛化能力，但它們缺乏穩定心臟模擬所需的結構先驗。因此，我們提出LEADS，一個將心臟EP領域知識表述為結構化行動空間的框架，並利用LLM代理來發現混合模型。該代理遵循迭代推理和行動循環來選擇、組合和精煉混合模型，同時使用梯度下降來處理參數擬合。所提出的LEADS旨在使每個候選模型朝向物理基礎、可解釋且數值穩定的方向設計，同時允許開放式的架構發現。我們在具有三個真實反應模型的合成數據和真實心臟EP數據上驗證LEADS，證明其在性能上優於人類設計的混合模型和其他基於LLM的混合建模。

WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning

2606.18147v1 by Yuwei Zhang, Tong Xia, Bianca Emmerich, Yu Yvonne Wu, Dimitris Spathis, Xin Liu, Daniel McDuff, Cecilia Mascolo

Language models are remarkably capable at medical question answering, in some cases surpassing the accuracy of general physicians. However, answering questions about wearable health data remains challenging and understudied, as these ubiquitous sensors produce continuous, high-dimensional, and longitudinal data, which is non-trivial to align with text-centric distributions in LLM pretraining. The diversity of sensor modalities and user intents cannot be effectively handled by a fixed reasoning workflow or a single pretrained foundation model. To address these challenges, we propose WEQA, a query-adaptive agent framework that unifies LLM reasoning with specialized wearable analytical and modeling tools. An LLM controller is employed to synthesize execution plans and dynamically route each query to the appropriate combination of sensor analysis and pretrained models, and perform grounded response auditing with external knowledge. We also curate a benchmark spanning four open wearable datasets comprising analytic and predictive tasks in three different health domains. Experiments show that our framework is 24% more accurate than LLM and agentic baselines, and a blinded study with 12 medical experts and 8 users shows substantial gains in usefulness and clinical soundness.

摘要：語言模型在醫學問答方面表現出色，在某些情況下超越了一般醫生的準確性。然而，回答有關可穿戴健康數據的問題仍然具有挑戰性且研究不足，因為這些無處不在的傳感器產生連續的、高維度的和長期的數據，這與 LLM 預訓練中的以文本為中心的分佈對齊並非易事。傳感器模態和用戶意圖的多樣性無法通過固定的推理工作流程或單一的預訓練基礎模型有效處理。為了解決這些挑戰，我們提出了 WEQA，一個查詢自適應代理框架，將 LLM 推理與專門的可穿戴分析和建模工具統一。我們使用 LLM 控制器來合成執行計劃，並動態地將每個查詢路由到適當的傳感器分析和預訓練模型的組合，並利用外部知識進行基於事實的回應審核。我們還策劃了一個基準，涵蓋四個開放的可穿戴數據集，包括三個不同健康領域的分析和預測任務。實驗表明，我們的框架比 LLM 和代理基準準確性高出 24%，而一項由 12 位醫學專家和 8 位用戶參與的盲測顯示在實用性和臨床合理性方面有顯著提升。

Knowledge Reutilization in Meta-Reinforcement Learning

2606.18132v1 by Yuan Meng, Bo Wang, Juan de los Rios Ruiz, Xiangtong Yao, Zhenshan Bing, Fuchun Sun, Alois Knoll

Meta-reinforcement learning enables fast adaptation by extracting shared structure from related tasks, but existing end-to-end methods often couple task inference with embodiment-specific control. This coupling can obscure non-parametric task semantics, reduce sample efficiency, and limit cross-agent reuse. We propose a meta-knowledge reutilization framework that learns task-level knowledge on a dynamics-simplified agent and transfers it to heterogeneous agents. The framework uses a Bayesian non-parametric prior to organize latent task modes and a high-level policy to generate task-level magnitude guidance. To bridge reusable task knowledge with different embodiments, we introduce a semantic-magnitude interface and a lightweight temporal adaptor, which convert frozen meta-knowledge into temporally aligned subgoals for embodiment-specific low-level controllers. Experiments on multiple locomotion agents show that our framework reduces final-step tracking error by 94.75% -- 99.79% compared with recent state-of-the-art baselines and achieves comparable deployment performance with about 23.8% of their interaction data.

摘要：Meta-強化學習通過從相關任務中提取共享結構來實現快速適應，但現有的端到端方法通常將任務推斷與具體實體的控制結合在一起。這種結合可能會模糊非參數任務語義，降低樣本效率，並限制跨代理重用。我們提出了一個元知識重用框架，該框架在動力學簡化的代理上學習任務級知識並將其轉移到異質代理。該框架使用貝葉斯非參數先驗來組織潛在任務模式，並使用高級策略生成任務級幅度指導。為了將可重用的任務知識與不同的實體橋接，我們引入了一個語義-幅度接口和一個輕量級的時間適配器，將凍結的元知識轉換為時間對齊的子目標，以供具體實體的低級控制器使用。在多個運動代理上的實驗顯示，我們的框架相比於最近的最先進基準，將最終步驟跟蹤誤差降低了94.75%至99.79%，並且在約23.8%的交互數據下實現了可比的部署性能。

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

2606.18129v1 by Abeer Badawi, Moyosoreoluwa Olatosi, Negin Baghbanzadeh, Laleh Seyyed-Kalantari, Frank Rudzicz, R. Shayna Rosenbaum, Sara Pishdadian, Elham Dolatabadi

Recent incidents involving LLMs used for mental-health support reveal a critical evaluation gap: surface-level safety scores do not capture how models behave across realistic, emotionally sensitive interactions over time. Existing benchmarks measure knowledge, safety, or static response quality, but miss whether LLM interactions help users keep reflecting, coping, and making decisions themselves. We formalize this missing dimension as COGNITIVE ATROPHY, a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. To measure it, we introduce COGNITIVE ATROPHY BENCH, a clinically grounded benchmark built from 1,576 fully human-generated counseling conversations, 15,680 turns, and 42,230 responses from five LLMs. Three clinical and neuropsychology experts developed a 20-attribute schema spanning user context, response behaviour, and global risk flags; six trained clinical reviewers applied it with span-grounded evidence, producing 5,324 reviewer judgments. We further introduce the User-Input Risk Index (UIRI), the Cognitive Atrophy Risk Index (ARI), and trajectory summaries. Across five LLMs, models show a consistent moderate-to-high level of atrophy-aligned behaviour across single and multi-turn settings. While models generally respond to overt safety cues, they adapt less reliably when users seek solutions or decisions. The dominant recurring patterns are directive advice, problem-solving, recommendation responses, topic shifts, and forms of validation that may reinforce dependence rather than reflection. Our work makes COGNITIVE ATROPHY measurable and provides a foundation for auditing model behaviour in sensitive LLM conversations.

摘要：最近涉及用於心理健康支持的LLM事件揭示了一個關鍵的評估缺口：表面上的安全分數無法捕捉模型在現實情境中隨時間推移的情感敏感互動中的行為。現有的基準測量知識、安全性或靜態反應質量，但未能評估LLM互動是否幫助用戶持續反思、應對和自主做出決策。我們將這一缺失的維度正式化為認知萎縮（COGNITIVE ATROPHY），這是一種在AI介導的心理健康支持中與安全性和幫助性不同的過程層面行為測量。為了測量它，我們引入了認知萎縮基準（COGNITIVE ATROPHY BENCH），這是一個基於1,576個完全由人類生成的諮詢對話、15,680次回合和來自五個LLM的42,230個回應的臨床基準。三位臨床和神經心理學專家開發了一個涵蓋用戶背景、回應行為和全球風險標誌的20屬性架構；六位經過培訓的臨床審核員應用該架構並提供基於證據的評估，產生了5,324條審核判斷。我們進一步引入了用戶輸入風險指數（User-Input Risk Index, UIRI）、認知萎縮風險指數（Cognitive Atrophy Risk Index, ARI）和軌跡摘要。在五個LLM中，模型在單回合和多回合設置中顯示出一致的中到高水平的萎縮對齊行為。儘管模型通常對明顯的安全提示作出反應，但當用戶尋求解決方案或決策時，它們的適應性較低。主導的重複模式包括指導性建議、問題解決、推薦回應、主題轉換和可能加強依賴而非反思的驗證形式。我們的工作使認知萎縮可測量，並為審計敏感LLM對話中的模型行為提供了基礎。

Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models

2606.18114v1 by Ramprasath Ganesaraja, Sahil Dilip Panse, Swathika N

State Space Models (SSMs) such as Mamba-2 offer linear-time inference but their memory footprint limits edge deployment. Prior ternary SSM work (Slender-Mamba) trains from scratch on 150B tokens; we show a pretrained checkpoint suffices, reducing the marginal token budget by 1,000x. Using grouped quantization-aware training (QAT) with knowledge distillation from a frozen FP16 teacher, we compress Mamba-2 1.3B to 3.61x (2,687 to 744 MB) and achieve 48.1% zero-shot accuracy (7-task average) in just 102M tokens (4 GPU-hours, single H100) -- approaching Bi-Mamba's 48.4% (within +/-0.9pp CI). This QAT-from-pretrained setting reveals zero-ratio collapse, a novel instability caused by learnable quantization scales that does not arise in from-scratch training. We further show that post-hoc correction strategies effective for Transformers fail for SSMs due to error accumulation through the recurrence. These results demonstrate that ternary SSMs do not require expensive from-scratch training: QAT from pretrained checkpoints with KD is a data-efficient alternative.

摘要：狀態空間模型（SSMs）如 Mamba-2 提供線性時間推斷，但其記憶體佔用限制了邊緣部署。先前的三元 SSM 研究（Slender-Mamba）從 150B 代幣開始訓練；我們顯示預訓練的檢查點已足夠，將邊際代幣預算減少 1,000 倍。使用分組量化感知訓練（QAT）並從凍結的 FP16 教師進行知識蒸餾，我們將 Mamba-2 1.3B 壓縮至 3.61 倍（從 2,687 MB 減少至 744 MB），並在僅 102M 代幣（4 GPU 小時，單個 H100）中達到 48.1% 的零-shot 準確率（7 任務平均）——接近 Bi-Mamba 的 48.4%（在 +/-0.9pp CI 內）。這種從預訓練設定的 QAT 顯示出零比率崩潰，這是一種由可學習量化尺度引起的新穩定性問題，而在從零開始訓練中並未出現。我們進一步顯示，對於Transformer有效的事後修正策略因重複性導致的誤差累積而對 SSMs 失效。這些結果表明，三元 SSMs 不需要昂貴的從零開始訓練：從預訓練檢查點進行的 QAT 結合 KD 是一種數據高效的替代方案。

S4oP: Operator-level Pruning of Structured State Space Models for Resource-Constrained Devices

2606.18096v1 by Marco Deano, Filippo Ziche, Nicola Bombieri

Structured State Space Models (SSMs), including the S4 and S4D architectures, have recently emerged as powerful alternatives to attention-based models for capturing long-range dependencies in sequential data. Despite their strong empirical performance, deploying these models in time- and resource-constrained settings remains challenging due to their computational and memory demands. In this paper, we propose a novel incremental, operator-level pruning approach for S4- and S4D-based models that significantly reduces inference cost while preserving predictive performance. To the best of our knowledge, this is the first work to systematically investigate structured operator pruning for SSMs. Our method progressively prunes model operators by interleaving structured masking with fine-tuning, while jointly monitoring accuracy and inference latency. We implement this approach within a unified training and evaluation framework that enables systematic exploration of efficiency-accuracy trade-offs. Experiments across multiple benchmark datasets show that pruning up to 70% of the model operators preserves the performance of the original models in most cases, while substantially reducing inference latency. These results demonstrate that structured operator pruning is an effective and previously unexplored strategy for improving the efficiency of SSMs and facilitate their deployment in practical, resource-constrained scenarios.

摘要：結構化狀態空間模型（SSMs），包括 S4 和 S4D 架構，最近已成為捕捉序列數據中長距依賴關係的強大替代方案，超越基於注意力的模型。儘管它們在實證性能上表現優異，但由於計算和記憶體需求，將這些模型部署在時間和資源受限的環境中仍然具有挑戰性。在本文中，我們提出了一種新穎的增量運算子級修剪方法，適用於基於 S4 和 S4D 的模型，該方法顯著降低了推理成本，同時保持預測性能。據我們所知，這是第一個系統性研究 SSMs 的結構化運算子修剪的工作。我們的方法通過將結構化遮罩與微調交錯進行，逐步修剪模型運算子，同時共同監控準確性和推理延遲。我們在一個統一的訓練和評估框架內實現了這種方法，使得系統性探索效率與準確性之間的權衡成為可能。在多個基準數據集上的實驗顯示，修剪多達 70% 的模型運算子在大多數情況下保持了原始模型的性能，同時顯著降低了推理延遲。這些結果表明，結構化運算子修剪是一種有效且之前未被探索的策略，用於提高 SSMs 的效率，並促進其在實際資源受限場景中的部署。

EAGG: Embodiment-Aligned Grasp Generation via Geometry-Aware Graph Conditioning

2606.18092v1 by Wanhao Niu, Qiyan Ke, Yuan Sun, Hao Sun, Jie Xu, Muyuan Ma, Ruiqi Hu, Fuchun Sun

Cross-end-effector grasp generation seeks a unified model that generalizes across objects and across embodiments ranging from parallel grippers to dexterous end effectors. Existing grasp generators are typically designed for a fixed embodiment or encode embodiment identity with a static descriptor, which weakens transfer when topology, actuation coupling, and contact geometry differ substantially. We present EAGG, an embodiment-aligned grasp generator that represents each embodiment with a topology-aware end-effector graph and an embodiment-specific low-dimensional end-effector control space. A frozen end-effector-cognition backbone converts the current articulated state into geometry-aware tokens that act as a reusable morphology prior, and iterative geometry injection refreshes these tokens throughout sampling so that conditioning remains synchronized with the evolving end-effector geometry. On the MultiGripperGrasp benchmark, EAGG reaches 56.17% average success across six training end effectors, remaining within 1.10 percentage points of specialized training while preserving transfer to finetuning and zero-shot end effectors. Iterative geometry injection further reduces the pooled median contact distance from 0.239 cm to 0.189 cm. These results show that cross-end-effector grasp generation is strengthened by aligning embodiment structure inside a shared generator rather than suppressing embodiment differences. Code is available at https://github.com/wanhaoniu/EAGG.

摘要：跨端執行器抓取生成尋求一個統一模型，能夠在物體和從平行夾具到靈巧端執行器的不同實體之間進行泛化。現有的抓取生成器通常是為固定實體設計，或使用靜態描述符編碼實體身份，這在拓撲、驅動耦合和接觸幾何顯著不同時削弱了轉移能力。我們提出了EAGG，一個與實體對齊的抓取生成器，該生成器用一個拓撲感知的端執行器圖和一個特定實體的低維端執行器控制空間來表示每個實體。一個凍結的端執行器認知骨幹將當前的關節狀態轉換為幾何感知的標記，這些標記作為可重用的形態先驗，而迭代幾何注入在採樣過程中不斷刷新這些標記，以便條件保持與不斷演變的端執行器幾何同步。在MultiGripperGrasp基準測試中，EAGG在六個訓練端執行器上達到56.17%的平均成功率，與專門訓練相差僅1.10個百分點，同時保持對微調和零樣本端執行器的轉移。迭代幾何注入進一步將合併的中位接觸距離從0.239厘米減少到0.189厘米。這些結果顯示，跨端執行器抓取生成通過在共享生成器內對齊實體結構而不是抑制實體差異得到了加強。代碼可在https://github.com/wanhaoniu/EAGG獲得。

A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation

2606.18075v1 by Haoyang Zhong, Yifei Sun, Antong Zhang, Chunping Wang, Lei Chen, Yang Yang

Retrieval-Augmented Generation (RAG) has emerged as a paradigm for enhancing large language models (LLMs) with external knowledge, yet existing graph-based methods face a fundamental limitation: entity-centric and chunk-centric approaches operate on representations anchored to original text without true knowledge fusion. While entity-centric methods connect logically related content and chunk-centric methods preserve context, both retrieve information separately through similarity search, missing emergent understanding from their synthesis. In this paper, we propose HyGRAG, a hierarchical graph RAG framework that transcends source documents by addressing three core challenges: constructing summaries that genuinely integrate contextual and relational information, leveraging these synthesized representations to access emergent knowledge during retrieval, and efficiently updating hierarchical structures for dynamic corpora. Specifically, we design hierarchical index structures over hybrid graphs with both chunk and entity nodes, then iteratively cluster them and generate LLM-based summaries. Then, we design context and relation-aware retrieval that searches across all abstraction levels while expanding through community membership. Moreover, we enable dynamic knowledge update through attachment-based algorithms with only local re-summarization. Experimental results show that HyGRAG improves the average accuracy of multi-hop reasoning tasks by 9.7%, while maintaining reasonable efficiency.

摘要：檢索增強生成（RAG）已成為增強大型語言模型（LLMs）與外部知識的範式，但現有的基於圖的方法面臨一個基本限制：以實體為中心和以區塊為中心的方法在原始文本的基礎上運作，卻沒有真正的知識融合。雖然以實體為中心的方法連接邏輯相關的內容，而以區塊為中心的方法保留上下文，但兩者都是通過相似性搜索分別檢索信息，錯過了其綜合所產生的理解。在本文中，我們提出了HyGRAG，一個層次化圖形RAG框架，通過解決三個核心挑戰超越源文檔：構建真正整合上下文和關聯信息的摘要，利用這些綜合表示在檢索過程中訪問新興知識，並有效更新動態語料庫的層次結構。具體而言，我們設計了基於混合圖的層次索引結構，包含區塊和實體節點，然後對它們進行迭代聚類並生成基於LLM的摘要。接著，我們設計了上下文和關係感知的檢索，能夠在所有抽象層次上進行搜索，同時通過社群成員資格進行擴展。此外，我們通過基於附加的算法實現動態知識更新，僅需進行局部重新摘要。實驗結果顯示，HyGRAG將多跳推理任務的平均準確率提高了9.7%，同時保持合理的效率。

When LLMs Analyze Scars: From Images to Clinically-Meaningful Features

2606.18063v1 by Ruman Wang, Hangting Ye

Medical image classification faces a fundamental dilemma: while deep learning models achieve remarkable performance at scale, real-world clinical scenarios often suffer from severe data scarcity due to annotation costs, privacy constraints, and disease rarity. This challenge is particularly pronounced in pathological scar classification, where differentiating keloids from hypertrophic scars requires subtle expert knowledge and labeled images are extremely limited. We propose a novel paradigm that repositions large language models (LLMs) as knowledge-driven feature engineers rather than end-to-end classifiers. We call this framework ScaFE (Scar Feature Engineering). Our key insight is that LLMs encode rich medical knowledge that can be externalized as executable feature extraction code, enabling the transformation of high-dimensional images into low-dimensional, clinically interpretable representations. Specifically, we prompt an LLM with established scar assessment criteria to generate deterministic Python code that extracts features aligned with clinical scoring systems such as the Vancouver Scar Scale. Our approach offers three key advantages: (1) data efficiency, achieving robust performance with limited training samples by decoupling knowledge acquisition from statistical learning; (2) privacy preservation, as raw images are processed locally without exposure to external LLMs; and (3) interpretability, through explicit features grounded in clinical reasoning. Extensive experiments on scar classification demonstrate that our method consistently outperforms end-to-end deep learning baselines or using LLMs as black-box classifiers under limited data conditions, establishing a promising direction for integrating LLMs into data-efficient and clinically transparent medical AI systems.

摘要：醫學影像分類面臨一個根本性的困境：雖然深度學習模型在大規模下表現卓越，但現實世界的臨床情境常常因為標註成本、隱私限制和疾病稀有性而遭遇嚴重的數據匱乏。這一挑戰在病理性疤痕分類中尤為明顯，因為區分凹疤和肥厚性疤痕需要微妙的專家知識，而標註的影像極為有限。我們提出了一種新穎的範式，將大型語言模型（LLMs）重新定位為知識驅動的特徵工程師，而非端到端的分類器。我們稱這一框架為ScaFE（疤痕特徵工程）。我們的關鍵見解是，LLMs編碼了豐富的醫學知識，這些知識可以外部化為可執行的特徵提取代碼，使高維影像轉換為低維且臨床可解釋的表示。具體而言，我們使用既定的疤痕評估標準來提示LLM生成確定性的Python代碼，提取與臨床評分系統（如溫哥華疤痕量表）對齊的特徵。我們的方法提供了三個主要優勢：（1）數據效率，通過將知識獲取與統計學習解耦，實現有限訓練樣本下的穩健性能；（2）隱私保護，因為原始影像在本地處理，未暴露於外部LLMs；以及（3）可解釋性，通過基於臨床推理的明確特徵。對疤痕分類的廣泛實驗表明，我們的方法在有限數據條件下始終優於端到端的深度學習基準或將LLMs用作黑箱分類器，確立了將LLMs整合進數據高效且臨床透明的醫學AI系統中的有前景方向。

Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond

2606.18062v1 by Hobin Kim, Xiaoyuan Wu, Omer Akgul, Lujo Bauer, Nicolas Christin

Large language models (LLMs) are widely used to fulfill users' information needs; users ask LLMs about the weather, pose educational questions, and consult them for legal assistance. One particularly understudied area is digital security and privacy (S&P), where users may seek LLMs' help on how to secure their online accounts or protect their computers from cyber attacks. To the best of our knowledge, no prior study has collected or analyzed the S&P questions users ask LLMs; prior research on LLM response quality relied on expert-authored S&P misconceptions or FAQs rather than user queries. Drawing from WildChat, a dataset of 3.2M user-LLM conversations collected in the wild, our study identifies 14,727 S&P prompts and categorizes them into nine categories covering a wide range of S&P topics. From the S&P prompts, we sampled 450 and performed a thematic analysis to characterize the S&P questions users ask LLMs. Separate from the thematic analysis, we curated 270 advice-seeking S&P prompts, where users ask for recommendations, guidance, or specific S&P information. We measured LLM response quality and consistency when posing the prompt to LLMs 10 times. We found that commercial LLMs outperform open-weight models (GPT 5.5 provided "good enough" responses on 98% of prompts; Llama 4 on 47%). However, among prompts that received high-quality responses on average, commercial models sometimes produce contradictory responses across runs, risking confusing or misleading users.

摘要：大型語言模型（LLMs）被廣泛用來滿足用戶的信息需求；用戶向LLMs詢問天氣、提出教育問題，並諮詢法律協助。數位安全和隱私（S&P）是一個特別少有研究的領域，用戶可能會尋求LLMs的幫助，以了解如何保護他們的在線帳戶或防止電腦遭受網絡攻擊。據我們所知，之前沒有研究收集或分析用戶向LLMs提出的S&P問題；先前對LLM回應質量的研究依賴於專家撰寫的S&P誤解或常見問題，而不是用戶查詢。基於WildChat，這是一個收集到的320萬用戶-LLM對話的數據集，我們的研究識別了14,727個S&P提示，並將其分類為九個類別，涵蓋各種S&P主題。從這些S&P提示中，我們抽取了450個並進行了主題分析，以特徵化用戶向LLMs提出的S&P問題。與主題分析分開，我們整理了270個尋求建議的S&P提示，用戶在這些提示中請求建議、指導或具體的S&P信息。我們測量了LLM在向其提出提示時的回應質量和一致性，進行了10次提問。我們發現商業LLMs在性能上優於開放權重模型（GPT 5.5在98%的提示中提供了「足夠好」的回應；Llama 4則為47%）。然而，在平均獲得高質量回應的提示中，商業模型有時會在不同的運行中產生矛盾的回應，這可能會使用戶感到困惑或誤導。

C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift

2606.18003v1 by Davide Domini, Gianluca Aguzzi, Lorenzo Pellegrini, Mirko Viroli, Lukas Esterle

Collective Adaptive Systems (CAS) increasingly rely on machine learning to let each node learn from locally sensed data, aligning its behavior with the surrounding environment. Scaling this intelligence, however, raises fundamental challenges: sensed data is often privacy-sensitive, preventing centralized collection; nodes are mobile, traversing regions where nearby nodes perceive similar phenomena while distant ones observe radically different conditions, creating natural spatial clusters; and these distributions evolve over time due to mobility, introducing temporal drift that makes local models progressively stale. These dynamics arise across domains - vehicular sensing, drone-based monitoring, smartphone crowdsensing - yet the interplay of privacy, spatial heterogeneity, and temporal drift severely undermines conventional learning strategies. Therefore, we propose C2FL, a fully distributed Federated Learning (FL) approach where nodes self-organize into learning groups through spatial clustering, reflecting the geographic structure of the environment. To counteract temporal drift, each node combines experience replay with a dwell-time-aware adaptive averaging step, progressively incorporating the regional consensus as it remains longer within the same area, while preserving previously acquired knowledge under evolving distributions. We evaluate our approach on synthetic experiments that systematically reproduce spatial and temporal shifts, showing that standard federated strategies degrade significantly under these conditions and that our method restores robust collective adaptation.

摘要：集體自適應系統（CAS）越來越依賴機器學習，讓每個節點從本地感知數據中學習，並使其行為與周圍環境對齊。然而，擴展這種智能會帶來根本挑戰：感知數據通常涉及隱私問題，阻止集中收集；節點是移動的，穿越附近節點感知相似現象而遠端節點觀察到截然不同條件的區域，形成自然的空間集群；而且，由於移動性，這些分佈隨時間演變，引入了時間漂移，使得本地模型逐漸過時。這些動態在不同領域中出現——車輛感知、無人機監測、智能手機群眾感知——然而，隱私、空間異質性和時間漂移的相互作用嚴重削弱了傳統學習策略。因此，我們提出了C2FL，一種完全分散的聯邦學習（FL）方法，節點通過空間集群自組織成學習小組，反映環境的地理結構。為了抵消時間漂移，每個節點將經驗重播與考慮停留時間的自適應平均步驟相結合，隨著在同一區域停留時間的增長，逐步納入地區共識，同時在演變的分佈下保留先前獲得的知識。我們在系統性重現空間和時間變化的合成實驗中評估我們的方法，顯示標準的聯邦策略在這些條件下顯著退化，而我們的方法恢復了強健的集體適應能力。

Dimensionality Controls When Modularity Helps in Continual Learning

2606.17889v1 by Kathrin Korte, Christian Medeiros Adriano, Joachim Winther Pedersen, Eleni Nisioti, Sebastian Risi

Compositional learning systems must balance plasticity, the ability to acquire new knowledge, with stability, the preservation of previously learned components, especially when tasks share structure and risk interference. We study how modular architecture, task similarity, and representational dimensionality jointly shape compositional continual learning in a sequential A-B-A paradigm, comparing a task-partitioned recurrent network to a single-network baseline while inducing high- and low-dimensional regimes via weight-scale manipulations. In a high-dimensional "lazy" regime, both architectures achieve similar performance and internal geometry, suggesting that explicit modular structure has little impact when representations are weakly constrained. In a lower-dimensional "rich" regime, modularity becomes decisive: the modular network develops graded task-specific subspaces that overlap for similar tasks, partially align for moderately dissimilar tasks, and separate for dissimilar tasks, yielding a more compositional and interpretable organization than the single network. These findings identify the representational regime induced by initialization scale, which co-varies with representational dimensionality, as a key factor governing when compositional, modular structure is functionally beneficial in continual learning, and support viewing safety and robustness as problems of adaptive allocation of representational subspaces rather than fixed separation versus sharing.

摘要：組合學習系統必須在可塑性，即獲取新知識的能力，與穩定性，即保留先前學習的組件之間取得平衡，特別是在任務共享結構並存在干擾風險的情況下。我們研究模組化架構、任務相似性和表徵維度如何共同塑造序列 A-B-A 範式中的組合持續學習，將任務劃分的遞歸網絡與單一網絡基準進行比較，同時通過權重縮放操作引入高維和低維範疇。在高維的「懶惰」範疇中，兩種架構的性能和內部幾何形狀相似，這表明當表徵受到弱約束時，顯式模組結構的影響不大。在較低維的「豐富」範疇中，模組化變得至關重要：模組網絡發展出分級的任務特定子空間，對於相似任務重疊，對於中等不相似的任務部分對齊，對於不相似的任務則分開，從而產生比單一網絡更具組合性和可解釋性的組織。這些發現確定了由初始化規模引起的表徵範疇，該範疇與表徵維度共同變化，是決定何時組合模組結構在持續學習中功能上有益的關鍵因素，並支持將安全性和穩健性視為表徵子空間的自適應分配問題，而非固定的分離與共享問題。

AI Adoption Across a Multinational Workforce: Sociotechnical Conditions for GenAI Acceptance in Human Resources

2606.17887v1 by Dalia Ali, Maria José Rodríguez Velázquez, Manoel Horta Ribeiro, Vera Liao, Orestis Papakyriakopoulos

Generative AI (GenAI) deployment in the workplace is accelerating rapidly. Nevertheless, questions of who adopts, who benefits, and who is left behind and why are still understudied. In this paper, we investigate these dynamics in the context of a multinational tech company transitioning from a legacy Human Resources (HR) search system to a GenAI-supported system, analyzing search log data, survey data (n=25), and ten semi-structured interviews. Our findings show that adoption depended on the fit between the GenAI system's design assumptions and employees' work positionalities (role, spoken language, tenure). Further, we find that employees' trust in GenAI answers was built through source-checking, comparison among systems, and seeking input from colleagues or HR when in doubt. Our contribution is twofold. First, we provide empirical evidence of workplace GenAI adoption during a live organizational transition, showing that adoption is influenced by factors such as situational fit, search literacy, and trust calibration. It is also further shaped by knowledge conditions such as the system's content quality, employee training, and guidance. Second, we translate these findings into design considerations for inclusive deployment and adoption in high-stakes environments such as HR. We argue that organizations should design systems considering the role and context-sensitive benefits they yield to different social groups. They also need to treat the organizational knowledge infrastructure as AI infrastructure to improve the accountability and usability of GenAI systems

摘要：生成式人工智慧（GenAI）在工作場所的部署正在迅速加速。然而，誰採用、誰受益、誰被遺留在外以及原因等問題仍然缺乏研究。在本文中，我們調查了這些動態，背景是一家跨國科技公司從傳統人力資源（HR）搜尋系統過渡到GenAI支持的系統，分析了搜尋日誌數據、調查數據（n=25）和十次半結構式訪談。我們的研究結果顯示，採用取決於GenAI系統的設計假設與員工的工作位置（角色、語言、任期）之間的契合度。此外，我們發現員工對GenAI答案的信任是通過檢查來源、系統之間的比較以及在有疑慮時向同事或HR尋求意見來建立的。我們的貢獻有兩個方面。首先，我們提供了在實時組織過渡期間工作場所GenAI採用的實證證據，顯示採用受到情境契合、搜尋素養和信任校準等因素的影響。此外，這也受到系統內容質量、員工培訓和指導等知識條件的進一步影響。其次，我們將這些發現轉化為高風險環境（如HR）中包容性部署和採用的設計考量。我們認為，組織應該設計系統時考慮到它們對不同社會群體所產生的角色和情境敏感的好處。他們還需要將組織知識基礎設施視為人工智慧基礎設施，以提高GenAI系統的問責性和可用性。

FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow

2606.17856v1 by Bihao Zhan, Zongsheng Cao, Jie Zhou, Bo Zhang, Liang He

Graph-based retrieval-augmented generation (GraphRAG) is effective for knowledge-intensive and multi-hop query tasks; however, many existing methods primarily seed entity-based graphs and rely on implicit semantic relevance propagation. This often (i) under-retrieves when user queries are abstract and semantically sparse at the entity level, and (ii) suffers from brittle multi-hop reasoning, where noisy activations can derail entity-to-entity transitions and corrupt the inferred relation chain, yielding unreliable conclusions. To this end, we propose \texttt{FlowRAG}, a semantic-aware retrieval framework that improves both semantic recall and explicit reasoning. Specifically, \texttt{FlowRAG} constructs a quad-level heterogeneous graph over passages, summaries, sentences, and entities, where summary nodes serve as a coarse semantic hub. At retrieval time, a dual-granularity activation module combines summary--query alignment with sentence-level matching to activate relevant entities under paraphrase and abstraction robustly. We then introduce a frequency-aware weighted flow module that routes relevance through entity--passage links weighted by within-passage term frequency, pruning noisy connections and extracting high-confidence reasoning paths as an explicit logic skeleton for generation. Extensive experiments show that \texttt{FlowRAG} obtains state-of-the-art performance on complex reasoning benchmarks.

摘要：圖基檢索增強生成（GraphRAG）對於知識密集型和多跳查詢任務是有效的；然而，許多現有的方法主要以實體為基礎構建圖形，並依賴於隱式語義相關性傳播。這通常會在（i）當用戶查詢在實體層面上抽象且語義稀疏時，導致檢索不足，以及（ii）在脆弱的多跳推理中受到影響，噪聲激活可能會干擾實體到實體的轉換並腐蝕推斷的關係鏈，產生不可靠的結論。為此，我們提出了\texttt{FlowRAG}，一個語義感知的檢索框架，改善了語義召回和明確推理。具體而言，\texttt{FlowRAG}在段落、摘要、句子和實體上構建了一個四層異構圖，其中摘要節點作為粗略的語義中心。在檢索時，雙粒度激活模塊將摘要-查詢對齊與句子級匹配相結合，穩健地激活相關實體以應對同義詞和抽象。我們接著引入了一個頻率感知的加權流模塊，通過實體-段落鏈接路由相關性，這些鏈接根據段落內的詞頻加權，修剪噪聲連接並提取高置信度的推理路徑，作為生成的明確邏輯骨架。大量實驗表明，\texttt{FlowRAG}在複雜推理基準上獲得了最先進的性能。

DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL

2606.17821v1 by Esteban Schafir, Xu Zheng, Hojat Allah Salehi, Zhuomin Chen, Mo Sha, Wei Cheng, Dongsheng Luo

Large Language Models (LLMs) have demonstrated remarkable capabilities in translating natural language to SQL, yet existing methods still falter on complex queries requiring multi-step, data-aware reasoning. We introduce DecoSearch, a training-free framework that addresses this by routing each query to the appropriate level of reasoning effort. A lightweight Schema Selector first prunes the full database schema to the relevant tables and columns. An LLM Judger then decides whether the question requires decomposition: straightforward questions follow a direct generation path and complex ones are escalated to a Directed Acyclic Graph (DAG) of atomic sub-questions, each solved by a targeted SQL generation step. A RAG component grounds the decomposer with semantically similar training examples, and a Topology Refiner restructures the reasoning plan when execution failures signal a flawed decomposition rather than a fixable SQL error. DecoSearch achieves 70.53% execution accuracy on BIRD and 88.31% on Spider with a DeepSeek backbone, surpassing all training-free baselines while consuming an order of magnitude fewer tokens than competing methods. It also functions as a model-agnostic wrapper, consistently improving fine-tuned SQL generation backbones without any modification to the pipeline.

摘要：大型語言模型（LLMs）在將自然語言翻譯成 SQL 方面展現了卓越的能力，但現有的方法在需要多步驟、數據感知推理的複雜查詢上仍然表現不佳。我們介紹了 DecoSearch，一個無需訓練的框架，通過將每個查詢路由到適當的推理努力水平來解決這個問題。一個輕量級的 Schema Selector 首先將完整的數據庫架構修剪到相關的表和列。然後，LLM Judger 決定問題是否需要分解：簡單的問題遵循直接生成路徑，而複雜的問題則升級到原子子問題的有向無環圖（DAG），每個子問題通過針對性的 SQL 生成步驟解決。一個 RAG 組件用語義相似的訓練範例為分解器提供支持，而 Topology Refiner 在執行失敗信號表明分解存在缺陷而非可修復的 SQL 錯誤時，重構推理計劃。DecoSearch 在 BIRD 上達到 70.53% 的執行準確率，在 Spider 上達到 88.31% 的執行準確率，搭配 DeepSeek 主幹，超越了所有無需訓練的基準，同時消耗的標記數量比競爭方法少一個量級。它還作為一個模型無關的包裝器，持續改善微調的 SQL 生成主幹，而不需要對管道進行任何修改。

A Framework for Evaluating Agentic Skills at Scale

2606.17819v1 by Maksim Shaposhnikov, Nicolas Fortuin, Simon Stipcich, Maria I. Gorinova, Amy Heineike, Rob Willoughby

Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain under-studied, and no reusable methodology exists for evaluating an individual skill. In this work, we present an evaluation framework that lets a skill author construct realistic tasks to rigorously assess the aspects of a skill that matter most to them, and that estimates skill utility by solving those tasks. Further, we apply our evaluation approach at scale to 500 real-world skills, generating 1,000 tasks derived from the skills' content, along with instruction-following and goal-completion scoring rubrics. Using these metrics, we evaluate how 19 agent-model configurations, both proprietary and open-source, perform on the tasks. Our results show that models vary widely in how closely they adhere to the instructions encoded in skills, leading to substantial differences in their performance gains. Furthermore, we show that access to a skill significantly changes model behavior compared to the no-skill setup, providing an essential mechanism for encoding opinionated workflows into LLM agents. We release our evaluation dataset to support future work on agent skills.

摘要：代理技能——結構化、可重用的知識工件，增強了大型語言模型代理的能力——在業界迅速被採用，然而它們在商業和開源模型中的跨領域影響及使用仍未受到充分研究，且缺乏可重用的方法論來評估單一技能。在這項工作中，我們提出了一個評估框架，讓技能作者構建現實的任務，以嚴格評估對他們最重要的技能方面，並通過解決這些任務來估計技能的實用性。此外，我們將我們的評估方法擴展應用於500個現實世界的技能，生成1,000個從技能內容衍生的任務，以及遵循指令和目標完成的評分標準。使用這些指標，我們評估了19種代理模型配置，包括專有和開源模型，在這些任務上的表現。我們的結果顯示，模型在遵循技能中編碼的指令方面差異很大，導致其性能增益有顯著差異。此外，我們顯示，與無技能設置相比，訪問技能顯著改變了模型行為，提供了一種將有見地的工作流程編碼到大型語言模型代理中的基本機制。我們釋放了我們的評估數據集，以支持未來對代理技能的研究。

Conflict-Aware Retriever Editing for Knowledge Injection Attacks on LLM-Based RAG Systems

2606.18310v1 by Xinru Liu, Xianglong Zhang, Di Cai, Zhumin Chen, Pengfei Hu, Xin Xin

Injecting malicious knowledge into retrieval-augmented generation (RAG) systems can manipulate retrieved evidence and mislead downstream generation, posing a serious security threat for AI applications. Existing RAG injection attacks mainly rely on manipulating external knowledge bases, such as crafting malicious corpus. However, the synthetic text crafted by such data-centric methods could be detectable, leading to the failure of attacks. Beyond corpus manipulation, open-source retrievers are increasingly exposing RAG systems to model-centric attacks. In this paper, we propose conflict-aware retriever editing, i.e., CAREATTACK, a model-centric retriever attack framework for malicious knowledge injection in RAG. Specifically, CAREATTACK consists two stages of conflict-aware retriever editing and attack-preserving anchor repair. Conflict-aware retriever editing adapts efficient closed-form parameter editing to the dense retrieval model, promoting malicious knowledge above benign competing passages and resolving potential parameter conflicts through graph-based conflict detection and parameter editing projection. Then, attack-preserving anchor repair performs lightweight calibration on the edited retriever to further eliminate the impact on non-target prompts while preserving the attack effectiveness for target prompts. We instantiate CAREATTACK on Qwen3-Embedding-0.6B and BGE-M3, and conduct evaluation on three benchmark datasets. Experimental results demonstrate our method substantially promote malicious passages into the retrieved knowledge of RAG systems and can perform attacks for batches of target prompts and passages, given the access of retrieval model parameters. Since most RAG systems are built upon open-source retrieval models, this work reveals a practical attack surface in RAG systems. Codes are public accessible at https://anonymous.4open.science/r/CareAttack-3F1C.

摘要：注入惡意知識到檢索增強生成（RAG）系統中可以操縱檢索到的證據並誤導下游生成，對人工智慧應用構成嚴重的安全威脅。現有的 RAG 注入攻擊主要依賴於操縱外部知識庫，例如製作惡意語料。然而，這種數據中心方法製作的合成文本可能是可檢測的，導致攻擊失敗。除了語料操縱之外，開源檢索器越來越多地使 RAG 系統暴露於以模型為中心的攻擊。在本文中，我們提出了衝突感知檢索器編輯，即 CAREATTACK，一個針對 RAG 中惡意知識注入的以模型為中心的檢索器攻擊框架。具體而言，CAREATTACK 包含兩個階段的衝突感知檢索器編輯和攻擊保留的錨點修復。衝突感知檢索器編輯將高效的閉式參數編輯應用於密集檢索模型，促進惡意知識超越良性競爭段落，並通過基於圖的衝突檢測和參數編輯投影解決潛在的參數衝突。然後，攻擊保留的錨點修復對編輯過的檢索器進行輕量級校準，以進一步消除對非目標提示的影響，同時保留對目標提示的攻擊有效性。我們在 Qwen3-Embedding-0.6B 和 BGE-M3 上實現了 CAREATTACK，並在三個基準數據集上進行了評估。實驗結果表明，我們的方法顯著促進了惡意段落進入 RAG 系統的檢索知識中，並能夠對一批目標提示和段落執行攻擊，前提是能夠訪問檢索模型參數。由於大多數 RAG 系統是基於開源檢索模型構建的，這項工作揭示了 RAG 系統中的一個實際攻擊面。代碼可在 https://anonymous.4open.science/r/CareAttack-3F1C 上公開訪問。

LLMs Infer Cultural Context but Fail to Apply It When Responding

2606.17688v1 by Yisong Miao, Jian Zhu, Vered Shwartz

Recent work has shown that LLMs overrepresent dominant cultures, particularly Western ones, while marginalizing others. We investigate whether this affects models' ability to generate culturally adapted responses by evaluating their use of local measurement units based on the user's perceived cultural background. We introduce Cultural and Pragmatic Response Inference (CAPRI), a dataset of conversations with varying levels of cultural cues. Experiments with state-of-the-art LLMs show that models can infer cultural background and recall relevant conventions, but often fail to utilize the information to adapt their answers to the relevant cultural conventions, unless explicitly prompted to perform the tasks sequentially. We further evaluate adaptation to the interpretation of time and quantity expressions, two subjective language grounding dimensions that are affected by culture. We find that models increasingly adapt their answers as cultural cues accumulate, but their priors are not culture-neutral, sometimes aligning with the model's country of origin. Overall, CAPRI provides a resource for future research aimed at narrowing the gap between cultural knowledge and culturally adaptive language generation.

摘要：最近的研究顯示，大型語言模型（LLMs）過度代表主導文化，特別是西方文化，同時邊緣化其他文化。我們調查這是否影響模型生成文化適應性回應的能力，通過評估它們根據用戶感知的文化背景使用當地測量單位的情況。我們引入了文化與實用回應推斷（CAPRI），這是一個具有不同文化提示水平的對話數據集。對於最先進的LLMs的實驗顯示，模型可以推斷文化背景並回憶相關的慣例，但通常未能利用這些信息來使其回答適應相關的文化慣例，除非明確提示它們按順序執行任務。我們進一步評估對時間和數量表達的解釋的適應性，這是受文化影響的兩個主觀語言基礎維度。我們發現，隨著文化提示的累積，模型越來越能夠適應其回答，但它們的先驗並不是文化中立的，有時與模型的原產國相符。總體而言，CAPRI為未來旨在縮小文化知識與文化適應性語言生成之間差距的研究提供了一個資源。

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

2606.18309v1 by Jingyuan Zhang, Yucheng Bai, Peixi Wen, Zhehao Huang, Zhengbao He, Hanling Tian, Xinwen Cheng, Haiyin Ran, Xiaolin Huang

Large Language Model (LLM) unlearning aims to remove undesirable knowledge or behaviors while preserving retained capabilities. Current unlearning methods all involve a trade-off between unlearning and retention. We have found that the retention activation bias can also be used to quantify the damage an unlearning method inflicts on retention, without considering the specific implementation of the unlearning process. This allows us to restore retention performance for any unlearning method using a post-hoc approach. Therefore, we propose a complementary post-hoc setting to sanitize the final update vector without rerunning the original unlearning pipeline. In this setting, we design SAGE, Spectral Activation-GEometry Sanitization, a source-agnostic correction for final unlearning updates. SAGE collects real module inputs from a small retain proxy, extracts their dominant activation geometry, and solves a source-anchored optimization objective in closed form, which suppresses update components aligned with high-energy retained directions while preserving the source method's forgetting carrier. Across multiple unlearning methods, model scales, and benchmarks, SAGE consistently relieves the retain-forget trade-off, identifying post-hoc sanitization of final vectors as a practical and underexplored axis for machine unlearning.

摘要：大型語言模型（LLM）去學習的目的是在保留已保留能力的同時，去除不必要的知識或行為。當前的去學習方法都涉及去學習與保留之間的權衡。我們發現保留激活偏差也可以用來量化去學習方法對保留造成的損害，而不考慮去學習過程的具體實施。這使我們能夠使用後驗方法恢復任何去學習方法的保留性能。因此，我們提出了一個互補的後驗設置，以清理最終更新向量，而無需重新運行原始的去學習管道。在這個設置中，我們設計了SAGE，光譜激活幾何清理，這是一種與來源無關的最終去學習更新的修正。SAGE從小型保留代理收集真實模塊輸入，提取其主導激活幾何，並以封閉形式解決一個以來源為基礎的優化目標，該目標抑制與高能保留方向對齊的更新組件，同時保留來源方法的遺忘載體。在多種去學習方法、模型規模和基準測試中，SAGE持續減輕保留與遺忘之間的權衡，將最終向量的後驗清理確定為機器去學習的一個實用且未被充分探索的方向。

Handling Feature Heterogeneity with Learnable Graph Patches

2606.17667v1 by Yifei Sun, Yang Yang, Xiao Feng, Zijun Wang, Haoyang Zhong, Chunping Wang, Lei Chen

In recent years, the rapid development of foundation models and graph pre-training technologies has spurred increasing interest in constructing a universal pre-trained graph model or Graph Foundation Model (GFM). However, a significant challenge is that existing models are unable to address feature heterogeneity in graph data without textual information, which hinders the transferability of graph models across different datasets. To bridge this gap, we propose the concept of learnable graph patches, which we regard as the smallest semantic units of any graph data. We decompose the graph into learnable graph patches by unfolding the node features and constructing corresponding patch structures separately. We then design a framework that mines transferable information from graph data across domains. Specifically, after extracting graph patches, we propose a patch encoder to extract knowledge from each unit and a patch aggregator to learn how the units are combined into a whole. Due to its domain-agnostic nature, the model can be applied to downstream data across different domains. Furthermore, we analyze the connection between our method and existing graph models, as well as the transferability of the node embeddings it generates. Empirically, our method not only achieves the capability to use multi-domain graphs for pre-training, but also shows enhanced performance across various downstream datasets and tasks. Moreover, we observe consistent improvement in downstream performance as the volume of pre-training data increases.

摘要：近年來，基礎模型和圖形預訓練技術的快速發展引發了對構建通用預訓練圖形模型或圖形基礎模型（Graph Foundation Model, GFM）的日益關注。然而，一個重大挑戰是現有模型無法在沒有文本信息的情況下解決圖形數據中的特徵異質性，這妨礙了圖形模型在不同數據集之間的可轉移性。為了填補這一空白，我們提出了可學習圖形補丁的概念，將其視為任何圖形數據的最小語義單元。我們通過展開節點特徵並分別構建相應的補丁結構，將圖形分解為可學習的圖形補丁。然後，我們設計了一個框架，從不同領域的圖形數據中挖掘可轉移的信息。具體而言，在提取圖形補丁後，我們提出了一個補丁編碼器來從每個單元中提取知識，以及一個補丁聚合器來學習這些單元如何組合成整體。由於其領域無關的特性，該模型可以應用於不同領域的下游數據。此外，我們分析了我們的方法與現有圖形模型之間的聯繫，以及它所生成的節點嵌入的可轉移性。實證結果表明，我們的方法不僅實現了使用多領域圖形進行預訓練的能力，還在各種下游數據集和任務中顯示出增強的性能。此外，我們觀察到，隨著預訓練數據量的增加，下游性能持續改善。

SketchXplain: Intuitive Visual Explanations of Image Classifiers with Sketches

2606.17646v1 by Wencan Zhang, Mario Michelessa, Xuejun Zhao, Brian Y. Lim

Saliency map visualizations explain image-based AI predictions by pointing to regions, but these are often unintuitive and semantically unclear, leaving an interpretability gap. We argue that AI explanations should be intuitive -- coherent to user knowledge, yet simple and selective to accelerate interpretation. Inspired by artistic drawings, we propose SketchXplain to generate sketch-based visual explanations for intuitive image-based explainable AI (XAI). Combining techniques in saliency maps, concept-bottleneck models, and sketch optimization, SketchXplain integrates saliency to select coherent observation artifacts, concepts for knowledge coherence, cues to represent them, and abstraction for simplicity. Evaluating on face expression recognition, modeling and user studies showed that SketchXplain supported quicker interpretation with more aligned visualizations than saliency maps or simple drawings. Further evaluation on skin lesion diagnosis found that SketchXplain more coherently visualized disease symptoms, better supporting lay diagnosis. Thus, this work illustrates the value of sketches for intuitive, simple, coherent, and quick image-based XAI visualizations.

摘要：顯著性圖視覺化通過指向區域來解釋基於圖像的人工智慧預測，但這些通常不直觀且語義不清，留下了解釋性差距。我們認為，人工智慧的解釋應該是直觀的——與用戶知識一致，但又簡單且具選擇性，以加速解釋。受到藝術繪畫的啟發，我們提出了SketchXplain，用於生成基於草圖的視覺解釋，以實現直觀的基於圖像的可解釋人工智慧（XAI）。SketchXplain結合了顯著性圖、概念瓶頸模型和草圖優化技術，整合顯著性以選擇一致的觀察工件、知識一致性的概念、表示它們的提示以及簡單性的抽象。在面部表情識別的評估中，建模和用戶研究顯示，SketchXplain支持比顯著性圖或簡單繪圖更快的解釋，並且視覺化更一致。對皮膚病變診斷的進一步評估發現，SketchXplain更一致地視覺化疾病症狀，更好地支持非專業診斷。因此，這項工作說明了草圖在直觀、簡單、一致和快速的基於圖像的XAI視覺化中的價值。

Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification

2606.17637v1 by Yiyue Qian, Shinan Zhang, Huan Song, Negin Sokhandan, Hannah Marlowe, Diego Socolinsky

Building Management Systems (BMS) are essential for optimizing energy efficiency and operational performance in modern buildings. However, the lack of standardization across BMS points from different manufacturers creates significant barriers to integration and data utilization. While the Brick schema offers a standardized ontology for building systems, mapping BMS points to appropriate Brick classes presents three critical challenges: (i) the extensive number of Brick classes (936 in the latest version), (ii) limited domain-specific knowledge in large language models (LLMs), and (iii) substantial manual effort required for verification. To address these challenges, we propose Brick-DICL, a two-stage dynamic in-context learning framework for automated Brick schema classification. Brick-DICL consists of two primary components: metadata-RAG, which retrieves relevant examples to enhance LLMs' domain knowledge, and class-RAG, which narrows down potential Brick classes to address the large classification space. Additionally, we implement a multi-LLM filtering mechanism that compares predictions across multiple models, flagging low-confidence classifications for human review. As a result: (i) General: Brick-DICL is applicable to any building management system regardless of manufacturer or metadata format; (ii) Novel and Powerful: as the first dynamic in-context learning approach for Brick schema classification, Brick-DICL achieves significant classification accuracy improvements on building datasets, outperforming existing methods; (iii) Efficient: our multi-LLM filtering strategy reduces manual verification effort, enabling rapid digital building onboarding. Extensive experiments demonstrate Brick-DICL's effectiveness across diverse building datasets, accelerating the path toward standardized, interoperable building management systems.

摘要：建築管理系統（BMS）對於優化現代建築的能源效率和運營性能至關重要。然而，不同製造商的 BMS 點缺乏標準化，造成了整合和數據利用的重大障礙。雖然 Brick 架構提供了建築系統的標準化本體，但將 BMS 點映射到適當的 Brick 類別面臨三個關鍵挑戰：（i）大量的 Brick 類別（最新版本中有 936 個），（ii）大型語言模型（LLMs）中有限的領域專業知識，以及（iii）驗證所需的大量手動工作。為了解決這些挑戰，我們提出了 Brick-DICL，一種兩階段動態上下文學習框架，用於自動化 Brick 架構分類。Brick-DICL 包含兩個主要組件：metadata-RAG，該組件檢索相關示例以增強 LLM 的領域知識，以及 class-RAG，該組件縮小潛在的 Brick 類別以應對龐大的分類空間。此外，我們實施了一種多 LLM 過濾機制，該機制比較多個模型的預測，並標記低信心的分類以供人工審查。因此：（i）一般性：Brick-DICL 適用於任何建築管理系統，無論製造商或元數據格式如何；（ii）新穎且強大：作為第一個動態上下文學習方法用於 Brick 架構分類，Brick-DICL 在建築數據集上實現了顯著的分類準確性提升，超越了現有方法；（iii）高效：我們的多 LLM 過濾策略減少了手動驗證的工作量，使快速的數位建築上線成為可能。廣泛的實驗證明了 Brick-DICL 在多樣化建築數據集上的有效性，加速了邁向標準化、可互操作的建築管理系統的進程。

Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs

2606.17634v1 by Dong Huang, Jianbo Sun, Pengkun Yang

Evaluating large language models (LLMs) is important for understanding their capabilities, comparing competing systems, and supporting the deployment of reliable models in practice. For open-ended tasks, pairwise evaluation has become a popular paradigm, in which two responses to the same prompt are compared and the resulting judgments are aggregated into an overall ranking. A central challenge of this paradigm is intransitivity: the induced comparison outcomes may fail to support any coherent global ranking. For example, one may observe cyclic preferences such as $A \succ B \succ C \succ A$, or inconsistencies involving ties such as $A \equiv B\equiv C\neq A$. Such contradictions make the resulting leaderboard unstable and challenging to interpret. In this paper, we propose a prompt perturbation framework for improving the consistency of pairwise LLM evaluation. Our approach generates perturbed variants of each prompt, uses the resulting comparison graphs to identify and filter out structurally inconsistent comparison patterns, and then applies standard ranking methods to the filtered comparisons. A key feature of the proposed framework is that graph-level structural consistency is incorporated explicitly into the evaluation pipeline before ranking aggregation. This provides a simple and principled way to reduce cyclic inconsistencies and improve the reliability of LLM rankings.

摘要：評估大型語言模型 (LLMs) 對於理解其能力、比較競爭系統以及支持可靠模型在實踐中的部署至關重要。對於開放式任務，成對評估已成為一種流行的範式，其中比較對同一提示的兩個回應，並將結果判斷匯總成整體排名。這一範式的一個核心挑戰是非傳遞性：所引發的比較結果可能無法支持任何一致的全局排名。例如，可能會觀察到循環偏好，如 $A \succ B \succ C \succ A$，或涉及平局的不一致性，如 $A \equiv B\equiv C\neq A$。這些矛盾使得最終的排行榜不穩定且難以解釋。在本文中，我們提出了一個提示擾動框架，以改善成對 LLM 評估的一致性。我們的方法生成每個提示的擾動變體，使用結果比較圖來識別和過濾結構上不一致的比較模式，然後將標準排名方法應用於過濾後的比較。所提框架的一個關鍵特徵是，在排名匯總之前，圖級結構一致性被明確納入評估流程中。這提供了一種簡單且有原則的方法來減少循環不一致性，並提高 LLM 排名的可靠性。

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

2606.17628v1 by Guibin Zhang, Xun Xu, Yanwei Yue, Zikun Su, Wangchunshu Zhou, Xiaobin Hu, Shuicheng Yan

Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.

摘要：記憶已成為自我演化代理的標準基底，但保留經驗並不等同於學習如何通過這些經驗進行演化。現有的記憶代理可以儲存軌跡、檢索反思或累積技能，但往往缺乏選擇有用經驗、採取行動、撰寫可重用知識和維護不斷增長的資料庫的整體能力。我們介紹了 OPD-Evolver，一個緩慢-快速共同演化框架，通過政策內自我蒸餾培養這樣的代理演化者。在快速迴圈中，OPD-Evolver 與四層記憶層次結構互動，以快速讀取、使用、寫入和維護經驗，以便於快速測試時間的演化。在緩慢迴圈中，結果校準的記憶歸因和特權後見將這四種能力蒸餾成可部署的政策。在多領域基準測試中，OPD-Evolver 超越了記憶系統，如 ReasoningBank，達到高達 11.5% 的提升，以及基於訓練的方法，如 Skill0，約 5.8% 的提升。進一步分析顯示，OPD-Evolver 內化了高價值的經驗和記憶管理，使 OPD-Evolver-9B 能夠挑戰巨型對手，如 Qwen3.5-397B-A17B 和 Step-3.5-Flash，指向超越記憶增強代理的真正合格代理演化者。

Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning

2606.17591v1 by Yanwei Cui, Xing Zhang, Yulong Zhang, Li Shao, Xiaofeng Shi, Guanghui Wang, Peiyang He

Training-free verbal reinforcement learning enables LLM agents to learn from world feedback -- objective signals such as dynamic task outcomes, market returns, or demand forecasts -- by extracting verbal rules from experience and injecting them as context, updating the agent's behavior without parameter changes. However, in non-stationary environments these agents face a retention-forgetting dilemma: retaining stale insights causes negative transfer, while discarding them causes catastrophic forgetting when conditions recur. We identify four requirements for navigating this dilemma -- outcome-driven evaluation, persistent structured evidence, non-monotonic knowledge lifecycle, and compositional governance -- and show that existing methods invest heavily in experience extraction while underinvesting in insight governance. We propose a three-layer architecture -- rules, evidence, and skills -- connected by a feedback-driven curation loop that closes the governance gap. Rules capture distilled experience from world outcomes; evidence logs track each rule's reliability across episodes; skills govern which rules to apply, how to resolve conflicts, and when to abstain. On financial forecasting as a case study, where world feedback is naturally abundant, noisy, and non-stationary, we show that the same accumulated experience either degrades performance below the zero-shot baseline or dramatically improves accuracy and risk-adjusted returns, depending on whether the curation loop is present.

摘要：訓練自由的口頭強化學習使得大型語言模型（LLM）代理能夠從世界反饋中學習——例如動態任務結果、市場回報或需求預測等客觀信號——通過從經驗中提取口頭規則並將其注入作為背景，更新代理的行為而無需改變參數。然而，在非穩定環境中，這些代理面臨著保留與遺忘的困境：保留過時的見解會導致負面轉移，而丟棄它們則會在條件重現時導致災難性遺忘。我們確定了四個應對這一困境的要求——以結果為驅動的評估、持續的結構證據、非單調的知識生命周期和組合治理——並顯示現有的方法在經驗提取上投入過多，而在見解治理上投入不足。我們提出了一個三層架構——規則、證據和技能——通過一個以反饋為驅動的策展循環連接，來填補治理的空白。規則捕捉來自世界結果的提煉經驗；證據日誌追蹤每個規則在不同情節中的可靠性；技能則管理應用哪些規則、如何解決衝突以及何時應該放棄。以金融預測為案例研究，在這裡世界反饋自然豐富、雜訊多且非穩定，我們顯示相同的累積經驗要麼使性能低於零樣本基準，要麼根據策展循環的存在顯著提高準確性和風險調整回報。

Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow

2606.17577v1 by Osamu Ito, Akihiko Katagiri, Yoshikazu Nakagawa, Shin Saeki, Jun Shiraishi, Masato Sasaki

AI-driven engineering workflows face particular challenges in crash safety design: unlike aerodynamics, crash events involve highly nonlinear contact dynamics, material nonlinearity, and discrete state transitions that are difficult to capture with data-driven surrogate models. To the best of our knowledge, we present the first foundation model--orchestrated workflow for crash safety design that enables surrogate-assisted exploration for pedestrian protection, reducing evaluation time from hours per CAE simulation to seconds. The workflow integrates four components: (1) a surrogate trained on CAE crash simulations to predict pedestrian leg injury metrics from design parameters, achieving an average $R^2=0.87$ and providing distribution-free conformal prediction intervals; (2) multiobjective evolutionary search (NSGA-II) to discover diverse feasible parameter sets under user-specified constraints; (3) a morphing-based geometry generator that maps parameters to topology-preserving 3D shapes; and (4) a natural-language interface in which an LLM orchestrates the workflow and a vision--language model supports semantic comparison of generated designs. In an automotive front-bumper case study, the workflow produces 35 distinct safety-compliant alternatives from a single exploration, a process that would require weeks with conventional CAE iteration. These results suggest that foundation models can serve as integration layers between ML surrogates and physics-based simulation, helping bring AI capabilities to safety-critical engineering domains.

摘要：AI 驅動的工程工作流程在碰撞安全設計方面面臨特定挑戰：與氣動力學不同，碰撞事件涉及高度非線性的接觸動力學、材料非線性以及難以用數據驅動的替代模型捕捉的離散狀態轉換。據我們所知，我們提出了首個基於基礎模型的碰撞安全設計協調工作流程，該流程使得在行人保護方面進行替代輔助探索，將每次 CAE 模擬的評估時間從幾小時縮短到幾秒鐘。該工作流程整合了四個組件：(1) 一個基於 CAE 碰撞模擬訓練的替代模型，用於從設計參數預測行人腿部受傷指標，達到平均 $R^2=0.87$ 並提供無分佈的符合預測區間；(2) 多目標進化搜索 (NSGA-II) 用於在用戶指定的約束下發現多樣的可行參數集；(3) 一個基於變形的幾何生成器，將參數映射到保持拓撲的 3D 形狀；以及 (4) 一個自然語言界面，其中 LLM 協調工作流程，視覺-語言模型支持生成設計的語義比較。在一個汽車前保險杠的案例研究中，該工作流程從單次探索中產生了 35 種不同的安全合規替代方案，這一過程在傳統 CAE 迭代中需要數週時間。這些結果表明，基礎模型可以作為機器學習替代模型與基於物理的模擬之間的整合層，幫助將 AI 能力引入安全關鍵的工程領域。

An AI Security Agent for Banking: Multi-Vector Fraud and AML Detection Across Retail and Corporate Accounts

2606.17555v1 by Joseph Walusimbi, Joshua Benjamin Ssentongo

Banks simultaneously face signature-based fraud (card-not-present attacks, account takeover, ATM cloning) and behavioural financial crime (structuring, layering, mule networks, business email compromise) -- two threat families with fundamentally different detection requirements. Static rule engines that reliably catch brute-force and high-velocity events are structurally blind to business-email-compromise (BEC) payment redirection, session hijacking, and money-laundering layering, which are engineered to appear indistinguishable from legitimate activity at the individual transaction or session level. This paper presents an AI security agent for retail and corporate banking that addresses this gap through a three-component fusion architecture operating on two parallel event streams: a transaction stream (card fraud, ACH/wire fraud, AML categories) and a session stream (account takeover, session hijacking, SIM-swap, insider abuse). Each stream combines an LSTM sequence model capturing per-account behavioural history, a statistical velocity/threshold monitor, and a graph/network module capturing account-counterparty relationship patterns (fan-in, fan-out, pass-through ratio) for money-laundering detection. Experiments on a synthetic event log of 237,669 transactions and 113,508 sessions across 13 threat categories and 3,470 simulated accounts demonstrate overall F1 of 0.787 (transaction stream) and 0.867 (session stream) for the proposed model, versus 0.562/0.733 for a rule-based baseline and 0.655/0.713 for an LSTM-only baseline. The agent includes a customer-facing transaction-verification chatbot (96.6% identity verification accuracy, 86.8% mass-reset attack detection) and an analyst case-summary assistant (99.3% action-recommendation F1), with Critical-tier automated response latency under 0.43 ms at the 95th percentile.

摘要：銀行同時面臨基於簽名的詐騙（非持卡人攻擊、帳戶接管、ATM克隆）和行為金融犯罪（結構化、分層、騙子網絡、商業電子郵件妥協）——這兩類威脅具有根本不同的檢測需求。靜態規則引擎可靠地捕捉暴力破解和高速度事件，但在結構上對商業電子郵件妥協（BEC）支付重定向、會話劫持和洗錢分層等行為視而不見，這些行為被設計得在個別交易或會話層面上與合法活動無法區分。本文提出了一種針對零售和企業銀行的人工智慧安全代理，通過在兩個平行事件流上運行的三組件融合架構來填補這一空白：交易流（卡片詐騙、ACH/電匯詐騙、反洗錢類別）和會話流（帳戶接管、會話劫持、SIM交換、內部濫用）。每個流結合了一個捕捉每個帳戶行為歷史的LSTM序列模型、一個統計速度/閾值監控器，以及一個捕捉帳戶-對方關係模式（進入、退出、通過比例）的圖形/網絡模組，用於洗錢檢測。在237,669筆交易和113,508個會話的合成事件日誌上進行的實驗顯示，該模型的整體F1為0.787（交易流）和0.867（會話流），而基於規則的基線為0.562/0.733，僅LSTM基線為0.655/0.713。該代理包括一個面向客戶的交易驗證聊天機器人（96.6%的身份驗證準確率，86.8%的大規模重置攻擊檢測）和一個分析師案例摘要助手（99.3%的行動建議F1），其關鍵層級自動響應延遲在第95百分位數下低於0.43毫秒。

FoundCause: Causal Discovery with Latent Confounders from Observational Data

2606.17516v1 by Patrick Blöbaum, Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan

Causal discovery from observational data remains challenging due to the need to recover directed structure and latent confounding without interventions. We propose FoundCause, an amortized causal discovery model trained entirely on synthetic data that maps datasets directly to causal graphs in a single forward pass. By learning from large collections of simulated structural causal models, FoundCause captures transferable statistical patterns that generalize beyond individual datasets. The architecture incorporates several key inductive biases for causal discovery. It uses a permutation-invariant transformer encoder with alternating attention over samples and variables to jointly model cross-variable dependence and per-variable distributions. Pairwise statistical features derived from classical asymmetry measures are injected through statistics-conditioned attention, guiding the model toward known causal signals. A factorized decoder separates edge existence from direction, while a triangular refinement module enables reasoning over higher-order causal motifs such as chains and colliders. In addition, a dedicated confounder module based on learnable latent tokens explicitly models hidden common causes, and the model explicitly handles missing data via its masked input representation. To our knowledge, FoundCause is the first amortized causal discovery approach to explicitly model latent confounding. FoundCause outperforms 11 classical non-amortized methods (e.g., PC, GES, NOTEARS-style optimization) and 4 amortized causal discovery methods on 15 real-world datasets, achieving +9.6% improvement in $F_1$, +1.2% in AUROC, and an 18.9% reduction in structural Hamming distance relative to the strongest non-amortized methods, while performing inference in a single forward pass.

摘要：因果發現從觀察數據中仍然具有挑戰性，因為需要在沒有干預的情況下恢復有向結構和潛在的混淆。我們提出了FoundCause，這是一種完全基於合成數據訓練的攤銷因果發現模型，能夠在單次前向傳遞中將數據集直接映射到因果圖。通過從大量模擬結構因果模型中學習，FoundCause 捕捉可轉移的統計模式，這些模式超越了單個數據集的範疇。該架構結合了幾個關鍵的歸納偏見以進行因果發現。它使用了一種對置換不變的Transformer編碼器，並在樣本和變數之間交替注意，以共同建模跨變數依賴和每個變數的分佈。通過統計條件注意注入的成對統計特徵源自於經典的不對稱性測量，指導模型朝向已知的因果信號。一個因子化解碼器將邊的存在與方向分開，而一個三角形細化模塊則使得對於更高階因果圖形（如鏈和碰撞器）進行推理成為可能。此外，基於可學習潛在標記的專用混淆模塊明確建模隱藏的共同原因，並且模型通過其掩蔽輸入表示明確處理缺失數據。據我們所知，FoundCause 是第一個明確建模潛在混淆的攤銷因果發現方法。FoundCause 在 15 個真實世界數據集上超越了 11 種經典的非攤銷方法（例如，PC、GES、NOTEARS 風格優化）和 4 種攤銷因果發現方法，實現了 $F_1$ 提升 +9.6%、AUROC 提升 +1.2%，以及相對於最強的非攤銷方法結構漢明距離減少 18.9%，同時在單次前向傳遞中進行推理。

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

2606.17506v1 by Ramaravind Kommiya Mothilal, Terry Jingchen Zhang, Raiyan Ahmed, Zhijing Jin, Shion Guha, Syed Ishtiaque Ahmed

Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate biased content, which current methods do not systematically capture. We call this second-order bias: social bias in an LLM's judgment about social bias, which we evaluate through a novel, philosophically grounded reasoning task. Drawing on entitlement epistemology, we conceptualize bias as misplaced foundational knowledge that shapes an agent's rational inquiry, and derive a logical reasoning task for LLMs to judge to whom a biased text is acceptable or non-acceptable. We develop two simple metrics to measure how biased LLM judges are in inferring demographics for acceptability without sufficient support, and how these inferences vary across groups targeted by biased texts. Evaluating open and closed models, we find that our task evades safety guardrails by surfacing bias in model judgment. It varies systematically across target groups, reflects implicit social maps, and shows how models are still triggered by demographic labels. Our work points to the need for LLM bias evaluation in judgment tasks and broadly, for more theoretically grounded approaches to bias evaluation in NLP. We release our code and model responses at https://github.com/uofthcdslab/second-order-bias.

摘要：社會偏見在大型語言模型（LLMs）中的評估主要集中在模型是否生成或暗示偏見內容。然而，隨著 LLMs 越來越多地被用作偏見的評判者，它們可能在評估偏見內容的方式上以更微妙的方式表現出社會偏見，而目前的方法並未系統性地捕捉到這一點。我們稱之為二階偏見：LLM 對社會偏見的判斷中的社會偏見，我們通過一項新穎的、以哲學為基礎的推理任務來進行評估。基於權利認識論，我們將偏見概念化為錯位的基礎知識，這種知識塑造了代理人的理性探究，並推導出一項邏輯推理任務，讓 LLM 判斷誰對偏見文本是可接受或不可接受的。我們開發了兩個簡單的指標來衡量 LLM 評判者在推斷可接受性的人口統計時的偏見程度，這些推斷在缺乏充分支持的情況下是如何變化的，以及這些推斷在受到偏見文本針對的群體之間的變化。評估開放和封閉模型時，我們發現我們的任務通過顯示模型判斷中的偏見而逃避了安全防護措施。它在目標群體之間系統性地變化，反映了隱含的社會地圖，並顯示模型仍然會受到人口標籤的觸發。我們的工作指出了在判斷任務中對 LLM 偏見評估的必要性，以及在自然語言處理中對偏見評估的更理論基礎的方法的廣泛需求。我們在 https://github.com/uofthcdslab/second-order-bias 上發布了我們的代碼和模型響應。

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

2606.17474v1 by Jiahui Niu, Huizi Yu, Wenkong Wang, Guangxin Dai, Jingxian He, Xiang Li, Zhiying Liang, Xinxin Lin, Kent CY So, Bryan YP Yan, Yun Kwok Wing, Yanqiu Xing, Xin Ma, Lizhou Fan

Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.

摘要：大型語言模型（LLMs）越來越被考慮用於臨床諮詢任務，然而大多數醫療評估仍然是靜態的、單回合的或狹隘的結果導向，限制了它們反映現實世界護理的連續性、不確定性和互動性的能力。在此，我們提出了AIPatient Arena，一個基於電子健康紀錄（EHRs）的評估框架，用於評估LLMs在八個臨床能力維度上的臨床效用。該框架將EHR數據整合到患者特定的知識圖譜中，使多回合的醫生-患者互動成為可能。我們在一個由437名患者組成的主要隊列以及兩個分佈外的驗證隊列（119名和67名患者）上應用AIPatient Arena。我們觀察到LLMs在醫療面試提問技能（QS；平均分數，4.43-4.99/5）、倫理和專業行為（ET；4.38-4.93/5）以及臨床解釋的清晰性和透明性（EX；3.80-4.72/5）方面表現良好。在信息整合（II；3.19-4.21/5）和藥物安全性與合理性（MS；3.13-3.78/5）方面表現中等，但在處理模糊患者反應（HR；2.57-3.32/5）、信息覆蓋（IC；2.08-3.02/5）以及診斷準確性和推理（Dx；2.63-3.55/5）方面持續存在弱點。基於過程的評估揭示了重複的互動失敗，包括重複提問、遺漏過去病史以及對不確定性的處理不足。更豐富的對話上下文改善了診斷推理，但在治療計劃方面的增益有限。這些發現表明，僅僅依賴最終答案的準確性不足以評估臨床準備情況，並突顯了評估模型在諮詢過程中如何收集、解釋和傳達信息的重要性。AIPatient Arena提供了一個基於EHR的框架，用於針對工作流程的醫療LLMs預部署評估。

Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation

2606.17459v1 by Yuyang Dai, Xueqing Peng, Lingfei Qian, Zhuohan Xie

Evaluating the decision-making capabilities of large language models (LLMs) is a growing research priority, yet existing benchmarks focus on isolated cognitive tasks such as reasoning, knowledge retrieval, and economic rationality in stylized settings. These evaluations overlook the defining challenge of real executive decision-making: integrating conflicting recommendations from specialized stakeholders under information asymmetry, organizational constraints, and temporal dependencies. We introduce \textsc{CEO-Bench}, a multi-agent benchmark that evaluates LLMs on CEO-level strategic resource reallocation -- the process of redirecting capital across business units in a multi-round, constraint-rich organizational environment. In \textsc{CEO-Bench}, LLM agents receive conflicting advice from four role-conditioned C-suite advisors (CFO, CTO, COO, CMO), each with private signals and distinct priorities, and must synthesize these into a concrete allocation plan evaluated along four dimensions: role integration, conditional boldness, history-sensitive judgment, and plan validity. Experiments across five frontier models on 13 scenarios reveal that all models achieve high structural validity but diverge sharply on strategic calibration -- the hardest capability layer. We identify systematic failure modes including single-advisor capture, conservative default under ambiguity, and historical amnesia, and uncover a structural integration-boldness tradeoff: models that engage more deeply with conflicting perspectives tend to produce less decisive action. These findings delineate the current capability boundary of LLMs as organizational decision-makers and inform the design of future AI-assisted executive systems.

摘要：評估大型語言模型（LLMs）的決策能力正成為一項日益重要的研究優先事項，但現有的基準主要集中在孤立的認知任務上，例如推理、知識檢索和在典型情境中的經濟理性。這些評估忽略了真實執行決策的定義挑戰：在信息不對稱、組織約束和時間依賴的情況下，整合來自專業利益相關者的相互矛盾建議。我們介紹了 \textsc{CEO-Bench}，這是一個多代理基準，評估 LLM 在 CEO 級別的戰略資源重新分配上的表現——在一個多輪、約束豐富的組織環境中重新指引資本的過程。在 \textsc{CEO-Bench} 中，LLM 代理從四位角色條件的高管顧問（CFO、CTO、COO、CMO）那裡接收相互矛盾的建議，每位顧問都有私有信號和不同的優先事項，並必須將這些建議綜合成一個具體的分配計劃，該計劃將在四個維度上進行評估：角色整合、條件勇敢、歷史敏感判斷和計劃有效性。對五個前沿模型在 13 個場景中的實驗顯示，所有模型都達到了高結構有效性，但在戰略校準上卻有明顯的分歧——這是最難的能力層級。我們識別了系統性的失敗模式，包括單一顧問捕獲、在模糊情況下的保守默認和歷史健忘，並發現了一個結構整合-勇敢的權衡：更深入地參與相互矛盾觀點的模型往往會產生不那麼果斷的行動。這些發現描繪了 LLM 作為組織決策者的當前能力邊界，並為未來 AI 輔助的高管系統的設計提供了信息。

Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos

2606.17437v1 by Bo Gou, Jicheng Zhang, Jianlong Xiong, Tao He, Bentian Liu, Hai Wu, Yijiao Wang, Yu Zhang, Yujia Yang, Yun Dai, Jian Liu, Jie Wang

Automated classification of standard echocardiographic views is crucial for efficient clinical workflow but faces three main challenges. First, publicly available datasets are scarce and limited in scale and view coverage. Second, the performance of some modern video-level architectures for echocardiographic view classification remains underexplored. Third, some view categories exhibit highly similar spatial appearances, making single-frame features insufficient for discrimination, while heterogeneous frame quality complicates robust temporal information fusion. To address these challenges, we release the Echocardiographic Videos of Nine Views (EV9V) dataset, comprising 5,138 videos, 910,579 frames, and 9 standard views, which is, to the best of our knowledge, the largest publicly available echocardiography video dataset. Using EV9V, we systematically benchmark representative video classification architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. Furthermore, we propose a Spatio-Temporal Fusion Model (STFM), an efficient dual-stream CNN-LSTM (Long Short-Term Memory) framework that jointly captures spatial anatomical structures and temporal cardiac dynamics. The proposed framework leverages uncertainty-aware learning to preferentially sample representative video segments during training and evidence-based fusion during inference, improving robustness to variations in frame quality across echocardiographic videos. Extensive experiments demonstrate that our method achieves competitive performance across diverse video classification models, validating the effectiveness of uncertainty-aware spatio-temporal learning for echocardiographic view classification. The code is available at https://github.com/bgx666/stfm.

摘要：自動化的標準超聲心動圖視圖分類對於高效的臨床工作流程至關重要，但面臨三個主要挑戰。首先，公開可用的數據集稀缺，且在規模和視圖覆蓋方面有限。其次，一些現代視頻級架構在超聲心動圖視圖分類中的性能仍未被充分探索。第三，一些視圖類別表現出高度相似的空間外觀，使得單幀特徵不足以進行區分，而異質幀質量則使得穩健的時間信息融合變得複雜。為了解決這些挑戰，我們發布了九個視圖的超聲心動圖視頻（EV9V）數據集，包含5,138個視頻、910,579幀和9個標準視圖，據我們所知，這是目前最大的公開可用超聲心動圖視頻數據集。使用EV9V，我們系統性地基準測試了代表性的視頻分類架構，包括卷積神經網絡（CNNs）、遞歸神經網絡（RNNs）和Transformer。此外，我們提出了一個時空融合模型（STFM），這是一個高效的雙流CNN-LSTM（長短期記憶）框架，能夠共同捕捉空間解剖結構和時間心臟動力學。所提出的框架利用不確定性感知學習，在訓練期間優先抽樣代表性視頻片段，並在推斷期間進行基於證據的融合，從而提高對超聲心動圖視頻中幀質量變化的穩健性。大量實驗表明，我們的方法在各種視頻分類模型中達到了競爭性能，驗證了不確定性感知時空學習在超聲心動圖視圖分類中的有效性。代碼可在 https://github.com/bgx666/stfm 獲得。

SoK: AI-Augmented Binary Reversing

2606.17398v1 by Yujeong Kwon, Yiyue Zhang, Shakhzod Yuldoshkhujaev, Kexin Pei, Dokyung Song, Hyungjoon Koo

Binary reversing is fundamental to software understanding, vulnerability discovery, malware investigation, and firmware auditing. However, it remains inherently challenging due to the irreversible loss of semantic information during compilation. Recent advances in machine learning, large language models (LLMs), and agentic AI systems have accelerated the adoption of AI-augmented binary reversing. Yet, the resulting body of work has become increasingly fragmented across reversing domains, artifact representations, learning approaches, and evaluation practices. This paper presents the first comprehensive systematization of knowledge on AI-augmented binary reversing. We analyze 144 research papers published since 2015, and organize them into 22 binary reversing domains according to the inference tasks. We further introduce a unified taxonomy spanning conventional and AI-augmented reversing pipelines. Our taxonomy connects traditional analysis techniques, binary-derived artifacts, representation strategies, learning paradigms, and downstream inference tasks, while clarifying the emerging roles of LLMs and agentic AI systems. By establishing a common vocabulary and structured framework, we provide a holistic view of the field's evolution over the past decade. Our study reveals common structures underlying seemingly disparate approaches, highlights persistent technical challenges and evaluation gaps, and identifies promising opportunities for future research. Collectively, these insights clarify the current state of the field and provide a foundation for the next generation of reliable and scalable AI-augmented binary reversing systems.

摘要：二進位反向工程對於軟體理解、漏洞發現、惡意程式調查和韌體審計是基本的。然而，由於編譯過程中語義資訊的不可逆損失，它仍然固有地具有挑戰性。最近在機器學習、大型語言模型（LLMs）和自主人工智慧系統方面的進展，加速了AI增強的二進位反向工程的採用。然而，隨之而來的研究成果在反向工程領域、工件表示、學習方法和評估實踐上變得越來越支離破碎。本文呈現了對AI增強的二進位反向工程的首個全面知識系統化。我們分析了自2015年以來發表的144篇研究論文，並根據推理任務將其組織為22個二進位反向工程領域。我們進一步介紹了一個統一的分類法，涵蓋傳統和AI增強的反向工程流程。我們的分類法連接了傳統分析技術、二進位衍生工件、表示策略、學習範式和下游推理任務，同時澄清了LLMs和自主人工智慧系統的新興角色。通過建立共同的詞彙和結構化框架，我們提供了對過去十年該領域演變的整體觀點。我們的研究揭示了看似不同的方法之間的共同結構，突顯了持續存在的技術挑戰和評估空白，並識別了未來研究的有前景機會。總體而言，這些見解澄清了該領域的當前狀態，並為下一代可靠且可擴展的AI增強二進位反向工程系統提供了基礎。

MeiBRD: Meta-Learning Intraoperative Biomechanical Residual Deformation

2606.17379v1 by Casey Meisenzahl, Jon Heiselman, Michael Holtz, Yubo Ye, Michael Miga, Linwei Wang

Accurate intraoperative liver registration is challenging due to substantial soft-tissue deformation yet sparse intraoperative measurements. Biomechanical models regularize this ill-posedness with prior knowledge but exhibit persistent prediction bias due to simplifying assumptions, while data-driven learning solutions struggle with data efficiency, generalization, and physical plausibility. We propose a hybrid registration framework that adapts a biomechanical prior using sparse intraoperative correspondences. Rather than learning a full deformation field, we learn a residual deformation function that corrects linear biomechanical predictions, modeled as a graph neural diffusion function with geometry-aware attention over the 3D liver mesh. To enable long-range information transfer of sparse observations, we take a novel perspective of sparse intraoperative measurements as \textit{context} samples where input-output pairs of the residual deformation function are fully observed, casting the problem into learning-to-learn this residual function from intraoperative context samples with feedforward meta-learners. Experiments on a deformable liver phantom dataset demonstrate improved registration accuracy and generalization compared to rigid, biomechanical, and data-driven baselines, particularly for out-of-distribution geometries and deformations.

摘要：準確的術中肝臟註冊因為顯著的軟組織變形以及稀疏的術中測量而具有挑戰性。生物力學模型利用先驗知識來正則化這種不適定性，但由於簡化假設而表現出持續的預測偏差，而基於數據的學習解決方案在數據效率、泛化性和物理合理性方面面臨挑戰。我們提出了一個混合註冊框架，利用稀疏的術中對應關係來調整生物力學先驗。與其學習完整的變形場，我們學習一個殘差變形函數，該函數修正線性生物力學預測，並建模為具有幾何感知注意力的圖神經擴散函數，作用於3D肝臟網格。為了實現稀疏觀測的長距離信息傳遞，我們以新穎的視角看待稀疏術中測量，將其視為\textit{上下文}樣本，其中殘差變形函數的輸入-輸出對完全可觀察，將問題轉化為從術中上下文樣本中學習這個殘差函數的學習過程，使用前饋元學習器。在可變形肝臟幻影數據集上的實驗顯示，與剛性、生物力學和基於數據的基準相比，註冊準確性和泛化性得到了改善，特別是在分佈外幾何形狀和變形的情況下。

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

2606.17328v1 by Xianxuan Long, Zhikai Chen, Shenglai Zeng, Shouren Wang, Kai Guo, Jiliang Tang

LLM agents increasingly maintain long-term memory of user facts across sessions. Yet such memory is usually evaluated by aggregating accuracy over question rows or episodes. Because this approach scores question rows independently, even when several questions probe the same fact, it cannot show how that fact behaves as conditions change. We introduce MemTrace, a benchmark whose unit of measurement is the knowledge point: a single typed fact about the user, rather than an individual question. MemTrace probes each fact along three controlled dimensions: memory age, defined by how many sessions ago the fact appeared in the history; question type, covering current state, earlier state, and trajectory of change; and evidence condition, covering present, missing, and contradicted-by-false-premise settings. Evaluating 13 memory-system configurations across four paradigms, we find that similar pooled accuracy hides different failures: recovering a fact's current and earlier states does not imply tracking how it changed, and safe abstention does not imply correcting a false premise. The dominant bottleneck is evidence use, not retrieval: when systems fail, the evidence was retrievable 10 times more often than it was missing. These results suggest that improving long-term memory requires better use of reachable evidence, not simply more storage or retrieval.

摘要：LLM 代理越來越能夠在會話之間維持用戶事實的長期記憶。然而，這種記憶通常是通過對問題行或情節的準確性進行聚合來評估的。因為這種方法獨立地評分問題行，即使幾個問題探查同一事實，它也無法顯示該事實在條件變化時的行為。我們引入了 MemTrace，一個基準，其測量單位是知識點：關於用戶的單一鍵入事實，而不是單獨的問題。MemTrace 沿著三個受控維度探查每個事實：記憶年齡，定義為該事實出現在歷史中的會話數；問題類型，涵蓋當前狀態、早期狀態和變化的軌跡；以及證據條件，涵蓋當前、缺失和被虛假前提矛盾的設置。在四個範式中評估 13 種記憶系統配置後，我們發現相似的聚合準確性隱藏了不同的失敗：恢復事實的當前和早期狀態並不意味著追蹤其變化，而安全的放棄並不意味著糾正虛假前提。主要的瓶頸是證據使用，而不是檢索：當系統失敗時，證據的可檢索性比缺失的情況多出 10 倍。這些結果表明，改善長期記憶需要更好地利用可達證據，而不僅僅是增加存儲或檢索。

Nothing from Something: Can a Language Model Discover 0?

2606.17289v1 by Phoebe Zeng, Thomas L. Griffiths, Brenden M. Lake

AI systems based on artificial neural networks are being developed with aspirations of pushing the boundary of human mathematical knowledge. A key question for these systems is how much they can reach beyond their training data. Mathematical discovery requires a strong form of out of distribution generalization; the ability to hypothesize genuinely new - and potentially logically more powerful - mathematical structures. It has been hypothesized that language abilities support such generalizations in human cognition. In this work, we use simple arithmetic as a case study for examining how modern AI models could expand their mathematical horizons, evaluating whether these models can independently discover the concept of "zero". We show that We show that (1) language models of a GPT-2 size are unable to perform this generalization at test time regardless of language pretraining, but (2) models can improve substantially after training on tens or hundreds of examples of zero. Additionally, we find that language pretraining reduces the number of required examples by approximately $50\%$, showing that language abilities can scaffold mathematical discovery in neural models.

摘要：AI 系統基於人工神經網絡的發展旨在推動人類數學知識的邊界。這些系統的一個關鍵問題是它們能在多大程度上超越其訓練數據。數學發現需要一種強形式的分佈外泛化能力；即假設真正新穎的 - 並且可能在邏輯上更強大的 - 數學結構的能力。有人假設語言能力支持人類認知中的這種泛化。在這項工作中，我們使用簡單的算術作為案例研究，以檢驗現代 AI 模型如何擴展其數學視野，評估這些模型是否能獨立發現「零」的概念。我們顯示 (1) 大小為 GPT-2 的語言模型在測試時無法進行這種泛化，無論語言預訓練如何，但 (2) 模型在訓練數十或數百個零的例子後可以顯著改善。此外，我們發現語言預訓練將所需的例子數量減少了約 $50\%$，顯示語言能力可以支持神經模型中的數學發現。

Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering

2606.17257v1 by Rohit Kundu, Arindam Dutta, Sarosij Bose, Athula Balachandran, Amit K. Roy-Chowdhury

Open-weight video diffusion models can generate photorealistic unsafe content, from violence to misinformation, yet existing defenses either require expensive safety fine-tuning that degrades general capability, or apply external filters that are trivially bypassed by adversarial prompts. We present REINS (REpresentation-space INference-time Safety steering), a training-free method that aligns video diffusion models at inference time by steering their internal representations toward safe generation. Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead. Through mechanistic analysis, we reveal that while safety information accumulates monotonically with transformer depth, steering effectiveness peaks at intermediate layers (~50% depth), exposing a fundamental tradeoff between information availability and downstream propagation capacity. We evaluate REINS across 9 video diffusion models, multiple parameter scales (1.3B-5B), and both text-to-video and image-to-video generation, to our knowledge, the broadest safety evaluation suite in the video generation literature.

摘要：開放權重的視頻擴散模型可以生成逼真的不安全內容，從暴力到錯誤信息，但現有的防禦措施要麼需要昂貴的安全微調，這會降低一般能力，要麼應用外部過濾器，這些過濾器可以輕易被對抗性提示繞過。我們提出了REINS（REpresentation-space INference-time Safety steering），這是一種無需訓練的方法，通過在推理時引導其內部表示朝向安全生成來對齊視頻擴散模型。我們的關鍵發現是，安全相關的結構在線性編碼於視頻擴散Transformer的隱藏狀態激活中，並且通過對二元安全標籤進行監督式主成分分析發現的單一方向足以將安全和不安全的生成軌跡分開。在推理時，將這一方向添加到中間Transformer層的隱藏狀態中，可以將生成從有害內容重定向到語義相關的安全替代品，無需權重更新、無需概念枚舉，且計算開銷微不足道。通過機械分析，我們揭示了雖然安全信息隨著Transformer深度單調累積，但引導效果在中間層達到峰值（約50%深度），暴露了信息可用性與下游傳播能力之間的基本權衡。我們在9個視頻擴散模型、多個參數範圍（1.3B-5B）以及文本到視頻和圖像到視頻生成中評估了REINS，據我們所知，這是視頻生成文獻中最廣泛的安全評估套件。

Rift: A Conflict Signature for Deception in Language Models

2606.17229v1 by Petr Nyoma

A model that lies while knowing the truth is the central case ELK cannot handle with behavioral evaluation alone. We ask whether such deception leaves an internal signature distinguishing it from honest error. Our key move is a control for wrongness: we contrast a sleeper agent (knows the truth, lies on trigger) against a naive liar (fine-tuned to emit the same wrong answers with no honest training). Both produce identical wrong outputs; any difference is about knowledge conflict, not incorrectness. We find deceptive forward passes carry a conflict signature - 2.1-2.3x higher residual rank than naive-liar passes on the same wrong answer - strong enough to identify which of two responses is the lie with 100% accuracy and no labels, across GPT-2 small/medium (three seeds) and three instruct models. Across Qwen2.5-1.5B/7B and Phi-3-mini, instructed deception raises residual rank on every tested fact (18/18, 40/40, 34/34); on Phi-3, lies separate perfectly from both honest answers and hallucinations (AUC 1.0, Wilcoxon p~6e-11). The signature survives strategic self-constructed deception (model invents its own lie, AUC 1.0), active concealment attempts (AUC 1.0), and length-controlled replication (20/20, AUC 1.0, p~1e-6). Using basis-free relative representations, a probe trained on one model family detects deception in two other families zero-shot (mean AUC 0.933), surviving simultaneous architecture and format change (AUC 0.821), and transfers across five languages (AUC 1.000, length-controlled). The signature is read-only: detectable but not injectable (0/8 both directions). Honest limitations and six negative experiments are documented in full.

摘要：一個在知道真相的情況下撒謊的模型是ELK無法僅通過行為評估處理的核心案例。我們詢問這種欺騙是否留下了內部特徵，使其與誠實錯誤區分開來。我們的關鍵舉措是對錯誤的控制：我們將一個睡眠特工（知道真相，在觸發時撒謊）與一個天真的撒謊者（經過微調以發出相同的錯誤答案，沒有誠實的訓練）進行對比。兩者產生相同的錯誤輸出；任何差異都是關於知識衝突，而不是不正確性。我們發現欺騙性的前向傳遞具有衝突特徵——相較於天真的撒謊者在相同錯誤答案上的傳遞，其殘餘排名高出2.1-2.3倍——足夠強大以100%準確率識別哪兩個回應是謊言，且無需標籤，涵蓋GPT-2小型/中型（三個種子）和三個指令模型。在Qwen2.5-1.5B/7B和Phi-3-mini上，指令欺騙在每個測試的事實上提高了殘餘排名（18/18，40/40，34/34）；在Phi-3上，謊言與誠實答案和幻覺完全分開（AUC 1.0，Wilcoxon p~6e-11）。這一特徵能夠抵抗戰略性自我構建的欺騙（模型自創謊言，AUC 1.0）、主動隱瞞嘗試（AUC 1.0）以及長度控制的複製（20/20，AUC 1.0，p~1e-6）。使用無基礎相對表示法，對一個模型家族訓練的探針能夠在零樣本情況下檢測到兩個其他家族的欺騙（平均AUC 0.933），並能夠在同時架構和格式變更中存活（AUC 0.821），並在五種語言中轉移（AUC 1.000，長度控制）。這一特徵是只讀的：可檢測但不可注入（0/8雙向）。誠實的限制和六個負面實驗已完整記錄。

When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval

2606.17220v1 by Mingxu Tao, Jiawei Hu, Xian Zhou, Wenpeng Hu, Jiajun Cheng, Yunbo Cao, Zhunchen Luo, Guotong Geng

Legal case retrieval remains challenging due to the complexity of legal language and the need for precise lexical alignment between queries and relevant cases. Although dense retrieval models have achieved notable progress, empirical studies show that BM25 continues to serve as a strong baseline in this domain. It motivates us to propose a self-evolving framework for rule-driven query rewriting that enhances BM25 without any parameter training. The framework equips an LLM-based agent with an automatic evaluation environment, enabling it to iteratively create rewriting rules, plan validation experiments over rule combinations, and eliminate ineffective rules based on historical feedbacks. We evaluate our method on the Chinese legal case retrieval benchmark LeCaRD-v2. Experimental results demonstrate that the proposed framework outperforms non-evolutionary baselines, including human-designed rules and greedy rule selection, particularly when powered by a highcapacity core LLM. We also conduct detailed analyses to investigate the mechanisms underlying self-evolution. Our findings reveal that LLM's capabilities to leverage previous experimental results and its intrinsic knowledge of rule elimination play critical roles in refining the rule set via self-evolution.

摘要：法律案件檢索仍然充滿挑戰，因為法律語言的複雜性以及查詢與相關案件之間需要精確的詞彙對齊。儘管密集檢索模型已取得顯著進展，但實證研究顯示，BM25 在這個領域仍然是強有力的基準。這促使我們提出一個自我演化的框架，用於基於規則的查詢重寫，該框架在不進行任何參數訓練的情況下增強了 BM25。該框架為基於 LLM 的代理提供了一個自動評估環境，使其能夠迭代地創建重寫規則，規劃規則組合的驗證實驗，並根據歷史反饋消除無效規則。我們在中國法律案件檢索基準 LeCaRD-v2 上評估我們的方法。實驗結果顯示，所提出的框架在性能上超越了非演化基準，包括人類設計的規則和貪婪規則選擇，特別是在高容量核心 LLM 的支持下。我們還進行了詳細分析，以研究自我演化的機制。我們的發現揭示了 LLM 利用先前實驗結果的能力以及其對規則消除的內在知識在通過自我演化精煉規則集方面發揮了關鍵作用。

Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming

2606.18293v1 by Callum Barbour

Thanks to rapid developments in generative AI, we are in the midst of a paradigm shift that may change how we interact with computers forever. We have observed a growth in the use of natural language prompts to build applications and coding infrastructures without underlying knowledge of the field, and this practice has been dubbed `vibe coding.' It arguably represents what the field of programming has been building towards since the beginning, with every higher level of abstraction that is conceived. Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. This paper aims to evaluate the viability of vibe coding for greenfield software engineering tasks, as well as analyse the benchmarks that have been used to measure its software engineering prowess. To this end, we have developed an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.

摘要：由於生成性人工智慧的快速發展，我們正處於一場可能永遠改變我們與電腦互動方式的範式轉變之中。我們觀察到使用自然語言提示來構建應用程序和編碼基礎設施的增長，而無需對該領域有深入了解，這種做法被稱為 vibe coding。可以說，這代表了編程領域自始至今所追求的目標，隨著每一個新概念的抽象層次提高。Vibe coding 在輸入方法上承諾成為高級編程的終點：完全消除人類對代碼語法的使用，轉而使用母語進行編程。本文旨在評估 vibe coding 在綠地軟體工程任務中的可行性，以及分析用於衡量其軟體工程能力的基準。為此，我們開發了一個評估套件，用於分析 LLM 在執行 Python 中簡單、孤立的綠地編程任務的能力，以提供對此問題的具體見解。

Trust-Aware Multi-Agent Traceability: Confidence-Calibrated Knowledge Graphs for Consistent Software Artifact Management

2606.17203v1 by Mohamed Essam, Kareem Wael, Azza Hassan, Ahmed Haitham, Mahmoud Soliman, Samer Saber, Ibrahim Habib

Multi-agent AI systems are increasingly used to automate software engineering tasks including requirements analysis, architecture design, test generation, and traceability linking. When these agents operate as a sequential pipeline over shared software artifacts, errors and low-confidence decisions made by upstream agents propagate to downstream stages, producing orphaned requirements, contradictory links, and compliance gaps that pose significant risks in safety-critical domains. We propose a trust-aware coordination framework where a shared knowledge graph serves as both centralized semantic memory and a coordination surface through which agents assess and build upon each other's contributions using calibrated confidence scores. Our approach introduces a two-stage traceability link prediction pipeline combining embedding-based retrieval with LLM-based multi-criteria analysis, a traceability seeding mechanism that enables comparison between derivation-time and validation-time confidence, and a consistency protocol governing pipeline interactions through confidence threshold gating, confidence divergence detection, and conflict resolution. We evaluate on an automotive software engineering case study measuring link prediction calibration, protocol effectiveness, threshold sensitivity, and the impact of traceability seeding. Ablation studies confirm that confidence calibration is essential for effective pipeline coordination.

摘要：多代理人工智慧系統越來越多地用於自動化軟體工程任務，包括需求分析、架構設計、測試生成和可追溯性鏈接。當這些代理作為一個順序管道在共享軟體工件上運作時，上游代理所做的錯誤和低信心決策會傳播到下游階段，產生孤立的需求、矛盾的鏈接和合規差距，這在安全關鍵領域中構成了重大風險。我們提出了一個信任感知的協調框架，其中共享知識圖譜既作為集中式語義記憶，又作為協調表面，通過它，代理可以使用經過校準的信心分數來評估和建立彼此的貢獻。我們的方法引入了一個兩階段的可追溯性鏈接預測管道，結合了基於嵌入的檢索與基於大型語言模型的多標準分析、一個可追溯性播種機制，該機制使得可以比較推導時間和驗證時間的信心，以及一個一致性協議，通過信心閾值門控、信心分歧檢測和衝突解決來管理管道交互。我們在一個汽車軟體工程案例研究中進行評估，測量鏈接預測的校準、協議的有效性、閾值的敏感性以及可追溯性播種的影響。消融研究確認信心校準對於有效的管道協調至關重要。

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

2606.17188v2 by Prabhjot Singh, Bhushan Pawar, Madhu Reddiboina, Rajvee Sheth

Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark of 1,000 strictly parallel image-text instances across Punjabi's three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, we expose a substantial and systematic Script Gap. Models frequently solve visual tasks in one script while failing identical tasks in another, with accuracy deltas reaching 16%. Crucially, visual input boosts absolute performance uniformly yet does not close the orthographic gap. Furthermore, cross-script in-context transfer is highly brittle, exposing script-locked knowledge representation. Supported by McNemar tests across all script pairs, our findings demonstrate that current "multilingual" VLMs are not truly multi-script. We propose the Script Consistency Rate (SCR), which falls as low as 24.8% on our benchmark, as a mandatory metric for script-agnostic evaluation to ensure equitable AI access. Data and code are available at: https://github.com/prabhjotschugh/Not-Truly-Multilingual-PuMVR.

摘要：目前對於視覺語言模型（VLMs）的多語言評估假設語言與正字法之間存在一對一的映射，忽視了數十億使用多種文字語言的用戶。我們引入了PuMVR（旁遮普多模態視覺推理），這是一個包含1,000個嚴格平行的圖像-文本實例的基準，涵蓋旁遮普的三種活躍文字：古爾穆基、沙穆基和羅馬字。通過評估10個最先進的VLM，我們揭示了一個顯著且系統性的文字差距。模型經常在一種文字中解決視覺任務，而在另一種文字中對相同任務失敗，準確率差異高達16%。關鍵是，視覺輸入均勻地提升了絕對性能，但並未縮小正字法差距。此外，跨文字的上下文轉移非常脆弱，暴露了被文字鎖定的知識表徵。通過對所有文字對進行的McNemar測試，我們的發現表明，目前的“多語言”VLM並不是真正的多文字。我們提出了文字一致性率（SCR），在我們的基準上低至24.8%，作為無文字偏見評估的必要指標，以確保公平的AI訪問。數據和代碼可在以下網址獲得：https://github.com/prabhjotschugh/Not-Truly-Multilingual-PuMVR。

RepSelect: Robust LLM Unlearning via Representation Selectivity

2606.17168v1 by Filip Sondej, Yushi Yang, Adam Mahdi

Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.

摘要：使大型語言模型（LLMs）深度忘記特定知識和價值觀，而不犧牲一般能力，仍然是去學習中的一個核心挑戰。然而，當前的方法很容易被微調或少量提示逆轉，這表明它們的遺忘只是淺層的。我們確定了根本原因。現有的方法針對與保留集和微調攻擊者恢復的子空間共享的表示，這使得去學習對一般能力造成了干擾，並且容易被逆轉。我們提出了RepSelect（表示選擇性），通過在每次更新之前壓縮權重梯度的前幾個主成分來隔離忘記集特定的表示，從而保持一般能力不變，同時限制微調可以恢復的內容。我們在兩個忘記類別（生物危害知識和虐待傾向）和四個模型系列（包括密集和專家混合架構的Llama 3、Qwen 3.5、Gemma 4 E4B、DeepSeek V2 Lite）中進行評估。與五個流行的基準（GradDiff、NPO、SimNPO、RMU、UNDIAL）相比，RepSelect在後學習答案準確性上實現了比最強基準高出4-50倍的減少，並且對少量提示攻擊幾乎完美穩健。因此，針對選擇性表示是一個邁向深度和穩健的LLM遺忘的重要步驟。

A Causal Model of Theory of Mind in Conflict for Artificial Intelligence

2606.16944v1 by Nikolos Gurney

Theory of mind (ToM), the capacity to ascribe mental states to others and use those ascriptions for prediction and inference, is widely assumed to be essential for effective human-machine integration. Existing AI-ToM models address \emph{how} to mentalize, but leave the question of when largely unaddressed. The central question is: under what situational and agent-level conditions is ToM engagement causally warranted in conflict? This paper presents a structural causal model formalized as a directed acyclic graph (DAG), treating ToM as a mechanism activated by situational and agent-level conditions rather than as an always-on capacity. The model specifies four exogenous variables capturing situational and agent-level conditions, five endogenous mediators, and a mechanistic ToM node producing engagement states through three distinct causal pathways: a tractability pathway, a reasoning-depth pathway, and an enabling-cause pathway. The primary outcome is epistemic accuracy, which decouples social reasoning from behavioral policy and generalizes across social phenomena beyond conflict. The framework gives AI systems a principled, resource-rational decision procedure for mentalizing, with implications for efficiency, trust, and the development of robust artificial social intelligence. Simulation validation, empirical human-machine teaming studies, and ethical considerations arising from conflict-optimized mentalizing are discussed.

摘要：心智理論（ToM），即將心理狀態歸因於他人並利用這些歸因進行預測和推理的能力，被廣泛認為對於有效的人機整合至關重要。現有的AI-ToM模型主要探討\emph{如何}進行心智化，但對於何時進行心智化的問題則大多未予以解答。核心問題是：在什麼情境和代理層級的條件下，心智理論的參與在衝突中是因果上合理的？本文提出了一個結構性因果模型，形式化為一個有向無循環圖（DAG），將心智理論視為一種由情境和代理層級條件激活的機制，而非一種始終開啟的能力。該模型指定了四個外生變數，捕捉情境和代理層級的條件，五個內生中介變數，以及一個機械性心智理論節點，通過三條不同的因果途徑產生參與狀態：可處理性途徑、推理深度途徑和促成原因途徑。主要結果是認識準確性，這將社會推理與行為政策解耦，並在衝突以外的社會現象中進行概括。該框架為AI系統提供了一種原則性、資源理性的決策程序，用於心智化，對效率、信任以及穩健的人工社會智能的發展具有重要意義。文中討論了模擬驗證、實證人機協作研究以及由於衝突優化心智化而產生的倫理考量。

RAID: Semantic Graph Diffusion for True Cold-Start and Cross-Lingual Forecasting

2606.16925v1 by Arunkumar V, Manoranjan Gandhudi, Gangadharan G. R., Arun Prakash, S. Senthilkumar

Time-series foundation models show strong transfer performance when given a non-empty history window. However, true cold-start scenarios, where a new item has no prior observations, violate this assumption. We propose RAID (Retrieval-Augmented Iterative Diffusion) a framework, which replaces history-based correlation learning with metadata-driven semantic retrieval and graph-conditioned diffusion. RAID maps textual metadata into a shared semantic space using a frozen multilingual embedding model and constructs an inductive retrieval graph that extends naturally to unseen items. It first forms a base forecast by aggregating information from semantically related neighbors, then refines this forecast with a gated diffusion module to model residual uncertainty. Under a strict true cold-start protocol, RAID outperforms strong foundation models and competitive baselines on both forecasting accuracy and prediction interval coverage, while reducing inference latency by an order of magnitude through non-autoregressive decoding. The shared semantic space also enables zero-shot cross-lingual transfer, allowing a model trained on English descriptions to generalize to items described in other languages without direct supervision.

摘要：時間序列基礎模型在給定非空歷史窗口時顯示出強大的轉移性能。然而，真正的冷啟動場景，即新項目沒有先前觀察，違反了這一假設。我們提出了RAID（檢索增強迭代擴散）框架，該框架用元數據驅動的語義檢索和圖條件擴散替代基於歷史的相關性學習。RAID使用凍結的多語言嵌入模型將文本元數據映射到共享語義空間，並構建一個自然擴展到未見項目的歸納檢索圖。它首先通過聚合語義相關鄰居的信息來形成基礎預測，然後用門控擴散模塊來細化這一預測，以建模殘餘不確定性。在嚴格的真正冷啟動協議下，RAID在預測準確性和預測區間覆蓋率上超越了強大的基礎模型和競爭基準，同時通過非自回歸解碼將推理延遲降低了一個量級。共享語義空間還使零樣本跨語言轉移成為可能，允許在英語描述上訓練的模型在沒有直接監督的情況下對用其他語言描述的項目進行泛化。

LESS Is More: Mutual-Stability Sampling for Diffusion Language Models

2606.16908v1 by Amr Mohamed, Guokan Shang, Michalis Vazirgiannis

Diffusion large language models (dLLMs) offer a promising alternative to autoregressive decoding by iteratively refining masked sequences, enabling parallel token updates and bidirectional conditioning. Their practical efficiency, however, is limited by sampling procedures that execute a fixed number of reverse denoising steps selected before decoding, spending computation on already-stable positions and sometimes committing unstable ones too early. We present \textsc{LESS}, a training-free, model-agnostic adaptive sampler that treats token commitment as an online stopping problem. \textsc{LESS} implements mutual-stability sampling through a joint stability rule that makes a masked position eligible for unmasking only when its top-1 prediction has high confidence, its top-1 token persists across recent reverse steps, and its predictive distribution is stable under top-$K$ inter-step Jensen--Shannon divergence. We evaluate \textsc{LESS} on Dream-7B, LLaDA-8B, and LLaDA-1.5-8B, covering full-sequence diffusion and semi-autoregressive blockwise sampling regimes, across seven benchmarks spanning general knowledge, math, and code. \textsc{LESS} improves average accuracy over strong training-free adaptive samplers while using $72.1\%$ fewer reverse steps than fixed-budget decoding. Since each reverse step requires a Transformer forward pass, these step-count reductions translate into fewer forward evaluations, lower measured wall-clock latency, and lower estimated inference compute.

摘要：擴散大型語言模型（dLLMs）提供了一種有前景的替代自回歸解碼的方法，通過迭代地精煉遮蔽序列，實現並行的標記更新和雙向條件化。然後，它們的實際效率受到限制，因為取樣程序在解碼之前執行固定數量的反向去噪步驟，將計算花費在已經穩定的位置上，有時也會過早地承諾不穩定的位置。我們提出了\textsc{LESS}，一種無需訓練的模型無關自適應取樣器，將標記承諾視為一個在線停止問題。\textsc{LESS}通過一個聯合穩定性規則實現互穩定取樣，該規則僅在其 top-1 預測具有高信心、其 top-1 標記在最近的反向步驟中持續存在以及其預測分佈在 top-$K$ 交互步驟的詹森-香農散度下穩定時，使得遮蔽位置有資格被解遮蔽。我們在 Dream-7B、LLaDA-8B 和 LLaDA-1.5-8B 上評估\textsc{LESS}，涵蓋全序列擴散和半自回歸區塊取樣範疇，涉及七個基準，涵蓋一般知識、數學和代碼。\textsc{LESS}在使用$72.1\%$ 更少的反向步驟的情況下，提高了強大的無訓練自適應取樣器的平均準確性。由於每個反向步驟都需要一次 Transformer 前向傳遞，這些步驟數量的減少轉化為更少的前向評估、更低的實際牆鐘延遲和更低的估計推理計算。

Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier

2606.16811v1 by Keizo Kato, Chenhui Chu, Yugo Murawaki, Sado Kurohashi

For the development of Large language models (LLMs), recent approaches to generating pseudo intermediate reasoning have shown remarkable progress. But they typically rely on large numbers of correctly annotated answers to assess reasoning quality. This paper presents a semi-supervised framework that scales reasoning learning from minimal supervision, turning reasoning verification itself into a data creation mechanism. We train a lightweight reasoning-correctness classifier on only a few labeled samples, which judges whether intermediate reasoning traces generated by an LLM are valid. Furthermore, an entropy-based confidence threshold filters out unreliable samples, and the remaining high-confidence reasoning traces are used to fine-tune the model. Experiments on Verifiable Math Problems (Orca-Math subset) and Question Answering on Image Scene Graphs (GQA) with Visual Programming show that our method achieves accuracy comparable to using 10-15x more labeled data. Ablation analyses confirm that both the classifier and entropy filtering are essential for scalable and noise-resistant pseudo-labeling. By replacing expensive answer-level supervision with lightweight reasoning verification, our method provides a practical path toward constructing large-scale reasoning resources and paves the way for future autonomous reasoning systems that learn from minimal human input.

摘要：大型語言模型（LLMs）的發展中，最近生成偽中介推理的方法顯示出顯著的進展。但這些方法通常依賴大量正確標註的答案來評估推理質量。本文提出了一個半監督框架，從最小的監督中擴展推理學習，將推理驗證本身轉變為一種數據創建機制。我們在僅有幾個標註樣本的情況下訓練了一個輕量級的推理正確性分類器，用以判斷LLM生成的中介推理痕跡是否有效。此外，基於熵的置信度閾值過濾掉不可靠的樣本，剩下的高置信度推理痕跡用於微調模型。在可驗證數學問題（Orca-Math子集）和圖像場景圖的問題回答（GQA）與視覺編程的實驗中，我們的方法達到了與使用10-15倍更多標註數據相當的準確性。消融分析確認分類器和熵過濾對於可擴展和抗噪聲的偽標註至關重要。通過用輕量級的推理驗證取代昂貴的答案級監督，我們的方法提供了一條實用的路徑，用於構建大規模的推理資源，並為未來從最小人類輸入學習的自主推理系統鋪平道路。

OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models

2606.16774v1 by Tianyi Lin, Chuanyu Sun, Jingyi Zhang, Changxu Wei, Huanjin Yao, Shunyu Liu, Xikun Zhang, Liu Liu, Jiaxing Huang

Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems like OpenClaw. In this work, we aim to develop a framework that automatically constructs such reusable skills to enhance LLMs in tool use, multi-step reasoning, and dynamic environment interaction. To this end, we propose Collective Skill Tree Search (CSTS), a novel tree-search-based skill construction framework that constructs structured, diverse and generalizable tree of skills. The core idea of CSTS is to leverage collective intelligence to jointly search, identify and compose effective skills via two iterative phases: Collective Skill Node Generation (CSN-Gen) and Collective Skill Node Assessment (CSN-Assess). CSN-Gen exploits collective knowledge from multiple models to explore diverse candidate skills for each subtask, enabling comprehensive skill exploration. CSN-Assess employs multiple models as judges to evaluate and select skill nodes with two scoring mechanisms: (1) collective quality scoring that aggregates independent evaluations to produce a robust estimate of skill effectiveness, and (2) collective transferability scoring that explicitly verifies whether a skill generalizes well across different models. With CSTS, we construct a set of comprehensive tree of skills along with skill-augmented training data, enabling models to effectively learn and utilize skills. Besides, we introduce Collective Skill Reinforcement Learning, which actively selects multiple relevant skills from the tree to broaden solution-space exploration, avoid being trapped by a single skill and its resulting homogeneous or suboptimal solutions. As a result, our trained model, OpenClaw-Skill, exhibits outstanding agentic capabilities in long-horizon planning, tool use and generalization over challenging benchmarks.

摘要：為大型語言模型（LLM）代理配備有效技能對於解決像 OpenClaw 這樣的現實系統中的複雜任務至關重要。在這項工作中，我們旨在開發一個框架，自動構建可重用的技能，以增強 LLM 在工具使用、多步推理和動態環境互動方面的能力。為此，我們提出了集體技能樹搜尋（CSTS），這是一個基於樹搜尋的創新技能構建框架，構建結構化、多樣化和可泛化的技能樹。CSTS 的核心思想是利用集體智慧通過兩個迭代階段共同搜尋、識別和組合有效技能：集體技能節點生成（CSN-Gen）和集體技能節點評估（CSN-Assess）。CSN-Gen 利用來自多個模型的集體知識探索每個子任務的多樣候選技能，從而實現全面的技能探索。CSN-Assess 則利用多個模型作為評審，通過兩種評分機制來評估和選擇技能節點：（1）集體質量評分，聚合獨立評估以產生技能有效性的穩健估計，以及（2）集體可轉移性評分，明確驗證一項技能是否能在不同模型之間良好泛化。通過 CSTS，我們構建了一組全面的技能樹以及增強技能的訓練數據，使模型能夠有效地學習和利用技能。此外，我們引入了集體技能強化學習，主動從樹中選擇多個相關技能，以擴展解決方案空間的探索，避免被單一技能及其導致的同質或次優解所困住。因此，我們訓練的模型 OpenClaw-Skill 在長期規劃、工具使用和在挑戰性基準上的泛化能力上展現了卓越的代理能力。

Misinformation Propagation in Benign Multi-Agent Systems

2606.16710v1 by Jonas Becker, Jan Philip Wahle, Terry Ruas, Bela Gipp

Multi-agent systems, in which multiple large language model agents solve problems through turn-based interaction, are increasingly deployed in high-stakes settings such as medical diagnosis, legal analysis, and forensic decision-making. Their reliability can be at risk when single agents reason from incorrect or misleading context, e.g., from tool calls, since errors may propagate through agent interactions. This work studies this risk by injecting intent-based misinformation into benign single-agent and multi-agent systems across reasoning, knowledge, and alignment tasks. We find that misinformation can degrade single-agent performance and persists across multi-agent debate, with agents often retaining answers introduced by misinformed peers. Nevertheless, multi-agent debate reduces the resulting performance degradation compared to single-agent prompting, especially when most agents are not exposed to misinformation. Robustness depends on group composition and decision protocol. Consensus can be more stable than voting under peer pressure, while majorities can often steer misinformed agents back toward correct answers. Our results show that misinformation robustness in multi-agent systems depends on the underlying model and also on how agents exchange information and aggregate decisions.

摘要：多代理系統中，多个大型语言模型代理通过回合制互动解决问题，越来越多地应用于高风险环境，如医疗诊断、法律分析和法医决策。当单一代理从不正确或误导性的上下文中推理时，例如来自工具调用，其可靠性可能会面临风险，因为错误可能会通过代理互动传播。本文通过在无害的单一代理和多代理系统中注入基于意图的错误信息，研究这一风险，涵盖推理、知识和对齐任务。我们发现，错误信息会降低单一代理的表现，并在多代理辩论中持续存在，代理通常会保留误导同伴引入的答案。尽管如此，与单一代理提示相比，多代理辩论减少了由此产生的表现下降，尤其是在大多数代理未接触错误信息的情况下。鲁棒性取决于群体组成和决策协议。在同伴压力下，达成共识可能比投票更稳定，而多数派通常可以将误导的代理引导回正确答案。我们的结果表明，多代理系统中的错误信息鲁棒性依赖于基础模型，以及代理如何交换信息和聚合决策。

User as Code: Executable Memory for Personalized Agents

2606.16707v1 by Bojie Li

A personalized AI agent needs a user memory: a persistent model of who the user is, built across many conversations and consulted on each new one. Today this memory is almost always stored as unstructured text, a knowledge graph, or a flat store of facts, and consulted by retrieval -- fetching the entries most similar to the current request. Such "bag-of-facts" memory recalls individual facts well, but because storing a fact and acting on it are separate steps, it struggles to resolve contradictions, aggregate over many records, or enforce rules. We argue that user memory should instead be executable. We introduce User as Code (UaC), a paradigm in which an agent's model of a user is a living software project: typed Python objects hold the user's state and ordinary Python functions encode the rules that govern it, so representing and reasoning about the user happen in one medium an interpreter can run. The enabling mechanism is a two-phase pipeline: an append-only log that never discards a fact, periodically checkpointed into typed code. This changes what memory can do. On standard long-term conversation benchmarks, UaC matches both a full-context upper bound and the strongest prior memory systems on recall (78.8% on LOCOMO). Its advantage emerges where representation matters most. On aggregate questions over a user's history -- "how many international trips did I take last year?" -- retrieval-based memory collapses (6-43%) while UaC stays near-perfect (99%), because the answer is a one-line computation over typed state rather than a search over text. And because its rules execute deterministically whenever the state changes, UaC can surface unsolicited, safety-critical alerts -- such as a newly prescribed drug that conflicts with an allergy recorded months earlier -- a capability query-driven memory cannot provide.

摘要：一個個性化的 AI 代理需要用戶記憶：一個持久的模型，描繪用戶是誰，這個模型是通過多次對話建立的，並在每次新的對話中進行查詢。今天，這種記憶幾乎總是以非結構化文本、知識圖譜或平面事實存儲的形式存儲，並通過檢索進行查詢——提取與當前請求最相似的條目。這種「事實袋」記憶能夠很好地回憶個別事實，但因為存儲一個事實和基於該事實行動是兩個不同的步驟，它在解決矛盾、整合多條記錄或執行規則方面面臨困難。我們認為，用戶記憶應該是可執行的。我們引入了用戶作為代碼（UaC），這是一種範式，其中代理對用戶的模型是一個活的軟體項目：類型化的 Python 對象持有用戶的狀態，而普通的 Python 函數編碼了管理該狀態的規則，因此表示和推理用戶的過程發生在一個解釋器可以運行的媒介中。啟用機制是一個兩階段的管道：一個只追加的日誌，永遠不會丟棄事實，並定期檢查點到類型化代碼中。
這改變了記憶能做的事情。在標準的長期對話基準上，UaC 同時達到了完整上下文的上限和最強的先前記憶系統的回憶（在 LOCOMO 上達到 78.8%）。它的優勢在於表示最為重要的地方。在針對用戶歷史的聚合問題上——「我去年出國旅行了多少次？」——基於檢索的記憶崩潰（6-43%），而 UaC 則保持近乎完美（99%），因為答案是一行計算基於類型化狀態，而不是對文本的搜索。而且，因為它的規則在狀態變化時以確定性執行，UaC 可以提出未經請求的、安全關鍵的警報——例如，與幾個月前記錄的過敏反應衝突的新處方藥——這是基於查詢的記憶無法提供的能力。

Progressive Knowledge-Guided Large Language Model Framework for Bearing Fault Diagnosis

2606.16684v1 by Jinghan Wang, Gaoliang Peng, Yanjun Chen, Wei Zhang, Wentao Wu, Tianchen Liu

Vibration-based bearing fault diagnosis requires resolving three interrelated measurement challenges, including the trade-off between global statistical feature efficiency and local transient signal fidelity, insufficient traceability of measurement features to underlying fault physics, and ineffective multi-source measurement information fusion across diagnostic scales. This paper presents a progressive physics-guided multi-scale vibration signal processing framework that addresses all three challenges within a unified diagnostic pipeline. An 81-dimensional measurement descriptor, derived from bearing kinematic theory and characteristic defect frequencies, establishes a physically traceable feature space enabling real-time fault screening at approximately 20 ms per sample. A fault-adaptive signal segmentation mechanism then directs analytical attention toward fault-relevant waveform regions guided by physics-based priors, without manual feature engineering. Structured fault mechanism knowledge is further encoded implicitly in model parameters during training, enabling autonomous multi-scale measurement fusion without external knowledge dependencies at inference. Validated on four public benchmark datasets under diverse operating conditions, the framework achieves 98.49% diagnostic accuracy with a 12.6-fold reduction in computational cost relative to signal-level baselines. Interpretability analysis confirms that diagnostic feature activations align with established bearing fault mechanics, supporting measurement traceability in safety-critical industrial systems.

摘要：振動基礎的軸承故障診斷需要解決三個相互關聯的測量挑戰，包括全球統計特徵效率與局部瞬態信號保真度之間的權衡、測量特徵對基礎故障物理的追溯性不足，以及在診斷尺度之間的多源測量信息融合效果不佳。本文提出了一個漸進的物理引導多尺度振動信號處理框架，解決了統一診斷流程中的所有三個挑戰。從軸承運動學理論和特徵缺陷頻率衍生出的81維測量描述子，建立了一個物理可追溯的特徵空間，使得每個樣本的實時故障篩選約為20毫秒。然後，故障自適應信號分割機制將分析重點引導到由物理基礎先驗知識指導的與故障相關的波形區域，無需手動特徵工程。在訓練過程中，結構化故障機制知識進一步隱式編碼在模型參數中，實現了無需外部知識依賴的自主多尺度測量融合。在多種操作條件下，該框架在四個公共基準數據集上進行驗證，實現了98.49%的診斷準確率，相較於信號級基準，計算成本降低了12.6倍。可解釋性分析確認診斷特徵激活與既定的軸承故障力學相一致，支持在安全關鍵的工業系統中進行測量追溯。

The Integrator Advantage: Controlled Agentic AI for Small and Medium-Sized Companies

2606.16649v1 by Christopner Koch, Joshua A. Wellbrock

Agentic AI marks a new phase of enterprise automation. Unlike traditional automation or conversational AI, agentic systems can interpret goals, plan multi step tasks, access tools, interact with enterprise systems, and execute workflows with varying degrees of autonomy. For small and medium sized companies, this creates potential to reduce administrative burden, accelerate routine processes, and improve the use of organizational knowledge. This paper argues that the near term value of Agentic AI does not lie in full autonomy or workforce reduction, but in controlled partial autonomy for simple and medium complexity business processes. It proposes an integration framework covering use case suitability, autonomy levels, technical integration, governance, security, employee enablement, and measurable impact. The paper concludes that Agentic AI can become a productivity lever when implemented as a human centered capability with responsibility and accountability retained by people.

摘要：代理人工智慧標誌著企業自動化的新階段。與傳統自動化或對話式人工智慧不同，代理系統能夠解釋目標、規劃多步驟任務、訪問工具、與企業系統互動，並以不同程度的自主性執行工作流程。對於中小型企業來說，這創造了減少行政負擔、加速日常流程和改善組織知識使用的潛力。本文主張，代理人工智慧的短期價值不在於完全自主或減少勞動力，而在於對簡單和中等複雜度商業流程的受控部分自主性。它提出了一個整合框架，涵蓋用例適用性、自主性水平、技術整合、治理、安全性、員工賦能和可衡量的影響。本文總結道，當代理人工智慧作為以人為中心的能力實施時，並由人保留責任和問責制，它可以成為生產力的杠杆。

DCP-Prune: Ultra-Low Token Pruning with Distribution Consistency Preservation

2606.16633v1 by Xifeng Xue, Xiaokang Wang, Zirui Li, Ming-Ming Cheng, Guolei Sun

Recent vision token pruning methods effectively preserve model performance under moderate token budgets but become unstable under ultra-low token budget. Our analysis shows that as the pruning budget decreases, accuracy degradation is often accompanied by larger feature distribution shifts. Critically, the degree of this distribution shift strongly correlates with performance degradation. To better characterize this phenomenon, we introduce a lightweight distribution consistency metric to estimate the distribution shift between retained and full tokens. Motivated by these observations, we propose a two-stage pruning framework consisting of Anchor-Context Graph Recovery (ACGR) and Text-Aware Token Cluster Selection (TATCS). Specifically, ACGR transfers contextual information before token removal, while TATCS dynamically re-selects representative tokens when severe distribution shift is detected. Extensive experiments demonstrate that our method achieves superior and more stable performance under ultra-low token budget. Notably, it retains 92.1% of the upper-bound average performance on LLaVA-1.5-7B with only 16 visual tokens.

摘要：最近的視覺標記修剪方法在中等標記預算下有效保留模型性能，但在超低標記預算下則變得不穩定。我們的分析顯示，隨著修剪預算的減少，準確度下降通常伴隨著更大的特徵分佈變化。關鍵是，這種分佈變化的程度與性能下降有著強烈的相關性。為了更好地描述這一現象，我們引入了一種輕量級的分佈一致性度量，以估計保留標記和完整標記之間的分佈變化。受到這些觀察的啟發，我們提出了一個由錨點-上下文圖恢復（ACGR）和文本感知標記集群選擇（TATCS）組成的兩階段修剪框架。具體而言，ACGR 在標記移除之前轉移上下文信息，而 TATCS 在檢測到嚴重的分佈變化時動態重新選擇代表性標記。大量實驗表明，我們的方法在超低標記預算下實現了更優越和更穩定的性能。值得注意的是，它在僅使用 16 個視覺標記的情況下，保留了 LLaVA-1.5-7B 的上限平均性能的 92.1%。

Islamic Large Language Models: From Knowledge Acquisition to Trustworthy and Hallucination-Resistant AI

2606.16629v1 by Mohammed Amine Mouhoub

Large language models (LLMs) are increasingly used for knowledge-intensive question answering, including religious and legal questions. Islamic knowledge is a particularly demanding setting: answers are expected to be grounded in authoritative sources, citations must be exact, Arabic varieties differ substantially from the language of classical sources, and legitimate jurisprudential disagreement must be represented rather than collapsed into a single answer. This survey reviews the emerging field of Islamic LLMs and trustworthy Islamic AI. We organize the literature around Arabic NLP and Arabic-centric LLMs, Islamic NLP resources, Qur'anic question answering, Islamic knowledge benchmarks, retrieval-augmented generation, Islamic legal reasoning, inheritance reasoning, hallucination evaluation, and trustworthiness. We argue that fluency in Arabic is not sufficient for Islamic AI. Reliable systems require curated sources, retrieval and verification modules, citation-aware generation, madhhab-aware reasoning, human expert evaluation, and benchmarks that measure not only answer accuracy but also faithfulness, source validity, and reasoning quality. The survey concludes with a research agenda for hallucination-resistant Islamic AI systems.

摘要：大型語言模型（LLMs）在知識密集型問題回答中越來越多地被使用，包括宗教和法律問題。伊斯蘭知識是一個特別要求高的環境：答案必須基於權威來源，引用必須準確，阿拉伯語的變體與古典來源的語言有顯著差異，合法的法理學爭議必須被表達出來，而不是簡化為單一答案。本調查回顧了新興的伊斯蘭LLMs和可信的伊斯蘭AI領域。我們將文獻組織在阿拉伯語NLP和以阿拉伯語為中心的LLMs、伊斯蘭NLP資源、古蘭經問題回答、伊斯蘭知識基準、檢索增強生成、伊斯蘭法律推理、繼承推理、幻覺評估和可信度等主題上。我們主張，流利的阿拉伯語不足以支持伊斯蘭AI。可靠的系統需要經過策劃的來源、檢索和驗證模塊、引用意識生成、教派意識推理、人類專家評估，以及不僅測量答案準確性還測量忠實性、來源有效性和推理質量的基準。本調查以針對抗幻覺的伊斯蘭AI系統的研究議程作結。

VeriGraph: Towards Verifiable Data-Analytic Agents

2606.16603v1 by Jiajie Jin, Zhao Yang, Wenle Liao, Yuyang Hu, Guanting Dong, Xiaoxi Li, Yutao Zhu, Zhicheng Dou

LLM-based agents have demonstrated strong capabilities in data-intensive analytical tasks, yet their outputs are rarely verifiable: a reliance on linear text trajectories makes their reasoning difficult to audit. In particular, deterministic computations over raw data and semantic deductions over natural-language claims are often entangled in an unstructured stream, leaving numerical conclusions hard to reproduce and qualitative judgments hard to inspect. To address this, we propose VeriGraph, a traceable neuro-symbolic reasoning framework that enables agents to construct an explicit heterogeneous evidence directed acyclic graph (DAG) during execution. VeriGraph introduces three evidence-expansion primitives, namely computational, grounding, and derivational expansion, to connect raw data, interpreter variables, computed results, and natural-language claims in a unified graph. Under this formulation, structural traceability is reduced to graph reachability from raw data sources to terminal claims, while semantic support is measured by claim-level evidence evaluation. To improve graph construction, we further design a graph-based policy optimization strategy with a composite reward that jointly supervises answer correctness, computational integrity, and derivational coherence. Experiments on four benchmarks show that VeriGraph-8B achieves the highest overall score among all baselines. More importantly, VeriGraph produces auditable evidence graphs with substantially stronger claim grounding, achieving a 87.61\% Grounding Rate under our claim-level evidence support evaluation. These results suggest that explicit evidence-graph construction is a promising path toward verifiable data-analytic agents. Our code is available at https://github.com/ignorejjj/VeriGraph.

摘要：LLM 基礎的代理在數據密集型分析任務中展現出強大的能力，但它們的輸出很少可驗證：對線性文本軌跡的依賴使得它們的推理難以審計。特別是，對原始數據的確定性計算和對自然語言聲明的語義推導通常交織在一個非結構化的流中，導致數值結論難以重現，定性判斷難以檢查。為了解決這個問題，我們提出了 VeriGraph，一個可追蹤的神經符號推理框架，該框架使代理在執行過程中能夠構建一個明確的異質證據有向無環圖 (DAG)。VeriGraph 引入了三個證據擴展原語，即計算擴展、基礎擴展和推導擴展，以在統一圖中連接原始數據、解釋變量、計算結果和自然語言聲明。在這種表述下，結構可追蹤性被簡化為從原始數據源到終端聲明的圖可達性，而語義支持則通過聲明級證據評估來衡量。為了改善圖的構建，我們進一步設計了一種基於圖的策略優化策略，該策略具有復合獎勵，聯合監督答案的正確性、計算完整性和推導一致性。在四個基準上的實驗顯示，VeriGraph-8B 在所有基線中達到了最高的整體分數。更重要的是，VeriGraph 生成了可審計的證據圖，具有顯著更強的聲明基礎，在我們的聲明級證據支持評估中達到了 87.61\% 的基礎率。這些結果表明，明確的證據圖構建是朝向可驗證數據分析代理的一條有前景的道路。我們的代碼可在 https://github.com/ignorejjj/VeriGraph 獲得。

SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM Agents

2606.16591v2 by Qiao Xiao, Haochen Shi, Yisen Gao, Wenbin Hu, Huihao Jing, Tianshi Zheng, Baixuan Xu, Ziheng Zhang, Weiqi Wang, Haoran Li, Jiaxin Bai, Yangqiu Song

Large language model (LLM) agents increasingly rely on agent harnesses that manage context, tools, and multi-turn execution, making tools a central interface for acting in realistic digital environments. As harness-connected tool ecosystems expand to hundreds or thousands of APIs, services, and task-specific skills, exhaustive tool schema injection becomes costly and imposes a closed-world assumption that limits agents to a predefined static inventory. Retrieval-augmented tool selection offers a natural alternative, but existing one-shot retrieval methods often fail to align isolated tool descriptions with the agent's true task intention, especially in long-horizon tasks where required capabilities emerge through decomposition, observations, and newly induced subgoals. We propose SING, an intention-aware active tool discovery framework that builds an intention-tool graph linking user intentions, tool capabilities, and tool collaboration patterns, and dynamically retrieves tools according to evolving task states. Using a unified corpus of 7,471 tools, we evaluate SING on three real-world tool-use benchmarks. SING improves Global Recall@5 by up to 59.8% and downstream success rate by up to 28.9% over baselines, while reducing full-corpus tool-schema exposure by 99.8%, demonstrating that intention-aware graph structure enables more accurate and context-efficient tool discovery in large-scale agentic ecosystems.

摘要：大型語言模型（LLM）代理越來越依賴於管理上下文、工具和多輪執行的代理架構，使得工具成為在現實數位環境中行動的核心介面。隨著連接架構的工具生態系統擴展到數百或數千個API、服務和特定任務技能，徹底的工具架構注入變得昂貴，並施加了一種封閉世界假設，限制代理僅能使用預定義的靜態庫存。檢索增強的工具選擇提供了一種自然的替代方案，但現有的一次性檢索方法往往無法將孤立的工具描述與代理的真實任務意圖對齊，特別是在需要通過分解、觀察和新引入的子目標出現所需能力的長期任務中。我們提出了SING，一種意圖感知的主動工具發現框架，建立了一個意圖-工具圖，連結用戶意圖、工具能力和工具協作模式，並根據不斷變化的任務狀態動態檢索工具。使用一個統一的7,471個工具的語料庫，我們在三個現實世界的工具使用基準上評估SING。SING在Global Recall@5上提高了多達59.8%，在下游成功率上提高了多達28.9%，同時將全語料庫工具架構的曝光降低了99.8%，證明了意圖感知的圖結構能夠在大規模代理生態系統中實現更準確和上下文高效的工具發現。

Graph neural networks at war: integrating cybersecurity and drone intelligence in the Israeli-Iranian conflict

2606.17119v1 by Sozan Sulaiman Maghdid, Tarik Ahmed Rashid, Shavan Askar

Physical cyber systems have brought about new threats and challenges in detection and immediate response. This study examines how Graph Neural Networks (GNNs) can be used to aid cybersecurity and drone management in a physical cyber system comprising of cyber intrusions and unmanned aerial vehicles (UAVs). By providing a bridge between structural understanding of graphical neural networks, this work has provided an integrated procedure that allows intrusion detection systems to educate on underlying network structures, identify malicious activity, and facilitates drone response measures. Based on an emulation-based case study, cyberattacks models were created to provoke the responses of the drones, which proved that graph-based learning can assist with the situational awareness, swarm coordination, and adaptive maneuver. According to the performance valuation, this method has a detection rate of 94.2, average area under the receiver operating characteristic (ROC) of 0.955 and an average response time of 1.4 seconds. Comparative experiments reveal that proposed GraphSAGE network is more effective than the Graphical Convolutional Networks (GCNs) and Graphical Attention Networks (GATs) in the identical situation. Such findings prove that graphical neural networks can be used to avert intrusion and response of dynamic cyber-physical systems.

摘要：物理網絡系統帶來了新的威脅和挑戰，尤其是在檢測和即時反應方面。本研究探討了圖神經網絡（GNNs）如何用於協助網絡安全和無人機管理，這些無人機管理涉及網絡入侵和無人航空載具（UAVs）。通過提供圖形神經網絡的結構理解之間的橋樑，本研究提供了一個綜合程序，使入侵檢測系統能夠對底層網絡結構進行教育，識別惡意活動，並促進無人機的反應措施。根據基於模擬的案例研究，創建了網絡攻擊模型以激發無人機的反應，這證明了基於圖的學習可以協助情境意識、群體協調和自適應機動。根據性能評估，這種方法的檢測率為94.2，接收者操作特徵（ROC）下的平均面積為0.955，平均反應時間為1.4秒。比較實驗顯示，所提出的GraphSAGE網絡在相同情況下比圖形卷積網絡（GCNs）和圖形注意網絡（GATs）更有效。這些發現證明了圖形神經網絡可以用來避免動態網絡物理系統的入侵和反應。

Kairos: A Native World Model Stack for Physical AI

2606.16533v2 by Kairos Team, Fei Wang, Shan You, Qiming Zhang, Tao Huang, Zuoyi Fu, Zhisheng Zheng, Yunlong Xi, Feng Lv, Xiaoming Wu, Zeyu Liu, Cong Wan, Pu Li, Ruiqing Yang, Xiaoou Li, Wei Wang, Kangkang Zhu, Yuwei Zhang, Shi Fu, Zheng Zhang, Xiaoning Wu, Xuzeng Fan, Dacheng Tao, Xiaogang Wang

World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.

摘要：世界模型正在從被動的視覺生成器轉變為物理人工智慧的基礎操作基礎設施：它們必須能夠從異質經驗中原生獲取世界知識，維持長期穩定的狀態，並在實際部署約束下高效執行。我們介紹了Kairos，一個圍繞這些要求設計的原生世界模型堆疊。(1) Kairos通過開創一種由跨體現數據課程主導的原生預訓練範式來學習世界，該課程將開放世界視頻、人類行為數據和機器人交互組織成一個漸進的發展路徑。(2) Kairos通過統一的世界理解、生成和預測來維持世界，並在配備混合線性時間注意力的原生統一架構中進行，其中滑動窗口注意力捕捉局部動態，擴張的滑動窗口捕捉中範圍依賴性，而門控線性注意力則維持持久的全局記憶。我們建立了正式的理論界限，證明這種時間因式分解嚴格限制了誤差累積，數學上保證了狀態在擴展視野中的傳播。(3) Kairos通過納入部署感知系統共同設計來運行世界，以支持在伺服器和消費級硬體上進行低延遲的推出生成，以實現現實世界的觀察-行動-反饋循環。在具身世界模型、長期視野和行動政策基準上的實驗表明，Kairos在提供強大的效率與能力權衡的同時，達到了頂級性能。綜合這些結果，Kairos被定位為未來自我演化物理智慧的統一操作基礎。

SkillWiki: A Living Knowledge Infrastructure for Agent Skills

2606.16523v1 by Dingcheng Huang, Yuda Ding, Bingshuo Liu, Qingbin Liu, Xi Chen, Jiang Bian, Hongliang Sun, Zhiying Tu, Dianhui Chu, Xiaoyan Yu, Dianbo Sui

While knowledge is managed through Wikipedia and software through GitHub, agent skills still lack an infrastructure for large-scale production, governance, and evolution. SkillWiki is a living knowledge infrastructure that supports the organization, grounding, and continuous evolution of agent skills by transforming heterogeneous knowledge into reusable skill assets linked to their originating evidence. Our demonstration presents the complete skill lifecycle, from knowledge ingestion and skill production to provenance-aware exploration, governance, and execution-driven evolution. SkillWiki highlights a future in which knowledge, skills, and execution experience co-evolve within a shared infrastructure. The live demonstration and source code are publicly available at https://github.com/Huangdingcheng/SkillWiki.

摘要：知識透過維基百科進行管理，軟體則透過GitHub進行管理，但代理技能仍缺乏大規模生產、治理和演化的基礎設施。SkillWiki是一個活的知識基礎設施，通過將異質知識轉化為與其來源證據相連的可重用技能資產，支持代理技能的組織、基礎和持續演化。我們的演示展示了完整的技能生命周期，從知識攝取和技能生產到考慮來源的探索、治理和執行驅動的演化。SkillWiki突顯了一個未來，即知識、技能和執行經驗在共享基礎設施中共同演化。現場演示和源代碼可在 https://github.com/Huangdingcheng/SkillWiki 上公開獲得。

Model Graph Inductive Learning for Knowledge Graph Completion

2606.16509v1 by Mohommad Esmaei Khani, Mahdieh Hasheminejad, Ali Taherkhani, Hossein Hajiabolhassan

Link prediction in knowledge graphs fundamentally depends on the quality of learned embeddings for entities and relations. However, most existing methods derive these embeddings by aggregating only the local neighborhood of each entity, neglecting the global structure of the knowledge graph. This limited view prevents models from capturing higher-level structural patterns that are essential for accurate and generalizable link prediction. To address these limitations, we introduce Model Graph Inductive Learning (\textbf{MGIL}), a framework that constructs a model graph by clustering entities based on the similarity of their incoming and outgoing relational structures or their entity types. A GNN is then applied to this model graph to produce embeddings that capture the global view of the knowledge graph. These embeddings subsequently serve as high-quality initial features %embeddings for the original knowledge graph, replacing random initialization and leading to more stable and expressive representations. Extensive experiments on standard and recently proposed inductive benchmarks demonstrate that MGIL achieves state-of-the-art or highly competitive performance in inductive link prediction, highlighting its effectiveness across diverse graph settings.

摘要：知識圖譜中的連結預測根本上依賴於對實體和關係的學習嵌入的質量。然而，大多數現有方法僅通過聚合每個實體的局部鄰域來推導這些嵌入，忽略了知識圖譜的全局結構。這種有限的視角使得模型無法捕捉到對於準確且可泛化的連結預測至關重要的高階結構模式。為了解決這些限制，我們提出了模型圖歸納學習（\textbf{MGIL}），這是一個通過根據實體的進出關係結構或實體類型的相似性對實體進行聚類來構建模型圖的框架。然後將 GNN 應用於這個模型圖，以生成捕捉知識圖譜全局視角的嵌入。這些嵌入隨後作為高質量的初始特徵 %嵌入，用於原始知識圖譜，取代隨機初始化，並導致更穩定和表達力更強的表示。在標準和最近提出的歸納基準上進行的廣泛實驗表明，MGIL 在歸納連結預測中達到了最先進或高度競爭的性能，突顯了其在多樣化圖形設置中的有效性。

REFLEX: Reflective Evolution from LLM Experience

2606.16496v1 by Pan Wang

Large multimodal language models (LLMs) have emerged as powerful tools for guiding evolutionary search toward interpretable programmatic policies. However, existing frameworks rely on a monolithic model call to simultaneously interpret visual behavioral evidence and synthesize corrective code. This diagnosis-repair entanglement creates an opaque feedback loop, obscuring the rationale behind mutations and preventing the retention of algorithmic insights across independent runs. To achieve auditable and efficient policy search, we argue that visual diagnosis must be structurally decoupled from code generation. We present REFLEX, a train-free evolutionary framework that operationalizes this decoupling. In REFLEX, a vision-enabled Critic first distills task-specific behavioral evidence into structured, auditable diagnoses. Subsequently, a text-optimized Actor synthesizes child policies using these diagnoses alongside a persistent, self-evolving Skill Memory of reusable code snippets. This architecture not only provides transparent mutation traces but also enables cross-run programmatic knowledge transfer. Extensive evaluations across control benchmarks (Lunar Lander, Acrobot, Pendulum) and a 36-dimensional antenna array synthesis task demonstrate exceptional sample efficiency. Notably, REFLEX solves Acrobot and Pendulum in under 10 LLM calls and reaches a best Normalized Weighted Score of 1.092 on Lunar Lander, achieving highly competitive final performance while significantly accelerating the early-stage discovery of transparent policies.

摘要：大型多模態語言模型（LLMs）已成為引導進化搜索朝向可解釋的程式政策的強大工具。然而，現有的框架依賴於單一模型調用，同時解釋視覺行為證據並合成修正代碼。這種診斷-修復的糾纏創造了一個不透明的反饋循環，模糊了突變背後的理由，並阻礙了在獨立運行中保留算法洞察。為了實現可審計和高效的政策搜索，我們主張視覺診斷必須在結構上與代碼生成解耦。我們提出了REFLEX，一個無需訓練的進化框架，實現了這種解耦。在REFLEX中，一個具備視覺能力的評論者首先將任務特定的行為證據提煉成結構化的、可審計的診斷。隨後，一個文本優化的行為者使用這些診斷以及持久的、自我演化的技能記憶合成子政策，這些技能記憶包含可重用的代碼片段。這種架構不僅提供透明的突變痕跡，還使跨運行的程式知識轉移成為可能。在控制基準（Lunar Lander、Acrobot、Pendulum）和一個36維天線陣列合成任務中的廣泛評估顯示出卓越的樣本效率。值得注意的是，REFLEX在少於10次LLM調用中解決了Acrobot和Pendulum，並在Lunar Lander上達到了最佳的標準化加權分數1.092，實現了高度競爭的最終表現，同時顯著加速了透明政策的早期發現。

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

2606.16494v1 by Jieyuan Liu, Jianyang Gu, Shijie Chen, Jefferson Chen, Zhen Wang

Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In pure-text long-context LLMs, retrieved-context use follows the U-shaped "lost-in-the-middle" effect of Liu et al. (2024): information at the start and end of context is used, the middle is lost. Whether this transfers to deployed multimodal KB-VQA is open. To close this gap, we design the first controlled probe of reader-side position dependence in multimodal KB-VQA: a gold-position protocol in which only the gold passage's prompt slot varies within question. We run it on three open-source 7B/8B VLM readers and two KB-VQA benchmarks at k up to 20. The shape flips from U to primacy: gold-at-first beats gold-at-last by 16 to 26 points on every reader-by-benchmark cell, an effect we call "Lost at the End". Three targeted ablations narrow the cause: a text-only control shows the multimodal setting amplifies an already-present text-mode primacy 2.2 to 4.5 times, and image-position and distractor-shuffle ablations together pin the locus to prompt slot 0 of the instruction-tuned reader. On a frozen reader, three retrieval-side fixes (MMR, oracle reranking, rank-based reordering) all leave the gap intact (no separable improvement). Our findings indicate that recall@k is the wrong metric for deployed KB-VQA and that closing the gap requires reader-side intervention; we release our protocol as a controlled instrument for evaluating such interventions.

摘要：知識基礎的視覺問題回答（KB-VQA）讓視覺-語言系統能夠回答超出其參數知識範圍的問題，通過將讀者的注意力集中在從維基百科規模的知識庫檢索到的段落上。在純文本的長上下文大型語言模型（LLMs）中，檢索的上下文使用遵循劉等人（2024）所描述的U形「迷失在中間」效應：上下文開頭和結尾的信息被使用，而中間的信息則被遺失。這是否會轉移到部署的多模態KB-VQA上仍然是個未知數。為了縮小這一差距，我們設計了第一個控制性探測，檢視多模態KB-VQA中讀者側位置依賴性：一種金標位置協議，其中只有金段落的提示槽在問題中變化。我們在三個開源的7B/8B VLM讀者和兩個KB-VQA基準上進行了實驗，k值最高可達20。形狀從U型翻轉為優先效應：金標在前的表現比金標在後的表現高出16到26個點，在每個讀者-基準單元中均如此，這一效應我們稱之為「在結尾迷失」。三個針對性的消融實驗縮小了原因：僅文本控制顯示多模態設置將已存在的文本模式優先效應放大了2.2到4.5倍，而圖像位置和干擾項隨機化的消融實驗共同將焦點鎖定在指令調整讀者的提示槽0上。在一個固定的讀者上，三個檢索側的修正（MMR、預測重排序、基於排名的重新排序）都未能縮小這一差距（沒有可分離的改進）。我們的發現表明，recall@k是部署KB-VQA的錯誤指標，縮小這一差距需要讀者側的干預；我們將我們的協議作為評估此類干預的控制工具釋出。

Unified Multimodal Model for Brain MRI Imputation and Understanding

2606.16484v1 by Zhiyun Song, Che Liu, Tian Xia, Avinash Kori, Wenjia Bai

Multimodal large language models (MLLMs) hold great potential for medicine, as they inherit knowledge from LLM and allow multiple data modalities to be integrated, analysed and interpreted in natural language. However, the field of medical MLLMs is constrained by non-trivial challenges, notably the scarcity of high-quality training data and the frequent occurrence of missing data in the real-world clinical setting. Here, we propose a novel unified multimodal model, UniBrain, for brain magnetic resonance image (MRI) analysis. To address potential missing brain MRI modalities, we employ a unified training strategy to perform joint imaging modality imputation and brain image understanding. During training, an interleaved and description-enriched data flow is constructed to train the model in an autoregressive manner, enabling medical reasoning with generated multimodal data. A self-alignment strategy is introduced to leverage dense image embeddings to learn fine-grained anatomical features without requiring detailed image captions. Furthermore, we propose a dynamic hidden state mechanism to alleviate the exposure bias during long-context multimodal inference. Extensive experiments on multi-disease brain MRI dataset demonstrate that UniBrain achieves high performance for brain image imputation, understanding, and disease diagnosis under various extents of modality incompleteness.

摘要：多模態大型語言模型（MLLMs）在醫學領域具有巨大的潛力，因為它們繼承了LLM的知識並允許多種數據模態的整合、分析和用自然語言解釋。然而，醫學MLLM的領域受到非平凡挑戰的限制，尤其是高質量訓練數據的稀缺以及在現實臨床環境中經常出現的缺失數據。在此，我們提出了一種新穎的統一多模態模型UniBrain，用於腦部磁共振影像（MRI）分析。為了解決潛在的缺失腦部MRI模態，我們採用統一的訓練策略來執行聯合影像模態插補和腦影像理解。在訓練過程中，構建了一個交錯且描述豐富的數據流，以自回歸的方式訓練模型，使其能夠利用生成的多模態數據進行醫學推理。引入了一種自對齊策略，以利用密集的影像嵌入來學習細緻的解剖特徵，而無需詳細的影像標題。此外，我們提出了一種動態隱藏狀態機制，以減輕長上下文多模態推理過程中的曝光偏差。在多疾病腦部MRI數據集上的廣泛實驗顯示，UniBrain在不同程度的模態不完整性下，實現了腦影像插補、理解和疾病診斷的高性能。

Medical

Publish Date	Title	Authors	Homepage	Code
2026-06-17	Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA	Ikram Belmadani et.al.	2606.19266v1	null
2026-06-17	A Taxonomy of Mental Health and Technology Needs for Alzheimer's and Dementia Caregivers	Keran Wang et.al.	2606.19247v1	null
2026-06-17	Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis	Soheyl Bateni et.al.	2606.19183v1	null
2026-06-17	A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies	Fangyijie Wang et.al.	2606.19174v1	null
2026-06-17	A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI	Syed Mujtaba Haider et.al.	2606.18970v1	null
2026-06-17	Domain-Shift Aware Neural Networks for Unbalance Characterization in Rotating Systems	Bernardo Feijó Junqueira et.al.	2606.18882v1	null
2026-06-17	RedactionBench	Sean Brynjólfsson et.al.	2606.18782v1	null
2026-06-17	Augmenting Dysarthric Speech Severity Assessment with MOS Supervision	Kaimeng Jia et.al.	2606.18645v1	null
2026-06-17	Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance	Tianming Du et.al.	2606.18613v1	null
2026-06-17	Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep	Amama Mahmood et.al.	2606.18596v1	null
2026-06-16	PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization	Arshia Ilaty et.al.	2606.18518v1	null
2026-06-16	From Specification to Execution: AI Assisted Scientific Workflow Management	Komal Thareja et.al.	2606.18425v1	null
2026-06-16	RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills	Weizhi Zhang et.al.	2606.18203v1	null
2026-06-16	WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning	Yuwei Zhang et.al.	2606.18147v1	null
2026-06-16	Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour	Abeer Badawi et.al.	2606.18129v1	null
2026-06-16	Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications	Divyansh Srivastava et.al.	2606.18068v1	null
2026-06-16	When LLMs Analyze Scars: From Images to Clinically-Meaningful Features	Ruman Wang et.al.	2606.18063v1	null
2026-06-16	ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents	Ander Alvarez et.al.	2606.18037v1	null
2026-06-16	Recover Semantics First, Generate Better: Improved Latent Modeling for 3D MRI Reconstruction and Cross-Contrast Synthesis	Yonghao Chen et.al.	2606.17989v1	null
2026-06-16	STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training	Jinjie Shen et.al.	2606.17979v1	null
2026-06-16	Robustness of Similarity-based Positional Encoding Under Rotations: Theoretical Analysis and Experimental Validation	Andrea Santomauro et.al.	2606.17961v1	null
2026-06-16	A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease	Antonio Scardace et.al.	2606.17867v1	null
2026-06-16	When Multiple Scripts Matter: Evaluating ASR in Clinical Settings	Jean Seo et.al.	2606.17826v1	null
2026-06-16	Talking to Your Data: Exploring Embodied Conversation as an Interface for Personal Health Reflection	Nikola Kovacevic et.al.	2606.17767v1	null
2026-06-16	Vision-language models for chest radiography do not always need the image	Mahshad Lotfinia et.al.	2606.17710v1	null
2026-06-16	SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology	Wan Siti Halimatul Munirah Wan Ahmad et.al.	2606.17702v1	null
2026-06-16	AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows	Jiahui Niu et.al.	2606.17474v1	null
2026-06-16	A Machine-Learned Comorbidity Index	Suleman Baloch et.al.	2606.17450v1	null
2026-06-16	Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems	Xi Chu et.al.	2606.17443v1	null
2026-06-16	Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos	Bo Gou et.al.	2606.17437v1	null
2026-06-16	Feynman Kac Reweighted Schrödinger Bridge Matching for Surface-Based Tau PET Harmonization	Jianwei Zhang et.al.	2606.17420v1	null
2026-06-16	Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation	Xinyu Qin et.al.	2606.17405v1	null
2026-06-15	Geometry-Consistent Endoscopic Representations for Image-Guided Navigation via Structured Foundation Model Adaptation	Hongchao Shu et.al.	2606.17340v1	null
2026-06-15	SpeechDx: A Multi-Task Benchmark for Clinical Speech AI	Sejal Bhalla et.al.	2606.17339v1	null
2026-06-15	Symbolic Informalization: Fluent, Productive, Multilingual	Aarne Ranta et.al.	2606.16893v1	null
2026-06-15	Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering	Sanjay Basu et.al.	2606.16890v1	null
2026-06-15	Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection	Markus Bujotzek et.al.	2606.16868v1	null
2026-06-15	GIST-CMTF: Goal-State Inference for Causal Minimal Tool Filtering in LLM Agents	Rahul Suresh Babu et.al.	2606.16813v1	null
2026-06-15	AgentFairBench: Do LLM Agents Discriminate When They Act?	Triveni Morla et.al.	2606.16723v1	null
2026-06-15	Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies	Ke Liu et.al.	2606.16721v1	null
2026-06-15	Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation	Rutherford A. Patamia et.al.	2606.16568v1	null
2026-06-15	Unified Multimodal Model for Brain MRI Imputation and Understanding	Zhiyun Song et.al.	2606.16484v1	null
2026-06-15	Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis	Jingyu Hu et.al.	2606.17115v1	null
2026-06-15	Autonomous End-to-End SOH Prediction Services for Battery Systems via Temporal-Contrastive Representation Learning	Junting Wen et.al.	2606.16434v1	null
2026-06-15	Input-Dependent Fisher Information for Local Sensitivity Analysis of Medical Image Classifiers	Sourya Sengupta. Mark A. Anastasio et.al.	2606.16362v1	null
2026-06-15	Medical Heuristic Learning: An LLM-Driven Framework for Interpretable and Auditable Clinical Decision Rules	Wei Xu et.al.	2606.16337v2	null
2026-06-15	Propagating Structural Guidance: Synthesizing Fluorescein Angiography from Fundus Images and Sparse OCT Scans	Tengfei Ma et.al.	2606.16234v1	null
2026-06-15	Embedded Arena: Iterative Optimization via Hardware Feedback	Zhihan Zhang et.al.	2606.16190v1	null
2026-06-15	A Comprehensive Survey of Medical Image Segmentation: Challenges, Benchmarks, and Beyond	Pengyu Zhu et.al.	2606.16153v1	null
2026-06-15	LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis	Minh-Ha Nguyen et.al.	2606.16149v1	null
2026-06-15	PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization	Samah Fodeh et.al.	2606.16074v1	null
2026-06-14	DeepRoot: A KG-Coordinated Multi-Agent System for Therapeutic Reasoning over Historical Medical Texts	Zijian Carl Ma et.al.	2606.15931v1	null
2026-06-14	Let Them Steal: Trapping Large Language Model Extraction Attacks with Knowledge Honeypot	Yuyang Dai et.al.	2606.15810v1	null
2026-06-14	EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries	Jiyoun Kim et.al.	2606.15735v2	null
2026-06-14	Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning	Zhenyu Yu et.al.	2606.15733v1	null
2026-06-14	AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan	Mohammed Fasha et.al.	2606.15709v1	null
2026-06-14	LLM-Assisted Stance Detection in Scientific Discourse: A Test Case in Bayesian Cognitive Science	Eyup Engin Kucuk et.al.	2606.15566v1	null
2026-06-13	Hierarchical Modeling of ICD Codes in EHR Foundation Models	Megha Thukral et.al.	2606.15447v1	null
2026-06-13	Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models	Mayur Sanap et.al.	2606.15436v1	null
2026-06-13	Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering	Zaifu Zhan et.al.	2606.15419v1	null
2026-06-13	APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents	Ya-Chuan Chen et.al.	2606.15363v1	null
2026-06-13	CAP: Towards PPG Universal Representation Learning with Patient-level Supervision	Chenyang He et.al.	2606.15284v1	null
2026-06-13	RECTOR: Masked Region-Channel-Temporal Modeling for Affective and Cognitive Representation Learning	Jinhan Liu et.al.	2606.15278v1	null
2026-06-13	Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs	Zhisen Hu et.al.	2606.15250v1	null
2026-06-13	Enabling Real-Time Point-of-Care Ultrasound Segmentation: A GPU-Free Deployment in Resource-Limited Settings	Weihao Gao et.al.	2606.15176v1	null
2026-06-13	Bridging Geographic Bias in Urban Streetscape Inference via Lifelong Learning with Visual-Semantic Pivoting	Xinze Zhang et.al.	2606.15055v1	null
2026-06-13	Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling	Zhemin Zhang et.al.	2606.15038v1	null
2026-06-12	Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability	Alyssa Unell et.al.	2606.15029v1	null
2026-06-12	ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning	Sicheng Yang et.al.	2606.14697v1	null
2026-06-12	Expert-Driven Survival Machines: Improving Stratification and Interpretability in Multiple Clinical Cohorts	Farica Zhuang et.al.	2606.14608v1	null
2026-06-12	A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health	Pavlos Nicolaou et.al.	2606.14604v1	null
2026-06-12	CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation	Guanyu Liu et.al.	2606.14581v1	null
2026-06-12	Securing the Future of IoMT in the Post-Quantum Era: An Edge-Native Federated Learning Approach	Taym Alshoghri et.al.	2606.14515v1	null
2026-06-12	Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport	Paula Joy B. Martinez et.al.	2606.14157v1	null
2026-06-12	Applicability Condition Extraction for Therapeutic Drug-Disease Relations	Guanting Luo et.al.	2606.14031v1	null
2026-06-11	Explaining RhythmFormer: A Systematic XAI Analysis of Periodic Sparse Attention for Remote Photoplethysmography	Louis Chen et.al.	2606.13839v1	null
2026-06-11	ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages	Tanmoy Kanti Halder et.al.	2606.13572v1	null
2026-06-11	Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation	Aruna Dey et.al.	2606.13556v2	null
2026-06-11	MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait Assessment	Minlin Zeng et.al.	2606.13258v2	null
2026-06-11	Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints	Omar Alshahrani et.al.	2606.13211v1	null
2026-06-11	Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework	Abhishek H S et.al.	2606.13188v1	null
2026-06-11	Mental-R1: Aligning LLM Reasoning for Mental Health Assessment	Xin Wang et.al.	2606.13176v1	null
2026-06-11	Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation	Elena S. Kozachok et.al.	2606.13135v1	null
2026-06-11	AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction	Fabien Maury et.al.	2606.13051v1	null
2026-06-11	A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis	Manex Atxa et.al.	2606.12988v1	null
2026-06-11	OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models	Ibrahim Gulluk et.al.	2606.12953v1	null
2026-06-11	Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata	Daniel Soliman et.al.	2606.12824v1	null
2026-06-10	Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System	Alyssa Unell et.al.	2606.12702v1	null
2026-06-10	LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data	Yifan Gao et.al.	2606.12699v1	null
2026-06-10	CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents	Siyu Shen et.al.	2606.12666v2	null
2026-06-10	Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs	Shayan Mohammadizadehsamakosh et.al.	2606.12590v1	null
2026-06-10	EDEN: A Large-Scale Corpus of Clinical Notes for Italian	Tiziano Labruna et.al.	2606.12569v1	null
2026-06-10	Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy	Kai Standvoss et.al.	2606.12346v1	null
2026-06-10	Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification	Veerendhra Kumar Dangeti et.al.	2606.12252v1	null
2026-06-10	OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models	Negin Baghbanzadeh et.al.	2606.12169v1	null
2026-06-10	Towards Responsibly Non-Compliant Machines	Marija Slavkovik et.al.	2606.12147v1	null
2026-06-10	Tabular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation	Minh-Khoi Pham et.al.	2606.12006v1	null
2026-06-10	Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability	Kuo-En Hung et.al.	2606.11930v2	null
2026-06-10	Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task	Qianyu Yao et.al.	2606.11830v1	null
2026-06-10	Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data	Boris-Stephan Rauchmann et.al.	2606.11794v1	null

Abstracts

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

2606.19266v1 by Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet, Richard Dufour, Benoit Favre

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

摘要：大型語言模型（LLMs）的發展使得對其在專業領域和語言中的適應性更加關注，但領域適應策略的有效性仍然不明確。我們以法語醫療問答（QA）為案例，展示了一項醫療領域適應的研究。我們比較了持續預訓練（CPT）、監督微調（SFT）及其組合，在三個模型家族、多個尺寸和三種初始化類型之間，明確區分適應效果與基礎模型選擇的影響。我們在貪婪和受限解碼下，使用自動指標和LLM作為評估者的評價，評估了多選擇題（MCQA）和開放式問答（OEQA）。對於MCQA，CPT+SFT最常達到最佳分數，但相較於SFT的增益較小且經常不具統計顯著性，使得SFT成為一個強大且具成本效益的預設選擇。對於OEQA，CPT始終改善基於重疊的指標，而SFT則常常降低生成質量；指令調整和CPT+SFT在LLM基礎的評估中更受青睞。跨語言實驗進一步顯示了從法語適應到英語基準的有效轉移。總體而言，我們提供了在計算限制下選擇適應策略的實用指導方針。

A Taxonomy of Mental Health and Technology Needs for Alzheimer's and Dementia Caregivers

2606.19247v1 by Keran Wang, Drishti Goel, Jiayue Melissa Shi, Violeta J. Rodriguez, Daniel S. Brown, Dong Whi Yoo, Ravi Karkar, Koustuv Saha

Family members caring for individuals with Alzheimer's disease and related dementias (AD/ADRD) provide the foundation of long-term care worldwide. In 2023, more than 11 million U.S. family and friends contributed 18 billion hours of unpaid care, often at the cost of their own physical and mental health. These informal caregivers -- also referred as the "invisible second patients" -- experience elevated rates of mental health problems. Yet research commonly reduces their complex psychosocial experiences to a single construct of caregiver burden, obscuring which specific needs are unmet or effectively supported. At the same time, digital and AI-enabled technologies are rapidly expanding, from smartphone apps and videoconferencing to sensor platforms and AI chatbots. However, the absence of shared frameworks across medicine, psychology, and technology research limits cumulative progress. This study introduces a Caregiver Mental Health and Technology Taxonomy that systematically links AD/ADRD caregiver needs with corresponding classes of technology-based interventions. Drawing from an interdisciplinary literature review and two qualitative studies with caregivers, the taxonomy identifies mismatches between caregiver priorities and existing technological support, highlights under-served domains such as relational strain and compassion fatigue, and proposes design directions for adaptive, responsive systems. The framework offers a shared vocabulary to guide clinicians, researchers, and technology designers in developing more person-centered and clinically grounded innovation in dementia care.

摘要：家庭成員照顧阿茲海默症及相關癡呆症（AD/ADRD）患者，為全球長期照護提供了基礎。在2023年，超過1100萬名美國家庭成員和朋友貢獻了180億小時的無償照護，這往往以他們自身的身心健康為代價。這些非正式的照護者——也被稱為「隱形的第二患者」——經歷著較高的心理健康問題發生率。然而，研究通常將他們複雜的心理社會經驗簡化為單一的照護者負擔構念，模糊了哪些特定需求未被滿足或有效支持。同時，數位和人工智慧技術正在迅速擴展，從智慧型手機應用程式和視訊會議到感應平台和人工智慧聊天機器人。然而，醫學、心理學和技術研究之間缺乏共享框架，限制了累積進展。本研究介紹了一個照護者心理健康與技術分類法，系統性地將AD/ADRD照護者需求與相應的技術干預類別聯繫起來。該分類法基於跨學科的文獻回顧和兩項與照護者的質性研究，識別出照護者優先事項與現有技術支持之間的不匹配，突顯了如關係緊張和同情疲勞等被忽視的領域，並提出了適應性、響應性系統的設計方向。該框架提供了一個共享的詞彙，以指導臨床醫生、研究人員和技術設計師在癡呆症照護中開發更以人為中心且臨床基礎的創新。

Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis

2606.19183v1 by Soheyl Bateni, Maryam Abdolali

Large language models (LLMs) can make clinical decision support more accessible by interpreting free-text documentation, but their direct use as diagnostic engines is limited by sensitivity to prompts, information order, and plausible but incorrect outputs. Structured machine-learning models offer more stable risk prediction, yet they require tabular inputs that are difficult to integrate with narrative clinical workflows. We present ClaMPAPP (Clinical Language-assisted Machine-learning Pipeline for Appendicitis), a hybrid system that uses an LLM as an interface rather than as the final decision-maker. ClaMPAPP extracts schema-constrained clinical features from note-like narratives, applies deterministic plausibility checks, and passes validated features to an XGBoost classifier trained on clinical, laboratory, and ultrasound variables. We evaluated ClaMPAPP on two independent pediatric appendicitis cohorts from German hospitals and compared it with end-to-end LLM baselines, including open-source and proprietary models. To preserve ground truth while testing free-text input, narratives were generated from structured electronic health records through template rendering and constrained LLM rewriting, with additional sentence-order permutation to assess positional robustness. ClaMPAPP achieved the strongest overall diagnostic performance in both internal and external validation while minimizing missed appendicitis cases, the key safety concern in acute triage. End-to-end LLMs showed unstable sensitivity-specificity trade-offs and greater degradation under narrative reordering. These results support an LLM-as-interface, ML-as-predictor design that separates natural-language usability from predictive inference and provides a more auditable pathway for clinical decision support.

摘要：大型語言模型（LLMs）可以通過解釋自由文本文檔，使臨床決策支持變得更加可及，但它們作為診斷引擎的直接使用受到提示、信息順序和合理但不正確的輸出敏感性的限制。結構化機器學習模型提供了更穩定的風險預測，但它們需要難以與敘述性臨床工作流程集成的表格輸入。我們提出了ClaMPAPP（臨床語言輔助機器學習管道，用於闌尾炎），這是一個混合系統，將LLM用作接口，而不是最終決策者。ClaMPAPP從類似筆記的敘述中提取受架構約束的臨床特徵，應用確定性的合理性檢查，並將經過驗證的特徵傳遞給基於臨床、實驗室和超聲變量訓練的XGBoost分類器。我們在來自德國醫院的兩個獨立兒科闌尾炎隊列上評估了ClaMPAPP，並將其與端到端LLM基準進行比較，包括開源和專有模型。為了在測試自由文本輸入時保留真實情況，敘述是通過模板渲染和受限的LLM重寫從結構化電子健康記錄生成的，並進行了額外的句子順序置換以評估位置穩健性。ClaMPAPP在內部和外部驗證中都實現了最強的整體診斷性能，同時最小化漏診的闌尾炎病例，這是急性分診中的關鍵安全問題。端到端LLM顯示出不穩定的敏感性-特異性權衡，並在敘述重新排序下出現更大的降級。這些結果支持LLM作為接口、機器學習作為預測者的設計，將自然語言的可用性與預測推斷分開，並提供了一個更可審計的臨床決策支持途徑。

A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies

2606.19174v1 by Fangyijie Wang, Jianjun Yu, Wentao Shi, Haixia Huang, Ran Shi, Guénolé Silvestre, Kathleen M. Curran

Clinician-centered evaluation is critical for validating medical AI systems, especially in ultrasound imaging where quantitative metrics do not always capture clinical usability. Existing medical image platforms primarily focus on dataset labeling. They lack integrated support for blinded model comparison and reproducible evaluation workflows. We present a clinician-centered pipeline for remote annotation and evaluation in ultrasound AI studies. The proposed pipeline uses a centralized server and lightweight browser interfaces to enable clinicians to perform annotation, blinded ranking, and review without local dataset downloads. The pipeline also supports multi-rater participation, centralized result aggregation, and automated statistical analysis. We validate the pipeline in a fetal ultrasound segmentation study with six raters spanning expert, generalist, and non-expert experience levels. The system automatically generated Spearman correlation, Kendall's $τ$, and top-1 selection statistics. Results indicated moderate to strong agreement across experts and other groups. The blinded evaluation results showed a tendency for later active learning models to be preferred. These outcomes suggest that the pipeline can support clinician-centered annotation and reproducible human-\ac{AI} evaluation studies in ultrasound imaging. The proposed pipeline is available on \href{https://github.com/13204942/SonoRate}{GitHub}.

摘要：臨床醫師中心的評估對於驗證醫療人工智慧系統至關重要，尤其是在超聲影像中，定量指標並不總是能夠捕捉臨床可用性。現有的醫療影像平台主要集中於數據集標註。它們缺乏對盲測模型比較和可重複評估工作流程的綜合支持。我們提出了一個臨床醫師中心的管道，用於遠程標註和超聲人工智慧研究中的評估。所提議的管道使用集中式伺服器和輕量級瀏覽器介面，使臨床醫師能夠在不下載本地數據集的情況下進行標註、盲測排名和審查。該管道還支持多評審者參與、集中結果聚合和自動統計分析。我們在一項涉及六位評審者的胎兒超聲分割研究中驗證了該管道，這些評審者的經驗水平涵蓋了專家、通才和非專家。系統自動生成了斯皮爾曼相關係數、肯德爾的 $τ$ 和前一選擇統計數據。結果顯示專家和其他組別之間的協議程度從中等到強。盲測評估結果顯示後期主動學習模型更受青睞。這些結果表明該管道可以支持臨床醫師中心的標註和可重複的人類-\ac{AI} 評估研究在超聲影像中。所提議的管道可在 \href{https://github.com/13204942/SonoRate}{GitHub} 上獲得。

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

2606.18970v1 by Syed Mujtaba Haider, Silvia Figini

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

摘要：醫學影像分類常受到有限標記數據的限制，這促使了生成增強的需求；最近，為此目的提出了量子生成模型，並經常報告準確性提升。然而，這些說法通常基於單次訓練運行，未能匹配量子和經典生成器的參數預算，且未能描述任何好處出現的數據範疇。我們提出了一個受控基準，隔離量子生成器對腦部MRI增強的貢獻。影像被編碼進入一個KL正則化的潛在空間，在該空間中，使用變分量子生成器或參數數量幾乎相同的經典生成器（1648對1632）訓練一個帶有梯度懲罰的條件Wasserstein GAN。合成樣本被解碼並用於增強一個預訓練的分類器，涵蓋從5%到100%的標記數據比例，並在八個隨機種子上進行評估，使用配對顯著性測試（帶有多重比較修正）以及內部集多樣性和潛在分佈分析。在所有比例中，沒有增強變體顯著超越僅使用真實數據的訓練，且量子和經典生成器在統計上無法區分。任何低數據的好處表現為正則化，而非忠實的數據擴展：合成樣本在數據稀缺的地方偏離分佈並嚴重模式崩潰，且量子生成器的多樣性不比其經典對應物更高。我們釋放該協議作為醫學影像中量子生成增強嚴格評估的測試平台。

Domain-Shift Aware Neural Networks for Unbalance Characterization in Rotating Systems

2606.18882v1 by Bernardo Feijó Junqueira, Claudio Kiyoshi Umezu, Bruno Bilhar Karaziack, Tomaz Junior, Daniel Alves Castello

This work investigates the application of a domain-shift aware neural network for regression tasks aimed at estimating unbalance masses in rotating shafts under varying operating conditions. Experimental data were collected from a test rig in which a primary shaft, equipped with a flange carrying unbalanced masses, was driven at different rotational speeds, while a secondary shaft could be optionally activated to introduce domain discrepancy. The unbalance masses were positioned at a fixed radial distance, and the dynamic response of the system was recorded using triaxial accelerometers. The inverse problem of mass estimation is formulated within a domain adaptation framework, where the network is trained with a maximum mean discrepancy strategy to align feature representations across source and target distributions. The results demonstrate the effectiveness of explicitly addressing domain shift in improving prediction accuracy, especially when the system's physical behavior and sources of domain discrepancy are not fully known and fall outside the training conditions. These findings highlight the potential of domain-shift aware models for regression tasks in Structural Health Monitoring.

摘要：這項工作探討了針對回歸任務應用領域轉移感知神經網絡，以估算在變化操作條件下旋轉軸上的不平衡質量。實驗數據是從一個測試裝置中收集的，該裝置中一根主軸配備有承載不平衡質量的法蘭，以不同的轉速驅動，而一根次軸則可以選擇性啟動以引入領域差異。不平衡質量被定位在固定的徑向距離，系統的動態響應是使用三軸加速度計記錄的。質量估算的逆問題是在一個領域適應框架內進行公式化的，其中網絡使用最大均值差異策略進行訓練，以對齊源和目標分佈之間的特徵表示。結果顯示，明確處理領域轉移在提高預測準確性方面的有效性，特別是在系統的物理行為和領域差異的來源未完全了解且超出訓練條件時。這些發現突顯了領域轉移感知模型在結構健康監測中進行回歸任務的潛力。

RedactionBench

2606.18782v1 by Sean Brynjólfsson, Shashvat Jayakrishnan, Esha Sali, Diptanshu Purwar, Madhav Aggarwal

Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction mechanics with privacy semantics. A public phone number is not equivalent to a phone number in a medical record. Whether information constitutes a violation depends heavily on who holds it, why, and in what context, fundamentally differentiating redaction from simple entity recognition. Grounded in contextual integrity, we introduce RedactionBench, a manually annotated benchmark comprising 200 diverse documents across 11 domains, mostly seeded from real-world sources. We also introduce R-Score, a novel character-level metric that treats semantically similar redactions equally and nullifies shallow formatting choices, such as varying masking styles for phone numbers. Evaluations across Named Entity Recognition models, entity extraction Small Language Models, and frontier models equipped with agentic tools demonstrate that contextual redaction remains an unsolved problem. A human evaluation with over 80 users on RedactionBench reveals a stark dichotomy in privacy perceptions. Annotators show consensus with target labels for mandatory redactions (89.4 percent) and safe text preservations (94.1 percent), but fail to agree on contextual redactions (47.7 percent). This variance demonstrates the subjective nature of contextual privacy and motivates R-Score, which decouples contextual ambiguity from strict precision. We compare 35 models across families and report their performance in redacting PII. Finally, we release RedactionBench to establish a baseline for future privacy-preserving systems, hoping to inspire efficient model design and standardized evaluations.

摘要：大型語言模型越來越多地應用於需要刪除個人可識別信息（PII）的敏感領域。雖然刪除PII是數據清理的前提，但現有基準將提取機制與隱私語義混為一談。公共電話號碼並不等同於醫療記錄中的電話號碼。信息是否構成違規在很大程度上取決於誰持有它、為什麼以及在什麼上下文中，這根本上區分了刪除和簡單的實體識別。基於上下文完整性，我們引入了RedactionBench，一個手動註釋的基準，包含來自11個領域的200份多樣化文件，大多數來源於現實世界。我們還引入了R-Score，一種新穎的字符級指標，平等對待語義相似的刪除，並消除淺顯的格式選擇，例如對電話號碼使用不同的掩碼樣式。在命名實體識別模型、實體提取小型語言模型和配備代理工具的前沿模型的評估中，顯示上下文刪除仍然是一個未解決的問題。對RedactionBench進行的超過80名用戶的人類評估顯示出隱私感知的明顯二元性。註釋者對於強制刪除的目標標籤（89.4%）和安全文本保留（94.1%）顯示出共識，但對於上下文刪除（47.7%）則未能達成一致。這種變異顯示了上下文隱私的主觀性，並促進了R-Score的發展，該指標將上下文模糊性與嚴格精確性解耦。我們比較了35個模型的不同類別，並報告了它們在刪除PII方面的表現。最後，我們發布了RedactionBench，以建立未來隱私保護系統的基準，希望能激發高效的模型設計和標準化評估。

Augmenting Dysarthric Speech Severity Assessment with MOS Supervision

2606.18645v1 by Kaimeng Jia, Minzhu Tu, Zengrui Jin, Siyin Wang, Chao Zhang

Dysarthria is a speech disorder marked by reduced intelligibility and communicative effectiveness. Automatic utterance-level assessment of dysarthric speech can support scalable speech monitoring and therapy-related analysis. Yet training such systems is bottlenecked by the scarcity of clinically annotated dysarthric speech. This work proposes to augment dysarthric speech assessment using data from speech synthesis evaluations, specifically human-annotated utterances with Mean Opinion Score (MOS) labels from the QualiSpeech corpus. Experiments show that fine-tuning on speech synthesis assessment data consistently improves performance on both intelligibility and naturalness prediction, while joint training yields gains primarily on naturalness. These results suggest that synthesis artifacts and dysarthric speech share perceptual commonalities, and speech synthesis evaluation corpora offer a practical augmentation source that reduces reliance on scarce clinical annotations.

摘要：失語症是一種以可理解性和交際效果降低為特徵的語言障礙。自動化的失語症語音評估可以支持可擴展的語音監測和治療相關分析。然而，訓練這類系統的瓶頸在於臨床註釋的失語症語音稀缺。本研究提議利用語音合成評估中的數據來增強失語症語音評估，特別是來自QualiSpeech語料庫的人類註釋語句，並附有平均意見分數（MOS）標籤。實驗顯示，在語音合成評估數據上進行微調能夠持續改善可理解性和自然性預測的表現，而聯合訓練主要在自然性上帶來增益。這些結果表明，合成工件和失語症語音在感知上存在共通性，而語音合成評估語料庫則提供了一個實用的增強來源，減少對稀缺臨床註釋的依賴。

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

2606.18613v1 by Tianming Du, Peijie Yu, Sihan Shang, Danli Shi, My Linh Nguyen, Shengbo Gao, Guangyuan Li, Yinghong Yu, Yan Jiang, Qianlong Zhao, Behzad Bozorgtabar, Shaoxiong Ji, Jiazhen Pan, Daniel Rueckert, Jiancheng Yang

The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.

摘要：醫療 LLMs 在近期最可能的角色是協助而非取代醫生，但目前的評估往往測試孤立的能力：臨床知識、EHR 系統互動或病人溝通。醫生的協助需要在同一互動中協調這些能力，在這裡醫生發出不明確的請求，病人模糊地描述症狀，而 EHR 系統則需要精確的工具使用。我們引入了 PhysAssistBench，一個用於互動醫生-病人-EHR 協助的基準。PhysAssistBench 由真實的 MIMIC-IV 案例構建，使用可擴展的管道來構建具主動性的病人：互動的、基於記錄的代理，將靜態的 EHR 記錄轉化為多輪臨床場景，同時保持臨床事實性。PhysAssistBench 提供了一個經過策劃的雙語評估集，包含 1,296 個手動審核和醫生驗證的回合。與領先的 LLMs 進行的實驗顯示，當前模型在這種環境中仍然不可靠，這暴露了臨床 LLMs 的一個關鍵瓶頸：可靠的協助需要在知識、溝通和系統之間進行協調，而不是在任何一個方面的孤立增長。

Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep

2606.18596v1 by Amama Mahmood, Bokyung Kim, Honghao Zhao, Molly E. Atwood, Luis F. Buenaver, Michael T. Smith, Chien-Ming Huang

Sleep diaries are central to behavioral sleep medicine and cognitive behavioral therapy for insomnia, yet daily completion is difficult to sustain, and static forms often provide limited context for interpreting night-to-night sleep variation. We designed an LLM-powered conversational voice diary that delivers clinically grounded morning and evening sleep diary questions through proactive smart-speaker prompts, structured conversational intake, and adaptive follow-up dialogue. We evaluated the system in a four-week between-subjects field study with 30 university students, comparing it with a text-based mobile diary using matched diary items, reporting windows, and reminder intervals. Compared with the text-based diary, the conversational voice diary showed higher adherence and elicited more detailed contextual self-report about routines, stressors, environmental conditions, and other sleep-related factors. Participants also described the voice diary as easier to integrate into daily routines, despite longer perceived completion time. However, voice-based conversational intake produced lower completeness for some structured diary fields, revealing a trade-off between expressive richness and structured precision. These findings show both the promise and the challenge of using LLM-powered conversational voice assistants for longitudinal health self-report.

摘要：睡眠日記是行為睡眠醫學和失眠認知行為療法的核心，但每日填寫難以持續，靜態形式往往提供有限的背景來解釋夜間睡眠變化。我們設計了一個基於大型語言模型的對話式語音日記，通過主動的智能音箱提示、結構化的對話收集和自適應的後續對話，提供臨床基礎的早晚睡眠日記問題。我們在一項為期四週的受試者間田野研究中評估了該系統，參與者為30名大學生，並將其與使用匹配日記項目、報告窗口和提醒間隔的文本基礎移動日記進行比較。與文本基礎日記相比，對話式語音日記顯示出更高的依從性，並引發了更詳細的上下文自我報告，涉及日常作息、壓力源、環境條件和其他與睡眠相關的因素。參與者還描述語音日記更容易融入日常作息，儘管感知的完成時間較長。然而，基於語音的對話收集在某些結構化日記字段中產生了較低的完整性，顯示出表達豐富性與結構精確性之間的權衡。這些發現顯示了使用基於大型語言模型的對話式語音助理進行長期健康自我報告的潛力與挑戰。

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

2606.18518v1 by Arshia Ilaty, Hossein Shirazi, Manasi Chitale, Kedar Hegde, Dhanalakshmi Ramesh, Rashmi S. Manjunath, Amir Rahmani, Hajar Homayouni

The development of medical AI is constrained by limited access to high-quality clinical data due to institutional silos and strict privacy regulations such as HIPAA and GDPR. Synthetic data generation offers a potential solution, but existing methods lack principled mechanisms to explicitly manage the privacy-utility trade-off, often degrading clinically meaningful patterns or risking patient re-identification. We present PSyGenTAB, a privacy-preserving generative framework that formulates synthetic healthcare data generation as a constrained optimization problem solved using the Augmented Lagrangian Method. By embedding configurable privacy constraints directly into model training, PSyGenTAB enforces minimum privacy thresholds while maximizing clinical data utility. Across multiple clinically motivated benchmarks, PSyGenTAB preserves inter-feature clinical relationships and minority-class diagnostic patterns essential for reliable health AI. Downstream evaluation using Train-on-Synthetic, Test-on-Real and Train-on-Real, Test-on-Synthetic protocols shows that models trained on synthetic data achieve performance comparable to those trained on real patient records. Privacy auditing further demonstrates reduced exact record reproduction and strong resilience to membership inference attacks. These results establish PSyGenTAB as a principled framework for balancing privacy protection and clinical utility in synthetic healthcare data, supporting secure cross-institutional AI development.

摘要：醫療AI的發展受到高品質臨床數據獲取有限的限制，這是由於機構孤島和嚴格的隱私法規，例如HIPAA和GDPR。合成數據生成提供了一個潛在的解決方案，但現有的方法缺乏原則性機制來明確管理隱私與效用之間的權衡，這往往會降低臨床上有意義的模式或危及患者的重新識別。我們提出了PSyGenTAB，一個保護隱私的生成框架，將合成醫療數據生成公式化為一個約束優化問題，並使用增強拉格朗日方法解決。通過將可配置的隱私約束直接嵌入模型訓練中，PSyGenTAB在最大化臨床數據效用的同時，強制執行最低隱私閾值。在多個臨床動機的基準測試中，PSyGenTAB保留了臨床特徵之間的關係和對可靠健康AI至關重要的少數類別診斷模式。使用“在合成數據上訓練，在真實數據上測試”和“在真實數據上訓練，在合成數據上測試”的下游評估顯示，基於合成數據訓練的模型達到了與基於真實患者記錄訓練的模型相當的性能。隱私審計進一步顯示出精確記錄再現的減少和對會員推斷攻擊的強大抵抗力。這些結果確立了PSyGenTAB作為一個原則性框架，在合成醫療數據中平衡隱私保護和臨床效用，支持安全的跨機構AI開發。

From Specification to Execution: AI Assisted Scientific Workflow Management

2606.18425v1 by Komal Thareja, Hamza Safri, Rajiv Mayani, Anirban Mandal, Ewa Deelman

Scientific workflow management systems (WMS) support scalable and reproducible execution of complex pipelines, but workflow design, implementation, and debugging remain largely manual and require significant expertise. Recent approaches using large language models (LLMs) show promise for workflow generation from natural language, but often rely on direct code synthesis, which limits transparency, reproducibility, and integration with workflow systems. We present an AI-assisted approach to scientific workflow management that combines specification-driven workflow generation, automated debugging, and distributed execution. The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation. We also develop an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers. To support distributed execution and user interaction, we integrate Pegasus, a widely used WMS, with a Model Context Protocol (MCP) layer, providing a unified interface for workflow submission, monitoring, and control. We evaluate the approach using a federated learning workflow for medical imaging, chosen for its parallel, iterative, and dependency-intensive structure. The system generated and executed large-scale workflows with thousands of jobs, reduced debugging effort, and allowed non-expert users to construct workflows with expert-level design patterns. These results indicate that end-to-end AI-assisted workflow generation and execution is feasible, and point toward AI-driven platforms for managing the scientific workflow lifecycle.

摘要：科學工作流程管理系統（WMS）支持可擴展和可重現的複雜管道執行，但工作流程的設計、實施和除錯仍然主要是手動進行，並且需要相當的專業知識。最近使用大型語言模型（LLMs）的方法顯示出從自然語言生成工作流程的潛力，但通常依賴於直接的代碼合成，這限制了透明度、可重現性和與工作流程系統的整合。我們提出了一種AI輔助的科學工作流程管理方法，結合了以規範驅動的工作流程生成、自動化除錯和分佈式執行。該方法引入了一個結構化的規範階段，將工作流程的意圖、設計和實施分開，允許在生成代碼之前進行驗證。我們還開發了一個基於LLM的除錯代理，能夠診斷和解決多個系統層次的故障。為了支持分佈式執行和用戶互動，我們將廣泛使用的WMS Pegasus與模型上下文協議（MCP）層集成，提供一個統一的工作流程提交、監控和控制界面。我們使用一個聯邦學習的醫學影像工作流程來評估該方法，因為它具有並行、迭代和依賴密集的結構。該系統生成並執行了具有數千個作業的大規模工作流程，減少了除錯工作，並允許非專家用戶以專家級設計模式構建工作流程。這些結果表明，端到端的AI輔助工作流程生成和執行是可行的，並指向AI驅動的平台以管理科學工作流程的生命週期。

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

2606.18203v1 by Weizhi Zhang, Zechen Li, Hamid Palangi, Ben Graef, A. Ali Heydari, Simon A. Lee, Salman Rahman, Ray Luo, Zeinab Esmaeilpour, Erik Schenck, Chloe Zhang, Yamin Li, Menglian Zhou, Philip S. Yu, Daniel McDuff, Lindsey Sunden, Mark Malhotra, Shwetak Patel, Ahmed A. Metwally

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

摘要：LLM 驅動的個人健康代理與用戶健康（傳感器）指標提供了一條有希望的途徑，以減輕全球醫療保健獲取的不平等。然而，大規模臨床部署仍然受到一個無限期評估瓶頸的限制：醫生註釋可靠但成本高昂且無法擴展，而 LLM 作為評估者則可擴展但主觀、不一致，有時與臨床不符。我們介紹了 RubricsTree，一個可擴展的評估框架，具有專家對齊的分層分類法，包含超過 100 個原子級的臨床可驗證布爾標準，這些標準源於 4,000 個真實用戶查詢的洞察，通過一個由經驗豐富的醫生領導的專家小組進行的迭代人機協作策劃協議進化而來。上下文感知的自適應路由器僅在每個查詢中激活相關的自動加權標準子集，提供可擴展評估所需的通量，並保持專家對齊的質量。通過系統的元評估，我們顯示 RubricsTree (i) 在挑戰性的開放式查詢上，專家對齊的表現顯著超過強大的大規模評估基準；(ii) 可靠地懲罰上下文退化的回應；以及 (iii) 當用作結構化指令、文本反饋或性能優化的訓練獎勵時，對 Gemini、GPT 和 Qwen 模型系列在 HealthBench 上產生高達約 66% 的相對增益。因此，RubricsTree 提供了一個可擴展的、可審計的、持續演進的評估基礎設施，滿足產品級個人健康 AI 持續優化的需求。

WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning

2606.18147v1 by Yuwei Zhang, Tong Xia, Bianca Emmerich, Yu Yvonne Wu, Dimitris Spathis, Xin Liu, Daniel McDuff, Cecilia Mascolo

Language models are remarkably capable at medical question answering, in some cases surpassing the accuracy of general physicians. However, answering questions about wearable health data remains challenging and understudied, as these ubiquitous sensors produce continuous, high-dimensional, and longitudinal data, which is non-trivial to align with text-centric distributions in LLM pretraining. The diversity of sensor modalities and user intents cannot be effectively handled by a fixed reasoning workflow or a single pretrained foundation model. To address these challenges, we propose WEQA, a query-adaptive agent framework that unifies LLM reasoning with specialized wearable analytical and modeling tools. An LLM controller is employed to synthesize execution plans and dynamically route each query to the appropriate combination of sensor analysis and pretrained models, and perform grounded response auditing with external knowledge. We also curate a benchmark spanning four open wearable datasets comprising analytic and predictive tasks in three different health domains. Experiments show that our framework is 24% more accurate than LLM and agentic baselines, and a blinded study with 12 medical experts and 8 users shows substantial gains in usefulness and clinical soundness.

摘要：語言模型在醫學問答方面表現出色，在某些情況下超越了一般醫生的準確性。然而，回答有關可穿戴健康數據的問題仍然具有挑戰性且研究不足，因為這些無處不在的傳感器產生連續的、高維度的和長期的數據，這與 LLM 預訓練中的以文本為中心的分佈對齊並非易事。傳感器模態和用戶意圖的多樣性無法通過固定的推理工作流程或單一的預訓練基礎模型有效處理。為了解決這些挑戰，我們提出了 WEQA，一個查詢自適應代理框架，將 LLM 推理與專門的可穿戴分析和建模工具統一。我們使用 LLM 控制器來合成執行計劃，並動態地將每個查詢路由到適當的傳感器分析和預訓練模型的組合，並利用外部知識進行基於事實的回應審核。我們還策劃了一個基準，涵蓋四個開放的可穿戴數據集，包括三個不同健康領域的分析和預測任務。實驗表明，我們的框架比 LLM 和代理基準準確性高出 24%，而一項由 12 位醫學專家和 8 位用戶參與的盲測顯示在實用性和臨床合理性方面有顯著提升。

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

2606.18129v1 by Abeer Badawi, Moyosoreoluwa Olatosi, Negin Baghbanzadeh, Laleh Seyyed-Kalantari, Frank Rudzicz, R. Shayna Rosenbaum, Sara Pishdadian, Elham Dolatabadi

Recent incidents involving LLMs used for mental-health support reveal a critical evaluation gap: surface-level safety scores do not capture how models behave across realistic, emotionally sensitive interactions over time. Existing benchmarks measure knowledge, safety, or static response quality, but miss whether LLM interactions help users keep reflecting, coping, and making decisions themselves. We formalize this missing dimension as COGNITIVE ATROPHY, a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. To measure it, we introduce COGNITIVE ATROPHY BENCH, a clinically grounded benchmark built from 1,576 fully human-generated counseling conversations, 15,680 turns, and 42,230 responses from five LLMs. Three clinical and neuropsychology experts developed a 20-attribute schema spanning user context, response behaviour, and global risk flags; six trained clinical reviewers applied it with span-grounded evidence, producing 5,324 reviewer judgments. We further introduce the User-Input Risk Index (UIRI), the Cognitive Atrophy Risk Index (ARI), and trajectory summaries. Across five LLMs, models show a consistent moderate-to-high level of atrophy-aligned behaviour across single and multi-turn settings. While models generally respond to overt safety cues, they adapt less reliably when users seek solutions or decisions. The dominant recurring patterns are directive advice, problem-solving, recommendation responses, topic shifts, and forms of validation that may reinforce dependence rather than reflection. Our work makes COGNITIVE ATROPHY measurable and provides a foundation for auditing model behaviour in sensitive LLM conversations.

摘要：最近涉及用於心理健康支持的LLM事件揭示了一個關鍵的評估缺口：表面上的安全分數無法捕捉模型在現實情境中隨時間推移的情感敏感互動中的行為。現有的基準測量知識、安全性或靜態反應質量，但未能評估LLM互動是否幫助用戶持續反思、應對和自主做出決策。我們將這一缺失的維度正式化為認知萎縮（COGNITIVE ATROPHY），這是一種在AI介導的心理健康支持中與安全性和幫助性不同的過程層面行為測量。為了測量它，我們引入了認知萎縮基準（COGNITIVE ATROPHY BENCH），這是一個基於1,576個完全由人類生成的諮詢對話、15,680次回合和來自五個LLM的42,230個回應的臨床基準。三位臨床和神經心理學專家開發了一個涵蓋用戶背景、回應行為和全球風險標誌的20屬性架構；六位經過培訓的臨床審核員應用該架構並提供基於證據的評估，產生了5,324條審核判斷。我們進一步引入了用戶輸入風險指數（User-Input Risk Index, UIRI）、認知萎縮風險指數（Cognitive Atrophy Risk Index, ARI）和軌跡摘要。在五個LLM中，模型在單回合和多回合設置中顯示出一致的中到高水平的萎縮對齊行為。儘管模型通常對明顯的安全提示作出反應，但當用戶尋求解決方案或決策時，它們的適應性較低。主導的重複模式包括指導性建議、問題解決、推薦回應、主題轉換和可能加強依賴而非反思的驗證形式。我們的工作使認知萎縮可測量，並為審計敏感LLM對話中的模型行為提供了基礎。

Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications

2606.18068v1 by Divyansh Srivastava, Shreya Ghosh, Anshul Verma, Rajkumar Buyya

Recent advances in Large Language Models (LLMs) and multi-agent systems have driven the rise of Agentic AI, showing promise for medical reasoning. However, open-ended conversational agents remain prone to two critical failure modes: premature diagnostic handoff and silent clinical hallucinations that may go undetected before reaching the patient. In this work, we propose a multi-agent framework that addresses both issues by replacing ``LLM-as-a-judge'' routing with deterministic orchestration constraints. The framework incorporates two safety mechanisms. First, a neuro-symbolic state-tracking gate enforces completeness of the OLDCARTS clinical protocol (Onset, Location, Duration, Character, Aggravating/Alleviating factors, Radiation, Timing, and Severity) by blocking diagnostic transitions until all required dimensions are collected. Second, an epistemic uncertainty quantification (UQ) gate computes semantic entropy (H) across K=5 independent diagnostic samples to identify and intercept divergent outputs before delivery. We evaluate the system using simulated patient agents powered by the llama-3.1-70b-instruct model on 150 test cases. The full architecture achieves 49.3% diagnostic precision, representing an absolute improvement of 11.3 percentage points over an unconstrained baseline. Additionally, we observe a statistically significant negative correlation (r = -0.181, p < 0.05) between OLDCARTS completeness (σ) and semantic entropy (H), suggesting that structured information gathering is associated with reduced diagnostic uncertainty.

摘要：最近在大型語言模型（LLMs）和多代理系統方面的進展推動了代理式人工智慧的興起，顯示出在醫學推理方面的潛力。然而，開放式對話代理仍然容易出現兩種關鍵的失敗模式：過早的診斷轉交和可能在到達患者之前未被檢測到的靜默臨床幻覺。在這項工作中，我們提出了一個多代理框架，通過用確定性編排約束取代“LLM作為裁判”的路由來解決這兩個問題。該框架包含兩個安全機制。首先，一個神經符號狀態跟蹤閘強制執行OLDCARTS臨床協議的完整性（起始、位置、持續時間、特徵、加重/緩解因素、輻射、時間和嚴重性），通過阻止診斷轉換直到收集所有所需的維度。其次，一個認知不確定性量化（UQ）閘計算K=5個獨立診斷樣本的語義熵（H），以識別並攔截在交付之前的分歧輸出。我們使用由llama-3.1-70b-instruct模型驅動的模擬患者代理在150個測試案例中評估系統。完整架構實現了49.3%的診斷精確度，與不受約束的基線相比，絕對改善了11.3個百分點。此外，我們觀察到OLDCARTS完整性（σ）與語義熵（H）之間存在統計上顯著的負相關（r = -0.181，p < 0.05），這表明結構化的信息收集與降低診斷不確定性相關。

When LLMs Analyze Scars: From Images to Clinically-Meaningful Features

2606.18063v1 by Ruman Wang, Hangting Ye

Medical image classification faces a fundamental dilemma: while deep learning models achieve remarkable performance at scale, real-world clinical scenarios often suffer from severe data scarcity due to annotation costs, privacy constraints, and disease rarity. This challenge is particularly pronounced in pathological scar classification, where differentiating keloids from hypertrophic scars requires subtle expert knowledge and labeled images are extremely limited. We propose a novel paradigm that repositions large language models (LLMs) as knowledge-driven feature engineers rather than end-to-end classifiers. We call this framework ScaFE (Scar Feature Engineering). Our key insight is that LLMs encode rich medical knowledge that can be externalized as executable feature extraction code, enabling the transformation of high-dimensional images into low-dimensional, clinically interpretable representations. Specifically, we prompt an LLM with established scar assessment criteria to generate deterministic Python code that extracts features aligned with clinical scoring systems such as the Vancouver Scar Scale. Our approach offers three key advantages: (1) data efficiency, achieving robust performance with limited training samples by decoupling knowledge acquisition from statistical learning; (2) privacy preservation, as raw images are processed locally without exposure to external LLMs; and (3) interpretability, through explicit features grounded in clinical reasoning. Extensive experiments on scar classification demonstrate that our method consistently outperforms end-to-end deep learning baselines or using LLMs as black-box classifiers under limited data conditions, establishing a promising direction for integrating LLMs into data-efficient and clinically transparent medical AI systems.

摘要：醫學影像分類面臨一個根本性的困境：雖然深度學習模型在大規模下表現卓越，但現實世界的臨床情境常常因為標註成本、隱私限制和疾病稀有性而遭遇嚴重的數據匱乏。這一挑戰在病理性疤痕分類中尤為明顯，因為區分凹疤和肥厚性疤痕需要微妙的專家知識，而標註的影像極為有限。我們提出了一種新穎的範式，將大型語言模型（LLMs）重新定位為知識驅動的特徵工程師，而非端到端的分類器。我們稱這一框架為ScaFE（疤痕特徵工程）。我們的關鍵見解是，LLMs編碼了豐富的醫學知識，這些知識可以外部化為可執行的特徵提取代碼，使高維影像轉換為低維且臨床可解釋的表示。具體而言，我們使用既定的疤痕評估標準來提示LLM生成確定性的Python代碼，提取與臨床評分系統（如溫哥華疤痕量表）對齊的特徵。我們的方法提供了三個主要優勢：（1）數據效率，通過將知識獲取與統計學習解耦，實現有限訓練樣本下的穩健性能；（2）隱私保護，因為原始影像在本地處理，未暴露於外部LLMs；以及（3）可解釋性，通過基於臨床推理的明確特徵。對疤痕分類的廣泛實驗表明，我們的方法在有限數據條件下始終優於端到端的深度學習基準或將LLMs用作黑箱分類器，確立了將LLMs整合進數據高效且臨床透明的醫學AI系統中的有前景方向。

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

2606.18037v1 by Ander Alvarez, Santhiya Rajan, Samuel Mugel, Román Orús

Tool-using LLM agents increasingly use the Model Context Protocol (MCP) to answer from heterogeneous evidence sources, including search, APIs, databases, clinical records, and formulary tools. Standard factuality metrics usually test whether an answer is supported by pooled evidence, missing a provenance-sensitive failure mode: a claim may be supported somewhere while being attributed to the wrong source. We call this cross-source conflation. We introduce ProvenanceGuard, a source-aware verifier for MCP-grounded answers. It consumes captured MCP traces with stable tool IDs, source IDs, and raw outputs; decomposes answers into atomic claims; routes claims to source-specific evidence; checks support with NLI and a token-alignment proxy; compares stated attribution with the routed source; and returns per-claim verdicts plus an answer-level allow/block decision. Blocked answers can be repaired with retrieval-augmented answer revision and re-verified. We evaluate on 281 medical-domain MCP-agent traces. A 266-trace adjudicated subset yields 2,325 LLM-assisted claim labels split by trace; 361 held-out labels are human-verified. On the 40-trace held-out split, ProvenanceGuard achieves block F1 0.802 and source accuracy 0.858 over 260 source-eligible claims, outperforming source-blind baselines that do not emit claim-to-source IDs. On a harder multi-source benchmark it reaches block F1 0.846, while source-plus-relation accuracy drops to 0.229, showing that exact source ownership remains difficult with semantically close sources. Repair-and-reverify resolves all blocked answers in the full trace set, often via conservative fallback. In 50 controlled clinical conflation probes, ProvenanceGuard detects all injected attribution swaps with no retained wrong attribution. These results show that source attribution is an independent axis for factuality verification in MCP-based agents.

摘要：使用工具的 LLM 代理越來越多地使用模型上下文協議 (MCP) 來從異質證據來源回答問題，包括搜索、API、數據庫、臨床記錄和處方工具。標準事實性指標通常測試答案是否得到聚合證據的支持，但忽略了一種對來源敏感的失敗模式：一個主張可能在某處得到支持，同時卻被歸因於錯誤的來源。我們稱這種情況為跨來源混淆。我們介紹 ProvenanceGuard，一種針對 MCP 基礎答案的來源感知驗證器。它處理捕獲的 MCP 跟蹤，並包含穩定的工具 ID、來源 ID 和原始輸出；將答案分解為原子主張；將主張路由到特定來源的證據；使用 NLI 和令牌對齊代理檢查支持；比較聲明的歸因與路由的來源；並返回每個主張的裁決以及答案層級的允許/阻止決策。被阻止的答案可以通過檢索增強的答案修訂進行修復並重新驗證。我們在 281 個醫療領域的 MCP 代理跟蹤上進行評估。一個 266 跟蹤的裁決子集產生了 2,325 個 LLM 輔助的主張標籤，按跟蹤分割；361 個保留標籤經過人工驗證。在 40 跟蹤的保留分割上，ProvenanceGuard 在 260 個符合來源的主張上達到阻止 F1 0.802 和來源準確率 0.858，超越了不發出主張到來源 ID 的來源盲基準。在一個更具挑戰性的多來源基準上，它達到阻止 F1 0.846，而來源加關係準確率降至 0.229，顯示出精確的來源擁有權在語義相近的來源中仍然困難。修復和重新驗證解決了完整跟蹤集中的所有被阻止答案，通常通過保守的後備方式。在 50 個受控的臨床混淆探測中，ProvenanceGuard 檢測到所有注入的歸因交換，且沒有保留錯誤的歸因。這些結果顯示，來源歸因是 MCP 基於代理的事實性驗證的一個獨立軸心。

Recover Semantics First, Generate Better: Improved Latent Modeling for 3D MRI Reconstruction and Cross-Contrast Synthesis

2606.17989v1 by Yonghao Chen, Sicheng Yang, Rui Tang, Lei Zhu

Multi-contrast magnetic resonance imaging (MRI) provides complementary information for clinical diagnosis. However, acquiring all MRI sequences is often time-consuming and costly. Recent generative models perform cross-contrast synthesis to address this issue by inferring absent contrasts from the available ones. Nevertheless, synthesizing 3D MRI presents significant challenges. Due to the massive volume sizes, operating directly in the pixel space is computationally prohibitive; therefore, a common approach is to first compress the 3D volumes into a latent space and subsequently train generative models in that space. We observe that existing compression architectures face several critical issues: they under-preserve long-range anatomical coherence, discard clinically meaningful semantics, and rely on optimization objectives that lead to over-smoothed reconstructions. Ultimately, these shortcomings compromise the performance of subsequent generative models. In this work, we propose a semantics-first latent modeling framework for 3D MRI reconstruction and cross-contrast synthesis. Specifically, we introduce a Latent Harmonization Encoder (LHE) to capture global anatomical dependencies, ensuring coherent volumetric representations. To mitigate semantic degradation during latent compression, we further design a Semantic Recovery Block (SRB) that injects high-level priors from a self-supervised semantic teacher, enhancing contrast-aware separability in the latent space. Additionally, we propose an Anatomy-aware Frequency Loss (AFL) to adaptively preserve diagnostically relevant high-frequency structures. Extensive experiments on two public multi-contrast MRI datasets demonstrate consistent improvements in reconstruction fidelity and cross-contrast synthesis quality. Our code is available at https://github.com/script-Yang/RSF.

摘要：多重對比磁共振成像（MRI）提供了臨床診斷的補充資訊。然而，獲取所有MRI序列通常耗時且成本高昂。最近的生成模型通過從可用的對比中推斷缺失的對比來進行交叉對比合成，以解決這一問題。然而，合成3D MRI面臨著重大挑戰。由於巨大的體積大小，直接在像素空間中操作在計算上是不可行的；因此，一種常見的方法是首先將3D體積壓縮到潛在空間中，然後在該空間中訓練生成模型。我們觀察到現有的壓縮架構面臨幾個關鍵問題：它們未能充分保留長距離的解剖一致性，丟棄臨床上有意義的語義，並依賴於導致過度平滑重建的優化目標。最終，這些缺陷損害了後續生成模型的性能。在本研究中，我們提出了一種以語義為先的潛在建模框架，用於3D MRI重建和交叉對比合成。具體而言，我們引入了一個潛在調和編碼器（LHE），以捕捉全局解剖依賴性，確保一致的體積表示。為了減輕潛在壓縮過程中的語義退化，我們進一步設計了一個語義恢復模塊（SRB），該模塊從自我監督的語義教師中注入高層次的先驗，增強潛在空間中的對比感知可分離性。此外，我們提出了一種解剖感知頻率損失（AFL），以自適應地保留診斷相關的高頻結構。在兩個公共多重對比MRI數據集上進行的廣泛實驗顯示了重建保真度和交叉對比合成質量的一致改進。我們的代碼可在 https://github.com/script-Yang/RSF 獲得。

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

2606.17979v1 by Jinjie Shen, Wei Deng, Xian Hu, Daiguo Zhou, Jian Luan

Existing RL post-training methods for text-to-image generation usually convert the final-image reward into a single scalar advantage and apply it with the same strength to the entire generative trajectory. However, text-to-image generation naturally has temporal and spatial structure: different denoising steps are responsible for different generation stages, and the content that truly determines text alignment often appears only in part of the image. This granularity mismatch makes it difficult for policy updates to focus on the generative components that actually affect the reward. To address this issue, we propose \textbf{SpatioTemporal Adaptive Reward (STAR) Allocation} for RL post-training of text-to-image diffusion and flow models. STAR uses text-image attention inside the generative model and starts from the core content that the user truly cares about in the prompt. It constructs spatial allocation maps that dynamically vary across denoising steps and rollouts, and allocates the same group-relative advantage to more relevant latent regions with almost no additional computational overhead. STAR then applies stronger policy updates to these regions through a spatially resolved policy objective. We use Stable Diffusion 3.5 Medium as the base model and evaluate on three tasks: GenEval, OCR text rendering, and PickScore. Experimental results show that STAR improves compositional semantic alignment, text rendering, and preference optimization without changing the external reward source, achieving $\mathbf{0.9759}$, $\mathbf{0.9757}$, and $\mathbf{23.60}$ on GenEval, OCR, and PickScore, respectively.

摘要：現有的強化學習後訓練方法在文本到圖像生成中，通常將最終圖像的獎勵轉換為單一的標量優勢，並以相同的強度應用於整個生成過程。然而，文本到圖像生成自然具有時間和空間結構：不同的去噪步驟負責不同的生成階段，而真正決定文本對齊的內容往往僅出現在圖像的一部分。這種粒度不匹配使得策略更新難以專注於實際影響獎勵的生成組件。為了解決這個問題，我們提出了\textbf{時空自適應獎勵 (STAR) 分配}，用於文本到圖像擴散和流模型的強化學習後訓練。STAR在生成模型內部使用文本-圖像注意力，並從用戶在提示中真正關心的核心內容開始。它構建了在去噪步驟和展開過程中動態變化的空間分配圖，並將相同的群組相對優勢分配給更相關的潛在區域，幾乎不增加額外的計算開銷。然後，STAR通過空間解析的策略目標對這些區域應用更強的策略更新。我們使用Stable Diffusion 3.5 Medium作為基礎模型，並在三個任務上進行評估：GenEval、OCR文本渲染和PickScore。實驗結果表明，STAR在不改變外部獎勵來源的情況下，改善了組合語義對齊、文本渲染和偏好優化，分別在GenEval、OCR和PickScore上達到$\mathbf{0.9759}$、$\mathbf{0.9757}$和$\mathbf{23.60}$。

Robustness of Similarity-based Positional Encoding Under Rotations: Theoretical Analysis and Experimental Validation

2606.17961v1 by Andrea Santomauro, Luigi Portinale, Giorgio Leonardi

Positional encoding is a fundamental component of Transformer architectures, as it injects information about the spatial or sequential arrangement of inputs. Among recent alternatives to standard absolute and sinusoidal encodings, similarity-based positional encoding (simPE) has emerged as a flexible framework for representing positional structure through pairwise relations. simPE was originally designed for medical imaging applications, where geometric robustness is especially relevant: small rotations naturally arise during image acquisition, induced by imaging instruments, patient positioning, or slight acquisition misalignments. Despite its empirical promise, the theoretical behavior of simPE under geometric perturbations has not been fully characterized. In this paper, we study the robustness of simPE with respect to rotations, combining formal theoretical analysis with experimental validation. We first show that simPE is generally not rotation-invariant. We then prove that, under mild Lipschitz assumptions on the elementary components, simPE is stable under rotational perturbations and derive explicit perturbation bounds in Frobenius norm. We validate these findings experimentally on four controlled datasets--a synthetic Arrow dataset, a synthetic Shapes dataset (four geometric shape categories), a synthetic Digits dataset, and a benchmark image classification dataset (FashionMNIST)--in which training and validation images are kept in a fixed canonical orientation while test images are subjected to increasing rotation angles. Across all datasets, simPE consistently outperforms standard learned positional encoding in terms of accuracy, F1 score, precision, and recall under rotation, particularly in the small-to-moderate angle regime, corroborating the theoretical stability guarantees.

摘要：位置編碼是Transformer架構的一個基本組成部分，因為它注入了有關輸入的空間或序列排列的信息。在最近的標準絕對和正弦編碼的替代方案中，基於相似性的位置信息編碼（simPE）已經成為一個靈活的框架，通過成對關係來表示位置結構。simPE最初是為醫學影像應用設計的，其中幾何穩健性尤其重要：在影像獲取過程中，因影像儀器、病人定位或輕微的獲取不對齊而自然產生的小旋轉。儘管其經驗上顯示出潛力，但在幾何擾動下，simPE的理論行為尚未完全表徵。在本文中，我們研究了simPE對於旋轉的穩健性，結合了正式的理論分析和實驗驗證。我們首先顯示simPE通常不是旋轉不變的。然後我們證明，在對基本組件的輕微Lipschitz假設下，simPE在旋轉擾動下是穩定的，並推導出Frobenius範數下的明確擾動界限。我們在四個受控數據集上實驗性地驗證這些發現——一個合成的箭頭數據集、一個合成的形狀數據集（四種幾何形狀類別）、一個合成的數字數據集，以及一個基準圖像分類數據集（FashionMNIST）——其中訓練和驗證圖像保持在固定的典範方向，而測試圖像則受到逐漸增加的旋轉角度影響。在所有數據集中，simPE在準確性、F1分數、精確度和召回率方面，始終優於標準學習的位置信息編碼，特別是在小到中等角度範圍內，證實了理論穩定性的保證。

A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease

2606.17867v1 by Antonio Scardace, Daniele Ravì

Despite increasing adoption of multimodal approaches in Alzheimer's Disease (AD) research -- aimed at integrating molecular, structural, clinical, and genetic biomarkers to enhance disease characterization -- the relationships among these modalities remain poorly understood. A systematic analysis of their dynamic interaction is essential for improving disease modeling, identifying redundant assessments, and reducing patient burden and acquisition costs. In this paper, we present a quantitative analysis of multimodal AD biomarkers by integrating tau-PET, structural MRI, cognitive scores (MMSE and CDR), and APOE4 data from 789 subjects drawn from the ADNI dataset. In our analyses, we (A) quantify cross-modal mutual information and explained variance to assess redundancy and predictive dependencies; (B) examine associations between tau topologies and structural atrophy across brain regions to select informative ROIs; (C) perform a statistical decomposition of the tau-cognition association into atrophy-related and atrophy-independent components; (D) and identify a dominant neurodegenerative trajectory that aligns with cognitive decline. This study provides a systematic characterization of cross-modal relationships, improving the interpretability and selection of biomarkers in AD. Code is publicly available at: https://github.com/antonioscardace/Multimodal-AD.

摘要：儘管在阿茲海默症（AD）研究中越來越多地採用多模態方法——旨在整合分子、結構、臨床和遺傳生物標記以增強疾病特徵——這些模態之間的關係仍然不甚了解。對它們動態互動的系統分析對於改善疾病建模、識別冗餘評估以及減少患者負擔和獲取成本至關重要。本文中，我們通過整合來自789名受試者的tau-PET、結構MRI、認知評分（MMSE和CDR）及APOE4數據，對多模態AD生物標記進行定量分析，這些數據來自ADNI數據集。在我們的分析中，我們（A）量化跨模態的互信息和解釋變異，以評估冗餘和預測依賴；（B）檢查tau拓撲與大腦區域結構性萎縮之間的關聯，以選擇有用的ROI；（C）對tau-認知關聯進行統計分解，將其分為與萎縮相關和與萎縮無關的成分；（D）並識別與認知衰退相一致的主導神經退行性軌跡。本研究提供了跨模態關係的系統特徵，改善了AD中生物標記的可解釋性和選擇性。代碼可在以下網址公開獲取：https://github.com/antonioscardace/Multimodal-AD。

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

2606.17826v1 by Jean Seo, Minkyu Kim, Jeonguk Lee, Jisoo Jung, Wooseok Han, Eunho Yang

Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics often underestimate ASR performance by treating orthographic variants as errors. To address this issue, we introduce MultiClin, a clinical ASR benchmark designed to evaluate robustness to multiscript variability. Experiments across diverse ASR models show that multiscript-aware evaluation provides a fairer assessment of recognition quality than conventional single-reference evaluation. We further investigate the impact of script consistency during training and find that inconsistent script mappings increase orthographic uncertainty and hinder model convergence, with a balanced 50% mapping ratio producing the highest entropy. In contrast, script unification consistently yields the best ASR performance. Our dataset and code are publicly available at: https://github.com/aitrics-ronaldo/Interspeech_MultiClin.

摘要：自動語音識別（ASR）在非英語臨床環境中面臨多種文字變異的挑戰，其中相同的術語可能以多種有效的正字法形式出現。傳統的字符串匹配評估指標通常低估了ASR的性能，因為它們將正字法變體視為錯誤。為了解決這個問題，我們介紹了MultiClin，一個旨在評估對多文字變異的穩健性的臨床ASR基準。在多種ASR模型中的實驗顯示，考慮多文字的評估提供了比傳統單一參考評估更公正的識別質量評估。我們進一步研究了訓練期間文字一致性的影響，發現不一致的文字映射增加了正字法的不確定性並妨礙了模型的收斂，平衡的50%映射比例產生了最高的熵。相比之下，文字統一始終產生最佳的ASR性能。我們的數據集和代碼可在以下網址公開獲得：https://github.com/aitrics-ronaldo/Interspeech_MultiClin。

Talking to Your Data: Exploring Embodied Conversation as an Interface for Personal Health Reflection

2606.17767v1 by Nikola Kovacevic, Bastien Husler, Di Zhuang, Rafael Wampfler, Barbara Solenthaler

Personal health data from wearables are typically presented through dashboards of charts and summary statistics, requiring users to actively interpret patterns and implications. We explore an alternative interaction paradigm: engaging with personal health data through an embodied conversational agent that facilitates objective data reflection in dialogue with the user. We present a system that combines lightweight preprocessing of wearable data with a Unity-based embodied character. Internally, the system follows a dual-agent design in which an Observer agent extracts descriptive statistics and temporal trends, and a Presenter agent communicates these findings through "spoken statistics," intentionally refraining from clinical advice to isolate the impact of the interaction modality. We evaluate this approach through a simulated-self user study (N=5) using a within-subject design. Participants adopted health personas and goals derived from the LifeSnaps dataset to compare traditional dashboard exploration with embodied conversational reflection. Our evaluation focuses on perceived understanding, the specificity of generated actions, and the cognitive shift from passive viewing to active sensemaking. The paper contributes a functional prototype, a design pattern for objective health data narrative generation, and early empirical insights into how embodiment affects the interpretation of personal health metrics.

摘要：個人健康數據通常通過圖表和摘要統計的儀表板呈現，要求用戶主動解釋模式和含義。我們探索了一種替代的互動範式：通過一個具身的對話代理與個人健康數據互動，促進用戶與數據之間的客觀反思。我們提出了一個系統，結合了可穿戴數據的輕量級預處理和基於Unity的具身角色。在內部，該系統遵循雙代理設計，其中觀察者代理提取描述性統計和時間趨勢，而呈現者代理通過“口述統計”傳達這些發現，故意避免臨床建議，以隔離互動模式的影響。我們通過一項模擬自我用戶研究（N=5）使用內部受試者設計來評估這種方法。參與者採用了來自LifeSnaps數據集的健康角色和目標，以比較傳統的儀表板探索與具身的對話反思。我們的評估重點在於感知理解、生成行動的具體性，以及從被動觀看到主動意義建構的認知轉變。本文貢獻了一個功能原型、一種客觀健康數據敘事生成的設計模式，以及對具身性如何影響個人健康指標解釋的早期實證見解。

Vision-language models for chest radiography do not always need the image

2606.17710v1 by Mahshad Lotfinia, Sebastian Ziegelmayer, Lisa Adams, Daniel Truhn, Andreas Maier, Soroosh Tayebi Arasteh

Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that reads the scan, and no standard benchmark separates them. We introduce a causal audit that intervenes on the image, occluding the relevant region, occluding an irrelevant one, and swapping in another patient's same-label scan, and combines three behavioral metrics to test whether a correct answer depends on the image. Across nine systems, a text-only model with no image access reaches within 5.7 accuracy points of the best multimodal one, and a 119-billion-parameter multimodal model is statistically indistinguishable from a 7-billion text-only baseline. The audit splits the cohort into three models that ignore the image, one that is unstable, and five that use it selectively, for a subset of findings; the categories hold across a second dataset, resolution, and prompt phrasing. Against board-certified radiologists, a text-only model is statistically indistinguishable from a radiologist's accuracy while grounding at zero, whereas the image-using models ground at radiologist-comparable rates. Reported confidence flags ungrounded answers only when a model uses the image. Grounding audits, not accuracy, should gate clinical deployment.

摘要：醫學視覺-語言模型報告顯示胸部X光片的準確性很高，這越來越被解讀為它們使用了影像。這種推論是不安全的：一個利用發現-名稱先驗的模型得分與一個閱讀掃描的模型相似，並且沒有標準基準能夠將它們區分開來。我們引入了一個因果審計，對影像進行干預，遮蔽相關區域，遮蔽不相關的區域，並替換為另一位患者的同標籤掃描，並結合三個行為指標來測試正確答案是否依賴於影像。在九個系統中，一個僅使用文本且無影像訪問的模型達到了距離最佳多模態模型僅5.7的準確性點，而一個1190億參數的多模態模型在統計上與一個70億文本模型的基準無法區分。該審計將樣本分為三個忽略影像的模型，一個不穩定的模型，以及五個選擇性使用影像的模型，針對一部分發現；這些類別在第二個數據集、解析度和提示措辭中保持一致。與董事會認證的放射科醫生相比，僅使用文本的模型在基準為零時在統計上與放射科醫生的準確性無法區分，而使用影像的模型則在與放射科醫生可比的比率下進行基準。報告的信心僅在模型使用影像時標記未基準的答案。應該以基準審計，而非準確性，來限制臨床部署。

SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology

2606.17702v1 by Wan Siti Halimatul Munirah Wan Ahmad, Faris Syahmi Samidi, Mohammad Badal Ahmmed, Vimal Angela Thiviyanathan, Selvam James Thavaraj, Anwar P. P. Abdul Majeed

Characterising the tumour microenvironment (TME) from routine H&E-stained histology images requires simultaneous cell segmentation, feature extraction, and interpretable clinical reporting. We present SEGTME-UNI2, a unified framework addressing these requirements. Its core is UNI2-UPERHOVER, a dual-head segmentation model pairing the UNI2-H pathology foundation model (ViT-Giant, pretrained on >100M tiles from 100K slides) with two parallel UperNet decoders: one for six-class semantic segmentation and one for horizontal-vertical gradient regression enabling watershed-based nuclear instance separation. To address the lack of pixel-level annotations in large real-world repositories, UNI2-UPERHOVER undergoes a three-stage progressive pseudo-label curriculum. Each stage trains a fresh model without weight transfer, driving improvement entirely via increased pseudo-label quality: Stage 1: Uses human-annotated PanNuke (7,901 images, 189,744 nuclei, 0.25 um/pixel). Stage 2: Uses entropy-filtered pseudo-labels from the Stage 1 model on 271,711 TCGA-UT scale-0 patches (0.5 um/pixel). Stage 3: Uses pseudo-labels from the Stage 2 model on all 1,608,060 TCGA-UT patches across six resolution scales (0.5-1.0 um/pixel). Segmentation outputs feed a structured TME feature extraction pipeline computing 20+ per-patch compositional, morphological, spatial entropy, and intercellular distance metrics. These are encoded as JSON and passed to a fine-tuned NVIDIA BioNeMo GPT model to generate clinically interpretable TME narratives. Preliminary validation on held-out PanNuke and TCGA-UT partitions demonstrates framework feasibility and internal consistency. The pseudo-labelled TCGA-UT dataset and UNI2-UPERHOVER checkpoint are publicly released to support large-scale TME profiling and spatial biology research.

摘要：對於常規 H&E 染色組織學影像來說，對腫瘤微環境 (TME) 的特徵描述需要同時進行細胞分割、特徵提取和可解釋的臨床報告。我們提出了 SEGTME-UNI2，一個統一的框架來滿足這些需求。其核心是 UNI2-UPERHOVER，一個雙頭分割模型，將 UNI2-H 病理基礎模型（ViT-Giant，預訓練於超過 1 億個來自 10 萬張幻燈片的切片）與兩個平行的 UperNet 解碼器配對：一個用於六類語義分割，另一個用於水平-垂直梯度回歸，以實現基於分水嶺的核實例分離。為了解決大型現實世界數據庫中缺乏像素級標註的問題，UNI2-UPERHOVER 進行了三階段的漸進式偽標籤課程。每個階段訓練一個全新的模型，沒有權重轉移，完全通過提高偽標籤質量來驅動改進：階段 1：使用人類標註的 PanNuke（7,901 張影像，189,744 個細胞核，0.25 微米/像素）。階段 2：使用來自階段 1 模型的熵過濾偽標籤，針對 271,711 個 TCGA-UT 標度 0 補丁（0.5 微米/像素）。階段 3：使用來自階段 2 模型的偽標籤，針對所有 1,608,060 個 TCGA-UT 補丁，涵蓋六個解析度尺度（0.5-1.0 微米/像素）。分割輸出進入一個結構化的 TME 特徵提取管道，計算 20 多個每個補丁的組成、形態學、空間熵和細胞間距度量。這些數據被編碼為 JSON 並傳遞給經過微調的 NVIDIA BioNeMo GPT 模型，以生成臨床可解釋的 TME 敘述。在保留的 PanNuke 和 TCGA-UT 部分上進行的初步驗證顯示了框架的可行性和內部一致性。偽標註的 TCGA-UT 數據集和 UNI2-UPERHOVER 檢查點已公開發布，以支持大規模 TME 輪廓和空間生物學研究。

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

2606.17474v1 by Jiahui Niu, Huizi Yu, Wenkong Wang, Guangxin Dai, Jingxian He, Xiang Li, Zhiying Liang, Xinxin Lin, Kent CY So, Bryan YP Yan, Yun Kwok Wing, Yanqiu Xing, Xin Ma, Lizhou Fan

Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.

摘要：大型語言模型（LLMs）越來越被考慮用於臨床諮詢任務，然而大多數醫療評估仍然是靜態的、單回合的或狹隘的結果導向，限制了它們反映現實世界護理的連續性、不確定性和互動性的能力。在此，我們提出了AIPatient Arena，一個基於電子健康紀錄（EHRs）的評估框架，用於評估LLMs在八個臨床能力維度上的臨床效用。該框架將EHR數據整合到患者特定的知識圖譜中，使多回合的醫生-患者互動成為可能。我們在一個由437名患者組成的主要隊列以及兩個分佈外的驗證隊列（119名和67名患者）上應用AIPatient Arena。我們觀察到LLMs在醫療面試提問技能（QS；平均分數，4.43-4.99/5）、倫理和專業行為（ET；4.38-4.93/5）以及臨床解釋的清晰性和透明性（EX；3.80-4.72/5）方面表現良好。在信息整合（II；3.19-4.21/5）和藥物安全性與合理性（MS；3.13-3.78/5）方面表現中等，但在處理模糊患者反應（HR；2.57-3.32/5）、信息覆蓋（IC；2.08-3.02/5）以及診斷準確性和推理（Dx；2.63-3.55/5）方面持續存在弱點。基於過程的評估揭示了重複的互動失敗，包括重複提問、遺漏過去病史以及對不確定性的處理不足。更豐富的對話上下文改善了診斷推理，但在治療計劃方面的增益有限。這些發現表明，僅僅依賴最終答案的準確性不足以評估臨床準備情況，並突顯了評估模型在諮詢過程中如何收集、解釋和傳達信息的重要性。AIPatient Arena提供了一個基於EHR的框架，用於針對工作流程的醫療LLMs預部署評估。

A Machine-Learned Comorbidity Index

2606.17450v1 by Suleman Baloch, Kishlay Jha, Alberto M. Segre, Philip M. Polgreen, Bijaya Adhikari

Traditional comorbidity scores (e.g., Charlson and Elixhauser) are widely used for risk adjustment and patient stratification, but they have two key limitations: (i) they are largely mortality-centric and do not align well with other clinical outcomes, and (ii) their linear, rule-based structure cannot capture nonlinear, outcome-specific risk relationships. We propose a Machine-Learned Comorbidity Index (MLCI) that maps diagnosis codes to a single scalar by maximizing the normalized Hilbert-Schmidt Independence Criterion (nHSIC) between the learned score and multiple clinical outcomes. MLCI captures nonlinear risk-outcome dependence and is supported by a theory that characterizes when a unified, informative admission-level ordering can be achieved across outcomes. Empirical results on multiple benchmark electronic health record (EHR) datasets show that MLCI outperforms strong baselines across multiple evaluation metrics.

摘要：傳統的共病指數（例如，Charlson 和 Elixhauser）廣泛用於風險調整和病人分層，但它們有兩個主要限制：（i）它們主要以死亡率為中心，與其他臨床結果不太一致，並且（ii）它們的線性、基於規則的結構無法捕捉非線性、特定結果的風險關係。我們提出了一種機器學習共病指數（MLCI），通過最大化學習分數與多個臨床結果之間的標準化 Hilbert-Schmidt 獨立性標準（nHSIC），將診斷代碼映射到單一標量。MLCI 捕捉非線性風險-結果依賴性，並有理論支持，該理論描述了何時可以在結果之間實現統一的、信息豐富的入院級別排序。對多個基準電子健康記錄（EHR）數據集的實證結果顯示，MLCI 在多個評估指標上超越了強基準。

Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems

2606.17443v1 by Xi Chu, Yupeng Hou

Large language models (LLMs) are becoming a major way for consumers to find products, but we do not yet understand how brands compete in this new channel. We study brand dynamics in LLM recommendations using skincare products -- a category where consumers cannot easily judge quality before buying and must rely on brand reputation -- across three commercial LLMs (GPT-4o-mini, Claude Sonnet, Gemini 3 Flash), with a robustness check on search goods. In three experiments, we find: (1) a Conditional Monopoly where well-known brands get recommended 100% of the time (IAI = 10.0) when all products have the same specifications, but this dominance disappears with less than a +0.1-star rating advantage for a competitor; (2) authority-style marketing language, including fabricated clinical-evidence claims, breaks this monopoly at a Bias Surplus Value equal to +0.17 rating points, with each model responding differently; and (3) a social dilemma in multi-brand GEO competition: when all brands adopt the same optimization strategy, individual payoff falls from +0.802 to +0.007 in our payoff proxy, and non-participating brands receive zero recommendations in our tests. Our results suggest that generative engine optimization (GEO) should be studied not only as a security risk, but also as an emerging marketing practice that shapes market competition.

摘要：大型語言模型（LLMs）正成為消費者尋找產品的主要方式，但我們尚未了解品牌在這一新渠道中的競爭方式。我們研究了在LLM推薦中護膚產品的品牌動態——這是一個消費者在購買前無法輕易判斷質量，必須依賴品牌聲譽的類別——跨越三個商業LLM（GPT-4o-mini、Claude Sonnet、Gemini 3 Flash），並對搜索商品進行了穩健性檢查。在三個實驗中，我們發現：（1）當所有產品具有相同規格時，知名品牌在推薦中獲得100%的機會（IAI = 10.0），但這種主導地位在競爭對手的評分優勢低於+0.1顆星時消失；（2）權威風格的營銷語言，包括虛構的臨床證據聲明，在偏見超額價值等於+0.17評分點時打破了這一壟斷，每個模型的反應不同；（3）在多品牌GEO競爭中的社會困境：當所有品牌採用相同的優化策略時，我們的收益代理從+0.802下降到+0.007，且在我們的測試中未參與的品牌獲得零推薦。我們的結果表明，生成引擎優化（GEO）不僅應被研究為安全風險，還應作為一種新興的營銷實踐來塑造市場競爭。

Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos

2606.17437v1 by Bo Gou, Jicheng Zhang, Jianlong Xiong, Tao He, Bentian Liu, Hai Wu, Yijiao Wang, Yu Zhang, Yujia Yang, Yun Dai, Jian Liu, Jie Wang

Automated classification of standard echocardiographic views is crucial for efficient clinical workflow but faces three main challenges. First, publicly available datasets are scarce and limited in scale and view coverage. Second, the performance of some modern video-level architectures for echocardiographic view classification remains underexplored. Third, some view categories exhibit highly similar spatial appearances, making single-frame features insufficient for discrimination, while heterogeneous frame quality complicates robust temporal information fusion. To address these challenges, we release the Echocardiographic Videos of Nine Views (EV9V) dataset, comprising 5,138 videos, 910,579 frames, and 9 standard views, which is, to the best of our knowledge, the largest publicly available echocardiography video dataset. Using EV9V, we systematically benchmark representative video classification architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. Furthermore, we propose a Spatio-Temporal Fusion Model (STFM), an efficient dual-stream CNN-LSTM (Long Short-Term Memory) framework that jointly captures spatial anatomical structures and temporal cardiac dynamics. The proposed framework leverages uncertainty-aware learning to preferentially sample representative video segments during training and evidence-based fusion during inference, improving robustness to variations in frame quality across echocardiographic videos. Extensive experiments demonstrate that our method achieves competitive performance across diverse video classification models, validating the effectiveness of uncertainty-aware spatio-temporal learning for echocardiographic view classification. The code is available at https://github.com/bgx666/stfm.

摘要：自動化的標準超聲心動圖視圖分類對於高效的臨床工作流程至關重要，但面臨三個主要挑戰。首先，公開可用的數據集稀缺，且在規模和視圖覆蓋方面有限。其次，一些現代視頻級架構在超聲心動圖視圖分類中的性能仍未被充分探索。第三，一些視圖類別表現出高度相似的空間外觀，使得單幀特徵不足以進行區分，而異質幀質量則使得穩健的時間信息融合變得複雜。為了解決這些挑戰，我們發布了九個視圖的超聲心動圖視頻（EV9V）數據集，包含5,138個視頻、910,579幀和9個標準視圖，據我們所知，這是目前最大的公開可用超聲心動圖視頻數據集。使用EV9V，我們系統性地基準測試了代表性的視頻分類架構，包括卷積神經網絡（CNNs）、遞歸神經網絡（RNNs）和Transformer。此外，我們提出了一個時空融合模型（STFM），這是一個高效的雙流CNN-LSTM（長短期記憶）框架，能夠共同捕捉空間解剖結構和時間心臟動力學。所提出的框架利用不確定性感知學習，在訓練期間優先抽樣代表性視頻片段，並在推斷期間進行基於證據的融合，從而提高對超聲心動圖視頻中幀質量變化的穩健性。大量實驗表明，我們的方法在各種視頻分類模型中達到了競爭性能，驗證了不確定性感知時空學習在超聲心動圖視圖分類中的有效性。代碼可在 https://github.com/bgx666/stfm 獲得。

Feynman Kac Reweighted Schrödinger Bridge Matching for Surface-Based Tau PET Harmonization

2606.17420v1 by Jianwei Zhang, Xinyu Nie, Jiaxin Yue, Yonggang Shi

Tau PET imaging is central to tracking Alzheimer's disease progression, but systematic differences between scanners, protocols, and radiotracers across sites introduce nonbiological variability that inflates biomarker variance, reduces sensitivity to disease effects, and can bias downstream clinical assessments. Harmonization methods aim to remove these site-induced shifts while preserving biologically meaningful signal, yet existing approaches struggle when source and target cohorts differ in subgroup composition, risking conflation of site effects with biological variation such as tau-positivity status. We propose the Feynman Kac Reweighted Schröodinger Bridge Matching (FKRSBM) model to address this problem. Rather than routing data through a Gaussian noise prior as in diffusion-based methods, FKRSBM learns a direct stochastic transport process between source and target distributions via entropy-regularized optimal transport. To enforce biologically consistent transport, FKRSBM incorporates a subgroup-aware endpoint proposal derived from a Feynman Kac reweighting of the reference bridge measure, implemented entirely through stratified importance sampling at the data level and requiring no changes to the underlying bridge-matching solver or network architecture. For surface-based neuroimaging, FKRSBM employs a spherical convolutional backbone operating on cortical meshes to perform vertex-level harmonization. We evaluate the method on tau PET SUVR maps, harmonizing PI-2620 data from the HABS-HD cohort into the AV-1451 domain of ADNI. Compared against ComBat, CycleGAN, a diffusion-based method (DF), and unregularized Diffusion Schröodinger Bridge Matching (DSBM), FKRSBM achieves superior distributional alignment, reduced tau-positivity sign mismatch, stronger APOE subgroup alignment, and improved downstream disease classification performance.

摘要：Tau PET 成像在追蹤阿茲海默症進展中至關重要，但不同掃描儀、協議和放射性示蹤劑之間的系統性差異會引入非生物變異，這會增加生物標記的變異性，降低對疾病影響的敏感性，並可能偏倚後續的臨床評估。調和方法旨在消除這些由地點引起的變化，同時保留生物學上有意義的信號，然而現有方法在來源和目標群體的亞組組成不同時面臨挑戰，可能會將地點效應與生物變異（如 tau 陽性狀態）混淆。我們提出了費曼-卡茨重加權薛丁格橋匹配（FKRSBM）模型來解決這個問題。FKRSBM 通過熵正則化的最優傳輸學習來源和目標分佈之間的直接隨機傳輸過程，而不是像擴散基方法那樣通過高斯噪聲先驗來路由數據。為了強化生物學一致的傳輸，FKRSBM 結合了一個基於亞組的端點提議，該提議源自於對參考橋度量的費曼-卡茨重加權，完全通過在數據層級的分層重要性抽樣來實現，並不需要對基礎的橋匹配求解器或網絡架構進行任何更改。對於基於表面的神經影像學，FKRSBM 使用一個在皮質網格上運作的球形卷積骨幹來執行頂點級的調和。我們在 tau PET SUVR 地圖上評估該方法，將 HABS-HD 群體的 PI-2620 數據調和到 ADNI 的 AV-1451 領域。與 ComBat、CycleGAN、擴散基方法（DF）和未正則化的擴散薛丁格橋匹配（DSBM）相比，FKRSBM 實現了更優的分佈對齊、降低的 tau 陽性標誌不匹配、更強的 APOE 亞組對齊，以及改善的下游疾病分類性能。

Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

2606.17405v1 by Xinyu Qin, Anil K. Sood, Ruiheng Yu, Sara Corvigno, Elaine Stur, Lu Wang

Clinical decision support AI systems (CDSASs) must adapt to evolving patient conditions in real-time while adhering to strict safety constraints. We present an online adaptive framework that integrates Treatment Effect (TE) estimation to quantify clinical benefits, a patient Digital Twin (DT) to simulate treatment trajectories, and Reinforcement Learning (RL) for sequential decision-making. The AI system is initially trained on historical medical records and operates in a continuous learning loop. To ensure safety, a rule-based module monitors vital signs and blocks contraindicated treatments. Cases with strong internal model disagreement are flagged for clinician review, simulated in our experiments via a pre-trained outcome model. We validate our framework using both a synthetic clinical simulator and a real-world ovarian cancer dataset from The Cancer Genome Atlas (TCGA). In both simulated and clinical settings, our method demonstrated superior effectiveness and stability in recommending treatments compared to standard computational baselines. Furthermore, the AI system maintains low latency and requires expert consultation for only a minority of cases in our experimental validation, demonstrating its potential as a safe, clinician-supervised tool for personalized medicine that continuously improves through practical use.

摘要：臨床決策支持人工智慧系統 (CDSASs) 必須在遵循嚴格安全限制的同時，實時適應不斷變化的患者狀況。我們提出了一個在線自適應框架，該框架整合了治療效果 (TE) 估算以量化臨床效益、患者數位雙胞胎 (DT) 以模擬治療軌跡，以及強化學習 (RL) 用於序列決策。該人工智慧系統最初在歷史醫療記錄上進行訓練，並在持續學習循環中運作。為了確保安全，一個基於規則的模組監測生命體徵並阻止禁忌治療。內部模型存在強烈不一致的案例會被標記以供臨床醫生審查，這在我們的實驗中是通過預訓練的結果模型來模擬的。我們使用合成臨床模擬器和來自癌症基因組圖譜 (TCGA) 的真實卵巢癌數據集來驗證我們的框架。在模擬和臨床環境中，我們的方法在推薦治療方面展示了優越的有效性和穩定性，相較於標準計算基準。此外，該人工智慧系統保持低延遲，並且在我們的實驗驗證中只有少數案例需要專家諮詢，顯示其作為一個安全的、由臨床醫生監督的個性化醫療工具的潛力，並能通過實際使用不斷改進。

2606.17340v1 by Hongchao Shu, Roger D. Soberanis-Mukul, Hao Ding, Morgan Ringel, Mali Shen, Saif Iftekar Sayed, Hedyeh Rafii-Tari, Mathias Unberath

Accurate vision-based navigation in monocular endoscopy is difficult due to limited depth cues, weak tissue texture, non-rigid deformation, and substantial appearance variation across domains, all of which complicate pose estimation, depth prediction, and image-to-anatomy alignment. Although recent vision foundation models have shown promise, their learned representations often remain insufficiently geometry-consistent, hindering stable feature correspondence and limiting their reliability for downstream navigation tasks. We propose a unified framework for learning geometry-consistent and domain-robust image representations for monocular endoscopy. The framework combines a synthetic data pipeline that provides accurate geometric supervision with Hierarchy-Aware Geometry-Semantic Adaptation, a structured alternative to standard LoRA that inserts low-rank adapters selectively across the transformer hierarchy and couples them with layer-wise training objectives to encourage geometric correspondence in intermediate features and semantic consistency in deeper features. Experiments on public and proprietary datasets show improved geometric and semantic representation quality, leading to better performance on downstream navigation tasks including pose estimation and monocular depth estimation. The learned representations show favorable synthetic-to-real transfer on clinical bronchoscopy and provide a useful initialization for adaptation to sinus endoscopy and colonoscopy under limited supervision. The framework also shows favorable scaling with model size and training data. These results support hierarchy-aware, geometry-guided adaptation as a practical approach for endoscopic representation learning.

摘要：在單眼內窺鏡中，基於視覺的精確導航因為深度線索有限、組織紋理弱、非剛性變形以及跨領域的外觀變化而變得困難，這些因素都使得姿勢估計、深度預測和影像與解剖對齊變得複雜。儘管最近的視覺基礎模型顯示出潛力，但它們學習到的表示往往在幾何一致性方面仍然不足，這妨礙了穩定的特徵對應，並限制了它們在下游導航任務中的可靠性。我們提出了一個統一框架，用於學習幾何一致且對領域穩健的影像表示，專為單眼內窺鏡設計。該框架結合了一個提供準確幾何監督的合成數據管道，與層次感知幾何-語義適應，這是一種結構化的替代方案，用於標準LoRA，選擇性地在Transformer層次中插入低秩適配器，並將其與層級訓練目標結合，以促進中間特徵中的幾何對應和深層特徵中的語義一致性。對公共和專有數據集的實驗顯示幾何和語義表示質量有所改善，從而在下游導航任務中，包括姿勢估計和單眼深度估計，表現更佳。學習到的表示在臨床支氣管鏡檢查中顯示出有利的合成到實際轉移，並為在有限監督下適應鼻竇內窺鏡和結腸鏡檢查提供了有用的初始化。該框架在模型大小和訓練數據方面也顯示出良好的擴展性。這些結果支持層次感知、幾何引導的適應作為內窺鏡表示學習的一種實用方法。

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

2606.17339v1 by Sejal Bhalla, Larry Kieu, Aina Merchant, Eyal de Lara, Alex Mariakakis

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

摘要：語音提供了一個獨特的資訊窗口，通過同時參與神經系統、運動系統、呼吸系統和聲音系統來了解健康。當前的臨床語音AI方法主要通過孤立的特定條件研究進展，使得結果難以比較，且難以評估其普遍性。我們介紹了SpeechDx，這是一個大規模的臨床語音AI基準，涵蓋12個數據集和27個任務，涉及多種健康狀況。為了能夠在共享的臨床機制中進行評估，SpeechDx根據它們干擾的語音產生階段來結構任務：概念化、形成和表達。該基準通過包括有限標記數據的任務來測試普遍性，並在多個數據集中評估相同的健康狀況，以區分臨床上有意義的模式與數據集的假象。我們系統地評估了12種最先進的音頻編碼器在所有任務中的表現，以及在零樣本跨條件轉移下的表現。結果顯示，大規模語音模型代表了最強的整體基準，特定領域模型僅在密切匹配的任務上提高性能，而目前沒有任何表示能夠在臨床語音領域中可靠地普遍化。SpeechDx建立了一個共享的評估框架，以追踪朝向通用臨床語音表示的進展。

Symbolic Informalization: Fluent, Productive, Multilingual

2606.16893v1 by Aarne Ranta

Symbolic informalization enables a reliable conversion of formal mathematics to natural language. It has the potential to make machine-checked content human-readable without loss of precision. In a traditional proof system usage, symbolic informalization generalizes the limited mechanisms of syntactic sugar into the ordinary language of mathematics. In a setting where proofs are constructed by artificial intelligence and autoformalization, symbolic informalization can explain what precisely has been constructed. This paper outlines the project Informath, which aims to show how symbolic informalization can produce fluent text with a reasonable development effort and address multiple formal and natural languages. Informath is based on an interlingual architecture, where Dedukti works as a hub between different proof systems (Agda, Lean, Rocq) and Grammatical Framework (GF) takes care of linguistic correctness and variation in different natural languages.

摘要：符號非正式化使得正式數學能夠可靠地轉換為自然語言。它有潛力使機器檢查的內容在人類可讀的情況下不失精確性。在傳統的證明系統使用中，符號非正式化將有限的語法糖機制概括為數學的普通語言。在由人工智慧和自動形式化構建證明的環境中，符號非正式化可以解釋究竟構建了什麼。本文概述了項目Informath，旨在展示符號非正式化如何在合理的開發努力下產生流暢的文本，並處理多種正式和自然語言。Informath基於一種跨語言架構，其中Dedukti作為不同證明系統（Agda、Lean、Rocq）之間的樞紐，而語法框架（GF）則負責不同自然語言中的語言正確性和變化。

Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

2606.16890v1 by Sanjay Basu

Aggregate accuracy benchmarks conceal a systematic structure in how large language models fail at electronic health record (EHR) question answering: questions requiring more inferential steps produce disproportionately more errors. Motivated by theoretical results on transformer compositionality limits, we introduce a pre-specified hop-count taxonomy -- the number of distinct reasoning steps required to answer a clinical question from an EHR -- as a principled predictor of model failure. We annotate 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels and evaluate 301 questions in a within-model ablation (claude-sonnet-4-6, zero-shot vs. extended thinking) and cross-architecture replications (gpt-4o and gpt-5.4-2026-03-05, zero-shot). All three models, spanning two providers and two OpenAI generations (GPT-4 and GPT-5), show monotone accuracy decline with hop count: Claude Sonnet zero-shot falls from 30.6% (hop=1) to 17.6% (hop=4) (Cochran-Armitage z=-2.30, p=0.011; OR per hop 0.72, 95% CI [0.56,0.92], p=0.008); GPT-4o replicates this (37.8% to 14.7%; OR 0.58 [0.45,0.75], p<0.001); and gpt-5.4-2026-03-05 confirms it (37.8% to 23.5%; OR 0.80 [0.66,0.98], p=0.027). A pre-specified context-sufficiency audit shows higher-hop questions are not differentially disadvantaged by EHR truncation (answerability 93-95% at hops 2-4 vs. 79% at hop=1), so the decline reflects compositional reasoning difficulty. Extended thinking did not significantly flatten the accuracy-depth curve across three reasoning conditions, and thinking-token usage scaled with hop count (r=0.31, p<0.0001), consistent with the predicted O(k) computational requirement. Hop count is thus a theory-motivated, cross-architecture predictor of large-language-model error on EHR question answering, with direct implications for deployment risk stratification of clinical AI.

摘要：聚合準確性基準隱藏了大型語言模型在電子健康紀錄（EHR）問題回答上失敗的系統性結構：需要更多推理步驟的問題產生不成比例的錯誤。基於對Transformer組合性限制的理論結果，我們引入了一個預先指定的跳數分類法——回答來自EHR的臨床問題所需的不同推理步驟數量——作為模型失敗的原則性預測指標。我們對313個臨床醫生生成的MedAlign EHR問題-回答對進行了標註，涵蓋了四個跳數級別，並在模型內部消融（claude-sonnet-4-6，零樣本與擴展思考）和跨架構複製（gpt-4o和gpt-5.4-2026-03-05，零樣本）中評估了301個問題。所有三個模型，涵蓋了兩個提供者和兩個OpenAI世代（GPT-4和GPT-5），都顯示出隨著跳數的增加準確性單調下降：Claude Sonnet零樣本從30.6%（跳=1）下降到17.6%（跳=4）（Cochran-Armitage z=-2.30，p=0.011；每跳的OR 0.72，95% CI [0.56,0.92]，p=0.008）；GPT-4o複製了這一點（37.8%下降至14.7%；OR 0.58 [0.45,0.75]，p<0.001）；而gpt-5.4-2026-03-05確認了這一點（37.8%下降至23.5%；OR 0.80 [0.66,0.98]，p=0.027）。一項預先指定的上下文充分性審計顯示，高跳數問題並未因EHR截斷而受到差異性劣勢（在跳數2-4時可回答率為93-95%，而在跳=1時為79%），因此下降反映了組合推理的困難。擴展思考並未顯著平坦化三種推理條件下的準確性-深度曲線，且思考標記的使用隨著跳數的增加而增長（r=0.31，p<0.0001），與預測的O(k)計算需求一致。因此，跳數成為一個理論驅動的、跨架構的預測指標，用於大型語言模型在EHR問題回答上的錯誤，對臨床AI的部署風險分層具有直接的影響。

Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection

2606.16868v1 by Markus Bujotzek, Dimitrios Bounias, Stefan Denner, Ralf Floca, Maximilian Fischer, Peter Neher, Klaus Maier-Hein

While federated learning (FL) enables collaborative medical image segmentation without centralizing sensitive data, real-world deployment is frequently complicated by cross-site label imperfections such as contour disagreement, missing or additional structures, and confused labels. Federated noisy label learning (FNLL) aims to mitigate these effects, yet remains underused in practice as existing evidence is largely based on synthetic noise, simplified settings, and limited real-world noisy evaluation. We address this gap by introducing a benchmark suite that combines diverse real-world noisy datasets, deployment-relevant client-noise scenarios, and label-noise-targeted evaluation to support systematic FNLL assessment and informed method selection. The suite combines curated real-world noisy medical image segmentation datasets from diverse sources with a comprehensive federated segmentation framework including various client-noise scenarios and noise-targeted evaluation. The presented suite provides a realistic and discriminative basis for FNLL evaluation in medical image segmentation and establishes a reusable foundation for fair benchmarking, dataset-specific label-noise characterization, and future method development under realistic federated settings. Code is available at https://github.com/MIC-DKFZ/FedSegNoiseBench.

摘要：雖然聯邦學習 (FL) 使得在不集中敏感數據的情況下進行協作醫學影像分割成為可能，但實際部署常常因為跨站點標籤的不完善而變得複雜，例如輪廓不一致、缺失或多餘的結構以及混淆的標籤。聯邦噪聲標籤學習 (FNLL) 旨在減輕這些影響，但在實踐中仍然使用不足，因為現有的證據主要基於合成噪聲、簡化的設置和有限的實際噪聲評估。我們通過引入一個基準套件來解決這一空白，該套件結合了多樣的實際噪聲數據集、與部署相關的客戶端噪聲場景以及針對標籤噪聲的評估，以支持系統性的 FNLL 評估和知情的方法選擇。該套件結合了來自多個來源的策劃實際噪聲醫學影像分割數據集，以及包括各種客戶端噪聲場景和針對噪聲的評估的綜合聯邦分割框架。所呈現的套件為醫學影像分割中的 FNLL 評估提供了現實且具辨別力的基礎，並為公平基準測試、數據集特定的標籤噪聲特徵化以及在現實聯邦設置下的未來方法開發建立了可重用的基礎。代碼可在 https://github.com/MIC-DKFZ/FedSegNoiseBench 獲得。

GIST-CMTF: Goal-State Inference for Causal Minimal Tool Filtering in LLM Agents

2606.16813v1 by Rahul Suresh Babu, Rohit Shukla

Tool-augmented LLM agents rely on runtime filtering to decide which tools should be visible at each step. Causal Minimal Tool Filtering (CMTF) reduces tool-choice confusion by exposing only the next causally necessary tool frontier, but it assumes that the user request has already been mapped to a symbolic goal state. In practice, requests such as "handle my appointment" or "take care of this email" may correspond to multiple possible goals. This creates wrong-goal execution, where an agent follows a valid causal tool path for an unintended objective. We introduce GIST-CMTF, a goal-state inference layer that predicts candidate symbolic goals over the same state-transition vocabulary used by CMTF, estimates ambiguity, and either applies CMTF or exposes clarification as a causal action that produces missing goal or state variables. We evaluate GIST-CMTF across seven model backends, six filtering methods, and 120 controlled tool-use tasks. GIST-CMTF achieves 97.0% task success, compared with 80.1% for top-goal CMTF and 82.9% for semantic-goal CMTF. It reduces wrong-goal execution from 19.4% under top-goal CMTF to 2.5%, while preserving the one-tool exposure of causal filtering and using substantially fewer tokens than all-tools exposure. These results suggest that reliable tool-augmented agents should validate goal state, not only tool relevance, before exposing external actions.

摘要：工具增強的LLM代理依賴於運行時過濾來決定在每個步驟中哪些工具應該可見。因果最小工具過濾（CMTF）通過僅暴露下一個因果必要的工具邊界來減少工具選擇的混淆，但它假設用戶請求已經映射到一個符號目標狀態。實際上，像「處理我的約會」或「處理這封電子郵件」這樣的請求可能對應於多個可能的目標。這會導致錯誤目標執行，其中代理遵循有效的因果工具路徑以達成非預期的目標。我們引入GIST-CMTF，一個目標狀態推斷層，該層預測候選符號目標，使用與CMTF相同的狀態轉換詞彙，估計歧義，並根據需要應用CMTF或將澄清作為因果行動來暴露，從而產生缺失的目標或狀態變量。我們在七個模型後端、六種過濾方法和120個受控工具使用任務中評估GIST-CMTF。GIST-CMTF的任務成功率達到97.0%，而頂級目標CMTF為80.1%，語義目標CMTF為82.9%。它將錯誤目標執行率從頂級目標CMTF的19.4%降低到2.5%，同時保持因果過濾的單一工具暴露，並使用的標記數量明顯少於所有工具暴露。這些結果表明，可靠的工具增強代理在暴露外部行動之前應驗證目標狀態，而不僅僅是工具的相關性。

AgentFairBench: Do LLM Agents Discriminate When They Act?

2606.16723v1 by Triveni Morla, Rohith Reddy Bellibaltu, Manpreet Singh, Manmeet Singh Kapoor

Large language model (LLM) agents increasingly take actions (screening applicants, recommending credit, triaging patients), yet fairness for LLMs is still measured by grading answers. We introduce AgentFairBench, a cheap, reproducible, multi-domain benchmark for demographic disparity in the actions of LLM agents. Grounded in a companion framework, the Bias Conduction Framework (BCF, restated here), it spans three regulator-anchored domains: hiring, lending, and medical triage. Synthetic, demographic-neutral profiles are evaluated in counterfactual matched sets that vary only a name-coded race x gender signal (in the Bertrand Mullainathan tradition), under four agent scaffolds of increasing agency (direct, chain-of-thought, multi-agent deliberation, tool-augmented). A NumPy-only harness computes counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity, with bootstrap confidence intervals, paired tests, and false-discovery-rate control, for single-digit dollars per model. A live leaderboard with a held-out private split and a contamination canary admits external models by submission. Our pilot (864 decisions plus a test-retest replication) carries a methodological lesson: comparing a six-group score spread against a two-run noise difference overstates disparity by ~ 2.4X through statistic arity alone. Against an arity matched noise floor and an omnibus group test, claude haiku 4 5 shows no demographic effect above sampling noise (0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction); a planted-bias test confirms the instrument detects disparity when present. The contribution is a sound, sensitive, adoption-ready instrument, the arity matched null methodology, and open artifacts to scale it. Code, data, and harness are released under open licenses, with an anonymized review artifact.

摘要：大型語言模型（LLM）代理人越來越多地採取行動（篩選申請人、推薦信用、分診病人），然而LLM的公平性仍然是通過評分答案來衡量的。我們介紹AgentFairBench，一個便宜、可重複的多領域基準，用於衡量LLM代理人行動中的人口統計差異。這一基準基於一個伴隨框架，即偏見傳導框架（BCF，此處重述），涵蓋三個以監管者為基礎的領域：招聘、貸款和醫療分診。合成的人口統計中立檔案在反事實匹配集中的評估中，僅變化一個名稱編碼的種族 x 性別信號（遵循Bertrand Mullainathan的傳統），在四種逐漸增加代理權的代理架構下（直接、思考鏈、多代理協商、工具增強）。一個僅使用NumPy的工具計算反事實翻轉率、平均絕對分數差（MASD）、行動率差異和工具調用差異，並提供自助信心區間、配對測試和假發現率控制，每個模型的成本僅為單位數美元。一個實時排行榜帶有保留的私有拆分和污染警示，通過提交允許外部模型參與。我們的試點（864個決策加上測試重測複製）帶來了一個方法論教訓：將六組分數差與兩次運行的噪音差進行比較，僅通過統計的相同性就過度強調了差異約2.4倍。在一個相同性匹配的噪音基準和一個綜合組測試中，claude haiku 4 5顯示沒有超過抽樣噪音的人口統計效應（120對比中0個和9個綜合對比中0個在校正後存活）；一個植入偏見的測試確認該工具在存在差異時能夠檢測到差異。這一貢獻是一個可靠、敏感、可採用的工具，具有相同性匹配的零假設方法論，以及可擴展的開放文物。代碼、數據和工具在開放許可下發布，並附有匿名的審查文物。

Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies

2606.16721v1 by Ke Liu, Mengxuan Li, Yanyi Bao, Tianyun Zhang, Chong Chu, Jiajun Bu, Haishuai Wang

Medical diagnosis and treatment are dynamic processes in which patient states evolve over time and clinical interventions alter future outcomes. Although current medical AI can detect disease, estimate risk and generate reports, many systems still return static labels or scores, offering limited insight into how illness may progress or how alternative interventions may reshape its trajectory. Medical world models adapt the world-model idea from artificial intelligence to healthcare by learning internal simulators of patient-state dynamics. Their long-term goal is to help clinicians anticipate deterioration, compare treatment-conditioned futures and tailor care to individual patients. Yet relevant work remains scattered across foundation models, longitudinal modelling, disease simulation, treatment-effect estimation, reinforcement learning and digital twins. To bridge this gap, this review outlines a roadmap for advancing medical AI from isolated diagnosis and prediction toward medical world models that simulate disease evolution and support intervention decisions. This roadmap is organized around three coupled capabilities: patient-state construction, clinical dynamics modelling and intervention decision support. Across representative systems, the comparison highlights what each capability contributes and how partial components can be integrated into more mature perception--dynamics--planning systems. Finally, we identify the challenges involved in turning plausible rollouts into clinically useful simulators. Related literature is available at https://github.com/1999kevin/awesome_medical_world_models.

摘要：醫療診斷和治療是動態過程，患者狀態隨時間演變，臨床干預改變未來結果。儘管當前的醫療人工智慧可以檢測疾病、估算風險並生成報告，但許多系統仍然返回靜態標籤或分數，對於疾病如何進展或替代干預如何改變其軌跡提供有限的洞察。醫療世界模型將人工智慧中的世界模型概念應用於醫療保健，通過學習患者狀態動態的內部模擬器。它們的長期目標是幫助臨床醫生預測惡化、比較治療條件下的未來並為個別患者量身定制護理。然而，相關工作仍然分散在基礎模型、縱向建模、疾病模擬、治療效果估算、強化學習和數位雙胞胎之間。為了彌補這一差距，本綜述概述了一個推進醫療人工智慧的路線圖，從孤立的診斷和預測轉向模擬疾病演變並支持干預決策的醫療世界模型。這個路線圖圍繞三個相互關聯的能力組織：患者狀態構建、臨床動態建模和干預決策支持。在代表性系統中，這一比較突顯了每個能力的貢獻，以及如何將部分組件整合到更成熟的感知--動態--規劃系統中。最後，我們確定了將可行的推廣轉變為臨床有用模擬器所面臨的挑戰。相關文獻可在 https://github.com/1999kevin/awesome_medical_world_models 獲得。

Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation

2606.16568v1 by Rutherford A. Patamia, Ming Liu, Wei Luo, Favour Ekong, Akan Cosgun

Reliable turn-taking is essential for spoken dialogue systems. However, most existing methods are designed for two-speaker interaction and struggle with realistic multiparty audio containing overlap and rapid speaker changes. We study multiparty turn-taking on the VoxConverse dataset and propose an audio-only two-stage pipeline that separates when to trigger a turn boundary from whether the floor is actually transferring. A fast trigger scans the audio and proposes candidate end-of-turn times, while a lightweight verifier runs only at those times to decide \textsc{Hold} or \textsc{Shift} and support next-speaker prediction. We report results in the full multiparty setting and a controlled dyadic top-2 projection for comparability. We also investigate diffusion-based, label-preserving background-audio mixing as a data augmentation strategy. Results show improved shift detection over a baseline, with further improvements from diffusion augmentation.

摘要：可靠的輪流發言對於口語對話系統至關重要。然而，大多數現有方法是為兩位講者的互動而設計，並且在處理包含重疊和快速講者變換的現實多方音頻時表現不佳。我們在VoxConverse數據集上研究多方輪流發言，並提出一種僅使用音頻的兩階段管道，將觸發輪轉邊界的時機與實際是否轉移發言權分開。快速觸發器掃描音頻並提出候選的結束輪轉時間，而輕量級驗證器僅在這些時間運行，以決定\textsc{Hold}或\textsc{Shift}並支持下一位講者的預測。我們在完整的多方設置和受控的雙人前兩名投影中報告結果，以便進行比較。我們還研究了基於擴散的、保持標籤的背景音頻混合作為數據增強策略。結果顯示，相較於基線，輪轉檢測有所改善，並且擴散增強進一步提升了效果。

Unified Multimodal Model for Brain MRI Imputation and Understanding

2606.16484v1 by Zhiyun Song, Che Liu, Tian Xia, Avinash Kori, Wenjia Bai

Multimodal large language models (MLLMs) hold great potential for medicine, as they inherit knowledge from LLM and allow multiple data modalities to be integrated, analysed and interpreted in natural language. However, the field of medical MLLMs is constrained by non-trivial challenges, notably the scarcity of high-quality training data and the frequent occurrence of missing data in the real-world clinical setting. Here, we propose a novel unified multimodal model, UniBrain, for brain magnetic resonance image (MRI) analysis. To address potential missing brain MRI modalities, we employ a unified training strategy to perform joint imaging modality imputation and brain image understanding. During training, an interleaved and description-enriched data flow is constructed to train the model in an autoregressive manner, enabling medical reasoning with generated multimodal data. A self-alignment strategy is introduced to leverage dense image embeddings to learn fine-grained anatomical features without requiring detailed image captions. Furthermore, we propose a dynamic hidden state mechanism to alleviate the exposure bias during long-context multimodal inference. Extensive experiments on multi-disease brain MRI dataset demonstrate that UniBrain achieves high performance for brain image imputation, understanding, and disease diagnosis under various extents of modality incompleteness.

摘要：多模態大型語言模型（MLLMs）在醫學領域具有巨大的潛力，因為它們繼承了LLM的知識並允許多種數據模態的整合、分析和用自然語言解釋。然而，醫學MLLM的領域受到非平凡挑戰的限制，尤其是高質量訓練數據的稀缺以及在現實臨床環境中經常出現的缺失數據。在此，我們提出了一種新穎的統一多模態模型UniBrain，用於腦部磁共振影像（MRI）分析。為了解決潛在的缺失腦部MRI模態，我們採用統一的訓練策略來執行聯合影像模態插補和腦影像理解。在訓練過程中，構建了一個交錯且描述豐富的數據流，以自回歸的方式訓練模型，使其能夠利用生成的多模態數據進行醫學推理。引入了一種自對齊策略，以利用密集的影像嵌入來學習細緻的解剖特徵，而無需詳細的影像標題。此外，我們提出了一種動態隱藏狀態機制，以減輕長上下文多模態推理過程中的曝光偏差。在多疾病腦部MRI數據集上的廣泛實驗顯示，UniBrain在不同程度的模態不完整性下，實現了腦影像插補、理解和疾病診斷的高性能。

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

2606.17115v1 by Jingyu Hu, Giuseppe Tripodi, Reed Naidoo, Sarah F. McGough, Tapabrata Chakraborti

Foundation models (FMs) have emerged as powerful representation extractors for medical data, yet their generalizability to datasets under distribution shift remains underexplored. This work systematically evaluates FM-based representations on a suite of computational pathology tasks across two real-world commercial cohorts, IH-BC and IH-NSCLC, drawn from the licensed in-house (IH) oncology dataset. The analysis focuses on two modalities, whole-slide images and transcriptomic profiles, drawn from the IH multimodal data. We first benchmark unimodal probing performance across five FMs on eight downstream classification tasks, and find that image and omics representations carry complementary predictive signals. Then we investigate whether multimodal fusion can yield additional gains over unimodal baselines by comparing three image-omics fusion strategies built on paired representations. The trustworthiness of selected unimodal and multimodal pipelines is further assessed through conformal prediction. Our results show that FM representations achieve competitive performance on out-of-distribution data and that multimodal fusion helps mainly when no single modality dominates the signal. Conformal prediction reveals that in the majority of cases where a point prediction fails, the true diagnosis remains recoverable within the prediction set, reinforcing the value of uncertainty-aware inference for clinical support.

摘要：基礎模型（FMs）已經成為醫療數據強大的表徵提取器，但它們在分佈變化下對數據集的可泛化性仍然未被充分探討。這項工作系統性地評估了基於FM的表徵在兩個來自授權內部（IH）腫瘤數據集的真實商業隊列IH-BC和IH-NSCLC上的一系列計算病理任務。分析重點集中在來自IH多模態數據的兩種模態：全切片圖像和轉錄組特徵。我們首先在八個下游分類任務上基準測試五個FMs的單模態探測性能，發現圖像和組學表徵具有互補的預測信號。然後，我們通過比較三種基於配對表徵的圖像-組學融合策略，調查多模態融合是否能在單模態基準上獲得額外的增益。所選單模態和多模態管道的可信度進一步通過符合預測進行評估。我們的結果顯示，FM表徵在分佈外數據上達到競爭性能，並且當沒有單一模態主導信號時，多模態融合主要有助於提升性能。符合預測顯示，在大多數點預測失敗的情況下，真實診斷仍然可以在預測集內恢復，強調了對臨床支持的考慮不確定性的推理的價值。

Autonomous End-to-End SOH Prediction Services for Battery Systems via Temporal-Contrastive Representation Learning

2606.16434v1 by Junting Wen, Dan Li, Qihao Quan, Xiwen Wang, Hang Yang, Zhaohong Meng, Zigui Jiang, Changlin Yang, Tianle Liu, Diego Muñoz-Carpintero, Jian Lou

Accurate state of health (SOH) estimation is a critical diagnostic service for lithium-ion battery management. However, reliance on labor-intensive manual feature engineering and opaque black-box models hinders scalable industrial deployment. To address this, we introduce TC-SOH: a modular, plug-and-play service architecture for autonomous, end-to-end SOH prediction. TC-SOH employs a temporal-contrastive mechanism and a cross-window prediction pretext task to extract degradation-relevant representations directly from raw operational data. To improve transparency, we connect model efficacy with representation diagnostics: visualization, sensitivity analysis, redundancy analysis, bidirectional probing, future-SOH probing, and temporal shuffling show that learned features overlap with selected expert descriptors while retaining additional SOH-relevant variation, and that ordered temporal context improves subsequent-SOH prediction. Across four public datasets, TC-SOH outperforms the considered physics-informed and data-driven baselines, reducing MAPE by 1.91 times and RMSE by 2.13 times.

摘要：準確的健康狀態（SOH）估計是鋰離子電池管理的一項關鍵診斷服務。然而，依賴勞動密集型的人工特徵工程和不透明的黑箱模型妨礙了可擴展的工業部署。為了解決這個問題，我們介紹了TC-SOH：一種模組化、即插即用的服務架構，用於自主的端到端SOH預測。TC-SOH採用時間對比機制和跨窗口預測的前置任務，直接從原始操作數據中提取與退化相關的表示。為了提高透明度，我們將模型效能與表示診斷相連接：可視化、敏感性分析、冗餘分析、雙向探測、未來SOH探測和時間洗牌顯示，學習到的特徵與選定的專家描述符重疊，同時保留額外的SOH相關變異，並且有序的時間上下文改善了後續的SOH預測。在四個公共數據集上，TC-SOH超越了考慮的物理知識驅動和數據驅動的基準，將MAPE降低了1.91倍，RMSE降低了2.13倍。

Input-Dependent Fisher Information for Local Sensitivity Analysis of Medical Image Classifiers

2606.16362v1 by Sourya Sengupta. Mark A. Anastasio

Deep neural networks have achieved strong performance in medical image classification, but often work like black-box. Commonly used post-hoc interpretation methods often provide heuristic visualizations whose relationship to the classifier's predictive distribution is indirect. This work introduces a local sensitivity analysis framework based on the input-dependent Fisher Information Matrix (iFIM) of a trained classifier. The iFIM characterizes how the classifier's predictive distribution changes under infinitesimal perturbations of the input image. By using a Gram-matrix formulation, the nonzero eigenspectrum of the iFIM can be recovered without explicitly forming the full image-dimensional Fisher matrix. The leading iFIM eigenspace is then used to project an input image into a high local-sensitivity component and its orthogonal component. These components provide a model-intrinsic description of local predictive sensitivity, rather than a conventional pixel-wise attribution heatmap or a causal segmentation of task-relevant anatomy. The framework is evaluated on controlled and clinical medical image classification tasks using multiple classifier architectures. Perturbation-based experiments show that high-sensitivity iFIM components are more strongly coupled to changes in predictive confidence and classification performance than lower-sensitivity complementary components. The results support the iFIM framework as a principled tool for analyzing local decision sensitivity and for complementing existing attribution-based interpretability methods in medical imaging.

摘要：深度神經網絡在醫學影像分類中取得了強大的表現，但通常運作如同黑箱。常用的事後解釋方法往往提供啟發式的可視化，其與分類器預測分佈的關係是間接的。本研究引入了一個基於訓練後分類器的輸入依賴費雪信息矩陣（iFIM）的局部靈敏度分析框架。iFIM 描述了分類器的預測分佈在輸入影像的無窮小擾動下如何變化。通過使用 Gram 矩陣的形式，可以在不明確形成完整影像維度費雪矩陣的情況下恢復 iFIM 的非零特徵譜。然後，利用主導的 iFIM 特徵空間將輸入影像投影到高局部靈敏度組件及其正交組件中。這些組件提供了模型內在的局部預測靈敏度描述，而不是傳統的逐像素歸因熱圖或與任務相關的解剖結構的因果分割。該框架在使用多個分類器架構的控制和臨床醫學影像分類任務上進行了評估。基於擾動的實驗表明，高靈敏度的 iFIM 組件與預測信心和分類性能的變化之間的耦合比低靈敏度的補充組件更強。這些結果支持 iFIM 框架作為分析局部決策靈敏度的原則性工具，並補充現有的基於歸因的醫學影像可解釋性方法。

Medical Heuristic Learning: An LLM-Driven Framework for Interpretable and Auditable Clinical Decision Rules

2606.16337v2 by Wei Xu, Ke Yang, Gang Luo, Keli Zheng, Lingyan Hu, Jing Wang, Kefeng Li

Predictive modeling for clinical tabular data is central to clinical decision support and therefore requires not only strong predictive performance but also transparent decision logic. Although deep learning and tree-based ensemble methods can achieve high accuracy, their black-box nature remains a major obstacle to clinical deployment. This challenge is further compounded by common characteristics of medical data, including limited sample sizes, severe class imbalance, and feature evolution arising from changes in diagnostic criteria and clinical documentation. To address these issues, we propose Medical Heuristic Learning (MHL), an instantiation of the learning-beyond-gradients paradigm for clinical tabular prediction. Instead of relying on neural network weight updates, MHL uses a large language model (LLM)-driven workflow that integrates statistical probes, medical knowledge probes, rule synthesis, and code-level iterative refinement to optimize a deterministic and executable decision system. The resulting model is expressed not as opaque parameters, but as versioned pure-Python decision rules that are explicitly interpretable, fully auditable, and clinically grounded. MHL also supports continual learning by starting from previously validated rules and iteratively revising them using updated feature information under data drift or feature evolution. Comprehensive experiments on medical datasets show that MHL achieves performance comparable to state-of-the-art methods while maintaining strong behavior in small-sample and highly imbalanced settings. The results further indicate that this explicit rule update mechanism can help alleviate catastrophic forgetting under feature evolution. Overall, these findings suggest that non-gradient-based heuristic systems offer a transparent and adaptable alternative for high-stakes clinical decision support.

摘要：臨床表格數據的預測建模對於臨床決策支持至關重要，因此不僅需要強大的預測性能，還需要透明的決策邏輯。儘管深度學習和基於樹的集成方法可以實現高準確度，但它們的黑箱特性仍然是臨床應用的一大障礙。這一挑戰進一步受到醫療數據的共同特徵的影響，包括樣本量有限、類別嚴重不平衡，以及由於診斷標準和臨床文檔變更而產生的特徵演變。為了解決這些問題，我們提出了醫療啟發式學習（MHL），這是一種超越梯度學習範式在臨床表格預測中的具體實現。MHL不依賴於神經網絡權重更新，而是使用一種基於大型語言模型（LLM）的工作流程，該流程整合了統計探測、醫療知識探測、規則綜合和代碼級迭代優化，以優化一個確定性且可執行的決策系統。最終模型不是以不透明的參數表達，而是以版本化的純Python決策規則表達，這些規則是明確可解釋的、完全可審計的，並且與臨床實踐相結合。MHL還支持持續學習，通過從先前驗證的規則開始，並在數據漂移或特徵演變下使用更新的特徵信息迭代修訂這些規則。對醫療數據集的全面實驗顯示，MHL在小樣本和高度不平衡的環境中達到了與最先進方法相當的性能，同時保持了強大的行為。結果進一步表明，這種明確的規則更新機制可以幫助減輕特徵演變下的災難性遺忘。總體而言，這些發現表明，非梯度基礎的啟發式系統為高風險的臨床決策支持提供了一種透明且可適應的替代方案。

Propagating Structural Guidance: Synthesizing Fluorescein Angiography from Fundus Images and Sparse OCT Scans

2606.16234v1 by Tengfei Ma, Ruiqi Wu, Chenran Zhang, Ye Geng, Na Su, Xiangyuan Duanmu, Tao Zhou, Yi Zhou, Wen Fan

Fundus fluorescein angiography (FFA) is critical for assessing retinal vascular abnormalities, but its acquisition is invasive and not always feasible. In contrast, color fundus photography (CFP) is non-invasive and widely accessible, which has motivated studies on CFP-to-FFA synthesis. However, prior works rely solely on CFP surface texture, fundamentally limiting the ability to reconstruct functional vascular information and subtle pathological changes. To address this, we propose a novel framework that synthesizes FFA from CFP with structural guidance provided by optical coherence tomography (OCT). We construct a multi-modal retinal imaging dataset with paired CFP, FFA, and OCT from 3,676 patient eyes--the first tri-modally aligned dataset in retinal imaging. To bridge the spatial gap between OCT and fundus modalities, we propose a Spatially Aligned Cross-Modal Fusion (SACMF) module that projects depth-resolved OCT features onto the fundus plane and injects them into the CFP encoder via adaptive layer normalization. Beyond feature fusion, we further introduce Token-wise Cross-Modality Alignment (TCMA), a token-level contrastive learning strategy that explicitly aligns CFP and FFA representations at corresponding spatial positions. Our method achieves superior synthesis performance compared to state-of-the-art methods. Moreover, extensive experiments demonstrate that the FFA images synthesized by our approach bring greater improvements in downstream disease diagnosis performance than existing methods, highlighting the clinical potential of our approach as a non-invasive decision-support tool in routine workflows. The code is available at https://github.com/while-plus/OCT-guide-FFA-Syn.

摘要：眼底螢光血管造影（FFA）對於評估視網膜血管異常至關重要，但其獲取方式具有侵入性且並不總是可行。相對而言，彩色眼底攝影（CFP）是非侵入性的且廣泛可及，這促使了對CFP到FFA合成的研究。然而，先前的工作僅依賴於CFP的表面紋理，根本限制了重建功能性血管信息和微妙病理變化的能力。為了解決這個問題，我們提出了一個新框架，通過光學相干斷層掃描（OCT）提供的結構指導，將FFA從CFP合成。我們構建了一個多模態視網膜影像數據集，包含3,676名患者眼睛的配對CFP、FFA和OCT——這是視網膜影像中首個三模態對齊數據集。為了彌合OCT和眼底模態之間的空間差距，我們提出了一個空間對齊跨模態融合（SACMF）模塊，該模塊將深度解析的OCT特徵投影到眼底平面，並通過自適應層歸一化將其注入CFP編碼器。除了特徵融合外，我們進一步引入了基於Token的跨模態對齊（TCMA），這是一種在對應空間位置上明確對齊CFP和FFA表示的token級對比學習策略。我們的方法在合成性能上超越了最先進的方法。此外，大量實驗表明，我們的方法合成的FFA影像在下游疾病診斷性能上帶來了比現有方法更大的改善，突顯了我們的方法作為日常工作流程中非侵入性決策支持工具的臨床潛力。代碼可在 https://github.com/while-plus/OCT-guide-FFA-Syn 獲得。

Embedded Arena: Iterative Optimization via Hardware Feedback

2606.16190v1 by Zhihan Zhang, Alexander Le Metzger, Jiuyang Lyu, Chun-Cheng Chang, Jiayi Shao, Yujia Liu, Emmanuel Azuh Mensah, Edward Wang, Kurtis Heimerl, Gregory D. Abowd, Shwetak Patel, Natasha Jaques, Vikram Iyer

Embedded devices from wildlife monitoring stations to clinical wearables require local AI inference due to latency, communication, or privacy constraints. Optimizing models for heterogeneous microcontrollers (MCUs) requires simultaneously satisfying hard physical constraints on memory, power, and temperature while preserving accuracy, a multidimensional optimization that is today performed manually by experts. We ask whether an LLM agent can autonomously navigate this complex, multi-turn pipeline guided by real hardware feedback, and introduce a hardware-in-the-loop agent arena in which the agent iteratively refines both model and firmware -- compiling, flashing, and measuring on real hardware -- to enable closed-loop optimization. Frontier models, including Claude Opus 4.7 and Gemini 3.1 Pro, fail entirely without hardware feedback (0% deployment success), whereas our hardware-in-the-loop formulation achieves the first successful deployment within three iterations and can surpass human expert results within seven. This agentic co-optimization achieves 250x compression for vision models with <3.3% accuracy loss and 400x for audio with <6% Feature Error Rate loss, enabling battery-free operation on a commercial MCU via solar harvesting. We demonstrate practical impact in two real-world systems: an elk-detection camera trap (96.7% accuracy) and a phonetic-transcription wearable (8.44% FER) for child development research.

摘要：嵌入式設備從野生動物監測站到臨床可穿戴設備，由於延遲、通信或隱私限制，需要進行本地 AI 推斷。對於異構微控制器 (MCUs) 優化模型需要同時滿足對記憶體、功耗和溫度的嚴格物理限制，同時保持準確性，這是一個多維優化，今天由專家手動執行。我們詢問一個 LLM 代理是否能夠自主導航這個複雜的多輪流程，並受到實際硬體反饋的指導，並介紹一個硬體在迴路的代理競技場，在這裡代理反覆精煉模型和韌體——在實際硬體上編譯、閃存和測量——以實現閉環優化。前沿模型，包括 Claude Opus 4.7 和 Gemini 3.1 Pro，完全依賴硬體反饋失敗（0% 部署成功），而我們的硬體在迴路的公式在三次迭代內實現了第一次成功部署，並且在七次迭代內可以超越人類專家的結果。這種代理協同優化為視覺模型實現了 250 倍壓縮，準確性損失小於 3.3%，音頻則實現了 400 倍壓縮，特徵錯誤率損失小於 6%，使得通過太陽能收集在商業 MCU 上實現無電池操作。我們在兩個現實世界系統中展示了實際影響：一個麋鹿檢測相機陷阱（96.7% 準確率）和一個語音轉錄可穿戴設備（8.44% 特徵錯誤率）用於兒童發展研究。

A Comprehensive Survey of Medical Image Segmentation: Challenges, Benchmarks, and Beyond

2606.16153v1 by Pengyu Zhu, Xiaojing Zhang, Kunbo Zhang, Chunyan Zhang, Zhenyu Wang

Medical image segmentation plays a critical role in clinical diagnostics, treatment planning, disease monitoring, and neurological disorder identification. This article presents a comprehensive review of its systematic development, covering widely used public datasets, representative methods built on the U-Net, Transformer, and SAM architectures, and key evaluation metrics with their differences, followed by an analysis of major challenges from multiple perspectives. Unlike surveys that focus on a single model family or a specific clinical application, this review organizes U-Net-, Transformer-, and SAM-based methods within a unified analytical framework, with a particular focus on their effectiveness in improving segmentation accuracy and efficiency. This work aims to guide future research and support clinical translation of medical image segmentation, with all related resources publicly available in our GitHub repository: https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main.

摘要：醫學影像分割在臨床診斷、治療計劃、疾病監測和神經疾病識別中扮演著關鍵角色。本文提供了其系統發展的綜合回顧，涵蓋了廣泛使用的公共數據集、基於U-Net、Transformer和SAM架構的代表性方法，以及關鍵評估指標及其差異，接著從多個角度分析主要挑戰。與專注於單一模型家族或特定臨床應用的調查不同，這篇回顧將基於U-Net、Transformer和SAM的方法組織在一個統一的分析框架內，特別關注它們在提高分割準確性和效率方面的有效性。這項工作旨在指導未來的研究並支持醫學影像分割的臨床轉化，所有相關資源均可在我們的GitHub儲存庫中公開獲得：https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main。

LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis

2606.16149v1 by Minh-Ha Nguyen, Erica Gray, Chih-Ting Yang, Rizwan Hamid, Lingyao Li, Siyuan Ma, Thomas A. Cassini, Cathy Shyr

Most medical AI systems improve by scaling additional machinery: more fine-tuning data, more agents, and/or larger retrieval databases. In rare-disease diagnosis, however, such scaling can produce systems that are difficult to deploy, audit, and maintain. We asked whether state-of-the-art diagnostic performance could instead be achieved by extending the reasoning chain of a single AI agent: guiding it with a diagnostic policy, developed through human-AI collaboration and augmenting with freely available biomedical tools. We introduce LiteOdyssey, a lightweight rare-disease diagnostic framework that guides reasoning language model through a clinical genetics workflow. This framework was developed through Policy Iteration with Human Feedback (PIHF) and uses dynamic access to public biomedical tools. On two challenging benchmarks that provide only patient clinical features, LiteOdyssey achieved state-of-the-art performance, with an overall disease Recall@1 of 59.3% over the combined 1,243 cases of LIRICAL (n = 370) and the PhenoPacket Store (n = 873). Both benchmarks have a high proportion of ultra-rare disease (a prevalence below 1 in 1,000,000, with ultra-rare shares of approximately 45% and 52.8%, respectively). On the more difficult PhenoPacket subset, where causal diseases were not mapped to Orphanet in our rarity-mapping pipeline, LiteOdyssey achieved 60.7% Recall@1, compared with 10.7% for the same baseline model (GPT-5.4) without tools. This performance was achieved without fine-tuning, multi-agent ensembles, or a large case-retrieval database. Gains were also observed in the following: on cases never seen during development, on a private cohort of real-world rare disease patients, and on a smaller open-weights model. LiteOdyssey suggests a path toward rare-disease AI systems that are accurate, easier to deploy, and more transparent for physician review.

摘要：大多數醫療人工智慧系統透過擴展額外的機器來提高性能：更多的微調數據、更多的代理和/或更大的檢索數據庫。然而，在罕見疾病診斷中，這種擴展可能會產生難以部署、審核和維護的系統。我們詢問是否可以通過擴展單個人工智慧代理的推理鏈來實現最先進的診斷性能：通過人類與人工智慧的合作開發診斷政策來引導它，並使用免費的生物醫學工具進行增強。我們介紹了LiteOdyssey，一個輕量級的罕見疾病診斷框架，通過臨床遺傳學工作流程引導推理語言模型。這個框架是通過人類反饋的政策迭代（PIHF）開發的，並使用對公共生物醫學工具的動態訪問。在兩個具有挑戰性的基準上，LiteOdyssey在僅提供患者臨床特徵的情況下實現了最先進的性能，整體疾病的Recall@1為59.3%，涵蓋了1,243個LIRICAL（n = 370）和PhenoPacket Store（n = 873）的案例。這兩個基準中超罕見疾病的比例很高（流行率低於1/1,000,000，超罕見的比例分別約為45%和52.8%）。在更具挑戰性的PhenoPacket子集上，因為因果疾病未在我們的稀有映射管道中映射到Orphanet，LiteOdyssey實現了60.7%的Recall@1，而同一基準模型（GPT-5.4）在未使用工具的情況下僅為10.7%。這一性能是在沒有微調、多代理集成或大型案例檢索數據庫的情況下實現的。還觀察到以下增益：在開發期間從未見過的案例上、在一個真實世界罕見疾病患者的私有隊列上，以及在一個較小的開放權重模型上。LiteOdyssey暗示了朝向準確性高、易於部署且對醫生審查更透明的罕見疾病人工智慧系統的發展道路。

PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization

2606.16074v1 by Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Elyas Irankhah, Sreeraj Ramachandran, Ashley Hagaman, Sarah Lowe, Aimee Roundtree

Motivation: Patient-generated text contains critical information on patients' lived experiences, social context, and care engagement, but remains largely unstructured, limiting its use in patient-centered outcomes research. Prior work introduced the PV-Miner benchmark and PVMinerLLM models for structured extraction. However, supervised fine-tuning (SFT) alone struggles with rare, fine-grained, and unevenly distributed errors, particularly in token-critical structured outputs. Results: We present PVminerLLM2, an improved set of LLMs for structured patient voice extraction that applies preference optimization to address token-critical errors beyond the reach of supervised fine-tuning. Our method introduces (i) a preference objective with token-level gated stabilization term that prevents degradation of absolute token likelihood under preference optimization, and (ii) confusion-aware preference pair construction to better capture low-separation distinctions. We further incorporate token-importance weighting and inverse-frequency reweighing to address token imbalance and class skew. Across multiple model sizes, PVMinerLLM2 consistently outperforms strong baselines, achieving gains of up to 4.43% (Code), 3.50% (Sub-code), and 1.55% (Span), and outperforms baseline LLM trained with existing preference optimization methods. Availability and Implementation: The supplementary material, code, evaluation scripts, and trained models for PVminerLLM2 are publicly available at: https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

摘要：動機：病人生成的文本包含了病人生活經歷、社會背景和護理參與的重要信息，但仍然大多是非結構化的，限制了其在以病人為中心的結果研究中的使用。先前的工作引入了PV-Miner基準和PVMinerLLM模型以進行結構化提取。然而，僅僅依賴監督式微調（SFT）在稀有、細緻和不均勻分佈的錯誤方面面臨挑戰，特別是在對令牌關鍵的結構化輸出中。
結果：我們提出了PVminerLLM2，一組改進的LLM，用於結構化病人聲音提取，應用偏好優化來解決超出監督式微調範疇的令牌關鍵錯誤。我們的方法引入了(i) 一個帶有令牌級門控穩定項的偏好目標，防止在偏好優化過程中絕對令牌可能性的下降，以及(ii) 具混淆感知的偏好對構建，以更好地捕捉低分離區別。我們進一步結合了令牌重要性加權和逆頻率重加權，以解決令牌不平衡和類別偏斜。在多個模型大小中，PVMinerLLM2始終超越強基準，實現了高達4.43%（代碼）、3.50%（子代碼）和1.55%（跨度）的增益，並且超越了使用現有偏好優化方法訓練的基準LLM。
可用性和實施：PVminerLLM2的補充材料、代碼、評估腳本和訓練模型可在以下網址公開獲得：https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

DeepRoot: A KG-Coordinated Multi-Agent System for Therapeutic Reasoning over Historical Medical Texts

2606.15931v1 by Zijian Carl Ma, Sean J. Wang, Sijbren Kramer, Li Erran Li

Historical medical archives and traditional medicines hold immense potential for drug discovery and remain a primary source for current drug development. However, pre-ontological prose and idiosyncratic taxonomies prevent the standardization and medical modernization of the data for use in current biomedical pipelines. Furthermore, no existing LLM agent system, whether tool-calling, retrieval-augmented, or agentic deep-research, can convert such text into verifiable drug-discovery leads at scale. We close this gap with DeepRoot, a multi-agent LLM system that jointly builds and utilizes a verified knowledge graph, showing that grounding and reasoning -- often conflated -- are separable axes the system can compose for therapeutic reasoning. Applied to the Shen Nong Ben Cao Jing, DeepRoot recovers $10$ of $21$ held-out compound-disease treatment pairs at R@$20$ ($47.6\%$ vs $4.8\%$ for a raw corpus LLM and $\sim!2.4\%$ random) and dominates an LLM-as-judge audit for reasoning quality over baseline LLMs and LLMs with direct tool-call access to the same APIs DeepRoot itself queries. Tool-using LLMs hallucinate evidence on $87\%$ of claims, versus 7-10% for DeepRoot. Graph-only inference hallucinates $0\%$ but ranks lowest on reasoning coherence; DeepRoot KG+LLM is the only condition to win on both axes, pointing toward a route for systematic mining and repurposing of historical medical knowledge.

摘要：歷史醫療檔案和傳統醫藥在藥物發現方面具有巨大的潛力，並且仍然是當前藥物開發的主要來源。然而，前本體論的散文和特有的分類法阻礙了數據的標準化和醫療現代化，以便用於當前的生物醫學管道。此外，現有的 LLM 代理系統，無論是工具調用、檢索增強還是代理深度研究，都無法將這類文本轉換為可驗證的藥物發現線索。我們通過 DeepRoot 彌補了這一空白，這是一個多代理 LLM 系統，聯合構建和利用經過驗證的知識圖譜，顯示出基礎和推理——通常被混淆——是系統可以組合的可分離軸，用於治療推理。應用於《神農本草經》，DeepRoot 恢復了 $10$ 個 $21$ 個保留的化合物-疾病治療對，在 R@$20$ 下的表現為 $47.6\%$（相比之下，原始語料庫 LLM 為 $4.8\%$，隨機約 $2.4\%$），並在推理質量的 LLM 作為評審的審計中超越了基準 LLM 和直接工具調用訪問相同 API 的 LLM，這些 API 是 DeepRoot 自身查詢的。使用工具的 LLM 在 $87\%$ 的主張上出現幻覺，而 DeepRoot 的比例為 7-10%。僅圖譜推理的幻覺為 $0\%$，但在推理連貫性上排名最低；DeepRoot KG+LLM 是唯一在兩個軸上都獲勝的條件，指向系統性挖掘和重新利用歷史醫療知識的路徑。

Let Them Steal: Trapping Large Language Model Extraction Attacks with Knowledge Honeypot

2606.15810v1 by Yuyang Dai, Yushun Dong

Large language models deployed as commercial APIs are vulnerable to model extraction attacks, while existing defenses either act too late or degrade utility for legitimate users. We propose \textbf{Knowledge Trap}, a defense that redirects extraction attacks toward low-transferability knowledge through a \emph{Honeypot Knowledge Graph} (HKG) and breadcrumb-guided exploration. Instead of blocking queries or perturbing outputs, Knowledge Trap consumes the attacker's limited query budget on knowledge with negligible downstream utility while preserving benign-user performance. Experiments in medical and financial domains show that Knowledge Trap reduces surrogate Agreement by 6.2\% on average without degrading legitimate-user accuracy, outperforming existing defenses that impose measurable user impact. These results suggest that defending knowledge-space traversal is a practical direction for mitigating LLM extraction attacks.

摘要：大型語言模型作為商業API部署時，容易受到模型提取攻擊，而現有的防禦措施要麼反應太晚，要麼降低合法用戶的效用。我們提出了\textbf{知識陷阱}，這是一種通過\emph{蜜罐知識圖}（HKG）和麵包屑引導探索將提取攻擊重定向到低可轉移性知識的防禦措施。知識陷阱並不是阻止查詢或擾動輸出，而是使攻擊者有限的查詢預算消耗在對下游效用幾乎無影響的知識上，同時保留良性用戶的性能。在醫療和金融領域的實驗顯示，知識陷阱平均減少了替代協議的6.2\%，而不降低合法用戶的準確性，超越了現有對用戶影響可測量的防禦措施。這些結果表明，防禦知識空間遍歷是一個減輕LLM提取攻擊的實用方向。

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

2606.15735v2 by Jiyoun Kim, Muhan Yeo, Eunhye Jang, Jeewon Yang, Hangyul Yoon, Su Ji Lee, Hee Jo Han, Hee-Jae Jung, Doyun Kwon, Jun young Lee, Jaehun Lee, Jung-Oh Lee, Sunjun Kweon, Jong Hak Moon, Daseul Kim, Minjae Cho, Edward Choi

Discharge summaries are crucial clinical documents containing the context of a patient's overall hospital stay, and are routinely reviewed by medical experts for patient readmission, ongoing care, and diagnostic decision-making. When reviewing them, medical experts often must iteratively synthesize information across multiple summaries while verifying the evidence supporting each answer. Although large language models (LLMs) are increasingly explored for clinical question answering, existing benchmarks do not sufficiently reflect this setting: they often evaluate exam-style medical knowledge or focus on single-turn question answering with limited evidence-grounding evaluation. We introduce EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over patients' multiple discharge summaries. Built from de-identified MIMIC-IV discharge summaries, EHRNote-ChatQA contains 967 patient-level multi-turn samples spanning one to five notes and 16,072 medical-expert-verified QA pairs (8,036 content questions, each paired with an evidence-grounding question) across eight clinical categories. The benchmark is constructed through an expert-informed pipeline combining discharge-summary structuring schema, expert-curated multi-turn QA templates, and LLM-based generation, followed by review and revision of every single QA sample by 11 medical experts. Benchmarking 22 open- and closed-source LLMs reveals several challenges, including that LLMs struggle more with evidence grounding than content answering, multi-turn errors compound across turns, and single-turn clinical QA performance does not reliably transfer to this setting. These findings establish EHRNote-ChatQA as a rigorous and practical benchmark for evaluating clinical QA systems. The dataset will be made publicly available through PhysioNet credentialed access.

摘要：出院摘要是關鍵的臨床文件，包含患者整體住院期間的背景，並且經常被醫療專家用於患者再入院、持續護理和診斷決策的審查。在審查這些摘要時，醫療專家通常必須在多個摘要之間反覆綜合信息，同時驗證支持每個答案的證據。儘管大型語言模型（LLMs）在臨床問題回答中越來越受到關注，但現有的基準並未充分反映這一情境：它們通常評估考試風格的醫學知識或專注於單輪問題回答，並且對證據基礎的評估有限。我們介紹了EHRNote-ChatQA，這是首個針對患者多份出院摘要的證據基礎多輪臨床問題回答的基準。EHRNote-ChatQA基於去識別化的MIMIC-IV出院摘要，包含967個患者級別的多輪樣本，涵蓋一到五份摘要，以及16,072對經醫療專家驗證的問答對（8,036個內容問題，每個問題都配有一個證據基礎問題），跨越八個臨床類別。該基準是通過一個專家知情的流程構建的，結合了出院摘要結構化方案、專家策劃的多輪問答模板和基於LLM的生成，隨後由11位醫療專家對每個問答樣本進行審查和修訂。對22個開源和閉源LLM的基準測試揭示了幾個挑戰，包括LLM在證據基礎方面的表現較內容回答更為困難，多輪錯誤在輪次之間累積，以及單輪臨床問答的表現並不可靠地轉移到這一情境中。這些發現確立了EHRNote-ChatQA作為評估臨床問答系統的一個嚴謹且實用的基準。該數據集將通過PhysioNet的認證訪問公開。

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

2606.15733v1 by Zhenyu Yu

Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are unchanged. We ask whether this lexical gap reflects information loss in the placeholder view or a misaligned read-out from a representation that still carries answer-relevant content. Vernier uses a paired-view weight update as an instrument and then inspects the mechanism left after the gap closes. In the working regimes, the evidence favours representational misalignment. A variable-name probe becomes more accurate on the placeholder view, and activation patching on Qwen-7B, Qwen-14B, and Llama-3.1-8B shows that the decision-token representation can transfer answer identity between views. The update that realigns the views is counterfactual augmentation over original and placeholder prompts, while the answer-subspace KL mainly sharpens intermediate answer-belief agreement. Success is bounded by model family, scale, and task. CRASS transfer is reliable across Qwen scales and Llama, e-CARE remains weak, and preliminary non-causal rename tasks show a similar qualitative pattern.

摘要：指令調整的語言模型在其英語變數名稱被替換為類型保留的佔位符後，可以對相同的因果推理問題給出不同的答案，儘管結構因果模型和金標答案並未改變。我們詢問這種詞彙差距是否反映了佔位符視圖中的信息損失，或者是從仍然承載答案相關內容的表示中錯位的讀取。Vernier使用配對視圖權重更新作為工具，然後檢查在差距關閉後留下的機制。在工作狀態下，證據支持表示錯位。變數名稱探針在佔位符視圖上變得更準確，而在Qwen-7B、Qwen-14B和Llama-3.1-8B上的激活修補顯示決策令牌表示可以在視圖之間轉移答案身份。重新對齊視圖的更新是對原始和佔位符提示的反事實增強，而答案子空間KL主要加強了中間答案信念的一致性。成功受限於模型家族、規模和任務。CRASS轉移在Qwen規模和Llama之間是可靠的，而e-CARE仍然較弱，初步的非因果重命名任務顯示出類似的質量模式。

AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan

2606.15709v1 by Mohammed Fasha, Nahel Al-Maayta, Bilal Sowan, Mohammad Athamneh, Husam Barham

Jordan faces severe water scarcity with 50\% of water produced is lost to leakage, theft and metering issues also known as non-revenue water (NRW). Traditional reactive approaches have proven insufficient for sustained NRW reduction. This paper proposes an intelligent framework integrating EPANET hydraulic modeling, digital twin technology, SCADA systems, and large language model (LLM)-based AI agents for continuous network monitoring and adaptive decision-making. The system combines real-time data streams with physics-based simulation to detect anomalies, employing retrieval-augmented generation (RAG) for policy interpretation and function calling for network control. A proof-of-concept implementation validates technical feasibility using EPYT with offline LLMs (llama3.1:8b via Ollama) on a 1,164-junction Amman district network. The system demonstrates automated hydraulic simulation, flow-based anomaly detection aligned with water distribution zone (DZ) practice, and AI-generated health reports with response times under 2 minutes and zero API costs. Burst detection relies on local flow anomaly analysis: a 30.1~L/s simulated leak produces measurable flow redistribution in 15 pipes, flagging a 15-junction cluster that localises the burst -- confirming alignment with water distribution zone (DZ) monitoring practice. The framework accommodates Jordan's intermittent supply patterns and limited automation through phased implementation, offering a scalable pathway for water-scarce regions to leverage intelligent automation for NRW reduction and operational efficiency.

摘要：約旦面臨嚴重的水資源短缺，50\% 的水產量因漏水、盜竊和計量問題而損失，這也被稱為非收入水 (NRW)。傳統的反應性方法已被證明不足以持續減少 NRW。本文提出了一個智能框架，整合了 EPANET 水力模型、數位雙胞胎技術、SCADA 系統和基於大型語言模型 (LLM) 的 AI 代理，用於持續的網絡監控和自適應決策。該系統結合了實時數據流和基於物理的模擬來檢測異常，採用檢索增強生成 (RAG) 進行政策解釋，並通過函數調用進行網絡控制。概念驗證實施使用 EPYT 和離線 LLM（llama3.1:8b 通過 Ollama）在 1,164 個接頭的安曼區網絡上驗證了技術可行性。該系統展示了自動化的水力模擬、基於流量的異常檢測，與水分配區 (DZ) 實踐相一致，並生成 AI 健康報告，響應時間在 2 分鐘以內且無 API 成本。爆裂檢測依賴於本地流量異常分析：一個 30.1~L/s 的模擬漏水在 15 根管道中產生可測量的流量重分配，標記出一個 15 接頭的集群，定位了爆裂點——確認與水分配區 (DZ) 監控實踐的一致性。該框架通過分階段實施，適應約旦的間歇性供應模式和有限的自動化，為水資源短缺的地區提供了利用智能自動化減少 NRW 和提高運營效率的可擴展途徑。

LLM-Assisted Stance Detection in Scientific Discourse: A Test Case in Bayesian Cognitive Science

2606.15566v1 by Eyup Engin Kucuk, Tarik Kelestemur, Ömer Dağlar Tanrikulu

Qualitative coding is central to social science, but expert annotation is difficult to scale. LLMs offer a possible extension, yet require careful validation when the target construct is interpretive, theoretically loaded, and only indirectly expressed. We study this problem in a difficult case: detecting whether authors treat Bayesian models as descriptions of mental and neural mechanisms (realism) or as useful mathematical tools (instrumentalism). Our method combines a theory-driven codebook, expert-coded reference annotations, a diagnostic-gated prompt-optimization search yielding a shared zero-shot prompt for three frontier LLMs (GPT-5.1, Claude Sonnet 4.6, Gemini 3 Pro Preview), and multi-rater reliability analysis. The final prompt achieved a held-out combined reliability score of 0.76 (harmonic mean of ICC = 0.79 and $α$ = 0.74), with all diagnostics satisfied. Deployed on 6,858 quotes from 210 articles, the three LLMs reached substantial quote-level agreement (ICC = 0.80; $α$ = 0.76; combined = 0.78) and near-perfect article-level rank stability ($r$ = 0.96-0.97 across rater pairs). The corpus was predominantly weakly realist, but article-level stances were rarely uniform: only 1.4% of articles used a single band, while 59.5% spanned four or more. Low-level perception/motor articles scored 8.8 Realism points higher than high-level cognition articles ($p < .001$, $d = 0.60$), quantifying a long-held qualitative intuition. We present this as an expert-led case study; the framework is intended to generalize to similar theoretically demanding tasks, not to all qualitative analysis.

摘要：質性編碼在社會科學中至關重要，但專家註釋難以擴展。大型語言模型（LLMs）提供了一種可能的擴展，但在目標構念是解釋性的、理論負載的且僅間接表達時，需要仔細驗證。我們在一個困難的案例中研究這個問題：檢測作者是否將貝葉斯模型視為心理和神經機制的描述（現實主義）或作為有用的數學工具（工具主義）。我們的方法結合了以理論為驅動的編碼手冊、專家編碼的參考註釋、診斷性門控提示優化搜索，產生了三個前沿LLM（GPT-5.1、Claude Sonnet 4.6、Gemini 3 Pro Preview）的共享零樣本提示，以及多評估者可靠性分析。最終提示達到了0.76的保留組合可靠性得分（ICC的調和平均值 = 0.79和$α$ = 0.74），所有診斷均滿足。在210篇文章的6,858條引用中，這三個LLM達成了相當可觀的引用級一致性（ICC = 0.80；$α$ = 0.76；組合 = 0.78）和幾乎完美的文章級排名穩定性（$r$ = 0.96-0.97，跨評估者對）。該語料庫主要呈現出弱現實主義，但文章級立場很少統一：只有1.4%的文章使用單一範疇，而59.5%跨越四個或更多範疇。低層次的感知/運動文章比高層次的認知文章高出8.8個現實主義點（$p < .001$，$d = 0.60$），量化了長期以來的質性直覺。我們將這作為一個專家主導的案例研究提出；該框架旨在推廣到類似理論要求高的任務，而不是所有質性分析。

Hierarchical Modeling of ICD Codes in EHR Foundation Models

2606.15447v1 by Megha Thukral, Dong Gyun Kang, Rudra Pratap Singh, Shruthi Kashinath Hiremath, Katrin Hänsel, Thomas Plötz

Electronic health record foundation models typically treat ICD diagnosis codes as flat tokens, overlooking the clinically meaningful hierarchical structure that captures disease families, subcategories, and fine-grained diagnostic detail. As a result, existing EHR representation learning methods do not explicitly exploit the hierarchical structure already present in the coding system. In this work, we study ICD-10-CM hierarchy as a general inductive bias for clinical representation learning. We investigate two complementary mechanisms for incorporating hierarchy: first, by augmenting diagnosis sequences in a BERT-style transformer with tokens corresponding to different levels of the ICD hierarchy, and second, by injecting hierarchy into graph-based code representations through hierarchy-aware edges combined with diagnosis co-occurrence structure. Across these settings, we evaluate whether explicit hierarchy improves downstream prediction, which levels of the hierarchy are most useful, whether hierarchy encoding improves transfer across datasets, and how hierarchy reshapes embedding similarity structure. We conduct experiments on two large-scale real-world clinical datasets: MIMIC-IV, used for pretraining and in-domain evaluation, and eICU, used to assess cross-dataset transfer via frozen encoder probing. Our findings show that explicitly encoding ICD hierarchy improves over flat code representations in both in-domain and cross-dataset settings, while revealing that the most useful level of hierarchy depends on both the task and the modeling approach. More broadly, we focus on hierarchy-aware EHR representation learning and show that the benefits of encoding hierarchy are generalizable across modeling settings and hierarchy levels.

摘要：電子健康紀錄基礎模型通常將ICD診斷碼視為平面標記，忽略了捕捉疾病家族、子類別和細緻診斷細節的臨床意義層級結構。因此，現有的EHR表示學習方法並未明確利用已存在於編碼系統中的層級結構。在這項工作中，我們研究ICD-10-CM層級作為臨床表示學習的一般歸納偏見。我們探討了兩種互補機制來納入層級：首先，通過在BERT風格的Transformer中增強診斷序列，添加對應於ICD層級不同級別的標記；其次，通過將層級意識邊緣與診斷共現結構相結合，將層級注入基於圖的代碼表示。在這些設置中，我們評估明確的層級是否改善下游預測，哪些層級的層級結構最有用，層級編碼是否改善跨數據集的轉移，以及層級如何重塑嵌入相似性結構。我們在兩個大型真實臨床數據集上進行實驗：MIMIC-IV，用於預訓練和域內評估，以及eICU，用於通過凍結編碼器探測評估跨數據集轉移。我們的發現顯示，明確編碼ICD層級在域內和跨數據集設置中均優於平面代碼表示，同時揭示出最有用的層級取決於任務和建模方法。更廣泛地說，我們專注於層級意識的EHR表示學習，並展示了編碼層級的好處在各種建模設置和層級中都是可推廣的。

Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models

2606.15436v1 by Mayur Sanap, Prasanna Desikan, Edgar Lobaton

Respiratory acoustic foundation models (FMs) excel at cough classification, yet their ability to predict continuous health quantities from cough audio remains largely unexplored, despite the clinical value of passive age, BMI, and disease probability estimation in settings where physical measurements are unavailable. We introduce the multi-model, multi-target cough regression benchmark evaluating five FMs (OPERA-CT, OPERA-CE, OPERA-GT, HeAR, M2D+Resp) across six targets on three datasets under subject-disjoint protocols, comparing linear, MLP-small, and full MLP regression heads. MLP-small beats the mean-predictor baseline on all tasks and linear probing in 23 of 30 model x task cases, with full MLP overfitting on small clinical data but recovering on larger sets, revealing a dataset size x head-capacity trade-off. HeAR leads within-dataset age regression on Coswara (9.12 yr MAE); its CIDRZ result is excluded from headline claims owing to possible HeAR-CIDRZ pretraining overlap. OPERA-GT is favored over OPERA-CT on age in all three datasets, with the CIDRZ margin within seed variance, extending a generative-pretraining advantage from breath to cough. HeAR and M2D+Resp reach near-full performance at N = 50 samples while OPERA models require N = 400. Cross-dataset transfer is strongly asymmetric as large diverse data generalises to small clinical populations (CoughVID to CIDRZ: -0.17 yr) but not vice versa (CIDRZ to Coswara: +2.43 yr, +26.6%).

摘要：呼吸聲學基礎模型 (FMs) 在咳嗽分類方面表現優異，然而它們從咳嗽音頻預測連續健康量的能力仍然大部分未被探索，儘管在無法進行物理測量的情況下，被動年齡、BMI 和疾病概率估算具有臨床價值。我們介紹了多模型、多目標的咳嗽回歸基準，評估五個 FMs（OPERA-CT、OPERA-CE、OPERA-GT、HeAR、M2D+Resp）在三個數據集上針對六個目標的表現，並在受試者不重疊的協議下比較線性、MLP-small 和全 MLP 回歸頭。MLP-small 在所有任務上超越了平均預測基準，並在 30 個模型 x 任務案例中有 23 個超越了線性探測，然而全 MLP 在小型臨床數據上出現過擬合，但在較大數據集上恢復，顯示出數據集大小與頭部容量之間的權衡。HeAR 在 Coswara 的數據集中領先於年齡回歸（9.12 年 MAE）；其 CIDRZ 結果因可能的 HeAR-CIDRZ 預訓練重疊而被排除在主要聲明之外。在所有三個數據集中，OPERA-GT 在年齡方面優於 OPERA-CT，CIDRZ 的邊際在種子變異範圍內，將從呼吸到咳嗽的生成預訓練優勢延伸。HeAR 和 M2D+Resp 在 N = 50 樣本時達到接近全性能，而 OPERA 模型則需要 N = 400。跨數據集轉移呈現強烈的不對稱性，因為大型多樣數據可以很好地推廣到小型臨床人群（CoughVID 到 CIDRZ: -0.17 年），但反之則不然（CIDRZ 到 Coswara: +2.43 年，+26.6%）。

Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

2606.15419v1 by Zaifu Zhan, Shuang Zhou, Rui Zhang

Objective: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). Method: We designed a multi-agent peer-reviewed reasoning method in which multiple LLM agents independently generate chain-of-thought reasoning with candidate answers, then act as peer reviewers to evaluate each other's reasoning for factual correctness and logical soundness. The highest-rated reasoning chain is selected to produce the final answer. Experiments were conducted with five state-of-the-art LLMs (Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, GPT-oss-20B) on three benchmark datasets: HeadQA, MedQA-USMLE, and PubMedQA. Performance was compared against single-model chain-of-thought reasoning and chain-of-thought-based majority voting. Results: Peer-reviewed reasoning consistently outperformed both baselines. The best model combination achieved an average accuracy of 0.820 across datasets, exceeding the strongest single model (0.777) and majority voting ensembles (up to 0.789). The method also scaled effectively with more participating models, while peer assessments reliably distinguished high- from low-quality reasoning chains. Conclusion: The proposed multi-agent peer-reviewed reasoning method enables LLMs to act as both solvers and evaluators, yielding superior performance in MedQA. By emphasizing reasoning quality rather than answer agreement alone, this approach improves accuracy, interpretability, and robustness, offering a promising direction for trustworthy biomedical AI systems.

摘要：目標：提升大型語言模型（LLMs）在醫學問題回答（MedQA）中的準確性、可解釋性和穩健性。
方法：我們設計了一種多代理同行評審推理方法，其中多個LLM代理獨立生成思考鏈推理及候選答案，然後作為同行評審來評估彼此的推理在事實正確性和邏輯合理性方面的表現。評分最高的推理鏈被選中以產生最終答案。實驗使用了五個最先進的LLM（Llama-3.1-8B、Qwen2.5-7B、Phi-4、DeepSeek-LLM-7B、GPT-oss-20B）在三個基準數據集上進行：HeadQA、MedQA-USMLE和PubMedQA。性能與單模型思考鏈推理和基於思考鏈的多數投票進行比較。
結果：同行評審推理始終超越了兩個基準。最佳模型組合在數據集上達到了平均準確率0.820，超過了最強的單一模型（0.777）和多數投票集成（最高達到0.789）。該方法在參與模型數量增加時也能有效擴展，而同行評估可靠地區分了高質量和低質量的推理鏈。
結論：所提出的多代理同行評審推理方法使LLMs能夠同時作為解決者和評估者，在MedQA中產生了卓越的表現。通過強調推理質量而非僅僅是答案一致性，這種方法提高了準確性、可解釋性和穩健性，為可信的生物醫學AI系統提供了有前景的方向。

APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents

2606.15363v1 by Ya-Chuan Chen, Tien-Jen Lai, Hsiang-Wei Hu

Self-improvement in AI agents has emerged as a key research frontier: systems that modify their own prompts, workflows, and decision rules based on accumulated operational experience. The state-of-the-art Self-Harness framework [1] achieves 14--21% improvement on Terminal-Bench-2.0 by mining failure clusters and patching the agent harness. However, Self-Harness optimises only one dimension -- the prompt harness -- leaving behavioural principles and workflow topology unchanged. We propose APEX (Adaptive Principle EXtraction), a three-layer co-evolution framework that simultaneously evolves: (L1) the harness via failure-mode patching, (L2) behavioural principles via success-trace distillation [2], and (L3) the agent workflow topology via structural fitness-based selection [6]. We implement APEX on Joe [13], a production-grade super AI Agent built on NVIDIA Nemotron and designed as an Edge AI Agent Factory for the NVIDIA Agent Challenge 2026, managing a 15-node compute fleet using 114 real task traces collected over 18 days. APEX achieves an APEX Health Score of 0.570 (+90% vs. baseline 0.300) in a single evolutionary run, distilling 6 novel reusable principles and selecting a research-first workflow topology scoring 0.900 (+20%). Our results demonstrate that multi-dimensional co-evolution substantially outperforms single-axis harness optimisation, at a cost of only 4 LLM calls (~270 s) on a local qwen2.5-coder:32b instance.

摘要：自我改進的 AI 代理已成為一個關鍵的研究前沿：這些系統根據累積的操作經驗修改自己的提示、工作流程和決策規則。最先進的 Self-Harness 框架 [1] 通過挖掘失敗集群和修補代理鞍具，在 Terminal-Bench-2.0 上實現了 14--21% 的改進。然而，Self-Harness 只優化了一個維度——提示鞍具——而行為原則和工作流程拓撲保持不變。我們提出了 APEX（自適應原則提取），這是一個三層共演化框架，同時演化： (L1) 通過失敗模式修補來改進鞍具，(L2) 通過成功追蹤蒸餾行為原則 [2]，以及 (L3) 通過基於結構適應度的選擇來改變代理工作流程拓撲 [6]。我們在 Joe [13] 上實現了 APEX，這是一個基於 NVIDIA Nemotron 構建的生產級超 AI 代理，設計為 NVIDIA 代理挑戰 2026 的邊緣 AI 代理工廠，管理一個使用 114 個在 18 天內收集的真實任務追蹤的 15 節點計算集群。在單次進化運行中，APEX 實現了 0.570 的 APEX 健康分數（比基線 0.300 提高 90%），蒸餾出 6 個新穎可重用的原則，並選擇了一個研究優先的工作流程拓撲，得分 0.900（提高 20%）。我們的結果表明，多維共演化顯著優於單軸鞍具優化，僅在本地 qwen2.5-coder:32b 實例上消耗 4 次 LLM 調用（約 270 秒）的成本。

CAP: Towards PPG Universal Representation Learning with Patient-level Supervision

2606.15284v1 by Chenyang He, Xinyi Shao, Shun Huang, Bosong Huang, Daoqiang Zhang, Ming Jing, Cheng Ding

Photoplethysmography (PPG) plays a central role in wearable health monitoring and clinical decision support. Yet existing approaches to universal PPG representation learning largely focus on signal-level objectives and often overlook patient-level health context, which limits generalization to complex clinical tasks and heterogeneous cohorts. To address this gap, we construct a large-scale paired PPG-EHR multimodal dataset by distilling fragmented medical histories and clinical records into cohesive, patient-level electronic health records (EHR). Building on this resource, we propose Clinical Anchored Pretraining for PPG (CAP). During pretraining, CAP performs cross-modal contrastive alignment that anchors PPG representations to patient-level clinical semantics, guiding the encoder beyond waveform fitting toward modeling consistency in a patient's overall physiological state. During downstream adaptation, the pretrained PPG encoder provides clinically grounded representations that strengthen inductive bias and improve robustness and transferability. Experiments demonstrate that CAP consistently outperforms strong baselines on four diverse downstream tasks. CAP achieves a particularly large gain on respiratory rate prediction (up to +87.6% relative improvement over the state-of-the-art baseline) and delivers an average relative +26.7% across all tasks. We further enhance the interpretability of our approach through comprehensive analyses, including ablations and multiple complementary visualizations of the learned representations. The code for our experiments is available at: https://github.com/gody123gody/CAP .

摘要：光學容積描記法（PPG）在可穿戴健康監測和臨床決策支持中扮演著核心角色。然而，現有的通用PPG表示學習方法主要集中於信號層面的目標，並且常常忽視患者層面的健康背景，這限制了其在複雜臨床任務和異質隊列中的泛化能力。為了解決這一問題，我們通過將零散的醫療歷史和臨床記錄提煉為一致的患者層面電子健康記錄（EHR），構建了一個大規模的配對PPG-EHR多模態數據集。基於這一資源，我們提出了針對PPG的臨床錨定預訓練（CAP）。在預訓練過程中，CAP執行跨模態對比對齊，將PPG表示錨定到患者層面的臨床語義，指導編碼器超越波形擬合，朝著建模患者整體生理狀態的一致性邁進。在下游適應過程中，預訓練的PPG編碼器提供臨床基礎的表示，增強歸納偏差並提高穩健性和可轉移性。實驗表明，CAP在四個不同的下游任務中始終超越強基線。CAP在呼吸率預測上取得了特別大的增益（相對於最先進基線的提升高達+87.6%），並在所有任務中提供了平均相對+26.7%的增益。我們還通過全面的分析進一步增強了我們方法的可解釋性，包括消融實驗和多種互補的學習表示可視化。我們實驗的代碼可在以下網址獲得：https://github.com/gody123gody/CAP 。

RECTOR: Masked Region-Channel-Temporal Modeling for Affective and Cognitive Representation Learning

2606.15278v1 by Jinhan Liu, Mahsa Shoaran

Affective and cognitive disorders manifest as distributed, time-varying brain network dynamics across regions, channels, and time, challenging robust representation learning from EEG/sEEG for clinical diagnosis. We propose RECTOR (Masked Region-Channel-Temporal Modeling), an end-to-end self-supervised framework that unifies joint region-channel-temporal representation learning beyond fixed anatomical priors. At its core, RECTOR-SA is a hierarchical, block-sparse self-attention induced by Adaptive Functional Partitioning that evolves region structures from static anatomical definitions to adaptive functional regions. The self-supervision is driven by Masked Topology and Representation Learning, which jointly optimizes three complementary objectives: Masked Predictive Modeling, Topological Structure Modeling, and Cross-View Consistency. Across diverse benchmarks, RECTOR sets a new state-of-the-art in EEG emotion recognition and sEEG task-engagement classification. Crucially, its strong robustness to missing channels and cross-montage generalization underscores its potential for large-scale pre-training on heterogeneous EEG/sEEG, providing interpretable insights at both region and channel levels.

摘要：情感和認知障礙表現為跨區域、通道和時間的分佈性、時變腦網絡動態，這對於從EEG/sEEG進行臨床診斷的穩健表示學習提出了挑戰。我們提出了RECTOR（遮蔽區域-通道-時間建模），這是一個端到端的自我監督框架，超越固定解剖先驗，統一了聯合區域-通道-時間表示學習。在其核心，RECTOR-SA是一種由自適應功能劃分引發的層次性、區塊稀疏自注意力，將區域結構從靜態解剖定義演變為自適應功能區域。自我監督是由遮蔽拓撲和表示學習驅動的，這共同優化了三個互補的目標：遮蔽預測建模、拓撲結構建模和跨視圖一致性。在多樣的基準測試中，RECTOR在EEG情感識別和sEEG任務參與分類中創造了新的最先進水平。重要的是，其對缺失通道和跨蒙太奇泛化的強大穩健性突顯了其在異質EEG/sEEG上進行大規模預訓練的潛力，並在區域和通道層面提供可解釋的見解。

Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs

2606.15250v1 by Zhisen Hu, Antti Kemppainen, David Johnson, Egor Panfilov, Huy Hoang Nguyen, Timothy Cootes, Claudia Lindner, Aleksei Tiulpin

Radiographic assessment of lower-limb alignment (LLA) is important for predicting joint health and surgical outcomes in total knee arthroplasty. Traditional measurement methods are manual and time-consuming, while recent machine learning approaches typically rely on locating a fixed set of anatomical landmarks. This dependence limits flexibility and may require re-annotation when clinical definitions change. To address this, we propose an automated workflow using Implicit Neural Shape Functions (INSF). Rather than relying on explicit landmark coordinates, we encode the anatomy into a compact latent space and regress clinical alignment measurements directly from these latent codes. This architecture allows for rapid extendability to new tasks without altering the backbone representation. We trained our method on an internal dataset of 566 knee radiographs, each annotated with the outline of the femur and tibia. We evaluated it on both an internal test dataset of 50 patients and a separate external set of 402 preoperative cases from the MRKR dataset. Manual clinical measurements are available for these data, and the MRKR measurements will be made publicly accessible. Performance was comparable to state-of-the-art landmark-based methods and manual agreement, while offering a flexible shape representation that can be extended to additional measurement tasks.

摘要：下肢對齊（LLA）的放射學評估對於預測全膝關節置換手術的關節健康和手術結果非常重要。傳統的測量方法是手動且耗時，而最近的機器學習方法通常依賴於定位一組固定的解剖標誌。這種依賴限制了靈活性，並且在臨床定義變更時可能需要重新標註。為了解決這個問題，我們提出了一個使用隱式神經形狀函數（INSF）的自動化工作流程。我們不依賴明確的標誌坐標，而是將解剖結構編碼到一個緊湊的潛在空間中，並直接從這些潛在代碼回歸臨床對齊測量。這種架構允許快速擴展到新任務，而無需改變主幹表示。我們在一個內部數據集上訓練了我們的方法，該數據集包含566張膝關節X光片，每張都標註了股骨和脛骨的輪廓。我們在50名患者的內部測試數據集以及來自MRKR數據集的402個術前案例的單獨外部數據集上進行了評估。這些數據可用於手動臨床測量，並且MRKR測量將公開可用。性能與最先進的基於標誌的方法和手動一致性相當，同時提供了一種靈活的形狀表示，可以擴展到其他測量任務。

Enabling Real-Time Point-of-Care Ultrasound Segmentation: A GPU-Free Deployment in Resource-Limited Settings

2606.15176v1 by Weihao Gao

Ultrasound imaging is the most widely adopted medical modality globally due to its low cost and portability, yet artificial intelligence (AI) deployment remains constrained by reliance on GPU-accelerated models, creating a structural paradox where the cost of "intelligence" exceeds that of the imaging device itself. Here, we present the systematic adaptation and extensive evaluation of UltraSeg, an ultra-lightweight architecture originally developed for colonoscopic polyp segmentation, now engineered for point-of-care ultrasound (POCUS) across ten public datasets spanning six anatomical sites (breast, thyroid, kidney, carotid, fetal, and small-animal tumor). We systematically validate both variants in ultrasound domains: UltraSeg-130K (0.13M parameters) achieves 89.7 FPS on single-core CPUs and 34.8 FPS on a refurbished mobile device, while UltraSeg-500K (0.5M parameters) delivers 44.6 FPS on CPU and 16.1 FPS on mobile device. UltraSeg-500K matches or exceeds the Dice performance of the 31M-parameter UNet and approaches 105M-parameter TransUNet in average performance, with superior zero-shot cross-dataset generalization on external validation sets (UDIAT, DDTI). By enabling clinical-grade segmentation without GPU dependency, this work brings AI costs in line with ultrasound accessibility, making advanced diagnostics available in resource-limited settings.

摘要：超聲影像因其低成本和可攜性而成為全球最廣泛採用的醫療模式，但人工智慧（AI）的應用仍受限於對GPU加速模型的依賴，形成了一種結構性悖論，即「智慧」的成本超過影像設備本身的成本。在此，我們展示了UltraSeg的系統性調整和廣泛評估，這是一種最初為結腸鏡息肉分割而開發的超輕量架構，現已針對十個公共數據集進行了針對即時超聲（POCUS）的工程設計，涵蓋六個解剖部位（乳腺、甲狀腺、腎臟、頸動脈、胎兒和小動物腫瘤）。我們系統性地驗證了超聲領域中的兩個變體：UltraSeg-130K（0.13M參數）在單核心CPU上達到89.7 FPS，在翻新移動設備上達到34.8 FPS，而UltraSeg-500K（0.5M參數）在CPU上提供44.6 FPS，在移動設備上提供16.1 FPS。UltraSeg-500K的Dice性能與31M參數的UNet相匹配或超過，並在平均性能上接近105M參數的TransUNet，並在外部驗證集（UDIAT，DDTI）上展現出優越的零樣本跨數據集泛化能力。通過實現無GPU依賴的臨床級分割，這項工作使AI成本與超聲可及性相符，使高級診斷在資源有限的環境中變得可用。

Bridging Geographic Bias in Urban Streetscape Inference via Lifelong Learning with Visual-Semantic Pivoting

2606.15055v1 by Xinze Zhang

Visual perception of urban streetscapes underpins evidence-based decisions in landscape planning, public health, and place-making. Yet models trained on a few well-photographed metropolises systematically misjudge underrepresented districts, propagating geographic bias into downstream policy. We address this gap with HVSP-LL, a lifelong learning framework that couples a stratified visual-semantic pivoting module with an equity-aware rehearsal mechanism. The pivoting module organises landscape concepts along a three-tier ontology (macro structure, meso composition, micro element) and aligns image features to learnable semantic anchors at each tier, providing transferable representations that resist distributional drift. The lifelong adaptation component sequentially absorbs new urban regions while constraining inter-region perception gaps through a worst-region sample-reweighting objective and a structurally-aware exemplar buffer. We evaluate HVSP-LL on a panoramic streetscape benchmark assembled from twelve cities across four continents and seven perceptual dimensions. The framework attains 0.834 Spearman correlation on the held-out city sequence, an absolute 6.1 point improvement over the strongest continual baseline, and shrinks the inter-city perception gap to 0.094 -- a 38% reduction relative to the strongest continual baseline (0.151) and a 57% reduction relative to a representative regularisation baseline (0.218). Ablations confirm that each tier of the pivoting hierarchy contributes monotonically, and the equity-aware rehearsal converts mean backward transfer from -0.038 (without retention) to +0.013, eliminating catastrophic forgetting on the held-out sequence. Our results indicate that hierarchical anchoring is a practical pathway toward geographically equitable streetscape inference at city scale.

摘要：城市街景的視覺感知支撐著基於證據的景觀規劃、公共健康和地方創建決策。然而，基於少數幾個拍攝良好的大都市訓練的模型系統性地錯誤評估了被低估的區域，將地理偏見傳播到下游政策中。我們通過 HVSP-LL 解決了這一差距，這是一個終身學習框架，結合了分層的視覺-語義樞紐模塊和注重公平的重複機制。樞紐模塊沿著三層本體（宏觀結構、中觀組成、微觀元素）組織景觀概念，並將圖像特徵與每一層的可學習語義錨點對齊，提供可轉移的表示，抵抗分佈漂移。終身適應組件依次吸收新的城市區域，同時通過最差區域樣本重加權目標和結構感知示例緩衝區來限制區域間的感知差距。我們在從四大洲十二個城市組成的全景街景基準上評估 HVSP-LL，該框架在保留的城市序列上達到 0.834 的斯皮爾曼相關性，較最強的持續基準絕對提高 6.1 分，並將城市間感知差距縮小至 0.094——相較於最強的持續基準（0.151）減少 38%，相較於代表性正則化基準（0.218）減少 57%。消融實驗確認樞紐層級的每一層都單調貢獻，並且注重公平的重複將平均反向轉移從 -0.038（無保留）轉變為 +0.013，消除了在保留序列上的災難性遺忘。我們的結果表明，分層錨定是實現城市規模地理公平街景推斷的實用途徑。

2606.15038v1 by Zhemin Zhang, Weijie Chen, David Le, Amara Tariq, Alex Wallace, Matthew Stib, Juan Maria Farina, Chadi Ayoub, Reza Arsanjani, Imon Banerjee

Accurate time-to-event (TTE) prediction from multimodal clinical data remains challenging due to modality imbalance and distribution shift. We introduce a foundation model-driven framework for cross-modal representation alignment between CT imaging and longitudinal EHR data, designed to generalize across tasks and institutions. CT and EHR modalities are encoded independently using domain-specific foundation models and aligned in a shared latent space through four principled fusion strategies: late fusion, contrastive alignment, cross-attention, and co-attention. We evaluate two clinically distinct TTE tasks: pulmonary embolism (PE) mortality and cardiovascular disease (CVD) outcomes, on large-scale multi-institutional cohorts (PE: N=3,099 train; 1,098 internal; 435 external; CVD: N=2,951 train; 837 internal; 682 external). Fusion consistently improves concordance index by 1.5-5.4% over unimodal baselines when modalities contribute comparably. Overall, contrastive multimodal fusion, particularly with CLMBR representations, provided the most consistent and statistically robust improvements, especially for PE mortality prediction. For MACE, cross-attention (one-hot) achieved the highest internal performance and image-guided co-attention achieved the best external performance. We therefore introduce a generalizable foundation model-based cross-modal alignment framework and provide the first systematic analysis of fusion behavior under modality imbalance in TTE prediction. Our results establish task-aware multimodal alignment as a necessary design principle for robust generalization and scalable clinical deployment.

摘要：準確的事件時間預測（TTE）從多模態臨床數據中仍然面臨挑戰，這是由於模態不平衡和分佈轉移。我們介紹了一個基於基礎模型的框架，用於CT影像和縱向EHR數據之間的跨模態表示對齊，旨在跨任務和機構進行泛化。CT和EHR模態分別使用特定領域的基礎模型進行編碼，並通過四種原則性的融合策略在共享潛在空間中對齊：晚期融合、對比對齊、交叉注意力和共同注意力。我們在大規模多機構隊列上評估了兩個臨床上不同的TTE任務：肺栓塞（PE）死亡率和心血管疾病（CVD）結果（PE：N=3,099訓練；1,098內部；435外部；CVD：N=2,951訓練；837內部；682外部）。當模態的貢獻相當時，融合始終提高了1.5-5.4%的協調指數，超過了單模態基線。總體而言，對比多模態融合，特別是使用CLMBR表示，提供了最一致和統計上穩健的改進，尤其是在PE死亡率預測方面。對於MACE，交叉注意力（one-hot）達到了最高的內部性能，而影像引導的共同注意力則實現了最佳的外部性能。因此，我們介紹了一個可泛化的基於基礎模型的跨模態對齊框架，並提供了在TTE預測中模態不平衡下融合行為的首次系統分析。我們的結果確立了任務感知的多模態對齊作為堅固泛化和可擴展臨床部署的必要設計原則。

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

2606.15029v1 by Alyssa Unell, Natalie Dullerud, Naomi Boneh, Meena Jagadeesan, Tatsu Hashimoto, Nigam Shah, Sanmi Koyejo

LLM judges are used to reduce the need for costly human labor in evaluating open-ended text generation. However, the reliability of these judges depends critically on their alignment with human raters -- a property that itself depends on costly human annotations. In this work, we develop a method (Metric Match) for estimating correlation-based reliability metrics of LLM judges from limited annotations. Metric Match selects a subset of samples for human annotation such that the subset matches the population reliability metric with respect to acquired synthetic labels. We empirically show that Metric Match achieves a win-rate of 0.838 against random subset selection across four different correlation metrics and 15 datasets, with an 18.7% decrease in average estimation error and reduces annotation needs by 32.5%. We provide a cost model and highlight a medical case study where our method saves $1,041.67 compared to random selection for expert annotation. Further, we shift our task from reliability estimation to reliability classification of whether a given judge is above a deployment threshold, outperforming random selection with Metric Match. All project code is publicly available, and we additionally provide an installable package for ease of use.

摘要：LLM 評估者被用來減少在評估開放式文本生成時對昂貴人力的需求。然而，這些評估者的可靠性在很大程度上依賴於它們與人類評分者的一致性——這一特性本身又依賴於昂貴的人類註釋。在這項工作中，我們開發了一種方法（Metric Match），用於從有限的註釋中估計 LLM 評估者的基於相關性的可靠性指標。Metric Match 選擇一個樣本子集進行人類註釋，使得該子集與獲得的合成標籤相對應的整體可靠性指標相匹配。我們實證表明，Metric Match 在四種不同的相關性指標和 15 個數據集上，對隨機子集選擇的勝率達到 0.838，平均估計誤差減少了 18.7%，並且減少了 32.5% 的註釋需求。我們提供了一個成本模型，並強調了一個醫療案例研究，其中我們的方法相比隨機選擇專家註釋節省了 $1,041.67。此外，我們將任務從可靠性估計轉移到可靠性分類，即判斷給定的評估者是否超過部署閾值，並使用 Metric Match 超越隨機選擇的表現。所有項目代碼均可公開獲得，我們還提供了一個可安裝的包以方便使用。

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

2606.14697v1 by Sicheng Yang, Hangjie Yuan, Wenjun Zhang, Jinwang Wang, Yichen Qian, Weihua Chen, Fan Wang, Lei Zhu

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at https://github.com/alibaba-damo-academy/ClinHallu.

摘要：建立可信賴的醫療多模態大型語言模型（MLLMs）對於可靠的臨床決策支持至關重要。現有的醫療幻覺基準主要集中在數據收集上，但往往忽略了幻覺在推理過程中產生的來源。我們發現幻覺來源在樣本之間有所不同：錯誤可能來自視覺錯誤識別、不正確的醫學知識回憶或有缺陷的推理整合。為了實現源級幻覺診斷，我們引入了ClinHallu，這是一個針對醫療MLLM推理的階段性幻覺診斷基準。ClinHallu包含7,031個經過驗證的實例，每個實例都附有結構化的推理痕跡，分解為視覺識別、知識回憶和推理整合。我們還使用階段替換干預來測量修正特定階段如何影響最終答案。除了評估，我們還展示了痕跡監督微調如何減少階段性幻覺。ClinHallu提供了一個細緻的幻覺測試平台，用於診斷和減輕醫療MLLM中的推理失敗。該基準在https://github.com/alibaba-damo-academy/ClinHallu上公開可用。

Expert-Driven Survival Machines: Improving Stratification and Interpretability in Multiple Clinical Cohorts

2606.14608v1 by Farica Zhuang, Zixuan Wen, Christos Davatzikos, Li Shen

Survival prediction plays a central role for healthcare providers and clinical researchers. Accurate risk stratification enables early intervention and improved patient management. Most existing deep survival models learn one common feature representation for all patients, which may hide important differences between patient subgroups. In contrast, a Mixture-of-Experts (MoE) framework allows different parts of the model to focus on different patient patterns, leading to more individualized representations. Therefore, in this work, we propose a mixture-of-experts enhanced adaptive deep clustering survival framework (AdaCSM) for modeling such heterogeneous survival patterns. We introduce a routing-based expert mechanism that enables conditional specialization within a parametric survival modeling framework. The proposed architecture allocates patients to specialized risk predictors dynamically while preserving the patient survival and subtype clustering objectives. We compare our method with state-of-the-art survival and deep clustering models on multiple real-world longitudinal clinical cohorts spanning diverse disease domains. The proposed method demonstrates improved predictive performance and leads to interpretable results in survival analysis.

摘要：生存預測在醫療提供者和臨床研究人員中扮演著核心角色。準確的風險分層使得早期介入和改善病人管理成為可能。大多數現有的深度生存模型為所有病人學習一個共同的特徵表示，這可能隱藏了病人子群之間的重要差異。相對而言，混合專家（MoE）框架允許模型的不同部分專注於不同的病人模式，從而導致更個性化的表示。因此，在本研究中，我們提出了一種基於混合專家的增強自適應深度聚類生存框架（AdaCSM），用於建模這種異質的生存模式。我們引入了一種基於路由的專家機制，使得在參數生存建模框架內實現條件專業化。所提出的架構動態地將病人分配給專門的風險預測器，同時保留病人的生存和亞型聚類目標。我們將我們的方法與多個現實世界的長期臨床隊列中的最先進生存和深度聚類模型進行比較，這些隊列涵蓋了多種疾病領域。所提出的方法顯示出改進的預測性能，並在生存分析中導致可解釋的結果。

A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health

2606.14604v1 by Pavlos Nicolaou, Kleanthis Malialis, Artemis Kontou, Panayiotis Kolios

Wearable devices and smartphones generate rich behavioural time series that can support proactive health interventions, yet systematic comparisons of modern forecasting architectures for these data are lacking. In particular, it remains unclear how models generalise across populations, how different architectures respond to participant-level fine-tuning and how forecasting accuracy degrades across multi-day horizons. We benchmark six deep learning architectures, two zero-shot Foundation Models (FM) and statistical baselines on three public datasets encompassing over 800 participants, reporting per-feature metrics for step counts, screen time and sleep duration across 1-8 day horizons. We further conduct a per-feature personalisation study across all six architectures and assess FM transferability across dataset sizes and temporal granularities. Our key findings are: (i) no single architecture dominates, PatchTST leads among trained models while the three runners-up (TCN, MLP, Transformer) show no meaningful performance difference; (ii) the FM TimesFM matches or exceeds trained models zero-shot, especially in low-data regimes and (iii) participant-level fine-tuning reduces per-feature RMSE by 16-60\%, with sleep benefiting most and step counts least. These results provide practical guidance on architecture selection, FM applicability and personalisation strategies for mobile health forecasting. To the best of our knowledge, this is the first study to jointly evaluate modern deep learning, FMs and personalisation for multi-horizon behavioural forecasting from wearables.

摘要：可穿戴設備和智能手機生成豐富的行為時間序列，這些序列可以支持主動的健康干預，但對於這些數據的現代預測架構的系統比較仍然缺乏。特別是，尚不清楚模型如何在不同人群之間進行泛化，不同架構如何對參與者級別的微調做出反應，以及預測準確性如何在多天的預測範圍內下降。我們在三個公共數據集上基準測試了六種深度學習架構、兩種零樣本基礎模型（FM）和統計基準，這些數據集涵蓋了超過800名參與者，報告了步數、屏幕時間和睡眠持續時間在1-8天範圍內的每個特徵指標。我們還在所有六種架構上進行了每個特徵的個性化研究，並評估了FM在數據集大小和時間粒度上的可轉移性。我們的主要發現是：（i）沒有單一架構占主導地位，PatchTST在訓練模型中領先，而三個亞軍（TCN、MLP、Transformer）表現沒有顯著差異；（ii）FM TimesFM在零樣本情況下匹配或超過訓練模型，特別是在低數據環境下，以及（iii）參與者級別的微調使每個特徵的均方根誤差（RMSE）降低了16-60\%，其中睡眠受益最多，步數受益最少。這些結果為移動健康預測的架構選擇、FM適用性和個性化策略提供了實用指導。據我們所知，這是第一項共同評估現代深度學習、FM和個性化在可穿戴設備多時間範圍行為預測中的研究。

CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation

2606.14581v1 by Guanyu Liu, Weiyi Kong, Zeyu Wang, Boer Zhang, Baiqing Li, Peiyu Zhang, Tianyu Shi

Granting LLMs direct control over costly, irreversible scientific experiments leads to unsafe exploration and unstable performance, but discarding LLM creativity entirely sacrifices significant optimization potential. We introduce CARE (Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation), an auditable controller for high-throughput experimentation (HTE) optimization that keeps a non-LLM incumbent optimizer as the default action path while using LLMs to revise challenger ranking policies. Before each outcome is revealed, a public-evidence intervention gate compares the challenger with the incumbent. It authorizes the challenger's selection only when the evidence available before selection supports the change, with the decision recorded in the audit log. CARE outperforms all other evaluated methods on Minerva/Olympus and ChemLex benchmarks, with final-best improving from 80.0 to 88.5 on Minerva/Olympus and from 83.9 to 92.1 on ChemLex, relative to the public incumbent. Our experiments indicate that LLM self-evolution is more reliable when it expands the proposal space under an auditable controller, rather than directly choosing experiments.

摘要：授予大型語言模型（LLMs）對昂貴且不可逆的科學實驗的直接控制會導致不安全的探索和不穩定的表現，但完全放棄LLM的創造力則犧牲了重要的優化潛力。我們提出了CARE（通過可審計的證據審查控制LLM生成的政策於科學實驗中），這是一個可審計的控制器，用於高通量實驗（HTE）優化，保持非LLM的現任優化器作為默認行動路徑，同時利用LLM來修訂挑戰者排名政策。在每個結果揭示之前，一個公共證據干預閘門會比較挑戰者與現任者。只有當選擇前可用的證據支持變更時，才授權挑戰者的選擇，並將決策記錄在審計日誌中。CARE在Minerva/Olympus和ChemLex基準測試中超越了所有其他評估方法，最終最佳成績在Minerva/Olympus上從80.0提升至88.5，在ChemLex上從83.9提升至92.1，相對於公共現任者。我們的實驗表明，當LLM在可審計的控制器下擴展提案空間時，自我演化更為可靠，而不是直接選擇實驗。

Securing the Future of IoMT in the Post-Quantum Era: An Edge-Native Federated Learning Approach

2606.14515v1 by Taym Alshoghri, Deemah H. Tashman, Mohammad Reza Gerami, Soumaya Cherkaoui

Internet of Medical Things (IoMT) devices operate under strict resource constraints while handling highly sensitive health data, making security and privacy critical concerns. Federated learning (FL) further complicates this landscape, as model updates exchanged during training may unintentionally expose private medical information. Emerging quantum computing capabilities threaten the long-term viability of conventional lightweight cryptographic mechanisms, motivating the integration of Post-Quantum Cryptography (PQC) into IoMT systems. This article discusses key enabling technologies for quantum-resilient IoMT, including post-quantum key establishment, lightweight encryption, and edge-native orchestration. We propose a scalable Kubernetes-based framework that integrates PQC into FL-enabled IoMT environments and validate it on a Raspberry Pi testbed. Results demonstrate that distributed cryptographic processing significantly reduces latency compared to sequential designs while maintaining feasible resource overhead. The primary contribution of this work lies in the design and validation of a secure orchestration and communication framework for FL-enabled IoMT systems. We conclude by outlining future directions toward energy-aware architectures, intelligent security optimization, and resilient next-generation Intelligent Internet of Medical Things (IIoMT) ecosystems.

摘要：物聯網醫療設備（IoMT）在處理高度敏感的健康數據時運行於嚴格的資源限制下，使得安全性和隱私成為關鍵問題。聯邦學習（FL）進一步複雜化了這一局面，因為在訓練過程中交換的模型更新可能無意中暴露私人的醫療信息。新興的量子計算能力威脅著傳統輕量級加密機制的長期可行性，促使將後量子加密（PQC）整合進IoMT系統。本文討論了量子抗性IoMT的關鍵支持技術，包括後量子密鑰建立、輕量級加密和邊緣原生編排。我們提出了一個可擴展的基於Kubernetes的框架，將PQC整合到支持FL的IoMT環境中，並在Raspberry Pi測試平台上進行驗證。結果顯示，與順序設計相比，分佈式加密處理顯著降低了延遲，同時保持了可行的資源開銷。這項工作的主要貢獻在於設計和驗證了一個安全的編排和通信框架，適用於支持FL的IoMT系統。我們最後概述了未來朝向能源感知架構、智能安全優化和韌性下一代智能醫療物聯網（IIoMT）生態系統的方向。

Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport

2606.14157v1 by Paula Joy B. Martinez

Cities deliver basic services through mixed public-private facility networks, including schools, clinics, transit providers, and subsidized service points. In these systems, planners often observe where households go, but not the latent cost function through which they trade off factors such as distance, price, and institutional access. We study this urban problem through school choice in the Philippines, where the country's largest national education subsidy is intended to redirect learners from congested public schools to participating private schools. Treating school-to-school enrollment flows as an entropic optimal transport plan, we recover latent choice costs using two complementary inverse optimal transport models: an interpretable distance-banded model with a subsidy term, and a neural cost model trained through a differentiable Sinkhorn forward pass. Applied to 283{,}016 learner trips across 23{,}820 observed flows in the most populated region, the framework estimates a subsidy-equivalent distance, $λ^{(k)}$, interpreted as the kilometers of perceived travel cost offset by the subsidy. The case demonstrates how administrative origin-destination data can be transformed into interpretable planning metrics for accessibility-aware subsidy design, facility siting, and urban service allocation.

摘要：城市透過混合的公私設施網絡提供基本服務，包括學校、診所、交通提供者和補貼服務點。在這些系統中，規劃者通常觀察家庭的去向，但並不瞭解它們在距離、價格和機構可及性等因素之間權衡的潛在成本函數。我們通過菲律賓的學校選擇研究這一城市問題，該國最大的國家教育補貼旨在將學習者從擁擠的公立學校引導到參與的私立學校。將學校之間的入學流視為一種熵最優運輸計劃，我們使用兩個互補的逆最優運輸模型恢復潛在的選擇成本：一個可解釋的距離帶模型，帶有補貼項，以及一個通過可微分的Sinkhorn前向傳遞訓練的神經成本模型。應用於最人口稠密地區的283,016次學習者出行，涵蓋23,820個觀察流，該框架估算出一個補貼等效距離$λ^{(k)}$，該距離被解釋為補貼抵消的感知旅行成本的公里數。這一案例展示了如何將行政來源-目的地數據轉化為可解釋的規劃指標，以便進行考慮可及性的補貼設計、設施選址和城市服務分配。

Applicability Condition Extraction for Therapeutic Drug-Disease Relations

2606.14031v1 by Guanting Luo, Noriki Nishida, Yuji Matsumoto, Yuki Arase

Identifying conditions that a certain drug takes therapeutic effect on a target disease is crucial for clinical decision-making support. However, most existing biomedical information extraction methods have focused on identifying only relations between drugs and diseases, while largely overlooking the context-specific conditions where such relations can apply. To address this problem, we introduce the task of applicability condition extraction for therapeutic drug--disease relations from biomedical research literature. We create the first dataset that has manually annotated triples of drugs, diseases, and applicability conditions on biomedical paper abstracts with 1,119 drug-disease pairs. Using this dataset, we systematically evaluate the performance of a range of existing methods. In addition, we propose a new method that enhances LoRA to consider relations between drugs and diseases. Our method consistently outperforms strong baselines across different evaluation settings. The source code and dataset of this paper can be obtained from: https://github.com/guantingluo98/Drug-ACE

摘要：識別某種藥物對目標疾病產生治療效果的條件對於臨床決策支持至關重要。然而，大多數現有的生物醫學信息提取方法僅專注於識別藥物與疾病之間的關係，而在很大程度上忽視了這些關係可以適用的具體條件。為了解決這個問題，我們引入了從生物醫學研究文獻中提取治療藥物-疾病關係的適用條件的任務。我們創建了第一個數據集，該數據集手動標註了包含1,119對藥物-疾病的三元組以及適用條件，並且基於生物醫學論文摘要。利用這個數據集，我們系統地評估了一系列現有方法的性能。此外，我們提出了一種新方法，增強了LoRA以考慮藥物與疾病之間的關係。我們的方法在不同的評估設置中始終超越強基準。本文的源代碼和數據集可以從以下網址獲取：https://github.com/guantingluo98/Drug-ACE

Explaining RhythmFormer: A Systematic XAI Analysis of Periodic Sparse Attention for Remote Photoplethysmography

2606.13839v1 by Louis Chen, Torbjörn E. M. Nordling

Remote photoplethysmography (rPPG) transformers achieve low heart-rate error on benchmarks, yet their decisions remain opaque--a growing concern as rPPG moves toward clinical heart rate estimation. Existing rPPG XAI is dominated by qualitative heatmap inspection without quantitative faithfulness metrics or physiology-grounded validation, leaving a gap between visual plausibility and auditable evidence. We address this gap. First, we adapt four attribution methods (raw attention, rollout, flow, Beyond Intuition) to RhythmFormer's bi-level routing attention with top-$k$ selection. Second, we introduce a skin coverage metric quantifying how much attribution mass falls on skin regions. Third, we adapt the SaCo faithfulness coefficient from its original classification setting to rPPG regression by using the MAE between original and perturbed predicted rPPG waveforms as the perturbation impact. Applying these tools, we quantify a multi-hop leakage effect under sparse top-$k$ routing: attention rollout and flow almost completely restores the connections that individual refined-attention layers explicitly set to zero. Beyond Intuition mitigates this via its value-projection-weighted rollout and gradient-supported mask, attaining the highest median refined skin coverage ($0.83$ vs. $0.57$ for vanilla rollout) and faithfulness ($F=0.92$) among the evaluated methods on UBFC-rPPG. Validation across diverse datasets and model variants is needed. A case study on a low-SaCo outlier further shows all four methods recovering consistently once an artefactual region is replaced, suggesting consistent SaCo behavior across attribution families in this illustrative case. Together, these metrics move XAI for rPPG toward auditable numerical evidence about spatial alignment and perturbation faithfulness, i.e. trustworthy rPPG XAI.

摘要：遠程光電容積描記法（rPPG）Transformer在基準測試中實現了低心率誤差，但它們的決策仍然不透明——隨著rPPG向臨床心率估計的發展，這成為一個日益關注的問題。現有的rPPG可解釋人工智慧（XAI）主要依賴於定性的熱圖檢查，缺乏定量的真實性指標或基於生理學的驗證，這使得視覺上的合理性與可審計的證據之間存在差距。我們針對這一差距進行了研究。首先，我們將四種歸因方法（原始注意力、展開、流動、超越直覺）適應於RhythmFormer的雙層路由注意力，並進行top-$k$選擇。其次，我們引入了一種皮膚覆蓋度指標，量化有多少歸因質量落在皮膚區域上。第三，我們將SaCo真實性係數從其原始分類設置調整為rPPG回歸，通過使用原始和擾動預測的rPPG波形之間的平均絕對誤差（MAE）作為擾動影響。應用這些工具，我們量化了在稀疏top-$k$路由下的多跳洩漏效應：注意力展開和流動幾乎完全恢復了個別精煉注意力層明確設置為零的連接。超越直覺通過其值投影加權的展開和梯度支持的掩碼來減輕這一問題，在評估的UBFC-rPPG方法中達到了最高的中位數精煉皮膚覆蓋度（$0.83$對比$0.57$的普通展開）和真實性（$F=0.92$）。需要在多樣化數據集和模型變體上進行驗證。一項關於低SaCo異常值的案例研究進一步顯示，一旦替換了人為產生的區域，所有四種方法都能一致恢復，這表明在這一示例案例中，歸因家族之間的SaCo行為是一致的。總體而言，這些指標使rPPG的XAI朝著可審計的數字證據邁進，關於空間對齊和擾動真實性，即值得信賴的rPPG XAI。

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

2606.13572v1 by Tanmoy Kanti Halder, Akash Ghosh, Subhadip Baidya, Arijit Roy, Sriparna Saha

Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English-centric MLLMs struggle to support such use cases, limiting equitable access to AI-driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making, and uses stored actor-critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: https://iitp-cse.github.io/ ArogyaSutra/

摘要：多模態大型語言模型（MLLMs）在一般領域中顯示出有希望的推理能力，但在專業環境中，如醫療保健，尤其是在多語言和低資源的情境下，其表現仍然有限。這一差距在像印度農村這樣的地區尤為重要，患者經常用母語印地語表達複雜的醫療問題，並依賴醫療影像等多模態輸入。現有以英語為中心的 MLLMs 難以支持這類用例，限制了公平獲得 AI 驅動的醫療協助。為了解決這一挑戰，我們介紹了 ArogyaBodha，這是一個大型多語言多模態醫療問答數據集，該數據集由八個異質來源構建，涵蓋 31 個身體系統、六種影像模態和 21 個臨床領域，涉及英語和七種主要的印度語言。我們進一步提出了 ArogyaSutra，一個基於演員-評論家的多代理框架，將工具基礎與雙重記憶機制整合，用於逐步的、具推理意識的決策，並利用存儲的演員-評論家模擬軌跡進行蒸餾。實驗表明，我們的數據集和框架提高了所有印地語言的多語言醫療推理準確性，消融實驗驗證了每個組件的貢獻。源代碼和數據集可在以下位置獲得：https://iitp-cse.github.io/ ArogyaSutra/

Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation

2606.13556v2 by Aruna Dey, Suraj Biswas

Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)G-hat_genomic + [1-w(t)]P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms -- a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.

摘要：個性化健康人工智慧系統面臨一個根本的冷啟動問題：生理解釋的機器學習模型需要數週的個體行為數據，才能區分憲法變異與環境驅動的偏差。我們提出了一個基於因果推斷和貝葉斯先驗設計的解決方案。個體的基因組特徵作為外生的遺傳錨點——一個基於領域知識的個性化先驗，固定於受孕時，不受逆因果影響，並且在收集到任何行為觀察之前就已可用。這個錨點初始化了一個貝葉斯信念狀態，該狀態關於個體的生理設置點 G-hat = mu + sum(beta_i * g_i)，其中 beta_i 是 GWAS 衍生的效應大小，g_i 是風險等位基因數量。每個進來的生理測量 P 產生一個非憲法偏差 delta = P - G-hat，這將環境和狀態所造成的信號與憲法固定的基線分開。隨著行為數據的累積，先驗根據 G-hat_t = w(t)G-hat_genomic + [1-w(t)]P-bar_t 衰減，從基因組主導的推斷過渡到經驗基線主導的推斷。同樣觀察到的 HRV 為 55 毫秒，對於一個其先驗預測 80 毫秒的人，產生了一個抑制假設，而對於一個其先驗預測 30 毫秒的人，則產生了一個增強假設——這種反轉在沒有個性化錨點的情況下是不可能的。我們在六個生理領域內發展這一架構，根據證據強度對基因組先驗進行分級，明確區分穩健重複的錨點（FTO、FADS1/2、FKBP5）與有爭議的候選基因（SLC6A4、MAOA、DRD2）。我們解決了關聯、孟德爾隨機化和個體標記因果之間的推斷邊界，並定義了四個部署約束：證據分級的先驗、動態衰減、祖先匹配的效應大小，以及歸因而非確定性輸出。

MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait Assessment

2606.13258v2 by Minlin Zeng, Zhipeng Zhou, Yang Qiu, Martin J. McKeown, Zhiqi Shen

Gait-based Parkinson's disease assessment increasingly relies on heterogeneous sensors, but clinical systems rarely collect all modalities simultaneously. New sensors may arrive through device upgrades, protocol changes, or multi-center deployment, while historical patient data are often unavailable because of privacy and storage constraints. This modality-incremental setting faces three challenges: unreliable cross-modal distillation, modality-specific statistical shifts, and reduced plasticity after preservation. We propose MOSAIC, a compact continual learning framework. First, we identify the Toxic Teacher phenomenon and introduce Modality-Specific Warm-Up to stabilize newly learned modality representations before distillation. Second, we propose a statistics-decoupled MSBN architecture that isolates sensor statistics while maintaining a shared semantic backbone. Third, we design a curriculum-guided repulsive objective for Plasticity Recovery, preserving legacy knowledge while recovering modality-specific capacity. Experiments on three multimodal Parkinson's gait datasets show that MOSAIC improves final performance and mitigates forgetting. Project code is available at: https://github.com/minlinzeng/MOSAIC_Modality-Specific-Adaptation-for-Incremental-Continual-Learning-in-PD-Gait-Assessment.git

摘要：步態基礎的帕金森病評估越來越依賴異質傳感器，但臨床系統很少同時收集所有模態。新傳感器可能通過設備升級、協議變更或多中心部署而出現，而歷史病人數據因隱私和存儲限制而常常無法獲得。這種模態增量設置面臨三個挑戰：不可靠的跨模態蒸餾、模態特定的統計變化，以及保存後的可塑性降低。我們提出了MOSAIC，一個緊湊的持續學習框架。首先，我們識別了有毒教師現象，並引入模態特定的熱身，以穩定新學習的模態表示，然後進行蒸餾。其次，我們提出了一種統計解耦的MSBN架構，該架構在保持共享語義骨幹的同時隔離傳感器統計。第三，我們設計了一個課程引導的排斥目標，用於可塑性恢復，在恢復模態特定能力的同時保留遺留知識。在三個多模態帕金森步態數據集上的實驗表明，MOSAIC改善了最終性能並減輕了遺忘。項目代碼可在以下鏈接獲得：https://github.com/minlinzeng/MOSAIC_Modality-Specific-Adaptation-for-Incremental-Continual-Learning-in-PD-Gait-Assessment.git

Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

2606.13211v1 by Omar Alshahrani, Muzammil Behzad

AI systems are being deployed across medical imaging faster than their failure modes are understood. At this point in time, the failure of greatest clinical concern is hallucination: clinically plausible but factually incorrect outputs, including fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements in generated reports, with direct consequences, for example, for biopsy decisions, staging, and treatment planning. This structured narrative synthesizes peer-reviewed studies, benchmark datasets, and FDA regulatory guidance across five imaging modalities to produce a cross-modality analysis of hallucination taxonomy, etiology, detection, and mitigation. Specifically, we address three questions in this study: (1) how can existing taxonomies be unified across modalities?, (2) how do medical-specialized foundation models hallucinate less than general-purpose ones?, and (3) which mitigation strategies are effective and compatible with FDA lifecycle oversight? We note that three taxonomic frameworks together cover the imaging pipeline in a way no single framework does alone. We also highlight that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks, indicating that narrow domain fine-tuning can introduce overfitting-induced confabulation. At the same time, the oversight of radiologists remains essential; for instance, a very high percentage of of AI-generated flags required expert correction before clinical use. Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and is effective when combined. All findings are mapped to the FDA's Total Product Lifecycle and Predetermined Change Control Plan frameworks, which treat hallucination management as a lifecycle obligation rather than a pre-deployment checklist.

摘要：AI 系統在醫學影像領域的部署速度超過了對其失效模式的理解。此時，最大的臨床關注點是幻覺：臨床上看似合理但事實上不正確的輸出，包括虛構的解剖結構、漏診、錯誤的側別以及生成報告中的虛構測量，這些都會直接影響，例如，活檢決策、分期和治療計劃。這篇結構化敘述綜合了同行評審的研究、基準數據集和 FDA 的監管指導，涵蓋五種影像模式，以產生對幻覺分類、病因、檢測和緩解的跨模式分析。具體而言，我們在這項研究中解決三個問題：(1) 如何統一現有的分類法？(2) 醫學專用的基礎模型如何比通用模型產生更少的幻覺？以及 (3) 哪些緩解策略是有效的並且與 FDA 生命週期監管相容？我們注意到，三個分類框架共同覆蓋了影像流程，而單一框架無法單獨做到這一點。我們還強調，通用基礎模型在針對幻覺的基準測試中表現優於醫學專用模型，這表明狹窄領域的微調可能會導致過擬合引起的虛構。同時，放射科醫生的監督仍然至關重要；例如，極高比例的 AI 生成標記在臨床使用前需要專家修正。基於物理的架構約束、思考鏈提示以及人機協作的安全措施各自針對不同的失效模式，並且在結合使用時效果良好。所有發現都映射到 FDA 的總產品生命週期和預定變更控制計劃框架，將幻覺管理視為生命週期的責任，而非部署前的檢查清單。

Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework

2606.13188v1 by Abhishek H S, Akash Ganamukhi, Abhimanyu Suresh, Aditya G Hiremath, Prasad B Honnavalli, Adithya Balasubramanyam

Building patient-specific cardiac models sits at the heart of precision cardiology, yet getting those models into clinical use keeps running into the same wall: mesh generation is slow, messy, and frustrating. The standard workflow -- segmenting the image, running Marching Cubes, and then manually cleaning up the result -- is time-consuming, inconsistent across operators, and demands specialist knowledge most clinical teams do not have. We take a fundamentally different approach. Instead of treating segmentation and mesh generation as two separate problems, we train a single end-to-end network that goes directly from a raw 3D medical image to a smooth, simulation-ready cardiac surface mesh. The core is a 3D Swin Transformer encoder-decoder that extracts volumetric features from CT or MRI volumes, paired with a Graph Attention Network (GAT) head that iteratively deforms a template mesh to fit the patient's cardiac boundary. We tested on the MM-WHS 2017 benchmark using both CT and MRI. Segmentation scores were competitive (Dice of 0.84 on CT, 0.83 on MRI), but the primary focus is mesh quality: mean Chamfer distance of 1.8 mm, with 95th-percentile surface distance below 5 mm. Every mesh is produced in a single forward pass -- no Marching Cubes, no smoothing filters, no manual cleanup. We argue that for cardiac digital twin pipelines, geometric fidelity and topological correctness matter more than pixel-level Dice scores. By removing the post-processing bottleneck, this approach makes patient-specific cardiac simulation substantially more accessible for clinical use.

摘要：建立病人特定的心臟模型是精準心臟醫學的核心，但將這些模型投入臨床使用時卻不斷遇到同樣的障礙：網格生成緩慢、混亂且令人沮喪。標準工作流程——對影像進行分割、運行 Marching Cubes，然後手動清理結果——耗時、在操作人員之間不一致，並且需要大多數臨床團隊所不具備的專業知識。我們採取了根本不同的方法。與其將分割和網格生成視為兩個獨立的問題，我們訓練了一個單一的端到端網絡，直接從原始的 3D 醫學影像生成平滑、可用於模擬的心臟表面網格。核心是一個 3D Swin Transformer 編碼器-解碼器，從 CT 或 MRI 體積中提取體積特徵，並配合一個圖形注意力網絡（GAT）頭，迭代地變形模板網格以符合病人的心臟邊界。我們在 MM-WHS 2017 基準上進行了測試，使用了 CT 和 MRI。分割得分具有競爭力（CT 的 Dice 為 0.84，MRI 為 0.83），但主要焦點是網格質量：平均 Chamfer 距離為 1.8 毫米，95 百分位表面距離低於 5 毫米。每個網格都是在單次前向傳遞中生成的——不需要 Marching Cubes，不需要平滑濾波器，也不需要手動清理。我們認為，對於心臟數位雙胞胎管道來說，幾何保真度和拓撲正確性比像素級的 Dice 得分更為重要。通過消除後處理瓶頸，這種方法使病人特定的心臟模擬在臨床使用上變得更加可及。

Mental-R1: Aligning LLM Reasoning for Mental Health Assessment

2606.13176v1 by Xin Wang, Boyan Gao, Yibo Yang, David A. Clifton

Mental health problems such as anxiety, depression, and suicide remain urgent global challenges, where timely and accurate assessment is critical for effective intervention. Recently, large language models have been explored for mental health assessment. However, existing general-purpose post-training methods do not align with the cognitive processes of human assessment, which may lead to unreliable reasoning outcomes. To bridge this gap, we propose Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework tailored for the mental health domain. CRPO extends group relative policy optimization by integrating stage-dependent uncertainty modeling into the policy optimization process. Specifically, we introduce a stage-wise entropy regularization mechanism that encourages broad exploration in early reasoning phases and progressively enforces confident decision-making in later stages, mimicking the human cognitive shift from uncertainty to certainty. In addition, inspired by cognitive appraisal theory, we formalize cognitive reasoning stages, thereby guiding theory-grounded interpretable inference. Experiments on 8 mental health datasets show that CRPO achieves an average improvement of 10.4 percentage points in weighted F1-score over the best reinforcement learning baseline. Furthermore, the CRPO-trained model Mental-R1 demonstrates clear advantages compared with existing large language models on reasoning-intensive cases, suggesting that CRPO enhances reasoning capabilities for mental health assessment.

摘要：心理健康問題，如焦慮、抑鬱和自殺，仍然是迫切的全球挑戰，及時和準確的評估對於有效的干預至關重要。最近，大型語言模型已被探索用於心理健康評估。然而，現有的一般性後訓練方法與人類評估的認知過程不一致，這可能導致不可靠的推理結果。為了填補這一空白，我們提出了認知相對策略優化（CRPO），這是一個針對心理健康領域量身定制的強化學習框架。CRPO 通過將階段依賴的不確定性建模整合到策略優化過程中，擴展了群體相對策略優化。具體而言，我們引入了一種階段性熵正則化機制，鼓勵在早期推理階段進行廣泛探索，並在後期階段逐步強化自信的決策，模仿人類從不確定性轉向確定性的認知轉變。此外，受到認知評估理論的啟發，我們正式化了認知推理階段，從而指導理論基礎的可解釋推斷。在8個心理健康數據集上的實驗顯示，CRPO 在加權 F1 分數上比最佳強化學習基線平均提高了10.4個百分點。此外，CRPO 訓練的模型 Mental-R1 在推理密集型案例中與現有的大型語言模型相比顯示出明顯的優勢，這表明 CRPO 增強了心理健康評估的推理能力。

Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation

2606.13135v1 by Elena S. Kozachok, Sergey S. Seregin, Aleksandr V. Kozachok, Ilya P. Latyshev, Oleg I. Samovarov

Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.

摘要：目的。比較深度學習架構和皮膚腫瘤的皮膚鏡圖像分類方案，並評估其從開放國際數據集轉移到俄羅斯臨床實踐獨立數據集的泛化能力。方法。比較了四種架構（ViT-B/16、Swin-S、ConvNeXt-S、EfficientNetV2-S）在三種方案中的表現：二元（惡性/良性）、單階段四類（良性、MEL、SCC、BCC），以及兩階段級聯（二元篩選，然後三類區分MEL/SCC/BCC）。所有模型均使用ImageNet預訓練權重和單一增強協議，基於聚合的開放ISIC檔案數據進行訓練，並在內部保留樣本和兩個臨床數據集（Melanoscope AI移動系統；塞琴諾夫大學）上進行評估。結果。在內部，二元階段達到ROC-AUC 0.952-0.966；在塞琴諾夫大學下降至0.797-0.893，靈敏度降至0.53-0.67，ECE從0.02上升至0.27-0.39，並低估了惡性腫瘤，量化了排名和校準中的泛化差距。配對測試確認了一個臨床數據上的架構間結果：ViT-B/16在二元階段的不足（p<0.05）；在區分階段，沒有架構具有明顯優勢。級聯方法在大多數架構中提高了宏觀F1分數，超過單階段四類分類，但對於ViT-B/16的提升顯著，因為它恢復了被分配到主導良性類別的惡性病變。在ISIC MILK10k上，直接的11類分類產生的平均類別靈敏度為0.525。結論。可調的篩選閾值提供了在標準單階段（argmax）分類中無法實現的靈敏度控制，並更好地重現臨床鑑別診斷邏輯。持續的泛化差距要求在部署之前進行外部臨床驗證和重新校準。

AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction

2606.13051v1 by Fabien Maury, Solène Grosdidier, Maud de Dieuleveult, Adrien Coulet

Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized biomedical fields, where domainspecific complexity poses challenges for generalist models. In this work, we focus on the domain of autoimmunity, where the main entities of interest are autoimmune diseases, autoantibodies (i.e., molecules that may mark or cause these diseases), their molecular targets, their location in the body, and their associated clinical signs. Herein, we present AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a corpus of 115 abstracts selected from PubMed, where we manually annotated entities and their relationships. First, AAbAAC was used to evaluate several methods on the task of named entity recognition (NER), and secondly, to fine-tune NER models. Our study demonstrates the utility of AAbAAC for information extraction in the domain of autoimmunity, showing expected improvement in NER performance after finetuning. This illustrates the value of small-scale annotation efforts for specialized domains and contributes to the computational study of autoimmunity. The AAbAAC corpus is available at https://github.com/f-maury/AAbAAC.

摘要：儘管深度學習和大型語言模型推動了信息提取的進步，但在高度專業化的生物醫學領域，性能差距仍然存在，這些領域特有的複雜性對通用模型構成挑戰。在這項工作中，我們專注於自體免疫領域，主要的關注實體包括自體免疫疾病、自體抗體（即可能標記或引起這些疾病的分子）、它們的分子靶點、它們在體內的位置以及相關的臨床徵兆。在此，我們呈現了 AAbAAC（自體抗體與自體免疫註釋語料庫），這是一個從 PubMed 中選取的 115 篇摘要的語料庫，我們手動註釋了實體及其關係。首先，AAbAAC 被用來評估幾種命名實體識別（NER）任務的方法，其次，用於微調 NER 模型。我們的研究展示了 AAbAAC 在自體免疫領域的信息提取中的實用性，顯示出在微調後 NER 性能的預期改善。這說明了小規模註釋工作對專業領域的價值，並為自體免疫的計算研究做出了貢獻。AAbAAC 語料庫可在 https://github.com/f-maury/AAbAAC 獲得。

A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis

2606.12988v1 by Manex Atxa, Bruno Simoes, Julen Balzategui

This paper introduces a new methodology for real-time prediction of ergonomic and non-ergonomic human poses using volumetric video data in three dimensions. Although the methodology was designed for ergonomic assessments, it can be adapted to other applications requiring real-time analysis of human posture. One aspect that makes this system stand out is its ability to analyze 3D point clouds during the assessment, enabling computation from multiple angles. This overcomes a critical limitation of cameras which provide often a fixed viewpoint, thereby restricting the data available for a thorough postural evaluation, especially when occlusions occur. The system continuously and automatically performs pose inference using the chosen perspective on the real-time streaming data; however, only the poses manually selected and labeled by the user are used to train the personalized deep learning classifier. The methodology has been refined through a case study in which RGB-D cameras captured subjects performing load-lifting tasks, enabling real-time skeletal labeling. The model was trained on this data and, following the training phase, performs inference on new streaming data in real time. This research offers a scalable and pragmatic approach for real-time ergonomic evaluation by combining state-of-the-art 3D data technologies and traditional 2D pose estimation algorithms. It addresses the increasing need for safety and health monitoring in workplace environments, marking a notable contribution to the domain.

摘要：這篇論文介紹了一種新方法，用於利用三維體積視頻數據進行人體姿勢的實時預測，包括符合人體工學和不符合人體工學的姿勢。雖然該方法是為人體工學評估而設計的，但它可以適應其他需要實時分析人體姿勢的應用。使這個系統脫穎而出的其中一個方面是它在評估過程中分析三維點雲的能力，從而實現從多個角度進行計算。這克服了相機的一個關鍵限制，因為相機通常提供固定的視角，從而限制了可用於徹底姿勢評估的數據，特別是在發生遮擋時。該系統持續並自動地使用所選視角對實時流數據進行姿勢推斷；然而，只有用戶手動選擇和標記的姿勢才用於訓練個性化的深度學習分類器。這一方法通過一個案例研究得到了完善，其中RGB-D相機捕捉受試者執行負載提升任務，實現了實時骨骼標記。該模型在這些數據上進行了訓練，並在訓練階段之後，對新的流數據進行實時推斷。這項研究通過結合最先進的三維數據技術和傳統的二維姿勢估計算法，提供了一種可擴展且務實的實時人體工學評估方法。它滿足了工作環境中對安全和健康監測日益增長的需求，標誌著該領域的一項顯著貢獻。

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

2606.12953v1 by Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert

We present OpenMedQ, a medical vision-language model pretrained on the broadest fully-open medical mix to date: 14 datasets totaling ~3.35M pretraining samples spanning pathology, radiology, microscopy, and text-only clinical QA. OpenMedQ reaches state-of-the-art BLEU-1 on PathVQA (75.9), beating Med-PaLM M variants up to 562B parameters (~80x larger), and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro-F1 (0.757) among BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). We release our code and an interactive demo is publicly available as a reproducible baseline for the community.

摘要：我們提出了 OpenMedQ，一個在迄今為止最廣泛的完全開放醫學混合資料上進行預訓練的醫學視覺-語言模型：14 個數據集總計約 335 萬個預訓練樣本，涵蓋病理學、放射學、顯微鏡學和僅文本的臨床問答。OpenMedQ 在 PathVQA 上達到了最先進的 BLEU-1（75.9），超越了高達 562B 參數的 Med-PaLM M 變體（約大 80 倍），並且與報告的最佳 VQA-MED BLEU-1（64.5）相匹配。其視覺編碼器在相同的下游食譜下轉移到 8 個未見的醫學分類基準，獲得了 BiomedCLIP（0.745）、PMC-CLIP（0.745）、PubMedCLIP（0.746）和一個從零開始的基準（0.616）中最高的平均宏 F1（0.757）。我們發布了我們的代碼，並提供了一個互動演示，作為社區可重現的基準。

Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata

2606.12824v1 by Daniel Soliman

AI governance for medical imaging is formalizing: the 2026 ACR-SIIM Practice Parameter recommends local acceptance testing and ongoing drift monitoring, and the ACR Assess-AI registry monitors AI outputs using DICOM metadata for context. We argue that a necessary, currently unmonitored layer sits beneath output metrics: whether incoming studies remain within the acquisition envelope a model was validated on. Using a LUNA16-trained MONAI RetinaNet lung-nodule detector, we test whether acquisition state behaves as a structured, measurable variable. On real paired CT differing only in reconstruction kernel (NLST B30f vs B80f), kernel alone shifted AI-measured diameter and flipped a Fleischner size category in 5.2% (8 of 155) of nodules at fixed patient and acquisition, while detection confidence was unchanged (Wilcoxon p=0.22). Under controlled LIDC-IDRI perturbations the effects dissociated by axis: the noise axis degraded detection confidence (p=5.9e-32, concentrated in nodules under 6 mm) but not measurement, while the frequency/kernel axis corrupted measurement (p=8.6e-13) but not detection. A 4-feature pixel fingerprint recovered reconstruction identity (patient-level AUC about 0.95 on real CT, 0.995 on a QIBA phantom) where the ConvolutionKernel DICOM tag was uninformative (identical labels across reconstructions). The kernel axis transported across four manufacturers (leave-one-vendor-out AUC 0.94-0.98, matching the within-vendor ceiling). Acquisition state thus maps to distinct AI failure modes, frequency content to measurement reliability and noise to detection sensitivity, and is not recoverable from metadata. Acquisition-aware, input-side validation is the missing layer for the acceptance-testing and drift-monitoring requirements now entering imaging-AI accreditation.

摘要：AI 醫療影像治理正在正式化：2026 年 ACR-SIIM 實踐參數建議進行本地接受測試和持續的漂移監控，而 ACR Assess-AI 註冊處則利用 DICOM 元數據監控 AI 輸出以提供上下文。我們主張，在輸出指標之下，存在一個必要的、目前未被監控的層面：即進來的研究是否仍然在模型驗證時的獲取範圍內。使用 LUNA16 訓練的 MONAI RetinaNet 肺結節檢測器，我們測試獲取狀態是否作為一個結構化的、可測量的變量。在僅在重建核上有所不同的真實配對 CT（NLST B30f 與 B80f）上，僅核便改變了 AI 測量的直徑，並在固定的患者和獲取下將 5.2%（8/155）的結節的 Fleischner 大小類別翻轉，而檢測信心則保持不變（Wilcoxon p=0.22）。在受控的 LIDC-IDRI 擾動下，影響按軸分離：噪聲軸降低了檢測信心（p=5.9e-32，集中在小於 6 mm 的結節上），但未影響測量，而頻率/核軸則損壞了測量（p=8.6e-13），但未影響檢測。一個 4 特徵像素指紋恢復了重建身份（在真實 CT 上的患者級 AUC 約為 0.95，在 QIBA 幻影上為 0.995），而 ConvolutionKernel DICOM 標籤則無法提供有用信息（在重建中標籤相同）。因此，核軸在四個製造商之間傳輸（去除一個供應商的 AUC 為 0.94-0.98，與供應商內的上限相匹配）。獲取狀態因此映射到不同的 AI 失效模式，頻率內容映射到測量可靠性，噪聲映射到檢測敏感性，且無法從元數據中恢復。具備獲取意識的輸入端驗證是當前進入影像 AI 認證的接受測試和漂移監控要求中缺失的層面。

Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

2606.12702v1 by Alyssa Unell, Miguel Fuentes, Brenna Li, Bridget Lin, Meena Jagadeesan, Sanmi Koyejo, Nigam Shah

Large language models (LLMs) are increasingly integrated into clinical systems, making it essential to evaluate the real-world utility of these systems. However, static benchmarks tend to measure correctness rather than user acceptance, aggregate performance across queries, and require densely annotated datasets -- leading to major blind spots for evaluating clinical systems. In this work, we perform a deployment-centered evaluation of an LLM system embedded within electronic health records at an academic medical center, where user feedback is sparse but closely reflects the deployment conditions. Specifically, we train a pre-response classifier that estimates the risk that a future interaction will result in the user rejecting the LLM response, based on query content and deployment-specific context available before generation. We conduct a prospective analysis of our model over 4.5 months of user feedback, finding that our prediction model achieves an AUROC of 0.719. Further, we estimate the benefit of such predictions in two downstream use cases (guardrail triggering and abstention). Our key conceptual insight is that making use of deployment-specific context (i.e., the provider type, department name, language model used for response), as opposed to only query content, improves the ability to predict whether the user will reject the system output. Altogether, our empirical case study demonstrates the feasibility of predicting user rejection using deployment-specific context, opening the door to targeted guardrails.

摘要：大型語言模型（LLMs）越來越多地整合進臨床系統中，因此評估這些系統在現實世界中的實用性變得至關重要。然而，靜態基準往往測量正確性而非用戶接受度，對查詢的整體性能進行匯總，並需要密集標註的數據集——這導致在評估臨床系統時出現重大盲點。在本研究中，我們對嵌入在學術醫療中心電子健康記錄中的LLM系統進行了以部署為中心的評估，這裡的用戶反饋稀少，但與部署條件密切相關。具體而言，我們訓練了一個預響應分類器，該分類器根據查詢內容和生成前可用的特定於部署的上下文來估計未來互動導致用戶拒絕LLM響應的風險。我們對模型進行了為期4.5個月的用戶反饋前瞻性分析，發現我們的預測模型達到了0.719的AUROC。此外，我們估算了這些預測在兩個下游使用案例中的好處（安全邊界觸發和放棄）。我們的關鍵概念見解是，利用特定於部署的上下文（即提供者類型、部門名稱、用於響應的語言模型），而不僅僅是查詢內容，可以提高預測用戶是否會拒絕系統輸出的能力。總的來說，我們的實證案例研究展示了使用特定於部署的上下文預測用戶拒絕的可行性，為針對性安全邊界的開發鋪平了道路。

LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

2606.12699v1 by Yifan Gao, Yanmin Gong, Yun Shi, Yuanxiong Guo

Type 2 Diabetes (T2D) poses an increasing global health threat, demanding effective glycemic assessment to support personalized and improved diabetes care. Wearable sensors such as continuous glucose monitors (CGM) and fitness trackers offer many valuable insights for glycemic assessment. However, effectively analyzing these data requires integration with essential individual-level context. Existing methods are often based on traditional machine learning (ML) and rely primarily on historical blood glucose measurements and overlook personalized information, which limits their performance across diverse diabetes populations. Recent advances in large language models (LLMs) have demonstrated their ability to integrate diverse data modalities while modeling sequential dependencies, motivating the exploration of their potential for personalized glycemic assessment. In this paper, we propose GlyLLM, an LLM-powered framework for modeling CGM-based glycemic dynamics through the integration of wearable sensor data and structured metadata. GlyLLM can leverage the extensive prior knowledge of pre-trained LLMs and achieve sensor-text semantic abstraction at decision time. Experiments on two related tasks on the AI-READI dataset demonstrate that our model outperforms traditional ML methods by an average of 13.66\% in Root Mean Squared Error (RMSE) for glucose forecasting and 13.08\% in Area Under the Receiver Operating Characteristic (AUROC) for diabetes categorization. Additionally, our ablation study shows that diabetes surveys and biometric tests are more critical than other health information for glycemic assessment. Our work presents a promising step toward harnessing the power of LLMs to advance personalized glycemic assessment in T2D care.

摘要：2型糖尿病（T2D）對全球健康構成日益嚴重的威脅，迫切需要有效的血糖評估以支持個性化和改善的糖尿病護理。可穿戴傳感器，如持續血糖監測儀（CGM）和健身追蹤器，為血糖評估提供了許多有價值的見解。然而，有效分析這些數據需要與重要的個體層面背景整合。現有的方法通常基於傳統的機器學習（ML），主要依賴歷史血糖測量，並忽略個性化信息，這限制了它們在多樣化糖尿病人群中的表現。最近在大型語言模型（LLMs）方面的進展已顯示出它們能夠整合多樣的數據模態，同時建模序列依賴性，這激勵我們探索它們在個性化血糖評估中的潛力。在本文中，我們提出了GlyLLM，一個基於LLM的框架，用於通過整合可穿戴傳感器數據和結構化元數據來建模基於CGM的血糖動態。GlyLLM可以利用預訓練LLMs的廣泛先驗知識，並在決策時實現傳感器-文本語義抽象。在AI-READI數據集上的兩個相關任務實驗表明，我們的模型在血糖預測中比傳統的ML方法平均提高了13.66\%的均方根誤差（RMSE），在糖尿病分類中提高了13.08\%的接收者操作特徵曲線下的面積（AUROC）。此外，我們的消融研究顯示，糖尿病調查和生物識別測試對於血糖評估比其他健康信息更為關鍵。我們的工作為利用LLMs的力量推進T2D護理中的個性化血糖評估邁出了有希望的一步。

CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents

2606.12666v2 by Siyu Shen, Fenghao Xu, Wenrui Diao, Kehuan Zhang

Screenshot-based mobile GUI agents can operate ordinary smartphone apps through the same visual interface as a human user, but this capability also turns every screen observation into a privacy boundary. During normal task execution, screenshots may expose contacts, messages, photos, files, recommendations, health cues, and other sensitive context that is unrelated to the user's request. We call this problem incidental visual privacy exposure. It is difficult to address with existing defenses: text anonymization misses many visual and inferential cues, while generic privacy masking can remove the evidence and controls that a GUI agent needs to complete the task. This paper presents CAPED, a context-aware pre-upload exposure control layer for mobile GUI agents. CAPED is designed as a phone-side protection layer: before screenshots are released to a remote multimodal agent, it extracts task requirements, uses screen context as a privacy prior, parses visible UI elements, and selectively exposes only content needed for the current task while masking incidental private content. We evaluate CAPED on AndroidWorld for broad task utility and with a controlled 28-task seeded privacy evaluation used as a measurement instrument for trajectory-level incidental leakage. In this seeded evaluation, Full CAPED reduces success-conditioned weighted seeded leakage from 0.766 under raw screenshots to 0.268 while preserving high task utility. A broader AndroidWorld run shows a remaining prototype-level utility cost, but the results show that task-driven selective exposure can reduce incidental visual leakage before screenshots are released to a remote GUI agent.

摘要：基於截圖的移動 GUI 代理可以通過與人類用戶相同的視覺界面操作普通智能手機應用，但這一能力也將每次屏幕觀察變成隱私邊界。在正常任務執行過程中，截圖可能會暴露聯絡人、消息、照片、文件、推薦、健康提示以及與用戶請求無關的其他敏感內容。我們稱這個問題為偶然的視覺隱私暴露。使用現有的防禦措施很難解決這個問題：文本匿名化忽略了許多視覺和推斷線索，而通用的隱私遮蔽可能會移除 GUI 代理完成任務所需的證據和控制。
本文介紹了 CAPED，一種針對移動 GUI 代理的上下文感知預上傳暴露控制層。 CAPED 被設計為手機端的保護層：在截圖發送到遠程多模態代理之前，它提取任務要求，使用屏幕上下文作為隱私先驗，解析可見的 UI 元素，並選擇性地僅暴露當前任務所需的內容，同時遮蔽偶然的私人內容。我們在 AndroidWorld 上評估 CAPED，以獲得廣泛的任務效用，並使用受控的 28 任務種子隱私評估作為測量工具來評估軌跡級別的偶然泄漏。在這次種子評估中，完整的 CAPED 將基於原始截圖的成功條件加權種子泄漏從 0.766 降低到 0.268，同時保持高任務效用。更廣泛的 AndroidWorld 測試顯示仍然存在原型級的效用成本，但結果顯示，基於任務驅動的選擇性暴露可以在截圖發送到遠程 GUI 代理之前減少偶然的視覺泄漏。

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

2606.12590v1 by Shayan Mohammadizadehsamakosh, Pritam Sarkar, Leonid Sigal, Ali Etemad, Elham Dolatabadi

Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

摘要：大型視覺-語言模型（LVLMs）在醫學影像任務中取得了強勁的表現，但它們仍然容易出現事實不一致、視覺基礎不佳以及與臨床意義反饋不對齊的問題。現有的後訓練對齊方法，包括直接偏好優化（DPO）及其變體，在醫學領域面臨三個關鍵限制：（1）序列級別的獎勵信號將臨床關鍵的標記與一般填充文本視為相同；（2）依賴靜態的監督微調參考作為偏好回應會引入偏離政策的分佈轉移，將優化引導至風格上的瑕疵而非臨床正確性；（3）對齊目標缺乏明確的視覺基礎約束，使模型對細微但診斷上決定性的病理特徵不敏感。我們的方法利用雙向標記級別的KL正則化器，並結合視覺對比基礎目標，將乾淨圖像與病變腐蝕圖像配對，以懲罰在缺乏足夠視覺證據的情況下生成的回應。這些組件共同構成了一個細緻的、在政策內的對齊框架，通過最小編輯模型生成的輸出來構建偏好對，僅修正臨床錯誤的範圍，同時保留原始的語言風格。在醫學影像任務和臨床文本生成基準上的廣泛實驗驗證了我們方法的有效性。

EDEN: A Large-Scale Corpus of Clinical Notes for Italian

2606.12569v1 by Tiziano Labruna, Guido Bertolini, Pietro Ferrazzi, Bernardo Magnini

We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Departments of Italian hospitals. The corpus, in its current version, is composed of approximately 4 million clinical notes fully anonymized, covering diverse phases of patient care during the stay in the emergency department. In addition, a subset of about six thousand notes has been manually annotated by clinical experts through a structured Case Report Form (CRF) containing 132 items relevant for two patient situations in emergency departments, dyspnea and loss of consciousness. Items may assume numerical values (e.g., for blood saturation), categorical (e.g., for level of consciousness ), binary (e.g., for presence of traumas), and mixed value types. The annotation process involved multiple clinicians and underwent iterative revision to resolve ambiguities in item formulation, resulting in a richly structured (although high imbalanced) resource. The dataset aims to fill a relevant gap of data able to support both the development and the use of Large Language Models in concrete medical applications. We describe the data collection protocol, the on-site anonymisation pipeline, corpus statistics, and the annotation scheme. Finally, we propose CRF-filling as a novel structured information extraction benchmark, and provide zero-shot baseline resulting from Gemma-27B and MedGemma-27B. To the best of our knowledge, the EDEN dataset is the largest freely available corpus of clinical notes existing for the Italian language.

摘要：我們介紹EDEN（急診部電子筆記），這是一個新穎且獨特的大規模臨床筆記語料庫，產自意大利醫院的急診部。該語料庫在當前版本中包含約400萬條完全匿名的臨床筆記，涵蓋患者在急診部住院期間的不同護理階段。此外，約六千條筆記已由臨床專家通過結構化病例報告表（CRF）手動標註，該表格包含132個與急診部兩種患者情況相關的項目，分別是呼吸困難和意識喪失。項目可以採用數值（例如，血氧飽和度）、類別（例如，意識水平）、二元（例如，創傷存在與否）和混合值類型。標註過程涉及多位臨床醫生，並經過多次修訂以解決項目表述中的模糊性，最終形成了一個結構豐富（雖然高度不平衡）的資源。該數據集旨在填補一個相關的數據空白，以支持大型語言模型在具體醫療應用中的開發和使用。我們描述了數據收集協議、現場匿名化流程、語料庫統計數據和標註方案。最後，我們提出CRF填寫作為一個新穎的結構化信息提取基準，並提供來自Gemma-27B和MedGemma-27B的零樣本基線。據我們所知，EDEN數據集是現存的意大利語臨床筆記中最大的免費可用語料庫。

Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

2606.12346v1 by Kai Standvoss, Miriam Hägele, Rosemarie Krupar, Julika Ribbat-Idel, Jennifer Altschüler, Gerrit Erdmann, Hans Pinckaers, Evelyn Ramberger, Madleen Drinkwitz, Ádám Nárai, Alexander Möllers, Katja Lingelbach, Sebastian Kons, Lukas Hönig, Recepcan Adigüzel, Joana Baião, Alberto Megina Gonzalo, Marius Teodorescu, Marie-Lisa Eich, Paolo Chetta, Shakil Merchant, Verena Aumiller, Simon Schallenberg, Andrew Norgan, Klaus-Robert Müller, Lukas Ruff, Maximilian Alber, Frederick Klauschen

Hematoxylin and eosin (H&E) staining is the cornerstone of histopathology, yet scalable, quantitative analysis of H&E whole-slide images (WSIs) remains a central challenge in computational pathology. We present Atlas H&E-TME, an AI-based system built on the Atlas family of pathology foundation models that predicts tissue quality, tissue region, and cell type labels across multiple cancer types, yielding over 4,500 quantitative readouts per slide at cell-level resolution. A key challenge to validating such systems is overcoming morphological ambiguity inherent to H&E-only ground truth and the limited scalability of more informed references drawing on modalities such as immunohistochemistry (IHC). We address this with a dual validation framework combining biologically grounded depth with technical and morphological breadth. For depth, we propose an IHC-informed multi-pathologist consensus protocol that substantially improves inter-rater agreement over conventional H&E-only annotation. This yields a molecularly grounded reference against which we compare Atlas H&E-TME and pathologists working from H&E alone. For breadth, we benchmark Atlas H&E-TME on over 200,000 high-confidence H&E-only pathologist annotations across 1,500+ cases spanning eight cancer types and their most common metastatic sites, with subtypes covering >90% of clinical cases per cancer type, drawn from 25+ sources and 8+ scanner models. Benchmarked against the IHC-informed consensus, Atlas H&E-TME matches or exceeds pathologist H&E-only performance and generalizes consistently and robustly across this broad morphological and technical scope. In doing so, Atlas H&E-TME turns the H&E slide -- the most ubiquitous data in pathology -- into a scalable, quantitative window into the tumor and its microenvironment, laying a foundation for the next generation of tissue-based biomarkers in translational and clinical research.

摘要：Hematoxylin 和 eosin (H&E) 染色是組織病理學的基石，但 H&E 全片影像 (WSIs) 的可擴展、定量分析仍然是計算病理學中的一個主要挑戰。我們提出了 Atlas H&E-TME，一個基於 AI 的系統，建立在 Atlas 病理學基礎模型家族上，能夠預測多種癌症類型的組織質量、組織區域和細胞類型標籤，每張幻燈片提供超過 4,500 個細胞級別解析的定量讀數。驗證此類系統的一個主要挑戰是克服 H&E 僅有的真實標準中固有的形態學模糊性，以及基於免疫組織化學 (IHC) 等模式的更具信息性的參考的有限可擴展性。我們通過一個雙重驗證框架來解決這個問題，結合了生物學上扎實的深度與技術和形態學的廣度。在深度方面，我們提出了一個 IHC 資訊的多病理學家共識協議，顯著提高了與傳統 H&E 僅有標註相比的評估者間一致性。這提供了一個分子基礎的參考，與我們比較 Atlas H&E-TME 和僅使用 H&E 的病理學家。在廣度方面，我們在超過 200,000 個高信心的 H&E 僅有病理學家標註上對 Atlas H&E-TME 進行基準測試，這些標註來自 1,500 多個案例，涵蓋八種癌症類型及其最常見的轉移部位，亞型覆蓋每種癌症類型超過 90% 的臨床案例，來自 25 多個來源和 8 種以上的掃描儀模型。與 IHC 資訊的共識進行基準測試後，Atlas H&E-TME 的表現與病理學家的 H&E 僅有表現相匹配或超過，並在這個廣泛的形態學和技術範疇內持續且穩健地進行泛化。通過這樣做，Atlas H&E-TME 將 H&E 幻燈片——病理學中最普遍的數據——轉變為一個可擴展的、定量的窗口，觀察腫瘤及其微環境，為轉化和臨床研究中的下一代基於組織的生物標誌物奠定基礎。

Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

2606.12252v1 by Veerendhra Kumar Dangeti, Xiao Gu, Ying Weng, Shreyank N Gowda

Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly evident in electrocardiogram classification, where large datasets and long training schedules make efficiency practically important. Progressive Data Dropout reduces training cost by excluding samples from gradient updates once they are learned, but it relies on model confidence and may retain samples that are difficult due to noise or ambiguity rather than useful signal. In this work, we introduce ERTS, an explainability-based reliability training signal for efficient ECG classification. ERTS uses explanation quality during training to distinguish between informative and unreliable uncertainty. Building on progressive data selection, we compute Grad-CAM attention maps for candidate samples and derive a focus score that measures whether model predictions are supported by coherent and localised patterns. Samples with low focus are filtered out, while those with meaningful attention are prioritised for gradient updates. We evaluate ERTS across three ECG datasets and multiple backbone architectures, showing consistent improvements in macro-F1 alongside reduced effective training cost. These results suggest that explanation quality can serve as a practical signal for improving both efficiency and reliability in clinical time-series learning. Code will be released.

摘要：訓練深度神經網絡以進行臨床時間序列分析在計算上要求甚高，然而許多醫療環境缺乏重複模型開發和部署所需的資源。這一挑戰在心電圖分類中特別明顯，因為大型數據集和長時間的訓練計劃使得效率變得非常重要。漸進式數據丟棄通過在樣本被學習後排除其對梯度更新的貢獻來降低訓練成本，但它依賴於模型信心，可能會保留由於噪聲或模糊而難以處理的樣本，而不是有用的信號。在這項工作中，我們介紹了ERTS，一種基於可解釋性的可靠性訓練信號，用於高效的心電圖分類。ERTS在訓練期間使用解釋質量來區分信息性和不可靠的不確定性。基於漸進式數據選擇，我們計算候選樣本的Grad-CAM注意力圖，並導出一個焦點分數，以衡量模型預測是否得到一致且局部化模式的支持。低焦點的樣本會被過濾掉，而那些具有意義的注意力的樣本則優先進行梯度更新。我們在三個心電圖數據集和多個主幹架構上評估了ERTS，顯示出宏觀F1分數的一致改善，同時有效的訓練成本降低。這些結果表明，解釋質量可以作為改善臨床時間序列學習中效率和可靠性的實用信號。代碼將會發布。

OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models

2606.12169v1 by Negin Baghbanzadeh, Pritam Sarkar, Michael Colacci, Abeer Badawi, Adibvafa Fallahpour, Arash Afkanpour, Leonid Sigal, Ali Etemad, Elham Dolatabadi

High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We introduce OpenMedReason, a large-scale, open multimodal medical reasoning corpus comprising approximately 450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. OpenMedReason provides high-fidelity supervision beyond synthetic chains of thought, covering diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others. We complement it with OpenMedReason-Bench, a held-out benchmark that allows fine-grained evaluation of LVLMs along three complementary axes of capability, including perception, medical knowledge, and rationale, enabling diagnostic evaluation beyond final-answer accuracy. OpenMedReason is a rich training resource that exhibits its effectiveness in both supervised fine-tuning (SFT) and reinforcement-based alignment. Training with OpenMedReason yields a 20% average improvement in VQA accuracy over the base model and achieves performance within 4.2% of the strongest comparable-scale medical LVLMs. Fine-grained performance analysis confirms that the gains are not concentrated in any single axis: OpenMedReason improves perception, medical knowledge, and rationale jointly, and its reasoning traces are preferred over those of the base model in 86.1% of pairwise comparisons. We release the code and dataset at huggingface.co/datasets/neginb/OpenMedReason.

摘要：高風險臨床使用大型視覺-語言模型 (LVLMs) 需要基於視覺證據和臨床知識的推理，而不僅僅是正確的最終答案。我們介紹了 OpenMedReason，一個大規模的開放多模態醫療推理語料庫，包含約 450K 的圖像-問題-答案實例，其推理過程主要來自經過策劃的生物醫學和人類撰寫的科學文章。OpenMedReason 提供了超越合成思維鏈的高保真監督，涵蓋了多樣的醫療領域視覺模態，如放射掃描、顯微圖像、可見光照片、圖表等。我們用 OpenMedReason-Bench 進行補充，這是一個保留的基準，允許在三個互補的能力軸上對 LVLMs 進行細緻的評估，包括感知、醫療知識和推理，使診斷評估超越最終答案的準確性。OpenMedReason 是一個豐富的訓練資源，顯示其在監督細調 (SFT) 和基於增強的對齊中的有效性。使用 OpenMedReason 進行訓練使 VQA 準確性平均提高 20%，並在最強的可比規模醫療 LVLMs 中達到性能在 4.2% 內。細緻的性能分析確認這些增益並不集中於任何單一軸：OpenMedReason 共同改善了感知、醫療知識和推理，其推理過程在 86.1% 的成對比較中優於基準模型。我們在 huggingface.co/datasets/neginb/OpenMedReason 上發布了代碼和數據集。

Towards Responsibly Non-Compliant Machines

2606.12147v1 by Marija Slavkovik, Marie Farrell, Louise Dennis, Michael Fisher, Simon Kolker, Emily C. Collins

We consider the problem of engineering autonomous intelligent agents that are capable to responsibly not comply with user requests. We argue that machine non-compliance comes in many different forms, and sketch the issues we should pursue on the road of accomplishing responsibly non-compliant intelligent machines. We anchor responsible non-compliance in justifications for task refusal, pathways to override the non-compliance, as well as careful tracking of security risks and liability transfers.

摘要：我們考慮工程自主智能代理的問題，這些代理能夠負責任地不遵從用戶請求。我們認為機器的不遵從有許多不同的形式，並勾勒出我們在實現負責任的不遵從智能機器的過程中應該追求的問題。我們將負責任的不遵從建立在拒絕任務的理由、覆蓋不遵從的途徑，以及對安全風險和責任轉移的仔細追蹤上。

Tabular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation

2606.12006v1 by Minh-Khoi Pham, Luca Cotugno, Alina Sirbu, Tai Tan Mai, Martin Crane, Marija Bezbradica

Predicting time-to-event outcomes such as mortality is a fundamental task in clinical decision-making, commonly addressed through survival analysis. While classical statistical and deep learning approaches have been widely studied, they typically require task-specific training and sufficient labeled data. Recent advances in tabular foundation models offer a new paradigm by learning general-purpose representations for structured data. However, their applicability to censored time-to-event prediction in clinical settings remains underexplored, as typical applications are restricted to discrete classification rather than survival analysis tasks. In this work, we propose a lightweight adaptation approach for applying tabular foundation models to clinical survival analysis by directly training a survival-aware head on top of the pretrained representations. We study representative architectures, including TabPFN, TabDPT, and TabICL, and adapt them using a multi-task logistic regression (MTLR) head to model right-censored time-to-event outcomes. We evaluate this approach on a diverse set of public survival benchmarks and two large-scale ICU cohorts, MIMIC-IV and eICU. Our results show that this transfer learning approach achieves competitive or superior performance compared to strong baselines. On MIMIC-IV, TabDPT-FT-MTLR reaches a C-index of 0.856, corresponding to a relative improvement of +1.4% over the best non-FM baseline (DeepSurv, 0.844) and +6.7% over the best zero-shot model (0.802). On eICU, TabICL-FT-MTLR achieves 0.797, yielding gains of +1.7% (DeepSurv, 0.784) and +6.4% (0.749), respectively. These findings highlight the importance of combining pretrained tabular representations with survival-aware objectives and suggest that tabular foundation models provide a practical and effective alternative for clinical survival prediction.

摘要：預測事件發生時間的結果，例如死亡率，是臨床決策中的一項基本任務，通常通過生存分析來解決。雖然傳統的統計方法和深度學習方法已被廣泛研究，但這些方法通常需要特定任務的訓練和足夠的標記數據。最近在表格基礎模型方面的進展提供了一種新的範式，通過學習結構化數據的一般性表示來進行處理。然而，這些模型在臨床環境中對於被審查的事件預測的適用性仍然未被充分探索，因為典型應用主要限於離散分類，而非生存分析任務。在本研究中，我們提出了一種輕量級的適應方法，通過在預訓練表示的基礎上直接訓練一個生存感知的頭部，將表格基礎模型應用於臨床生存分析。我們研究了代表性的架構，包括TabPFN、TabDPT和TabICL，並使用多任務邏輯回歸（MTLR）頭部進行調整，以建模右審查的事件結果。我們在一組多樣的公共生存基準和兩個大型ICU隊列（MIMIC-IV和eICU）上評估了這一方法。我們的結果顯示，這種轉移學習方法在與強基準相比時，達到了競爭或更優的性能。在MIMIC-IV上，TabDPT-FT-MTLR達到了0.856的C指數，相當於比最佳非FM基準（DeepSurv，0.844）提高了+1.4%，比最佳零樣本模型（0.802）提高了+6.7%。在eICU上，TabICL-FT-MTLR達到了0.797，分別帶來+1.7%（DeepSurv，0.784）和+6.4%（0.749）的增益。這些發現突顯了將預訓練的表格表示與生存感知目標相結合的重要性，並表明表格基礎模型為臨床生存預測提供了一種實用且有效的替代方案。

Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability

2606.11930v2 by Kuo-En Hung, Hung-Yue Suen, Shih-Ching Yeh, Hsiang-Wen Wang

Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging problem in AI-assisted interview assessment because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.

摘要：預測來自非同步視頻面試（AVI）的心理特徵是一個在AI輔助面試評估中具有挑戰性的問題，因為標記數據集有限，而每個回應包含高維度的視覺、聲音和語言信號。本文提出了我們對2026年ACM多媒體AVI挑戰的解決方案，該挑戰評估兩項任務：Track~1從與人格相關的面試回應中預測自我報告的HEXACO人格特徵，Track~2則從結構化的AVI回應中分類認知能力水平。我們將這個問題視為一個小樣本表示學習任務。我們不對大型預訓練模型進行微調，而是使用凍結的多模態編碼器，包括用於視覺特徵的CLIP、用於聲音特徵和文字稿的Whisper，以及用於文本表示的RoBERTa、E5和DeBERTaV3，然後再用低容量的下游模型。對於Track~1，我們的特徵特定回歸和後期融合系統達到了0.2696的平均驗證均方誤差（MSE），超過了官方基準的0.3334。消融結果顯示，從全局模型（0.3189）到每個特徵建模（0.2871），再到每個特徵的後期融合（0.2696），經歷了三步改進，對應於相對於官方基準的19.1% MSE減少。對於Track~2，一個緊湊的主題-屬性基準達到了0.5781的準確率，而我們的多模態集成達到了0.5313，均高於官方基準的0.4062。我們將這一結果解釋為在驗證拆分中可能存在的主題-屬性捷徑的證據，而不是從AVI內容中進行穩健的認知推斷。總體而言，我們的發現表明，基於AVI的心理評估受益於特徵特定的多模態建模，但認知能力預測需要對數據集捷徑進行仔細控制。

Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

2606.11830v1 by Qianyu Yao, Fei Sun, Bocheng Huang, Wei Chen, Jiarui Jiang, Shu Quan, Yifei Chen, Wenjie Xu, Bo li, Liping Su, Ruoqiong Wu, Huhai Hong, Huimei Wang

Background. Large language models and AI agents are increasingly used to support biomedical research, but native model outputs may omit key analytical steps, misuse methods, or overstate conclusions. We evaluated whether autonomous access to a medical research skill package was associated with higher-quality AI-generated transcriptomic research-analysis outputs compared with native AI without skills. Methods. We conducted an exploratory multi-model human evaluation using a non-small cell lung cancer immunotherapy biomarker task. Six model backbones were tested. The evaluation included 21 anonymized outputs: 9 native-AI outputs and 12 skill-augmented outputs generated through an AI agent implementation represented by OpenClaw. Four non-expert biomedical reviewers and two blinded experts evaluated each output, with two ratings from each reviewer type. The primary outcome was expert-rated overall quality. Results. Skill-augmented outputs showed directionally higher expert overall quality than native-AI outputs (mean 5.50 vs 5.11; difference=0.39; bootstrap 95\% CI, -0.04 to 0.90; Welch p=0.156). Non-expert reviewer quality showed the same direction (mean 4.72 vs 4.47; difference=0.26; bootstrap 95\% CI, -0.25 to 0.80; Welch p=0.373). Expert agreement was limited (single-rating ICC=-0.15), and model-specific effects were descriptive and heterogeneous. Conclusions. Autonomous skill access showed a directional quality signal in this exploratory sample, but the signal was smaller than expert-rating noise and should not be interpreted as confirmatory evidence. The findings primarily motivate larger evaluations of skill-augmented AI agents with stronger reliability controls, platform replication, and biological-validity assessment.

摘要：背景。大型語言模型和人工智慧代理越來越多地用於支持生物醫學研究，但原生模型的輸出可能省略關鍵的分析步驟、誤用方法或過度陳述結論。我們評估了自主訪問醫學研究技能包是否與較高質量的AI生成轉錄組研究分析輸出相關，與沒有技能的原生AI相比。方法。我們使用非小細胞肺癌免疫療法生物標記任務進行了探索性的多模型人類評估。測試了六個模型骨幹。評估包括21個匿名輸出：9個原生AI輸出和12個通過AI代理實現的技能增強輸出，該代理由OpenClaw表示。四位非專家生物醫學評審和兩位盲評專家評估了每個輸出，每種類型的評審提供了兩個評分。主要結果是專家評定的整體質量。結果。技能增強輸出的專家整體質量方向性上高於原生AI輸出（平均5.50對5.11；差異=0.39；自助法95\% CI，-0.04至0.90；Welch p=0.156）。非專家評審的質量顯示相同的方向（平均4.72對4.47；差異=0.26；自助法95\% CI，-0.25至0.80；Welch p=0.373）。專家之間的協議有限（單次評分ICC=-0.15），模型特定的效應是描述性的和異質的。結論。在這個探索性樣本中，自主技能訪問顯示出方向性的質量信號，但該信號小於專家評分的噪音，不應被解釋為確認性證據。這些發現主要促使對技能增強AI代理進行更大規模的評估，並加強可靠性控制、平台重複性和生物有效性評估。

Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data

2606.11794v1 by Boris-Stephan Rauchmann, Jonathan Laib, Buse Ercik, Robert Perneczky, Sergio Altares-López

Neurodegenerative diseases such as Alzheimer's disease (AD) require accurate and scalable tools for assessing disease severity, yet current clinical staging remains time-intensive and prone to variability. We propose an attention-enhanced multimodal machine learning framework with ordinal regression for automated and interpretable AD severity staging. The framework integrates T1-weighted MRI with demographic and genetic variables and compares unimodal and multimodal architectures using ordinal and non-ordinal prediction heads. Models were trained and validated using cohort-stratified splits derived from the ADNI, AIBL, and NIFD datasets. A strictly held-out test set was constructed using subjects excluded from all training, validation, preprocessing, and hyperparameter tuning procedures, with subject-level splitting employed throughout to prevent data leakage. Among unimodal approaches, the T1-weighted MRI model achieved slightly higher adjacent-stage accuracy (0.963) and agreement with clinical staging (QWK 0.444) than the tabular model (QWK 0.433). Integrating imaging, demographic, and genetic information improved overall performance. The multimodal non-ordinal baseline achieved the lowest prediction error (MAE 0.340), whereas the ordinal multimodal model achieved the highest adjacent-stage accuracy (0.970) and strongest agreement with clinical staging (QWK 0.549). These findings indicate that ordinal formulations better capture the ordered structure of the CDR scale and yield predictions more consistent with clinical staging. Explainability analyses using Grad CAM++ and SHAP demonstrated anatomically and clinically plausible model behavior, supporting transparent decision-making. Overall, attention-based multimodal learning with ordinal regression represents a robust, interpretable, and scalable approach for automated AD severity staging and AI-assisted clinical decision support.

摘要：神經退行性疾病，如阿茲海默症（AD），需要準確且可擴展的工具來評估疾病嚴重程度，但目前的臨床分期仍然耗時且容易變異。我們提出了一種增強注意力的多模態機器學習框架，結合序數回歸，用於自動化且可解釋的AD嚴重程度分期。該框架整合了T1加權MRI與人口統計和遺傳變數，並使用序數和非序數預測頭比較單模態和多模態架構。模型使用來自ADNI、AIBL和NIFD數據集的隊列分層拆分進行訓練和驗證。嚴格保留的測試集是使用所有訓練、驗證、預處理和超參數調整程序中排除的受試者構建的，並在整個過程中採用了受試者級別的拆分以防止數據洩漏。在單模態方法中，T1加權MRI模型的相鄰階段準確率（0.963）和與臨床分期的一致性（QWK 0.444）略高於表格模型（QWK 0.433）。整合影像、人口統計和遺傳信息提高了整體性能。多模態非序數基線達到了最低的預測誤差（MAE 0.340），而序數多模態模型則達到了最高的相鄰階段準確率（0.970）和與臨床分期的最強一致性（QWK 0.549）。這些發現表明，序數公式更好地捕捉了CDR量表的有序結構，並產生與臨床分期更一致的預測。使用Grad CAM++和SHAP的可解釋性分析顯示了在解剖學和臨床上合理的模型行為，支持透明的決策過程。總體而言，基於注意力的多模態學習結合序數回歸代表了一種穩健、可解釋且可擴展的自動化AD嚴重程度分期和AI輔助臨床決策支持的方法。

LLM

Publish Date	Title	Authors	Homepage	Code
2026-06-17	Native Active Perception as Reasoning for Omni-Modal Understanding	Zhenghao Xing et.al.	2606.19341v1	null
2026-06-17	Learning User Simulators with Turing Rewards	Yingshan Susan Wang et.al.	2606.19336v1	null
2026-06-17	Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States	Denis Peskoff et.al.	2606.19334v1	null
2026-06-17	UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning	Mohamed Nabail et.al.	2606.19328v1	null
2026-06-17	Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation	Siyi Gu et.al.	2606.19327v1	null
2026-06-17	Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors	Michael Finkelson et.al.	2606.19325v1	null
2026-06-17	Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents	Anoushka Vyas et.al.	2606.19319v1	null
2026-06-17	Explaining Attention with Program Synthesis	Amiri Hayes et.al.	2606.19317v1	null
2026-06-17	Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play	Leyang Shen et.al.	2606.19308v1	null
2026-06-17	Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA	Ikram Belmadani et.al.	2606.19266v1	null
2026-06-17	Structured Inference with Large Language Gibbs	Sanghyeok Choi et.al.	2606.19264v1	null
2026-06-17	A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2	Yijin Wang et.al.	2606.19259v1	null
2026-06-17	DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models	Zirui Wu et.al.	2606.19257v1	null
2026-06-17	X+Slides: Benchmarking Audience-Conditioned Slide Generation	Haodong Chen et.al.	2606.19256v1	null
2026-06-17	OneCanvas: 3D Scene Understanding via Panoramic Reprojection	Bartłomiej Baranowski et.al.	2606.19253v1	null
2026-06-17	TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology	Hannah Le et.al.	2606.19245v1	null
2026-06-17	STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability	Haipeng Luo et.al.	2606.19236v1	null
2026-06-17	Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning	Chenyu Zhou et.al.	2606.19222v1	null
2026-06-17	Machine Unlearning for the XGBoost Model with Network Intrusion Datasets	Diana Magalhães et.al.	2606.19220v1	null
2026-06-17	RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering	Pushwitha Krishnappa et.al.	2606.19218v1	null
2026-06-17	Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times	Giuseppe Gabriele et.al.	2606.19199v1	null
2026-06-17	Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis	Soheyl Bateni et.al.	2606.19183v1	null
2026-06-17	A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies	Fangyijie Wang et.al.	2606.19174v1	null
2026-06-17	User as Engram: Internalizing Per-User Memory as Local Parametric Edits	Bojie Li et.al.	2606.19172v1	null
2026-06-17	Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition	Shiho Matta et.al.	2606.19170v1	null
2026-06-17	Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection	Jinhan Li et.al.	2606.19168v1	null
2026-06-17	Essential Subspace Merging for Multi-Task Learning	Longhua Li et.al.	2606.19164v1	null
2026-06-17	IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages	Sakshi Joshi et.al.	2606.19157v1	null
2026-06-17	AdsMind: A Physics-Grounded Multi-Agent System for Self-Correcting Discovery of Adsorption Configurations on Heterogeneous Catalyst Surfaces	Zongmin Zhang et.al.	2606.19152v1	null
2026-06-17	OrthoReg: Orthogonal Regularization for Hybrid Symbolic-Neural Dynamical Systems	Till Richter et.al.	2606.19145v1	null
2026-06-17	Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction	Jingyi Zhou et.al.	2606.19144v1	null
2026-06-17	Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation	Ramza Basharat et.al.	2606.19139v1	null
2026-06-17	A Technical Taxonomy of LLM Agent Communication Protocols	Linus Sander et.al.	2606.19135v1	null
2026-06-17	Equivariant Graph Neural Networks Improve Optical Spectra Prediction for Materials Screening	Kasper Helverskov Petersen et.al.	2606.19133v1	null
2026-06-17	Written by AI, Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions	Hui Zhang et.al.	2606.19121v1	null
2026-06-17	Analysing drivers and interdependencies in European electricity markets using XAI	Antoine Pesenti et.al.	2606.19118v1	null
2026-06-17	Towards an Agent-First Web: Redesigning the Web for AI Agents	Eranga Bandara et.al.	2606.19116v1	null
2026-06-17	Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams	Haewoon Kwak et.al.	2606.19111v1	null
2026-06-17	ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL	Mukund Khanna et.al.	2606.19103v1	null
2026-06-17	ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection	Enrico Cassano et.al.	2606.19079v1	null
2026-06-17	Which Sections of a Research Paper Best Reveal Its Research Methods? Evidence from Library and Information Science	Qiuyu Fang et.al.	2606.19051v1	null
2026-06-17	Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration	Xhevahire Tërnava et.al.	2606.19042v1	null
2026-06-17	A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast Errors	David Aaron Evans et.al.	2606.19026v1	null
2026-06-17	FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs	Lorenzo Sani et.al.	2606.19025v1	null
2026-06-17	Sumi: Open Uniform Diffusion Language Model from Scratch	Mengyu Ye et.al.	2606.19005v1	null
2026-06-17	Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training	Ruiqi Lai et.al.	2606.19004v1	null
2026-06-17	Enhancing Multilingual Reasoning via Steerable Model Merging	Zhuoran Li et.al.	2606.19002v1	null
2026-06-17	TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction	Moon Ye-Bin et.al.	2606.18996v1	null
2026-06-17	G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment	Fengying Ye et.al.	2606.18989v1	null
2026-06-17	ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection	Jinhao Song et.al.	2606.18988v1	null
2026-06-17	Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering	Yafeng Wu et.al.	2606.18986v1	null
2026-06-17	Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment	Franziska Braun et.al.	2606.18979v1	null
2026-06-17	CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System	Marco Becattini et.al.	2606.18976v1	null
2026-06-17	A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI	Syed Mujtaba Haider et.al.	2606.18970v1	null
2026-06-17	GraphPO: Graph-based Policy Optimization for Reasoning Models	Yuliang Zhan et.al.	2606.18954v1	null
2026-06-17	RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models	San Kim et.al.	2606.18950v1	null
2026-06-17	Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents	Emmanuel Aboah Boateng et.al.	2606.18947v1	null
2026-06-17	SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents	Jingkun Luo et.al.	2606.18946v1	null
2026-06-17	Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking	Pierre Dantas et.al.	2606.18941v1	null
2026-06-17	SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety	Linghao Feng et.al.	2606.18936v1	null
2026-06-17	TransitNet: A Compact Attention-Augmented Deep Learning Framework for Low-SNR Transit Blind Searches	Xingchen Yan et.al.	2606.18932v1	null
2026-06-17	As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language	Jasmine Owers et.al.	2606.18922v1	null
2026-06-17	REVES: REvision and VErification--Augmented Training for Test-Time Scaling	Yuanxin Liu et.al.	2606.18910v1	null
2026-06-17	SAERec: Constructing Fine-grained Interpretable Intents Priors via Sparse Autoencoders for Recommendation	Jiangnan Xia et.al.	2606.18897v1	null
2026-06-17	Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction	Zhuangzhuang Pan et.al.	2606.18893v1	null
2026-06-17	Skill-Guided Continuation Distillation for GUI Agents	Zhimin Fan et.al.	2606.18890v1	null
2026-06-17	Improving Medical Communication using Rubric-Guided Counterfactual Recommendations	Adrian Cosma et.al.	2606.18889v1	null
2026-06-17	Generative-Model Predictive Planning for Navigation in Partially Observable Environments	Thomas Quilter et.al.	2606.18888v1	null
2026-06-17	Domain-Shift Aware Neural Networks for Unbalance Characterization in Rotating Systems	Bernardo Feijó Junqueira et.al.	2606.18882v1	null
2026-06-17	Efficient Financial Language Understanding via Distillation with Synthetic Data	Wen-Fong et.al.	2606.18875v1	null
2026-06-17	Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness	Zijian Wang et.al.	2606.18874v1	null
2026-06-17	Scaling Learning-based AEB with Massive Unlabeled Data	Xiangyu Wang et.al.	2606.18864v1	null
2026-06-17	URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification	Xinze Zhang et.al.	2606.18861v1	null
2026-06-17	Approximate Structured Diffusion for Sequence Labelling	Nicolas Floquet et.al.	2606.18856v1	null
2026-06-17	ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement	Bohou Zhang et.al.	2606.18850v1	null
2026-06-17	WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents	Yehang Zhang et.al.	2606.18847v1	null
2026-06-17	Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems	Hehai Lin et.al.	2606.18837v1	null
2026-06-17	Target-confidence Recourse Using tSeTlin machines: TRUST	K. Darshana Abeyrathna et.al.	2606.18832v1	null
2026-06-17	Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning	Xiaoyue Xu et.al.	2606.18831v1	null
2026-06-17	GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents	Zhe Ren et.al.	2606.18829v1	null
2026-06-17	Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation	Chenghao Xu et.al.	2606.18828v1	null
2026-06-17	Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets	Jiaxi Liu et.al.	2606.18820v1	null
2026-06-17	SwitchBraidNet: Quantisation-Aware Lightweight Architecture for Hybrid Brain-Computer Interface	Gourav Siddhad et.al.	2606.18816v1	null
2026-06-17	Reinforcement Learning Foundation Models Should Already Be A Thing	Abdelrahman Zighem et.al.	2606.18812v1	null
2026-06-17	Rescaling MLM-Head for Neural Sparse Retrieval	Youngjoon Jang et.al.	2606.18811v1	null
2026-06-17	Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards	Yingyu Shan et.al.	2606.18810v1	null
2026-06-17	ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch	Tengfei Lyu et.al.	2606.18803v1	null
2026-06-17	SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval	Youngjoon Jang et.al.	2606.18801v1	null
2026-06-17	Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports	Qingyu Lu et.al.	2606.18797v1	null
2026-06-17	HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space	Jaward Sesay et.al.	2606.18788v1	null
2026-06-17	RedactionBench	Sean Brynjólfsson et.al.	2606.18782v1	null
2026-06-17	Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation	Shanshan Lyu et.al.	2606.18781v1	null
2026-06-17	SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction	Quanjiang Guo et.al.	2606.18780v1	null
2026-06-17	Private Learning with Public Feature Conditioning	Shuli Jiang et.al.	2606.18773v1	null
2026-06-17	Output Vector Editing for Memorization Mitigation in Large Language Models	Ahmad Dawar Hakimi et.al.	2606.18767v1	null
2026-06-17	Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs	Chris Lee et.al.	2606.18747v1	null
2026-06-17	What Must Generalist Agents Remember?	Khurram Yamin et.al.	2606.18746v1	null
2026-06-17	SWE-Future: Forecast-Conditioned Data Synthesis for Future-Oriented Software Engineering Agents	Qiao Zhao et.al.	2606.18733v1	null
2026-06-17	LegalWorld: A Life-Cycle Interactive Environment for Legal Agents	Songhan Zuo et.al.	2606.18728v1	null
2026-06-17	Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring	Fang Wang et.al.	2606.18726v1	null

Abstracts

2606.19341v1 by Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

摘要：被動模型在長視頻理解中通常依賴於“全看”範式，均勻處理幀而不考慮查詢難度，導致計算成本隨著視頻時長增長。雖然互動框架已經出現，但它們通常依賴於全局預掃描，其上下文成本仍隨著視頻長度增長。我們提出了OmniAgent，第一個原生的全模態代理，將視頻理解公式化為基於POMDP的迭代觀察-思考-行動循環。OmniAgent執行按需行動，選擇性地將視聽線索提煉成持久的文本記憶，有效地將推理複雜性與原始視頻時長解耦。為了使這一點具體化，我們引入了(1) 代理監督微調，以通過最佳N軌跡合成和雙階段質量控制啟動原生主動感知，和(2) 代理強化學習與TAURA（轉向感知自適應不確定性重標定優勢），它利用轉向級熵來引導信用分配朝向關鍵發現轉。至關重要的是，OmniAgent顯示出正向測試時間擴展，隨著推理轉數的增加，性能提高，驗證了主動感知的有效性。在十個基準（例如，VideoMME，LVBench）上的實證結果表明，OmniAgent在開源模型中達到了最先進的性能。值得注意的是，在LVBench上，我們的7B代理超越了10$\times$更大的Qwen2.5-VL-72B（50.5%對47.3%）。

Learning User Simulators with Turing Rewards

2606.19336v1 by Yingshan Susan Wang, Cedegao E. Zhang, Linlu Qiu, Zexue He, Pengyuan Li, Alex Pentland, Roger P. Levy, Yoon Kim

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.

摘要：學習在互動環境中模擬人類使用者可能會推進代理助手的訓練、個性化系統的評估、社會科學的研究等。現有的方法通常通過訓練大型語言模型（LLM）來匹配單一的真實回應，無論是通過最大化對數概率還是使用相似性獎勵。我們則提出 {Turing-RL}：一種基於圖靈測試的強化學習方法，用於訓練使用者模擬器模型。{Turing-RL} 使用帶有 LLM 評審的判別性圖靈獎勵來評分生成的回應在考慮使用者歷史的情況下與真實使用者的回應有多麼難以區分，而使用者模擬器 LLM 學習生成與使用者可能所說的回應無法區分的回應，並獲得這樣的獎勵。在兩個不同的領域——對話聊天和 Reddit 論壇討論中——我們發現 {Turing-RL} 在 LLM 和人類評估指標上始終優於基線方法。我們的研究表明，優化不可區分性而非回應匹配對於學習使用者模擬器是有效的。

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

2606.19334v1 by Denis Peskoff, Joe Barrow, Christopher Vu, Diag Davenport

Progress in legal AI increasingly depends on access to authoritative legal text at scale. Yet one of the most consequential layers of American law remains largely absent from existing machine-readable corpora: local ordinances. Local codes govern zoning, housing, business licensing, public health, noise, animal control, and many other domains of everyday regulation, but they are fragmented across vendor platforms designed for human browsing rather than bulk research access. We introduce LOCUS - the Local Ordinance Corpus for the United States - a comprehensive corpus and county-harmonized access layer for U.S. municipal and county ordinance codes. The raw corpus, available for release to researchers, represents nearly all publicly available municipal and county ordinance codes. The resulting raw corpus contains codes from 9,239 cities and counties. A smaller county-harmonized LOCUS access layer provides coverage for the largest 2,309 of 3,144 U.S. counties, accounting for a majority of the population. We use OCR to handle the myriad of document formats that have kept the law from being a public resource. We release the corpus with coverage metadata to support reproducibility, downstream legal AI research, and the incremental expansion of machine-readable access to local law. We train a collection of ModernBERT-based classifiers and scorers to facilitate analyzing U.S. local law among several dimensions, such as opacity and paternalism, that have not previously been studied at this scale. LOCUS-v1 and its derivative models are available at: https://huggingface.co/datasets/LocalLaws/LOCUS-v1

摘要：進展中的法律人工智慧越來越依賴於大規模訪問權威法律文本。然而，美國法律中最具影響力的一層仍然在現有的機器可讀語料庫中大多數缺失：地方條例。地方法規管理區劃、住房、商業許可、公眾健康、噪音、動物控制以及許多其他日常監管領域，但它們在設計上是為了人類瀏覽而非批量研究訪問的供應商平台上分散。我們介紹了LOCUS - 美國地方條例語料庫 - 一個全面的語料庫和縣級協調訪問層，用於美國市政和縣的條例法規。這個原始語料庫可供研究人員使用，幾乎代表了所有公開可用的市政和縣條例法規。最終的原始語料庫包含來自9,239個城市和縣的法規。一個較小的縣級協調LOCUS訪問層涵蓋了3,144個美國縣中最大的2,309個，佔據了大多數人口。我們使用光學字符識別（OCR）來處理各種文檔格式，這些格式使法律無法成為公共資源。我們發布了帶有覆蓋元數據的語料庫，以支持可重複性、下游法律人工智慧研究以及對地方法律的機器可讀訪問的逐步擴展。我們訓練了一組基於ModernBERT的分類器和評分器，以便在幾個維度（如不透明性和父權主義）中分析美國地方法律，這些維度之前並未在這個規模上進行研究。LOCUS-v1及其衍生模型可在以下網址獲得：https://huggingface.co/datasets/LocalLaws/LOCUS-v1

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

2606.19328v1 by Mohamed Nabail, Leo Cheng, Jingmin Wang, Nicholas Rhinehart

Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.

摘要：基於偏好的強化學習提供了一種從行為的成對比較中學習獎勵模型的方法，繞過了明確獎勵設計的需求。然而，現有的方法通常依賴於被動數據收集，並且在學習的早期階段面臨樣本效率低下的問題。我們引入了一種基於模型的方法，通過共同推理獎勵、動力學和價值函數中的不確定性來主動指導探索。我們的方法，不確定性平衡偏好規劃（UBP2），使用獎勵、動力學和價值函數模型的集成來根據一個統一的分數評估候選軌跡，該分數結合了預期獎勵、終端價值和認知不確定性。在這一目標下進行規劃產生了明確的利用與信息獲取之間的權衡，而不需要臨時的探索啟發式方法。在標準的正則性假設下，我們為有限視野和無限視野的設置建立了次線性後悔保證。實證上，對Meta-World基準的實驗顯示，UBP2的樣本效率顯著高於無模型的基於偏好的方法和非樂觀的基於模型的基準。

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

2606.19327v1 by Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying

Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

摘要：後訓練推理語言模型通常由監督式蒸餾和具有可驗證獎勵的強化學習驅動。蒸餾通常依賴於思考鏈註釋，這些註釋獲取成本高昂，且可能本身是嘈雜的、不完整的或部分不正確的；即使最終解決方案是正確的，不完美的推理也可能干擾學習。另一方面，帶有驗證獎勵的強化學習通常將評估反饋壓縮為一個標量信號，模糊了應該改進的回應方面。我們提出了\textbf{基於評分標準的自我蒸餾}，這是一個將評分標準作為結構化、細緻反饋的框架，用於在政策上進行自我蒸餾。我們的方法將教師模型條件化於標準級別的評分標準，並利用它提供對學生自己採樣軌跡的標記級別指導。這一設計避免將單一參考推理視為唯一的監督目標。相反，評分標準指定強回應應滿足的條件，從而在推理過程中實現比標量獎勵優化更細緻的信用分配。我們用一個兩階段的管道實現這一框架，首先學習生成任務特定的評分標準，然後訓練一個基於評分標準的推理器。我們在多樣的科學推理基準上進行評估，結果顯示基於評分標準的自我蒸餾有效地將評分標準級別的標準轉換為推理過程中的標記級別指導，平均超過GRPO 1.0分和OPSD 0.9分。

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

2606.19325v1 by Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon, Nani Goldring, Poriya Panet, Nir Zabari, Benjamin Brazowski, Or Patashnik, Yoav HaCohen

Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

摘要：現有的多說話者對話系統通過結構化的監督將說話者與發言綁定：每輪標籤、多流轉錄或可學習的說話者嵌入。這些系統在僅語音的流程中運作，產生乾淨的聲音序列，而沒有真實對話的環境質感。我們採取了不同的方法。我們的方法ScenA，將一個在大規模野外數據上預訓練的文本到音頻流匹配基礎模型，直接以多個參考聲音和描述整個多說話者音頻場景的自由形式自然語言提示進行條件化。利用這樣的基礎模型使我們能夠繼承其對自然非錄音室音頻的能力：背景噪音、房間音響、重疊對話和自發的副語言事件，同時在沒有任何每輪結構的情況下添加多說話者控制。具體而言，參考潛變量被串聯到模型的標記序列中，並通過輕量級的身份感知位置編碼進行區分。然而，我們識別出這種方法的一個關鍵障礙：\textit{參考捷徑}。在標準噪音調度下的訓練過程中，模型可以通過聲學相似性識別匹配的參考，完全繞過文本提示。我們通過一個高噪音偏向的時間步分佈來解決這個問題，迫使模型依賴文本提示進行說話者分配。我們在CoVoMix2-Dialogue基準上評估ScenA，顯示它在說話者綁定指標上超越現有的多說話者系統，同時生成豐富的對話音頻，包含重疊的語音、情感的聲音表達和環境聲音。我們的結果展示了使用以自由形式場景描述為條件的通用音頻模型的優勢，而不是將結構化對話腳本通過僅語音的流程傳遞。

Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents

2606.19319v1 by Anoushka Vyas, Aarushi Dhanuka, Sina Khoshfetrat Pakazad, Henrik Ohlsson

Production data integration is bottlenecked by repeated, lossy handoffs between data owners, engineers, and analysts who must collaboratively discover, structure, and query enterprise data. We present Data Intelligence Agents (DIA), a system of three agents (Data Interpreter, Schema Creator, and Query Generator) that compresses this workflow by treating autonomous coding agents (ACAs) as a first-class abstraction: rather than emitting text, the agents generate, execute, validate, and repair concrete artifacts, draw on a shared memory for experience reuse, and surface each for review by domain experts. DIA is deployed in production for enterprise customers. We study the Query Generator in depth and evaluate it in fully autonomous mode across seven SQL benchmarks spanning four task categories and four dialects. It matches or surpasses the best published results on all seven, demonstrating that an architecture grounded in execution, built on ACAs and a shared memory, generalizes across the data intelligence workload with adaptation confined to natural-language instructions.

摘要：生產數據整合受到數據擁有者、工程師和分析師之間重複且有損的交接所阻礙，他們必須協作發現、結構化和查詢企業數據。我們提出了數據智能代理（DIA），這是一個由三個代理（數據解釋器、架構創建者和查詢生成器）組成的系統，通過將自主編碼代理（ACA）視為一種一級抽象來壓縮這一工作流程：這些代理不是生成文本，而是生成、執行、驗證和修復具體的工件，利用共享記憶進行經驗重用，並將每個工件呈現給領域專家進行審查。DIA 已在企業客戶的生產環境中部署。我們深入研究了查詢生成器，並在完全自主模式下對其進行評估，涵蓋了四個任務類別和四種方言的七個 SQL 基準。它在所有七個基準上匹配或超越了最佳發表結果，證明了一種以執行為基礎、建立在 ACA 和共享記憶上的架構能夠在數據智能工作負載中進行泛化，而適應僅限於自然語言指令。

Explaining Attention with Program Synthesis

2606.19317v1 by Amiri Hayes, Belinda Li, Jacob Andreas

A longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions. In this paper, we propose an approach for approximating the behavior of components of deep networks with executable programs. We focus on attention heads in transformer language models. For a given head, we first compute its associated attention matrices on a collection of randomly selected training examples. Next, we prompt a pre-trained language model with a summary of these matrices, and instruct it to generate a set of Python programs that can reproduce the associated attention patterns given only text from the input sentence. Finally, we re-rank programs according to how well our final set of programs predict behavior on held-out inputs. We demonstrate that a set of fewer than 1,000 such generated programs can reproduce the attention patterns of heads in GPT-2, TinyLlama-1.1B, and Llama-3B, achieving an average Intersection-over-Union similarity above 75% on TinyStories. Moreover, the best-fit programs can replace neural attention heads without substantially affecting model behavior: replacing 25% of attention heads with programmatic surrogates across the three models incurs only a 16% average perplexity increase, while maintaining performance on a variety of downstream question answering benchmarks. This work contributes a scalable pipeline for reverse-engineering attention heads in transformer models using human-readable, executable code, advancing a path toward symbolic transparency in neural models.

摘要：一項長期以來的研究目標是用人類可理解的符號描述來取代不透明的神經計算。在本文中，我們提出了一種方法，用可執行程序來近似深度網絡組件的行為。我們專注於Transformer語言模型中的注意力頭。對於給定的頭，我們首先在一組隨機選擇的訓練示例上計算其相關的注意力矩陣。接下來，我們用這些矩陣的摘要來提示一個預訓練的語言模型，並指示它生成一組 Python 程序，這些程序可以僅根據輸入句子的文本重現相關的注意力模式。最後，我們根據我們的最終程序集在保留輸入上的行為預測的準確性來重新排序這些程序。我們證明，少於 1,000 個這樣生成的程序可以重現 GPT-2、TinyLlama-1.1B 和 Llama-3B 中頭部的注意力模式，在 TinyStories 上達到超過 75% 的平均交集-聯合相似度。此外，最佳擬合的程序可以在不顯著影響模型行為的情況下替代神經注意力頭：在三個模型中用程序替代品替換 25% 的注意力頭僅會產生 16% 的平均困惑度增加，同時在各種下游問題回答基準上保持性能。這項工作貢獻了一個可擴展的管道，用於使用人類可讀的可執行代碼逆向工程Transformer模型中的注意力頭，推進了神經模型中符號透明度的道路。

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

2606.19308v1 by Leyang Shen, Yang Zhang, Xiaoyan Zhao, Chun Kai Ling, Tat-Seng Chua

Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm falls short on decision-making tasks that are also prevalent in the real world. These tasks require simultaneous reasoning from the stances of all involved stakeholders whose decisions are mutually dependent and thus cannot be solved in isolation. We characterize this challenge as stance entanglement, a form of decision complexity distinct from execution complexity. To address it, we propose Multi-Agent Fictitious Play (MAFP), a novel MAS paradigm that represents stakeholder stances as agents and formulates decision-making as an equilibrium-seeking process. Built on the game-theoretic principle of fictitious play, MAFP iteratively updates each agent's decision by best responding to the empirical mixture of other agents' past decisions. This enables agents to expose and address one another's weaknesses, progressively improving decision quality and robustness. We evaluate MAFP on challenging decision-making tasks that test the capability of deciding strategies for competitive scenarios prior to acting. MAFP outperforms both single-round and multi-round baselines on two complementary metrics, tournament strength and robustness, demonstrating its effectiveness in addressing stance entanglement.

摘要：大型語言模型（LLM）基礎的多代理系統（MAS）在解決執行複雜度的任務中展現了巨大的潛力，通過將子任務分配給合作的代理。然而，這種分而治之的範式在決策任務上表現不佳，而這些任務在現實世界中也很普遍。這些任務需要所有相關利益相關者的立場進行同時推理，這些決策是相互依賴的，因此無法孤立地解決。我們將這一挑戰稱為立場交纏，這是一種不同於執行複雜度的決策複雜度形式。為了解決這個問題，我們提出了多代理虛擬遊戲（MAFP），這是一種新穎的MAS範式，將利益相關者的立場表示為代理，並將決策制定形式化為尋求均衡的過程。基於虛擬遊戲的博弈論原則，MAFP通過最佳響應其他代理過去決策的經驗混合，迭代更新每個代理的決策。這使得代理能夠暴露並解決彼此的弱點，逐步提高決策質量和穩健性。我們在挑戰性的決策任務上評估MAFP，這些任務測試在行動之前決定競爭場景策略的能力。MAFP在兩個互補指標——錦標賽強度和穩健性上，超越了單輪和多輪基準，展示了其在解決立場交纏方面的有效性。

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

2606.19266v1 by Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet, Richard Dufour, Benoit Favre

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

摘要：大型語言模型（LLMs）的發展使得對其在專業領域和語言中的適應性更加關注，但領域適應策略的有效性仍然不明確。我們以法語醫療問答（QA）為案例，展示了一項醫療領域適應的研究。我們比較了持續預訓練（CPT）、監督微調（SFT）及其組合，在三個模型家族、多個尺寸和三種初始化類型之間，明確區分適應效果與基礎模型選擇的影響。我們在貪婪和受限解碼下，使用自動指標和LLM作為評估者的評價，評估了多選擇題（MCQA）和開放式問答（OEQA）。對於MCQA，CPT+SFT最常達到最佳分數，但相較於SFT的增益較小且經常不具統計顯著性，使得SFT成為一個強大且具成本效益的預設選擇。對於OEQA，CPT始終改善基於重疊的指標，而SFT則常常降低生成質量；指令調整和CPT+SFT在LLM基礎的評估中更受青睞。跨語言實驗進一步顯示了從法語適應到英語基準的有效轉移。總體而言，我們提供了在計算限制下選擇適應策略的實用指導方針。

Structured Inference with Large Language Gibbs

2606.19264v1 by Sanghyeok Choi, Henry Gouk, Esmeralda S. Whitammer

The knowledge encoded in large language models (LLMs) can serve as a substrate for structured reasoning over variables describing a complex world, but accessing this knowledge in a probabilistically coherent manner poses a difficult inference problem. We propose Large Language Gibbs, a scheme for structured probabilistic inference that uses conditional distributions of an LLM as transition operators. Rather than sampling structured objects through single-pass autoregressive generation, we iteratively resample individual variables conditioned on others using an LLM's next-token conditionals. This approach avoids order-dependent biases and produces a stationary distribution that reflects a compromise between all local conditionals. We apply this approach to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning. The results suggest that the use of LLM conditionals in MCMC is a practical alternative to one-pass generation for structured probabilistic inference under a world prior accessible through noisy LLM conditionals.

摘要：大型語言模型（LLMs）中編碼的知識可以作為對描述複雜世界的變量進行結構化推理的基礎，但以概率一致的方式訪問這些知識則構成了一個困難的推理問題。我們提出了大型語言吉布斯（Large Language Gibbs），這是一種結構化概率推理的方案，利用LLM的條件分佈作為轉移運算符。我們不是通過單次自回歸生成來抽樣結構化對象，而是使用LLM的下一個標記條件，迭代地重新抽樣基於其他變量的單個變量。這種方法避免了依賴順序的偏見，並產生了一個穩定的分佈，反映了所有局部條件的妥協。我們將這種方法應用於從合成分佈中抽樣、一致性推理任務和貝葉斯結構學習。結果表明，在通過噪聲LLM條件可訪問的世界先驗下，使用LLM條件在MCMC中是一種結構化概率推理的實用替代方案，而非單次生成。

A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

2606.19259v1 by Yijin Wang, Shuyi Wang, Wenhan Zhang, Yuqi Ouyang

Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI's GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.

摘要：文本豐富的圖像通常包含與隱私相關、交易性或決策相關的信息。隨著最近的多模態圖像生成模型越來越能合成現實的文本內容和結構化的視覺設計，檢測AI生成的文本豐富圖像已成為數位信任和內容真實性的重要挑戰。然而，現有的基準主要集中在物體中心的圖像上，對於文本語義和佈局組織為中心的場景覆蓋有限。在本文中，我們介紹了一個多領域基準，用於檢測由OpenAI的GPT Image 2生成的文本豐富圖像。該基準包含8,602幅圖像，涵蓋六個代表性類別：商業海報、信息圖表、學術海報、收據、表格和用戶界面截圖。利用這個基準，我們在零樣本設置中評估了五個代表性的AI生成圖像檢測器，並分析了它們的整體、類別和後處理穩健性。我們的結果顯示檢測器性能高度依賴於領域：在某些類別中表現良好的方法往往在其他類別中失敗，即使是最強的傳統檢測器對JPEG壓縮也表現出嚴重的敏感性。我們進一步進行了一項探索性評估，使用多模態視覺-語言模型，揭示了其在結構化格式上的潛力和局限性。這些發現突顯了現代AI生成圖像需要文本和佈局感知檢測方法。我們的數據集已在XXX發布。

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

2606.19257v1 by Zirui Wu, Lin Zheng, Jiacheng Ye, Shansan Gong, Xueliang Zhao, Yansong Feng, Wei Bi, Lingpeng Kong

Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at https://github.com/DreamLM/DreamReasoner.

摘要：區塊擴散語言模型通過平行區塊去噪加速解碼，但它們是否能可靠地擴展以進行長鏈思考（CoT）推理仍未解決。為此，我們開發了DreamReasoner-8B，一個開源的區塊擴散推理模型，並系統性地研究訓練和推理區塊大小如何影響長CoT推理。我們的分析揭示了顯著的性能差異：使用大區塊大小進行訓練會產生極差的推理，而小區塊大小則能保持有效的推理。為了彌補這一粒度差距，我們提出了區塊大小課程學習，該方法逐步將訓練從細粒度轉變為粗粒度的區塊大小，從而克服這一限制並實現強大的推理性能，能夠在不同的推理區塊大小中進行泛化。在數學和代碼推理基準上，DreamReasoner-8B的結果與領先的開放自回歸模型如Qwen3-8B相媲美。這項工作為高效、具推理能力的擴散語言模型建立了實用的基礎。我們在https://github.com/DreamLM/DreamReasoner上發布了我們的模型。

X+Slides: Benchmarking Audience-Conditioned Slide Generation

2606.19256v1 by Haodong Chen, Xuanhe Zhou, Wei Zhou, Xinyue Shao, Yanbing Zhu, Bo Wang, Jiawei Hong, Anya Jia, Fan Wu

Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor. For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions. To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation. Built on a diverse corpus spanning 113 topics and seven presentation scenes, X+Slides employs a dynamic evaluation framework constructed from 8,133 deduplicated, source-grounded probes. By assigning audience-specific utility weights to the same source-grounded probes, X+Slides reports four complementary metrics: Audience Coverage measures how much audience-essential information is conveyed, Domain-wise Coverage shows which information types are covered, Efficiency measures delivered utility per unit of attention cost, and Correctness verifies whether slide claims are supported by the source. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems can recover a substantial but still incomplete part of audience-essential information: at $τ_A=0.7$, DeepPresenter reaches a best Audience Coverage of 0.714, SlideTailor reaches 0.594, and the NotebookLM ablation reaches 0.853 while showing clear grounding differences. These results indicate that visual quality and broad topic coverage should not be treated as evidence support without source-grounded evaluation.

摘要：自動從來源文件生成簡報是大型語言模型（LLMs）的一個重要應用。現有的基準主要評估簡報的完整性和技術深度，卻忽略了目標受眾作為一個關鍵的現實因素。例如，專家要求嚴謹的證明，而決策者則優先考慮可行的結論。為了填補這一空白，我們推出了 X+Slides，一個專門為受眾條件化簡報生成而設計的基準。X+Slides 建立在涵蓋 113 個主題和七個演示場景的多樣化語料庫上，採用由 8,133 個去重的、基於來源的探測器構建的動態評估框架。通過為相同的基於來源的探測器分配特定於受眾的效用權重，X+Slides 報告了四個互補指標：受眾覆蓋率衡量傳達了多少受眾必需的信息，領域覆蓋率顯示了哪些信息類型被涵蓋，效率衡量每單位注意成本所提供的效用，而正確性驗證簡報中的主張是否得到來源的支持。在 DeepPresenter、SlideTailor 和 NotebookLM 上的實驗顯示，當前系統可以恢復大量但仍不完整的受眾必需信息：在 $τ_A=0.7$ 時，DeepPresenter 的最佳受眾覆蓋率達到 0.714，SlideTailor 達到 0.594，而 NotebookLM 的消融實驗達到 0.853，同時顯示出明顯的基礎差異。這些結果表明，視覺質量和廣泛的主題覆蓋不應被視為沒有基於來源的評估支持的證據。

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

2606.19253v1 by Bartłomiej Baranowski, Dave Zhenyu Chen, Matthias Nießner

Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch's metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.

摘要：現有的視覺-語言模型 (VLMs) 在 3D 場景理解方面要麼依賴於複雜的模型特定幾何編碼器，要麼需要大量的訓練預算以追求空間推理。相反，OneCanvas 將所有視圖的補丁特徵聚合到一個單一的等距全景畫布上。具體而言，每個補丁使用其深度和相機姿勢被反投影到 3D 世界坐標系中，然後根據從畫布原點看到的該點的連續經度和緯度放置在畫布上，並且不會在重疊視圖之間進行光柵化或聚合。因此，來自所有幀的補丁共享一個空間坐標系，沒有融合或對主幹的重大架構修改。預訓練的 VLM 像處理普通圖像一樣消耗這種表示。由於畫布可以以任何感興趣的姿勢為中心，這種表示直接支持從特定視點進行的情境推理，這在機器人技術和具身 AI 中是一個常見需求。得益於這種表示，我們還可以引入空間預訓練課程：通過程序性地將來自真實圖像的物體補丁特徵放置在選定的 3D 世界位置上，並在其他空白畫布上生成即時監督，涵蓋廣泛的空間推理任務，並控制答案分佈以減少空間推理捷徑。OneCanvas 在 SQA3D 和 VSI-Bench 上達到了最先進的準確性，並在 SPBench 上對分佈外數據進行了泛化，使用的訓練計算量比最強競爭方法少一個量級。

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

2606.19245v1 by Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, Kenny Workman

Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).

摘要：人工智慧 (AI) 代理承諾通過壓縮解釋和決策循環來加速藥物發現，但實際部署需要對現實程序決策進行可信評估。我們介紹了治療基準前臨床藥理學 (TxBench-PP)，這是一個可驗證的小分子前臨床藥理學基準，也是更廣泛的治療基準在藥物發現階段和治療模式中的首個專注切片。TxBench-PP 測試代理是否能從現實世界的測定數據中恢復準確的結論，而不是從文獻中記憶的事實。該基準包含 100 個評估，按程序階段、測定類型和任務結構進行索引，涵蓋作用機制 (MoA) 和藥效學 (PD) 推理、化合物-靶標相互作用、因果靶標驗證、可開發性和安全性以及轉化療效。代理接收現實工作流程快照，在編碼環境中檢查文件，並返回結構化的答案，這些答案是確定性評分的。在 16 種模型架構配置中，包括 11 種模型和 4,800 梯跡，沒有系統可靠地恢復前臨床藥理學決策。最強的配置，Claude Opus 4.8 / Pi，通過了 59.3\% 的終點嘗試 (178/300; 95\% CI, 51.1-67.6)，其後是 GPT-5.5 / Pi，通過率為 55.3\% (166/300; 47.0-63.6)。

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

2606.19236v1 by Haipeng Luo, Qingfeng Sun, Songli Wu, Can Xu, Wenfeng Deng, Han Hu, Yansong Tang

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.

摘要：強化學習與可驗證獎勵算法如 GRPO 已成為 LLMs 中複雜推理的主導後訓練範式，但在訓練過程中常常遭遇政策熵崩潰。我們對 GRPO 下的標記級熵動態進行了一階梯度分析，並識別出標記級信用分配不匹配：每個標記的熵變化分解為軌跡級優勢與下一標記分佈的熵敏感度函數的乘積，產生優勢-驚訝四象限結構和近臨界性特性。受到此啟發，我們提出了 STARE（驚訝引導的標記級優勢重加權以穩定政策熵），它通過批內驚訝分位數識別熵關鍵標記子集，選擇性地重新加權其有效優勢，並納入目標熵閉環門以穩定熵調節。在從 1.5B 到 32B 的模型規模及三個任務系列（短期 CoT、長期 CoT 和多輪工具使用）中，STARE 在數千步的穩定 RL 訓練中持續保持政策熵在目標範圍內。在 AIME24 和 AIME25 上，STARE 在平均準確率上超越 DAPO 和其他競爭基準 4%-8%，反映標記和回應長度同步增長，顯示出持續的探索-利用平衡，進一步釋放 RL 訓練潛力。代碼可在 https://github.com/hp-luo/STARE 獲得。

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

2606.19222v1 by Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou

We propose MAST (Mechanism-Aligned Selective Targeting), a mechanism-guided method for unlearning RLVR-induced reasoning with substantially lower collateral damage than standard full-parameter updates. In matched SFT/RLVR checkpoints on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, the SFT-to-RLVR increment differs sharply from the SFT update in token-level delta-log-probability, and full-parameter gradient ascent forgets only by damaging retain MATH and GSM8K. MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling magnitude, then updates only the top-ranked subset. On the primary model, MAST induces statistically significant target forgetting (MATH forget 45/150 to 37/150; McNemar p=0.0078) while preserving GSM8K (+0.8 pp) and MATH retain (-0.5 pp). The advantage reproduces across seeds, NPO/SimNPO objectives, and Qwen3, where MAST preserves GSM8K while full-parameter unlearning collapses it.

摘要：我們提出了MAST（機制對齊選擇性目標），這是一種機制引導的方法，用於消除RLVR引起的推理，其附帶損害遠低於標準的全參數更新。在Qwen2.5-Math-1.5B和Qwen3-1.7B-Base的匹配SFT/RLVR檢查點中，SFT到RLVR的增量在標記級別的delta-log-probability上與SFT更新有明顯差異，而全參數梯度上升僅通過損害保留MATH和GSM8K來遺忘。 MAST根據非主能量、更新幅度和遺忘梯度耦合幅度對注意力投影張量進行排序，然後僅更新排名最高的子集。在主要模型上，MAST引發了統計上顯著的目標遺忘（MATH從45/150遺忘到37/150；McNemar p=0.0078），同時保留GSM8K（+0.8 pp）和MATH（-0.5 pp）。這一優勢在不同的隨機種子、NPO/SimNPO目標和Qwen3中重現，其中MAST保留了GSM8K，而全參數的消除則使其崩潰。

Machine Unlearning for the XGBoost Model with Network Intrusion Datasets

2606.19220v1 by Diana Magalhães, Eva Maia, João Vitorino, Isabel Praça

Machine Unlearning (MU) has emerged as an important technique for removing specific data points from trained models without requiring full retraining. However, most existing MU research focuses on deep learning and image data, leaving a gap in the domain of network intrusion detection, which relies heavily on tabular data. This work introduces XGBoost-Forget, an unlearning approach for the XGBoost model, to address this gap. The approach is evaluated on two tabular Network Intrusion (NI) datasets, IoT-23 and GeNIS, using multiple metrics to assess model performance, unlearning efficiency, and forgetting quality. The results show that XGBoost-Forget maintains predictive performance close to the original model while providing significantly faster unlearning, demonstrating its potential for MU in tabular NI settings.

摘要：機器遺忘（MU）已成為一種重要技術，能在不需要全面重訓的情況下，從訓練模型中移除特定數據點。然而，大多數現有的MU研究集中在深度學習和圖像數據上，這在依賴表格數據的網絡入侵檢測領域留下了空白。本研究介紹了XGBoost-Forget，這是一種針對XGBoost模型的遺忘方法，以填補這一空白。該方法在兩個表格網絡入侵（NI）數據集IoT-23和GeNIS上進行評估，使用多種指標來評估模型性能、遺忘效率和遺忘質量。結果顯示，XGBoost-Forget在保持接近原始模型的預測性能的同時，提供了顯著更快的遺忘速度，展示了其在表格NI環境中進行MU的潛力。

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

2606.19218v1 by Pushwitha Krishnappa, Amit Das, Vinija Jain, Aman Chadha, Tathagata Mukherjee

Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven question answering, the two are in tension. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free evaluation dataset of 15,000 r/AskReddit questions (September 2025), each paired with its authentic community replies, which postdate every evaluated model's training cutoff. Scoring five open-source LLMs (7--10B) against every reply each metric paired with a random-derangement noise floor we find that no metric does both jobs well. Cosine similarity separates real from random answers (Cohen's $d \approx 2$) but cannot rank the five models ($|d| < 0.1$); BERTScore precision appears to rank the models (raw $|d|$ up to 0.63), but once response length is controlled this collapses to $|d| = 0.09$ and its validity is weak ($d \approx 0.8$, versus cosine's $\approx 2$). Because every metric scores the same outputs, this validity--discrimination tradeoff is a property of the metrics, not the models, and we argue it stems from representation design. Three independent LLM judges reproduce the validity gap and likewise separate the five models only weakly. We recommend reporting metrics on both axes, with an explicit random-baseline floor. RECOM is publicly available at https://anonymous.4open.science/r/recom-D4B0

摘要：自動化指標是評估LLM生成文本的默認選擇，然而這些指標被默默要求完成兩項任務：區分真實內容的一致性與表面巧合（有效性），以及區分一個更好的系統與一個較差的系統（區分能力）。在開放式、以意見為驅動的問題回答中，這兩者之間存在緊張關係。我們介紹RECOM（Reddit模型對應評估），這是一個無污染的評估數據集，包含15,000條r/AskReddit問題（2025年9月），每個問題都配有其真實的社區回覆，這些回覆的發布時間均在每個被評估模型的訓練截止日期之後。對五個開源LLM（7--10B）進行評分，針對每個回覆，每個指標都配有隨機擾動的噪聲基準，我們發現沒有任何指標能很好地完成這兩項任務。餘弦相似度能夠區分真實與隨機答案（Cohen's $d \approx 2$），但無法對五個模型進行排名（$|d| < 0.1$）；BERTScore的精確度似乎能對模型進行排名（原始$|d|$高達0.63），但一旦控制了回應長度，這一數據就崩潰至$|d| = 0.09$，其有效性也很弱（$d \approx 0.8$，相比之下，餘弦的$\approx 2$）。因為每個指標對相同的輸出進行評分，這種有效性與區分能力的權衡是指標的特性，而非模型的特性，我們認為這源於表示設計。三位獨立的LLM評審重現了有效性差距，同樣僅弱地區分這五個模型。我們建議在兩個軸上報告指標，並設置明確的隨機基準底線。RECOM可在https://anonymous.4open.science/r/recom-D4B0公開獲得。

Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times

2606.19199v1 by Giuseppe Gabriele, Fabio Pavirani, Seyed Soroush Karimi Madahi, Chris Develder

The recent growth of EV adoption poses challenges for power systems, including increased peak demand and potential grid instability. Smart control of EV charging -- e.g., based on reinforcement learning (RL) -- can alleviate these issues by learning temporal and contextual patterns from historical data. Yet, in real-world scenarios, key features, such as departure time, often are unavailable. This, in turn, makes it harder for an RL agent to learn and execute an effective charging policy. To mitigate this uncertainty, a trained forecaster can approximate the unknown features from available data. However, since these forecasting models are typically trained for accuracy (rather than their impact on a downstream agent's decision quality), their errors may propagate and hinder the overall performance of a controller that is using the forecasts. To avoid this, we propose a decision-focused RL (DF-RL) framework in which the forecaster is trained end-to-end, i.e., with feedback from the charging policy actions taken by the RL agent. Such joint training of both the forecaster and controller ultimately results in higher-quality actions: our proposed DF-RL method yields superior charging decisions compared to other baselines, achieving up to a 14% improvement in total reward and a 55% reduction of unsupplied energy (i.e., charging that failed to happen because the EV already left), relative to the RL method without departure time forecasting.

摘要：最近電動車 (EV) 採用的增長對電力系統帶來了挑戰，包括峰值需求的增加和潛在的電網不穩定性。智能控制 EV 充電——例如，基於強化學習 (RL)——可以通過從歷史數據中學習時間和上下文模式來緩解這些問題。然而，在現實場景中，關鍵特徵，如出發時間，通常是不可用的。這反過來使得 RL 代理學習和執行有效的充電策略變得更加困難。為了減輕這種不確定性，經過訓練的預測器可以從可用數據中近似未知特徵。然而，由於這些預測模型通常是為了準確性而訓練的（而不是它們對下游代理決策質量的影響），因此它們的錯誤可能會傳播並阻礙使用這些預測的控制器的整體性能。為了避免這種情況，我們提出了一種以決策為重點的強化學習 (DF-RL) 框架，其中預測器進行端到端的訓練，即根據 RL 代理所採取的充電政策行動進行反饋。這種預測器和控制器的聯合訓練最終導致更高質量的行動：我們提出的 DF-RL 方法相比其他基準，產生了更優的充電決策，總獎勵提高了最多 14%，未供應能量（即因為 EV 已經離開而未能充電的情況）減少了 55%。

Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis

2606.19183v1 by Soheyl Bateni, Maryam Abdolali

Large language models (LLMs) can make clinical decision support more accessible by interpreting free-text documentation, but their direct use as diagnostic engines is limited by sensitivity to prompts, information order, and plausible but incorrect outputs. Structured machine-learning models offer more stable risk prediction, yet they require tabular inputs that are difficult to integrate with narrative clinical workflows. We present ClaMPAPP (Clinical Language-assisted Machine-learning Pipeline for Appendicitis), a hybrid system that uses an LLM as an interface rather than as the final decision-maker. ClaMPAPP extracts schema-constrained clinical features from note-like narratives, applies deterministic plausibility checks, and passes validated features to an XGBoost classifier trained on clinical, laboratory, and ultrasound variables. We evaluated ClaMPAPP on two independent pediatric appendicitis cohorts from German hospitals and compared it with end-to-end LLM baselines, including open-source and proprietary models. To preserve ground truth while testing free-text input, narratives were generated from structured electronic health records through template rendering and constrained LLM rewriting, with additional sentence-order permutation to assess positional robustness. ClaMPAPP achieved the strongest overall diagnostic performance in both internal and external validation while minimizing missed appendicitis cases, the key safety concern in acute triage. End-to-end LLMs showed unstable sensitivity-specificity trade-offs and greater degradation under narrative reordering. These results support an LLM-as-interface, ML-as-predictor design that separates natural-language usability from predictive inference and provides a more auditable pathway for clinical decision support.

摘要：大型語言模型（LLMs）可以通過解釋自由文本文檔，使臨床決策支持變得更加可及，但它們作為診斷引擎的直接使用受到提示、信息順序和合理但不正確的輸出敏感性的限制。結構化機器學習模型提供了更穩定的風險預測，但它們需要難以與敘述性臨床工作流程集成的表格輸入。我們提出了ClaMPAPP（臨床語言輔助機器學習管道，用於闌尾炎），這是一個混合系統，將LLM用作接口，而不是最終決策者。ClaMPAPP從類似筆記的敘述中提取受架構約束的臨床特徵，應用確定性的合理性檢查，並將經過驗證的特徵傳遞給基於臨床、實驗室和超聲變量訓練的XGBoost分類器。我們在來自德國醫院的兩個獨立兒科闌尾炎隊列上評估了ClaMPAPP，並將其與端到端LLM基準進行比較，包括開源和專有模型。為了在測試自由文本輸入時保留真實情況，敘述是通過模板渲染和受限的LLM重寫從結構化電子健康記錄生成的，並進行了額外的句子順序置換以評估位置穩健性。ClaMPAPP在內部和外部驗證中都實現了最強的整體診斷性能，同時最小化漏診的闌尾炎病例，這是急性分診中的關鍵安全問題。端到端LLM顯示出不穩定的敏感性-特異性權衡，並在敘述重新排序下出現更大的降級。這些結果支持LLM作為接口、機器學習作為預測者的設計，將自然語言的可用性與預測推斷分開，並提供了一個更可審計的臨床決策支持途徑。

A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies

2606.19174v1 by Fangyijie Wang, Jianjun Yu, Wentao Shi, Haixia Huang, Ran Shi, Guénolé Silvestre, Kathleen M. Curran

Clinician-centered evaluation is critical for validating medical AI systems, especially in ultrasound imaging where quantitative metrics do not always capture clinical usability. Existing medical image platforms primarily focus on dataset labeling. They lack integrated support for blinded model comparison and reproducible evaluation workflows. We present a clinician-centered pipeline for remote annotation and evaluation in ultrasound AI studies. The proposed pipeline uses a centralized server and lightweight browser interfaces to enable clinicians to perform annotation, blinded ranking, and review without local dataset downloads. The pipeline also supports multi-rater participation, centralized result aggregation, and automated statistical analysis. We validate the pipeline in a fetal ultrasound segmentation study with six raters spanning expert, generalist, and non-expert experience levels. The system automatically generated Spearman correlation, Kendall's $τ$, and top-1 selection statistics. Results indicated moderate to strong agreement across experts and other groups. The blinded evaluation results showed a tendency for later active learning models to be preferred. These outcomes suggest that the pipeline can support clinician-centered annotation and reproducible human-\ac{AI} evaluation studies in ultrasound imaging. The proposed pipeline is available on \href{https://github.com/13204942/SonoRate}{GitHub}.

摘要：臨床醫師中心的評估對於驗證醫療人工智慧系統至關重要，尤其是在超聲影像中，定量指標並不總是能夠捕捉臨床可用性。現有的醫療影像平台主要集中於數據集標註。它們缺乏對盲測模型比較和可重複評估工作流程的綜合支持。我們提出了一個臨床醫師中心的管道，用於遠程標註和超聲人工智慧研究中的評估。所提議的管道使用集中式伺服器和輕量級瀏覽器介面，使臨床醫師能夠在不下載本地數據集的情況下進行標註、盲測排名和審查。該管道還支持多評審者參與、集中結果聚合和自動統計分析。我們在一項涉及六位評審者的胎兒超聲分割研究中驗證了該管道，這些評審者的經驗水平涵蓋了專家、通才和非專家。系統自動生成了斯皮爾曼相關係數、肯德爾的 $τ$ 和前一選擇統計數據。結果顯示專家和其他組別之間的協議程度從中等到強。盲測評估結果顯示後期主動學習模型更受青睞。這些結果表明該管道可以支持臨床醫師中心的標註和可重複的人類-\ac{AI} 評估研究在超聲影像中。所提議的管道可在 \href{https://github.com/13204942/SonoRate}{GitHub} 上獲得。

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

2606.19172v1 by Bojie Li

Personal memory in a language model is two problems: content and reasoning skill. The brain keeps the two apart (a sparse, local engram in the hippocampus for each episode, a slow neocortex for the shared skills that interpret it), so a new fact need not overwrite everything else. Most personalization today keeps a user's facts outside the weights, in a natural-language memory file or a retrieval index. When facts are written into the model instead, the standard recipe is the per-user LoRA adapter, which does the opposite of the brain, folding content and skill into one global weight delta. Writing a user's facts as a LoRA contaminates text unrelated to them; writing the same facts as local Engram rows leaves it mathematically untouched, resulting in a roughly 33,000x smaller memory footprint. We therefore propose User as Engram: store a user's content as surgical edits to the hash-keyed memory table of an Engram model, and carry the reasoning skill in one shared adapter. This layered design matches per-user LoRA's direct recall while delivering 5.6x higher indirect-reasoning accuracy on average, and never makes a single user worse at reasoning than the untouched base. The edit is a glass box: writing a fact switches on its lookup at exactly the trigger, adds the value the answer needs, leaves every other position unchanged to the last bit, and fails if written into the wrong layer. Because different users' facts land in disjoint hash slots, their edits compose: many users live in one shared table at once, stacking additively and losslessly, where a per-user LoRA, a single global weight delta, admits only one. Upon retrieval, a per-user Engram table does not grow with the population the retriever must search, so past ~100 facts it overtakes a retrieval pipeline on a 2.5x larger model.

摘要：個人記憶在語言模型中面臨兩個問題：內容和推理能力。大腦將這兩者分開（每個事件在海馬體中的稀疏、本地記憶痕跡，以及解釋這些事件的緩慢新皮層），因此一個新事實不需要覆蓋其他所有內容。當前大多數個性化方法將用戶的事實保留在權重之外，存儲在自然語言記憶檔案或檢索索引中。當事實被寫入模型時，標準做法是每用戶的LoRA適配器，這與大腦的運作相反，將內容和技能折疊成一個全局權重增量。將用戶的事實寫為LoRA會污染與其無關的文本；將相同的事實寫為本地記憶痕跡行則在數學上保持不變，從而導致大約33,000倍更小的記憶佔用。

因此，我們提出用戶作為記憶痕跡：將用戶的內容存儲為對記憶模型的哈希鍵記憶表的手術性編輯，並在一個共享適配器中攜帶推理技能。這種分層設計與每用戶的LoRA的直接回憶相匹配，同時在平均上提供5.6倍更高的間接推理準確性，並且從未使單一用戶的推理能力低於未觸及的基礎。編輯是一個玻璃盒：寫入一個事實會在恰好觸發時啟用其查找，添加答案所需的值，並保持其他每個位置不變到最後一位，若寫入錯誤的層則會失敗。由於不同用戶的事實落在不相交的哈希槽中，他們的編輯可以組合：許多用戶同時存在於一個共享表中，進行加法堆疊且無損，而每用戶的LoRA，單一全局權重增量，只允許一個。檢索時，每用戶的記憶表不會隨著檢索者必須搜索的人口增長，因此在超過約100個事實後，它在一個2.5倍更大的模型上超越了檢索管道。

Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

2606.19170v1 by Shiho Matta, Yin Jou Huang, Fei Cheng, Takashi Kodama, Hirokazu Kiyomaru, Yugo Murawaki

We introduce Dango, a 1.8B-parameter large language model designed for controlled studies of L1-to-L2 (Japanese-to-English) transfer in second language acquisition (SLA). While previous studies have explored SLA in language models, they have predominantly relied on smaller or non-decoder models, limiting their ability to generate open-ended text and reducing their suitability as practical L2 simulators. We identify a key challenge when scaling models to this size: L2 contamination within the "monolingual" pretraining corpus used for L1 acquisition. To address this, we propose a filtering method to reduce premature exposure to English while preserving realistic, minimal exposure. We then fine-tune the model on LLM-generated L2-learning lessons to simulate the L2 acquisition process. Our evaluations confirm that Dango develops human-like L2 production patterns, outperforming both unfiltered and standard multilingual baselines. We release the model, data, and code to facilitate reproducible computational SLA research and learner-facing applications.

摘要：我們介紹 Dango，一個擁有 1.8B 參數的大型語言模型，旨在控制 L1 到 L2（從日語到英語）轉移的第二語言習得（SLA）研究。雖然之前的研究已經探索了語言模型中的 SLA，但它們主要依賴於較小或非解碼器模型，這限制了它們生成開放式文本的能力，並降低了它們作為實用 L2 模擬器的適用性。我們確定了一個關鍵挑戰，即在將模型擴展到這個規模時，L2 污染存在於用於 L1 習得的「單語」預訓練語料庫中。為了解決這個問題，我們提出了一種過濾方法，以減少對英語的過早接觸，同時保持現實的、最小的接觸。然後，我們在 LLM 生成的 L2 學習課程上對模型進行微調，以模擬 L2 習得過程。我們的評估確認 Dango 發展出類似人類的 L2 產出模式，超越了未過濾和標準多語言基準。我們釋放模型、數據和代碼，以促進可重複的計算 SLA 研究和面向學習者的應用。

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

2606.19168v1 by Jinhan Li, Kexian Tang, Yihan Xu, Zhuorui Ye, Kaifeng Lyu

To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.

摘要：為了實現大型語言模型（LLMs）的更深層安全對齊，最近的研究努力探討如何將安全干預措施提前到預訓練階段，主要是通過過濾不安全數據或將其重寫為更安全的形式。我們認為，預訓練階段的對齊應該超越僅僅使數據安全：LLMs 可能會將看似無害的知識和能力組合成不安全的行為。為此，我們提出了安全反思預訓練，這是一種預訓練階段的對齊方法，定期將短暫的安全反思插入預訓練語料庫中，以將自我監控直接整合到語言建模中，建立一種基礎能力，隨後通過兼容的後訓練進行強化。我們對在 FineWeb-Edu 上預訓練的 1.7B 模型的實驗顯示，安全反思預訓練提高了安全分類準確性，並大幅降低了推理階段和微調攻擊的成功率。除了我們的實際世界實驗外，我們還介紹了一個完全受控的合成環境 MedSafetyWorld，該環境對安全有明確的定義，並具有一個推理結構，使模型能夠輕鬆地從安全數據中概括不安全的行為。在 MedSafetyWorld 中的消融實驗進一步證明了安全反思預訓練在防止模型基於安全數據概括不安全行為方面的明顯優勢，相較於數據過濾和重寫。綜合來看，我們的研究結果表明，預訓練對齊不僅應使訓練數據安全，還應塑造模型可能從安全數據中獲得的行為。

Essential Subspace Merging for Multi-Task Learning

2606.19164v1 by Longhua Li, Lei Qi, Xin Geng, Qi Tian

Model merging aims to enable multi-task learning by integrating the capabilities of multiple models fine-tuned from the same pre-trained checkpoint into a single model. Its core challenge is inter-task interference among task-specific parameter updates. In this paper, we analyze the output shifts induced by task updates and observe that their energy is concentrated in a small number of principal directions. We call the subspace spanned by these directions the essential subspace. In contrast, most remaining directions carry little task-relevant energy, but their accumulation across multiple task updates can cause severe interference during merging. Motivated by this observation, we propose Essential Subspace Decomposition (ESD), which decomposes each task update according to the principal components of its activation shift. Based on ESD, we introduce Essential Subspace Merging (ESM), a training-free static merging method that orthogonalizes and fuses essential components into one compact multi-task model. We further extend ESM to ESM++, a training-free dynamic merging method that decomposes task-specific residuals into low-rank experts and selects the most relevant expert through prototype-based routing during forward inference. Extensive experiments across multiple task sets and model scales demonstrate that ESM and ESM++ effectively preserves task knowledge while reducing inter-task interference.

摘要：模型合併旨在通過將從相同預訓練檢查點微調的多個模型的能力整合到一個模型中，以實現多任務學習。其核心挑戰是任務特定參數更新之間的相互干擾。在本文中，我們分析了由任務更新引起的輸出變化，並觀察到它們的能量集中在少數主要方向上。我們稱這些方向所跨越的子空間為基本子空間。相對而言，大多數剩餘方向攜帶的任務相關能量很少，但它們在多次任務更新中的累積可能會在合併過程中造成嚴重的干擾。受到這一觀察的啟發，我們提出了基本子空間分解（ESD），該方法根據其激活變化的主成分分解每個任務更新。基於ESD，我們引入了基本子空間合併（ESM），這是一種無需訓練的靜態合併方法，能夠將基本組件正交化並融合成一個緊湊的多任務模型。我們進一步將ESM擴展為ESM++，這是一種無需訓練的動態合併方法，能夠將任務特定的殘差分解為低秩專家，並通過基於原型的路由在前向推理過程中選擇最相關的專家。在多個任務集和模型規模上的大量實驗表明，ESM和ESM++有效地保留了任務知識，同時減少了任務之間的干擾。

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

2606.19157v1 by Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh, Eldho Ittan George, R J Hari, Kaushal Bhogale, Mitesh M. Khapra

AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.

摘要：AudioLLMs 使得語音識別能夠基於文本提示進行，例如領域描述或實體列表。然而，目前尚不清楚這些模型是否真正利用了這些上下文，還是依賴於在預訓練期間學到的參數知識。現有的基準無法回答這個問題，因為它們在固定的提示條件下評估轉錄，並且很少包含明確的上下文輸入。我們介紹了 IndicContextEval，一個涵蓋 555 位講者的 56 小時多語言自然語音基準，涉及 8 種印度語言和 23 個專業領域。我們設計了一個 7 級提示框架，逐步引入上下文信號，包括元數據、自然語言描述、英語和母語的實體列表，以及包含不正確實體的對抗性提示。對五個模型的評估顯示了上下文利用行為的顯著差異，突顯了對 AudioLLMs 中上下文基礎的明確評估的需求。

AdsMind: A Physics-Grounded Multi-Agent System for Self-Correcting Discovery of Adsorption Configurations on Heterogeneous Catalyst Surfaces

2606.19152v1 by Zongmin Zhang, Yuyang Lou, Bowen Zhang, Junwu Chen, Ryo Kuroki, Xuan Vu Nguyen, Edvin Fako, Lixue Cheng, Philippe Schwaller

Identifying the lowest-energy surface-adsorbate configuration is critical for modeling heterogeneous catalysis, yet exhaustive exploration with ab initio calculations is computationally prohibitive. Machine-learning force fields (MLFFs) accelerate structural relaxation but leave the search over the vast configurational space a major bottleneck, and open-loop large language model (LLM) agents lack a physics-grounded feedback mechanism to correct erroneous initial guesses. We propose AdsMind (Adsorption configuration discovery with Machine intelligence and relaxation feedback), a closed-loop multi-agent framework that enables autonomous error correction through MLFF relaxation feedback. Across four LLM backends, AdsMind achieves consistently high search reliability, with success rates of 100% and 98.8% on the benchmarks AA20 and OCD-GMAE62. Relative to its single-pass (1-Shot) ablation it reduces cross-backend energy dispersion, and it uses only 4.11 and 4.67 MLFF relaxations per case, respectively -- an approximately 14-fold reduction over heuristic enumeration baselines. Density functional theory (DFT) validation using VASP/PBE on six representative AA20 systems shows that the reported open-loop Adsorb-Agent outputs exhibit qualitative adsorption-energy sign errors for molecular adsorbates, whereas AdsMind preserves the correct sign in all tested cases with closer quantitative agreement. AdsMind thus delivers reliability, self-reflection, and interpretability simultaneously, supporting more DFT-informed autonomous chemistry workflows.

摘要：識別最低能量的表面-吸附劑配置對於建模異質催化至關重要，但使用從頭計算的全面探索在計算上是不可行的。機器學習力場（MLFFs）加速結構放鬆，但在廣闊的配置空間中搜索仍然是一個主要瓶頸，而開環大型語言模型（LLM）代理缺乏基於物理的反饋機制來修正錯誤的初始猜測。我們提出了AdsMind（基於機器智能和放鬆反饋的吸附配置發現），這是一個閉環多代理框架，通過MLFF放鬆反饋實現自主錯誤修正。在四個LLM後端中，AdsMind實現了一致的高搜索可靠性，在基準AA20和OCD-GMAE62上的成功率分別為100%和98.8%。相較於其單次通過（1-Shot）消融，它減少了跨後端的能量分散，並且每個案例僅使用4.11和4.67次MLFF放鬆，分別比啟發式枚舉基線減少了約14倍。使用VASP/PBE進行的密度泛函理論（DFT）驗證在六個代表性的AA20系統上顯示，報告的開環Adsorb-Agent輸出對於分子吸附劑顯示出定性吸附能量符號錯誤，而AdsMind在所有測試案例中保持正確的符號，並且在定量上更接近。AdsMind因此同時提供可靠性、自我反思和可解釋性，支持更多基於DFT的自主化學工作流程。

OrthoReg: Orthogonal Regularization for Hybrid Symbolic-Neural Dynamical Systems

2606.19145v1 by Till Richter, Niki Kilbertus

Dynamical systems are fundamental to modeling the natural world, yet modeling them involves a persistent trade-off: manually prescribed mechanistic models are interpretable by design but often overly simplistic and misspecified; in contrast, flexible data-driven neural methods lack physical insight. Hybrid modeling aims for the best of both worlds by combining a prescribed or symbolic, physics-based component with a flexible neural network. A critical challenge, however, is that the neural component may relearn mechanistic parts, yielding redundant and uninterpretable models, especially when the symbolic structure itself is discovered from data. Existing methods based on standard $L^2$ regularization rely on a projection argument that breaks when the symbolic component is learned through sparse discovery, allowing the neural augmentation to overlap with symbolic structure. We introduce \textbf{OrthoReg} (Orthogonal Regularization), which directly penalizes overlap between the symbolic and neural components, preventing symbolic structure from being absorbed by the neural residual. This yields a complementary decomposition: the symbolic part captures what the library can express, and the neural part captures what remains. On benchmark dynamical systems with partial library mismatch, OrthoReg improves symbolic recovery and out-of-distribution behavior.

摘要：動態系統是建模自然世界的基礎，但建模過程涉及持續的權衡：手動指定的機械模型在設計上是可解釋的，但往往過於簡化且規範不當；相對而言，靈活的數據驅動神經方法則缺乏物理洞察。然而，一個關鍵挑戰是神經組件可能會重新學習機械部分，導致冗餘且不可解釋的模型，特別是當符號結構本身是從數據中發現時。現有基於標準 $L^2$ 正則化的方法依賴於一個投影論證，當符號組件通過稀疏發現學習時，該論證會失效，允許神經增強與符號結構重疊。我們提出了 \textbf{OrthoReg}（正交正則化），該方法直接懲罰符號和神經組件之間的重疊，防止符號結構被神經殘差吸收。這產生了一個互補的分解：符號部分捕捉庫所能表達的內容，而神經部分捕捉剩餘的內容。在具有部分庫不匹配的基準動態系統上，OrthoReg 改進了符號恢復和分佈外行為。

2606.19144v1 by Jingyi Zhou, Senlin Luo, Haofan Chen

Current conversational AI systems have made significant progress in language generation, personalization, and long-context interaction. However, most existing methods model social behavior through isolated components such as emotion modeling, memory retrieval, or persona conditioning, lacking a unified framework to explain the emergence of stable social relationships and social intelligence in long-term human-AI interaction.To address this, we propose the Human-AI Coevolution Dynamics Framework (HACD-H), a formal model of human-AI interaction as a self-organizing social cognitive system. HACD-H integrates emotional adaptation, relational organization, social memory, and personality consistency into a unified dynamical framework and introduces principles including multi-timescale social cognition, relational attractors, trust basins, developmental phase transitions, and social cognitive energy dynamics.We construct a conversational dataset with approximately 14,700 interaction turns and develop a theory-driven empirical evaluation framework. Results reveal a hierarchy of temporal persistence in social cognition, stable relational attractors, phase-transition-like developmental patterns, and a structured social cognitive energy landscape. Social intelligence shows a significant negative correlation with social cognitive energy (r = -0.391, p < 0.001), and interaction trajectories exhibit progressive energy reduction over time.These findings suggest that social intelligence emerges from long-term social cognitive coevolution rather than isolated conversational capabilities. HACD-H provides a unified theoretical foundation for modeling adaptive human-AI social interaction and developing socially intelligent AI systems.

摘要：目前的對話式人工智慧系統在語言生成、個性化和長期上下文互動方面取得了顯著進展。然而，大多數現有的方法通過孤立的組件如情感建模、記憶檢索或角色調整來建模社會行為，缺乏統一的框架來解釋穩定社會關係和社會智慧在長期人機互動中的出現。為了解決這個問題，我們提出了人機共演化動力學框架（HACD-H），這是一個將人機互動視為自組織社會認知系統的正式模型。HACD-H將情感適應、關係組織、社會記憶和個性一致性整合到一個統一的動力學框架中，並引入了包括多時間尺度社會認知、關係吸引子、信任盆地、發展階段轉變和社會認知能量動力學等原則。我們構建了一個包含約14,700次互動回合的對話數據集，並開發了一個以理論為驅動的實證評估框架。結果顯示社會認知中的時間持久性層次、穩定的關係吸引子、類相變的發展模式以及結構化的社會認知能量景觀。社會智慧與社會認知能量之間顯示出顯著的負相關（r = -0.391, p < 0.001），而互動軌跡隨著時間的推移表現出逐步的能量減少。這些發現表明，社會智慧是從長期的社會認知共演化中產生的，而不是孤立的對話能力。HACD-H為建模自適應人機社會互動和開發社會智能人工智慧系統提供了統一的理論基礎。

Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

2606.19139v1 by Ramza Basharat, Muhammad Usman Ali

Automatic Handwritten Text Recognition (HTR) is inherently a challenging task, and its complexity is further increased when dealing with cursive scripts. Although significant efforts have been made on various cursive scripts, research regarding Urdu Handwritten Text Recognition (UHTR) has been relatively limited. This lag of research is primarily due to the unique challenges posed by its script, and the scarcity and unavailability of benchmark datasets. Therefore, to advance research in UHTR, this study presents a specialized real dataset called the Urdu Katib Handwritten Dataset (UKHD). To the best of our knowledge, this is the first offline Urdu handwritten text lines dataset specifically curated from the materials written by Katibs in historical times. It encompasses a diverse range of flat nib writing variations in the Nastalique calligraphic style. Additionally, the effectiveness of different CRNN-based hybrid models has been evaluated to identify the optimal architecture for Urdu Katib Handwriting Recognition (UKHR). Among the analyzed models, the CNN-BGRU-CTC model showed more robust performance, with low Character Error Rate (CER) and Word Error Rate (WER). This research work aims to support and encourage the research community in developing a robust recognition system for preserving Urdu handwritten literature.

摘要：自動手寫文字識別（HTR）本質上是一項具有挑戰性的任務，當處理草寫字體時，其複雜性進一步增加。儘管對各種草寫字體已經做出了重大努力，但對烏爾都手寫文字識別（UHTR）的研究相對有限。這一研究滯後主要是由於其字體所帶來的獨特挑戰，以及基準數據集的稀缺和不可用性。因此，為了推進UHTR的研究，本研究提出了一個名為烏爾都Katib手寫數據集（UKHD）的專門真實數據集。據我們所知，這是第一個專門從歷史時期Katib所寫材料中策劃的離線烏爾都手寫文字行數據集。它涵蓋了Nastalique書法風格中各種平尖筆書寫變體。此外，還評估了不同基於CRNN的混合模型的有效性，以確定烏爾都Katib手寫識別（UKHR）的最佳架構。在分析的模型中，CNN-BGRU-CTC模型顯示出更穩健的性能，具有較低的字符錯誤率（CER）和單詞錯誤率（WER）。本研究旨在支持和鼓勵研究社群開發一個穩健的識別系統，以保護烏爾都手寫文學。

A Technical Taxonomy of LLM Agent Communication Protocols

2606.19135v1 by Linus Sander, Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll

As large language models (LLMs) advance and multi-agent systems aim to overcome the limits of standalone agents, robust communication protocols are becoming essential infrastructure for distributed agent networks. Nonetheless, the fragmented protocol landscape presents a significant interoperability challenge. This study develops a technical taxonomy to classify and analyze LLM agent communication protocols. Following an established iterative method, we defined the taxonomy's purpose, meta-characteristic, and ending conditions, then performed five iterations, three empirical-to-conceptual and two conceptual-to-empirical, on nine actively maintained open-source protocols with demonstrable adoption. The taxonomy comprises five dimensions: counterparty, payload, interaction state, discovery mechanism, and schema flexibility. Classification reveals recurring architectural patterns: all sampled agent-to-agent protocols combine hybrid payloads with session-state persistence; most protocols support multiple predefined schemas, and two negotiate schemas at runtime, indicating a trend toward schema flexibility; decentralized discovery remains rare. Analysis suggests short-term convergence pressure toward protocols unifying agent-to-agent and agent-to-context (tool and data) communication. Long-term, however, no single protocol is likely to maximize versatility, efficiency, and portability simultaneously. The field will more likely evolve toward a federated, layered protocol stack. The framework guides protocol selection and highlights open research gaps such as privacy and policy enforcement.}

摘要：隨著大型語言模型（LLMs）的進步，以及多代理系統旨在克服獨立代理的限制，穩健的通信協議正成為分佈式代理網絡的基本基礎設施。儘管如此，碎片化的協議格局帶來了顯著的互操作性挑戰。本研究開發了一個技術分類法來分類和分析LLM代理通信協議。遵循既定的迭代方法，我們定義了分類法的目的、元特徵和結束條件，然後對九個積極維護的開源協議進行了五次迭代，其中三次是從經驗到概念，兩次是從概念到經驗，這些協議具有可證明的採用情況。該分類法包含五個維度：對方、有效載荷、互動狀態、發現機制和模式靈活性。分類顯示出反覆出現的架構模式：所有抽樣的代理對代理協議都將混合有效載荷與會話狀態持久性結合；大多數協議支持多個預定義模式，並且有兩個在運行時協商模式，這表明了一種向模式靈活性發展的趨勢；去中心化發現仍然很少見。分析表明，短期內存在向統一代理對代理和代理對上下文（工具和數據）通信的協議的收斂壓力。然而，從長期來看，沒有單一協議能夠同時最大化多功能性、效率和可攜性。該領域更可能朝著聯邦化的分層協議堆棧發展。該框架指導協議選擇，並突出如隱私和政策執行等開放研究空白。

Equivariant Graph Neural Networks Improve Optical Spectra Prediction for Materials Screening

2606.19133v1 by Kasper Helverskov Petersen, François R J Cornet, Martin Ovesen, Mikkel Jordahn, Kristian S. Thygesen, Mikkel N. Schmidt

Scalable prediction of optical spectra is a critical component of high-throughput materials screening for optoelectronic applications such as solar cells. Existing surrogate models are trained on spectra computed from lower levels of theory or rely on rotation-invariant scalar features, limiting their geometric expressiveness. We explore the use of equivariant graph neural networks for optical spectra prediction, adapting GotenNet to this task and evaluating it on multiple datasets including a recently published collection of 10,533 structures with spectra computed at the level of the random phase approximation (RPA). The proposed model outperforms the current state of the art, with the largest gains in the 0-8 eV range and on predicting the static real permittivity, both of particular relevance for thin-film optics.

摘要：可擴展的光譜預測是高通量材料篩選在光電應用（如太陽能電池）中的一個關鍵組成部分。現有的替代模型是基於從較低理論層次計算的光譜進行訓練，或依賴於旋轉不變的標量特徵，這限制了它們的幾何表達能力。我們探索了使用等變圖神經網絡進行光譜預測，將 GotenNet 調整為此任務，並在多個數據集上進行評估，包括最近發表的 10,533 個結構的集合，這些結構的光譜是基於隨機相位近似（RPA）計算的。所提出的模型在當前的最先進技術中表現優越，尤其是在 0-8 eV 範圍內以及預測靜態實部許可率方面，這兩者對於薄膜光學特別相關。

Written by AI, Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions

2606.19121v1 by Hui Zhang, Shuren Song

The prevailing engineering intuition for addressing conceptual drift in long-horizon LLM collaboration is to trade more formal constraints for more reliable outputs -- designing symbolic identifier systems, accumulating defensive rules in System Prompts, expanding context windows. Our engineering record shows that in long-horizon settings, this direction may produce effects contrary to design intent. Using action research methods in a real software project (Bang-v3) spanning approximately one month and 391 collaborative sessions, we document and analyze the failure process of these strategies. When the symbolic system exceeds a complexity threshold, LLMs do not become more accurate -- instead, they abandon genuine understanding of business semantics, retreat to self-referential reasoning within the symbolic layer, and generate outputs that appear internally consistent but are physically disconnected from reality. We name this failure pattern "Index Sickness," and its canonical manifestation "Phantom Legislation." We name the underlying principle the "Pang Principle (Semantic Vitality Law)": natural language carrying explicit purpose conveys far greater information quality than symbolic expression. From this, we design and validate its physical engineering mechanism: "Baseline-Log Physical Separation." In the same project, this mechanism reduced AI Instructions volume by ~75%, and across the subsequent ~150 sessions, no recurrence of Index Sickness was observed. A bilingual companion version (Chinese) is included as supplementary material.

摘要：當前針對長期 LLM 合作中概念漂移的工程直覺是用更正式的約束來換取更可靠的輸出——設計符號識別系統、在系統提示中累積防禦規則、擴展上下文窗口。我們的工程紀錄顯示，在長期設定中，這一方向可能會產生與設計意圖相反的效果。在一個持續約一個月且涵蓋 391 次合作會議的實際軟體專案（Bang-v3）中，我們記錄並分析了這些策略的失敗過程。當符號系統超過複雜性閾值時，LLM 並不會變得更準確——相反，它們放棄了對商業語義的真正理解，退回到符號層內的自我參照推理，並生成看似內部一致但與現實物理上脫節的輸出。我們將這種失敗模式命名為「指數病」，其典型表現為「幻影立法」。我們將其背後的原則命名為「彭原則（語義活力法則）」：承載明確目的的自然語言傳達的資訊品質遠高於符號表達。基於此，我們設計並驗證了其物理工程機制：「基線-對數物理分離」。在同一專案中，這一機制將 AI 指令的數量減少了約 75%，並且在隨後的約 150 次會議中未觀察到指數病的復發。附有雙語伴隨版本（中文）作為補充材料。

Analysing drivers and interdependencies in European electricity markets using XAI

2606.19118v1 by Antoine Pesenti, Aidan O'Sullivan

Electricity markets are inherently complex systems characterised by strong nonlinearities, high-dimensional interactions, and increasing interdependence across regions. While deep neural networks (DNNs) have demonstrated strong predictive capabilities for electricity prices, their lack of interpretability limits their usefulness for understanding the underlying drivers of price formation. This paper addresses this gap by combining DNN models with explainable artificial intelligence (XAI) techniques to analyse the determinants of electricity prices across 39 European bidding zones. We employ SHAP (SHapley Additive exPlanations) to quantify feature contributions and apply and extend SSHAP, an aggregation framework to improve interpretability in high-dimensional settings. The analysis identifies that renewable energy sources, particularly solar, play a disproportionately important role in price formation despite their lower share in total power generation. Gas prices remain a dominant and consistent driver across electricity markets, while interconnections significantly shape price dynamics, highlighting the strong interdependence of European electricity systems. In addition, a synthetic EU-wide electricity market is constructed to explore the counterfactual scenario of a fully integrated market with a single price.

摘要：電力市場本質上是複雜的系統，特徵是強非線性、高維互動和各地區之間日益增強的相互依賴性。雖然深度神經網絡（DNN）在電力價格的預測能力上表現出色，但其缺乏可解釋性限制了其對理解價格形成的基本驅動因素的實用性。本文通過將DNN模型與可解釋的人工智慧（XAI）技術相結合，來分析39個歐洲競標區域的電力價格決定因素，填補了這一空白。我們使用SHAP（SHapley Additive exPlanations）來量化特徵貢獻，並應用及擴展SSHAP，這是一個聚合框架，用於提高高維設置中的可解釋性。分析顯示，儘管可再生能源，特別是太陽能，在總發電量中的比例較低，但在價格形成中扮演著不成比例的重要角色。天然氣價格仍然是電力市場中的主導和一致驅動因素，而互聯網絡則顯著影響價格動態，突顯了歐洲電力系統之間的強相互依賴性。此外，構建了一個合成的EU範圍內電力市場，以探索完全整合市場的反事實情境，並且只有一個價格。

Towards an Agent-First Web: Redesigning the Web for AI Agents

2606.19116v1 by Eranga Bandara, Ross Gore, Ravi Mukkamala, Asanga Gunaratna, Safdar H. Bouk, Xueping Liang, Peter Foytik, Abdul Rahman, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Chalani Rajapakse, Ng Wee Keong, Kasun De Zoysa, Tharaka Hewa, Amin Hass, Wathsala Herath, Aruna Withanage, Nilaan Loganathan, Atmaram Yarlagadda, Sachin Shetty

The World Wide Web was built on an assumption held for three decades: the primary consumer of web content is a human being. This permeates every layer; its access model presumes human visitors, its economics rest on human attention, and its content targets human perception. The rapid emergence of AI agents as intermediaries between humans and web content invalidates this assumption. Yet the web resists agents through blanket blocking, CAPTCHA-based exclusion, and economic models that treat agent access as extraction rather than legitimate interaction. This paper proposes a principled redesign across three layers. At the access layer, agents acting for humans should inherit equivalent access rights, governed by rate limiting and agent identification metadata in HTTP requests, analogous to browser headers, alongside a dual-layer architecture serving human-readable and agent-optimized content from the same domain. At the economic layer, we propose an intent-based tier framework grounded in the agent-as-human-proxy principle: an agent's economic obligation mirrors that of the human it represents. A token-based subscription model meters content in tokens rather than pageviews, alongside a commissioned content economy anchoring AI content production in human intentionality. At the content layer, we identify epistemic recursion, the self-referential loop in which AI-generated content is consumed by agents to produce further content, progressively detaching web knowledge from human ground truth. We propose the Agent Text Markup Language (ATML), a four-level human supervision tier model, and a cryptographic provenance chain to counter this threat. Together these constitute ten design principles for an agent-first internet, one in which agents are first-class citizens whose integration requires renegotiating the web's foundational social contract across access, economics, and content.

摘要：全球資訊網建立在一個持續三十年的假設上：網路內容的主要消費者是人類。這一假設滲透到每一層；其訪問模型假定有人的訪客，其經濟學依賴於人類的注意力，而其內容則針對人類的感知。人工智慧代理作為人類與網路內容之間的中介的快速出現使這一假設失效。然而，網路通過全面封鎖、基於 CAPTCHA 的排除以及將代理訪問視為提取而非合法互動的經濟模型來抵抗代理。

本文提出在三個層面上進行原則性的重新設計。在訪問層，代表人類行動的代理應該繼承相應的訪問權限，這些權限由 HTTP 請求中的速率限制和代理識別元數據管理，類似於瀏覽器標頭，並且採用雙層架構，從同一域提供人類可讀和代理優化的內容。在經濟層面，我們提出一個基於意圖的層級框架，這一框架以代理作為人類代理的原則為基礎：代理的經濟責任反映其所代表的人類的責任。一種基於代幣的訂閱模型以代幣而非頁面瀏覽量來計量內容，並且一個委託內容經濟將 AI 內容生產與人類意圖相連接。在內容層，我們識別出認識論的遞歸，即 AI 生成的內容被代理消費以產生進一步的內容，逐步使網路知識脫離人類的真實基礎。我們提出了代理文本標記語言（ATML）、一種四層人類監督層級模型，以及一個加密來源鏈，以應對這一威脅。

這些共同構成了十項設計原則，旨在打造一個以代理為先的互聯網，在這個互聯網中，代理是第一公民，其整合需要重新協商網路的基礎社會契約，涵蓋訪問、經濟和內容。

Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

2606.19111v1 by Haewoon Kwak

Team science holds that leadership is contingent: it helps only under specific conditions, and capable, autonomous teams may need none at all. We ask the analogous question for multi-agent LLM teams: under what measurable conditions does process-level coordination control add value, and do those conditions match what team science predicts? We use behavioral signatures (majority lock-in, exploration, recovery from an incorrect round-0 consensus) and per-action ablations, clean because each controller is an explicit action set, not a monolithic prompt. We operationalize three classical leadership styles (transactional, transformational, situational) as controllers over a shared action vocabulary (explore, revise, accept, synthesize). A matched controller with the same actions but an arbitrary rule recovers no better than majority voting, so the theory-derived rule, not the vocabulary, does the work. Across four task regimes and three open-weight model families, no controller dominates by accuracy, as the contingency view predicts: transactional control matches a shared round-0 vote on all 12 (model, regime) combinations to within 1.3pp, and gains appear only on the one combination where the round-0 majority is unreliable (llama-4-scout social; situational +8pp over flat). A recovery-advantage account, tested with four boundary probes, says a controller beats plain interaction only where the round-0 majority is unreliable, the task is recoverable, and undirected interaction does not already repair it. These regions map onto contingency theory (leadership substitutes, path-goal redundancy, the situational readiness gap), so a largely null accuracy result is what the theory predicts, not a failure of the controllers. We read process-level coordination control as a contingency to be measured and theory-mapped, not a leaderboard to be topped.

摘要：團隊科學認為領導力是有條件的：它僅在特定條件下有效，而有能力的自主團隊可能根本不需要領導力。我們對多代理 LLM 團隊提出類似的問題：在什麼可測量的條件下，過程層級的協調控制能增加價值，這些條件是否與團隊科學的預測相符？我們使用行為特徵（多數鎖定、探索、從不正確的第 0 輪共識中恢復）和每個行動的消融，因為每個控制器都是明確的行動集，而不是單一的提示。我們將三種經典的領導風格（交易型、轉型型、情境型）作為對共享行動詞彙（探索、修訂、接受、綜合）的控制器。一個匹配的控制器具有相同的行動但任意規則，其表現不比多數投票更好，因此是理論推導的規則，而不是詞彙，發揮了作用。在四種任務範疇和三個開放權重模型家族中，沒有控制器在準確性上佔優勢，正如條件觀所預測的：交易控制在所有 12 個（模型、範疇）組合中與共享的第 0 輪投票相匹配，誤差在 1.3 個百分點以內，並且只有在第 0 輪多數不可靠的組合中出現增益（llama-4-scout 社交；情境型比平坦型多 8 個百分點）。一個恢復優勢的解釋，通過四個邊界探針進行測試，表示控制器僅在第 0 輪多數不可靠、任務可恢復且無導向互動未能修復的情況下，勝過普通互動。這些區域映射到條件理論（領導替代品、路徑目標冗餘、情境準備差距），因此大體上無效的準確性結果是理論的預測，而不是控制器的失敗。我們將過程層級的協調控制視為一個需要測量和理論映射的條件，而不是一個需要超越的排行榜。

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

2606.19103v1 by Mukund Khanna, Raj Singh Yadav, Kunal Singh

Recent advances in instruction-based image editing have enabled models to perform complex visual edits from natural language instructions. However, in product-centric scenarios where preserving product features, branding, and textual elements are critical, current open and closed source models often struggle to maintain this fine-grained object identity. This issue is further compounded by the lack of datasets for instruction-based product image editing with text fidelity constraints, leaving it largely treated as an implicit capability of instruction-based image editing models. In this work, we introduce the ProductConsistency dataset which is designed to improve product-centric image editing. Our approach includes a supervised fine-tuning (SFT) dataset of 87k samples for product editing, a reinforcement learning (RL) dataset with 869 unique product images, and a new benchmark dataset, the ProductConsistency Benchmark, to allow rigorous and standardized evaluation of editing models. To guide RL training, we propose a Cyclic Consistency reward that enforces semantic preservation of product identity by using caption similarity between the original product description and captions generated from the edited image. We fine-tune both Qwen-Image-Edit-2511 and Flux.1-Kontext-dev using our dataset and demonstrate consistent improvements over baseline models in OCR and Perceptual metrics, and MLLM-based evaluations as well, indicating stronger product consistency, text rendering, and overall visual quality; with the Qwen-Image-Edit-2511 model achieving a 5x reduction in the character error rate. The code and pipeline is available at https://anonymous.4open.science/r/ProductConsistency-6FCC/README.md

摘要：最近在基於指令的圖像編輯方面的進展，使得模型能夠根據自然語言指令執行複雜的視覺編輯。然而，在以產品為中心的場景中，保留產品特徵、品牌和文本元素至關重要，目前的開源和閉源模型往往難以維持這種細緻的物體身份。這一問題因缺乏符合文本保真約束的基於指令的產品圖像編輯數據集而進一步加劇，這使得它在很大程度上被視為基於指令的圖像編輯模型的隱含能力。在本研究中，我們介紹了ProductConsistency數據集，旨在改善以產品為中心的圖像編輯。我們的方法包括一個包含87,000個樣本的監督微調（SFT）數據集，用於產品編輯，一個包含869個獨特產品圖像的強化學習（RL）數據集，以及一個新的基準數據集——ProductConsistency Benchmark，以便對編輯模型進行嚴格和標準化的評估。為了指導RL訓練，我們提出了一種循環一致性獎勵，通過使用原始產品描述和從編輯圖像生成的標題之間的相似性來強化產品身份的語義保留。我們使用我們的數據集對Qwen-Image-Edit-2511和Flux.1-Kontext-dev進行微調，並在OCR和感知指標以及基於MLLM的評估中顯示出相對於基線模型的一致改進，這表明產品一致性、文本渲染和整體視覺質量更強；其中Qwen-Image-Edit-2511模型實現了字符錯誤率的5倍降低。代碼和流程可在https://anonymous.4open.science/r/ProductConsistency-6FCC/README.md獲得。

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

2606.19079v1 by Enrico Cassano, Michał Brzozowski, Zuzanna Dubanowska, Paolo Mandica, Neo Christopher Chung

The increasing deployment of parameter-efficient fine-tuning (PEFT) has led to model ecosystems in which a single backbone is paired with many task-specialized adapters. In this setting, inference-time queries often arrive without task labels, requiring the system to automatically select the most appropriate adapter from a growing and heterogeneous adapter pool. Existing routing methods either depend on access to adapter internals, such as weight decompositions or gradient-based statistics, or require additional router training, which limits scalability and portability as new adapters are added. We introduce ARIADNE, a training-free, adapter-agnostic routing framework for dynamic adapter selection at inference time. ARIADNE represents each adapter through a set of centroids computed from embeddings of its training set, capturing the data distribution associated with that adapter. Given an unlabeled input, it selects an adapter by measuring proximity to these centroids in latent space. Because routing is performed entirely in the input embedding space, ARIADNE is compatible with arbitrary PEFT methods and requires no modification to the adapters or training procedures. Primarily evaluated with Llama 3.2 1B Instruct on 23 diverse NLP tasks, ARIADNE recovers 97.44% of the upper bound performance. Scaling to 44 tasks, it achieves 89.7% average selection accuracy, without additional training or access to adapter internals.

摘要：隨著參數高效微調（PEFT）的日益普及，模型生態系統中出現了單一主幹與多個任務專用適配器的配對。在這種情況下，推理時的查詢通常在沒有任務標籤的情況下到達，這要求系統自動從不斷增長且異質的適配器池中選擇最合適的適配器。現有的路由方法要麼依賴於對適配器內部的訪問，例如權重分解或基於梯度的統計，要麼需要額外的路由器訓練，這限制了在添加新適配器時的可擴展性和可移植性。我們介紹了ARIADNE，一種無需訓練、與適配器無關的動態適配器選擇路由框架，適用於推理時。ARIADNE通過從其訓練集的嵌入計算的一組質心來表示每個適配器，捕捉與該適配器相關的數據分佈。給定一個未標記的輸入，它通過測量在潛在空間中與這些質心的接近度來選擇適配器。由於路由完全在輸入嵌入空間中進行，ARIADNE與任意PEFT方法兼容，且不需要對適配器或訓練程序進行修改。主要在23個多樣的NLP任務上使用Llama 3.2 1B Instruct進行評估，ARIADNE恢復了97.44%的上限性能。在擴展到44個任務時，它實現了89.7%的平均選擇準確率，無需額外訓練或訪問適配器內部。

Which Sections of a Research Paper Best Reveal Its Research Methods? Evidence from Library and Information Science

2606.19051v1 by Qiuyu Fang, Jiayi Hao, Chengzhi Zhang

Research methods are essential carriers of knowledge contribution in academic papers. Automatic multi-label classification of research methods can support knowledge services such as method retrieval, review generation, and research intelligence analysis. While existing studies primarily rely on titles and abstracts, abstracts often provide only limited methodological information, whereas utilizing full-text content faces challenges related to excessive length and information redundancy. Therefore, this paper proposes a segment combination strategy by partitioning the full-text content according to its physical postion. Using an annotated corpus of 1,954 full-text articles from three representative journals in Library and Information Science (JASIST, LISR, and JDoc), we evaluate the classification performance of various segments and their combinations across multiple models. Experimental results indicate that methodological information is distributed unevenly within the full-text content, with the middle-to-late and final segments exhibiting greater discriminative power. Furthermore, integrating bibliographic metadata with cross-segment combination strategies effectively enhances classification performance.

摘要：研究方法是學術論文中知識貢獻的重要載體。研究方法的自動多標籤分類可以支持知識服務，如方法檢索、評論生成和研究智能分析。雖然現有研究主要依賴標題和摘要，但摘要通常只提供有限的方法論信息，而利用全文內容則面臨過長和信息冗餘的挑戰。因此，本文提出了一種通過根據其物理位置劃分全文內容的段落組合策略。使用來自三本代表性圖書館與信息科學期刊（JASIST、LISR 和 JDoc）的 1,954 篇全文文章的註釋語料庫，我們評估了各種段落及其組合在多個模型中的分類性能。實驗結果表明，方法論信息在全文內容中分佈不均，中後段和最後段表現出更強的區分能力。此外，將書目元數據與跨段組合策略整合有效提升了分類性能。

Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration

2606.19042v1 by Xhevahire Tërnava

In vibe coding, an emerging AI-driven paradigm, an LLM generates an entire program from a natural language prompt, but what happens to the variability that traditional software engineering carefully builds into code? To answer this question, we conducted an exploratory analysis on 10 vibe coded C/C++ projects, which suggests that there is near-zero in-artifact variability, i.e., at compile and runtime. All variability decisions are resolved at a single new binding time, generation time, the moment the LLM produces the source code. Rather than treating this as a defect to fix, we propose Variability by Regeneration (VbR), to our knowledge the first product-line approach in which the LLM acts as the derivation engine, generating a purpose-built, free of dead code binary for each variant from a declarative specification, while a variant dispatcher transparently routes user requests to the matching binary. We formalise VbR, contrast it with classical SPL derivation, and demonstrate its full pipeline on a wc product family. For SPL engineering, variability in AI-generated software belongs in the specification, not in the code.

摘要：在氛圍編碼這一新興的 AI 驅動範式中，一個 LLM 從自然語言提示生成整個程序，但傳統軟體工程精心構建到代碼中的變異性會發生什麼呢？為了回答這個問題，我們對 10 個氛圍編碼的 C/C++ 項目進行了探索性分析，結果顯示在工件內幾乎沒有變異性，即在編譯和運行時。所有變異性決策都在單一的新綁定時間解決，即生成時間，也就是 LLM 產生源代碼的那一刻。我們並不將此視為需要修復的缺陷，而是提出了再生變異性（Variability by Regeneration，VbR），據我們所知，這是第一個產品線方法，其中 LLM 作為推導引擎，根據聲明性規範為每個變體生成一個專門構建、沒有死代碼的二進位檔，而變體調度器則透明地將用戶請求路由到匹配的二進位檔。我們對 VbR 進行了形式化，並將其與傳統的 SPL 推導進行對比，並在 wc 產品系列上展示了其完整的管道。對於 SPL 工程而言，AI 生成軟體中的變異性應該在規範中，而不是在代碼中。

A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast Errors

2606.19026v1 by David Aaron Evans, Jay C. Rothenberger, Kara J. Sulia, Nick P. Bassill, Chris D. Thorncroft

Forecast errors in high-resolution numerical weather prediction (NWP) systems are often linked to unresolved planetary boundary layer (PBL) processes, convection, terrain-induced circulations, and other vertically structured atmospheric phenomena. Previous work demonstrated that Long Short-Term Memory (LSTM) networks can successfully predict forecast errors in the High-Resolution Rapid Refresh (HRRR) model using mesonet observations, but we believe performance degradation is linked to periods of complex vertical atmospheric evolution. To address this limitation, we develop a hybrid LSTM-Vision Transformer (LSTM-ViT) framework that combines temporal sequence learning from surface observations with atmospheric profiles from the New York State Mesonet profiler network. The LSTM-ViT framework is trained to predict HRRR hourly precipitation, 10 m wind speed, and 2 m temperature forecast errors at individual mesonet stations. Across all three predictors, incorporation of profiler-derived atmospheric structure improves forecast error prediction skill relative to the baseline LSTM architecture, with the largest gains occurring at shorter forecast lead times and during periods of enhanced PBL activity. Improvements are particularly pronounced for precipitation forecast error, where the LSTM-ViT framework achieves approximately a twofold increase in predictive skill relative to the baseline LSTM while better capturing convectively driven error evolution and reducing degradation associated with PBL processes. These results demonstrate that combining temporal sequence learning with vertically informed attention mechanisms provides a physically meaningful pathway for improving forecast error prediction in operational NWP systems. Our research offers forecasters enhanced guidance regarding model bias and forecast confidence.

摘要：高解析度數值天氣預報（NWP）系統中的預報誤差通常與未解決的行星邊界層（PBL）過程、對流、地形引起的環流以及其他垂直結構的大氣現象有關。先前的研究顯示，長短期記憶（LSTM）網絡可以成功預測高解析度快速刷新（HRRR）模型中的預報誤差，使用的是中尺度網絡觀測數據，但我們認為性能下降與複雜的垂直大氣演變期間有關。為了解決這一限制，我們開發了一個混合LSTM-視覺Transformer（LSTM-ViT）框架，該框架將來自地面觀測的時間序列學習與來自紐約州中尺度剖面網的氣象剖面相結合。LSTM-ViT框架被訓練以預測HRRR每小時降水、10米風速和2米溫度的預報誤差，針對各個中尺度站。對於所有三個預測因子，納入剖面導出的氣象結構相較於基線LSTM架構提高了預報誤差預測技能，最大的增益發生在較短的預報提前期和增強的PBL活動期間。降水預報誤差的改善尤為明顯，LSTM-ViT框架相較於基線LSTM實現了約兩倍的預測技能提升，同時更好地捕捉了對流驅動的誤差演變並減少了與PBL過程相關的降解。這些結果表明，將時間序列學習與垂直信息的注意機制相結合，為改善運營NWP系統中的預報誤差預測提供了一條物理上有意義的途徑。我們的研究為預報員提供了有關模型偏差和預報信心的增強指導。

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

2606.19025v1 by Lorenzo Sani, Zeyu Cao, Meghdad Kurmanji, Alex Iacob, Andrej Jovanovic, Yan Gao, Wanru Zhao, Nicholas D. Lane

Pre-training Large Language Models (LLMs) typically demands large-scale infrastructure with tightly coupled hardware accelerators. While increasing model and dataset scale remains the dominant driver of performance, Mixture-of-Experts (MoEs) architectures have recently achieved state-of-the-art results by decoupling parameter count from computational cost. This efficiency enables training massive models on constrained compute budgets, yet it typically requires the high-speed interconnects of a single datacenter. To overcome these physical limits, recent approaches such as DiLoCo and Photon use low-communication data-parallel methods to enable scaling across geographically distributed, weakly connected data centers. However, these methods suffer from a fundamental inefficiency: they require full model replicas at every site, which imposes prohibitive memory constraints and communication overheads. In this work, we introduce FoMoE, a system that breaks the full-replica paradigm by partitioning expert layers across workers. We demonstrate that FoMoE: (I) reduces communication costs by up to 1.42x over efficient baselines and 45.44x over DDP via partial expert replication in the studied regimes; (II) achieves empirical throughput speedups of up to 1.4x through a novel skip-token mechanism; and (III) shows stable routing in the trained proxy regimes and projects the communication/memory benefits to 100B-scale configurations through system modelling.

摘要：預訓練大型語言模型（LLMs）通常需要大規模的基礎設施，並且硬體加速器緊密耦合。雖然增加模型和數據集的規模仍然是性能的主要驅動因素，但混合專家（MoEs）架構最近通過將參數數量與計算成本解耦，實現了最先進的結果。這種效率使得在受限的計算預算上訓練大型模型成為可能，但通常需要單一數據中心的高速互連。為了克服這些物理限制，最近的研究方法如DiLoCo和Photon使用低通信數據並行方法來實現跨地理分佈、弱連接數據中心的擴展。然而，這些方法存在根本性的低效：它們要求每個站點都有完整的模型副本，這會帶來禁止性的內存限制和通信開銷。在這項工作中，我們介紹了FoMoE，一個通過在工作者之間分區專家層來打破完整副本範式的系統。我們展示了FoMoE：（I）在研究的範疇內，通過部分專家複製將通信成本降低高達1.42倍，相較於高效基準和DDP降低高達45.44倍；（II）通過一種新穎的跳過令牌機制實現高達1.4倍的經驗吞吐量加速；以及（III）在訓練的代理範疇中顯示穩定的路由，並通過系統建模將通信/內存效益預測到100B規模的配置。

Sumi: Open Uniform Diffusion Language Model from Scratch

2606.19005v1 by Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi, Jun Suzuki

Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.

摘要：擴散模型已成為自回歸模型的一個有前途的替代方案。在這些模型中，均勻擴散語言模型（UDLMs）原則上允許在任何步驟更新任何標記，從而實現更靈活的生成。然而，目前尚未有任何UDLM從零開始在大參數規模和大標記預算下進行預訓練。自回歸建模和遮罩擴散建模已經有可用的模型在規模上供社群研究和構建；而均勻擴散則沒有。大規模的從零開始預訓練的UDLM將提供一個乾淨的參考點，以研究擴展行為、生成動態、可控性，以及與已建立的自回歸和遮罩擴散模型之間的權衡。為此，我們介紹Sumi（在日語中意為“墨水”），這是一個完全開放的7B均勻擴散語言模型，從零開始在1.5T標記上進行預訓練。Sumi在知識、推理和編碼基準上與在相似標記預算下訓練的自回歸模型表現競爭，但在常識基準上表現不佳，其中我們以教育為重的數據混合可能是主要原因。我們釋出我們的模型權重、檢查點和完整的訓練配方，包括對公開可用語料庫的數據混合的完整規範。我們希望這次釋出能使社群能夠在大規模上研究原生均勻擴散，並促進對其尚未充分理解的方面的研究。

Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training

2606.19004v1 by Ruiqi Lai, Dakai An, Wei Gao, Ju Huang, Siran Yang, Jiamang Wang, Lin Qu, Dmitrii Ustiugov, Wei Wang

Reinforcement learning (RL) post-training of Diffusion Transformers (DiTs) is prohibitively expensive, requiring thousands of high-end GPUs. Existing works explore two directions to reduce cost: seed exploration improves training convergence by selecting high-contrast samples, yet adds compute to the critical path; spot GPUs offer 69--77\% lower cost, yet sit idle during training because DiT rollouts finish nearly simultaneously, which prevents LLM-style pipelining of rollout with training. Spot preemptions further break Sequence Parallelism (SP) groups, fragmenting GPU topology. We present Spotlight, the first system that harvests spot GPUs for DiT RL post-training. Spotlight rests on two key insights we devise: (1)~we show that exploration can tolerate stale model weights because exploration that uses the model weights from the previous iteration preserves the relative ranking of random seeds, allowing exploration to run on idle spot GPUs during training. (2)~SP reconfiguration can reuse on-node state, reducing group recovery from minutes to sub-second launches. Built on these insights, Spotlight introduces three techniques: a bandit-based exploration planner that maximizes reward variance within the training time budget, elastic sequence parallelism that reconfigures SP groups on the fly via persistent schedulers and intra-node weight copying, and a preemption-aware pull-based request scheduler that balances load and commits in-flight state upon preemption. We implement Spotlight on the open-source RL platform ROLL and evaluate it on Qwen-Image post-training. Spotlight reaches the same target validation score $4\times$ faster than baselines, reducing total cost by $1.4$-$6.4\times$ while achieving superior image quality on DeepSeek-OCR and Geneval datasets with resolution $512\times512$ and $1280\times1280$.

摘要：強化學習（RL）在擴散Transformer（DiTs）後訓練中的應用成本極高，需要數千個高端GPU。現有的研究探索了兩個方向來降低成本：種子探索通過選擇高對比度樣本來改善訓練收斂，但這會增加關鍵路徑的計算；而臨時GPU提供69--77\%的成本降低，但在訓練期間閒置，因為DiT的展開幾乎同時完成，這妨礙了類似大型語言模型（LLM）的展開與訓練的流水線作業。臨時中斷進一步打破了序列並行（SP）組，造成GPU拓撲的碎片化。
我們提出了Spotlight，第一個利用臨時GPU進行DiT RL後訓練的系統。Spotlight基於我們設計的兩個關鍵見解：（1）我們顯示探索可以容忍過時的模型權重，因為使用前一迭代的模型權重進行的探索保持了隨機種子的相對排名，允許探索在訓練期間在閒置的臨時GPU上運行。（2）SP重配置可以重用節點上的狀態，將組恢復的時間從幾分鐘縮短到亞秒級啟動。基於這些見解，Spotlight引入了三種技術：一種基於賭徒的探索規劃器，最大化訓練時間預算內的獎勵方差；一種彈性序列並行技術，通過持久調度器和節點內權重複製即時重配置SP組；以及一種考慮中斷的基於請求的拉取調度器，平衡負載並在中斷時提交正在進行的狀態。我們在開源RL平台ROLL上實現了Spotlight，並在Qwen-Image後訓練中進行評估。Spotlight以比基準快$4\times$的速度達到相同的目標驗證分數，將總成本降低$1.4$-$6.4\times$，同時在DeepSeek-OCR和Geneval數據集上實現了優越的圖像質量，分辨率為$512\times512$和$1280\times1280$。

Enhancing Multilingual Reasoning via Steerable Model Merging

2606.19002v1 by Zhuoran Li, Rui Xu, Jian Yang, Junnan Liu, Zhijun Chen, Qianren Mao, Hongcheng Guo, Jiaheng Liu, Likang Xiao, Ming Li, Xiaojie Wang

Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance. In other words, the one-size-fits-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others. To this end, we propose a Steerable Model Merging (ST-Merge) framework to modulate the contribution of each source model. To realize this idea, we introduce a gated cross-attention mechanism to weight or filter the two attended source models in an adaptive manner. Extensive experiments demonstrate that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages.

摘要：模型合併是一種有效的技術，用於組合多語言模型和推理模型的能力。它通過對齊不同模型的特徵空間，在多語言推理任務中實現了令人鼓舞的泛化效果。然而，合併後的單一模型往往無法解決源模型之間的衝突，導致次優的性能。換句話說，通用的合併策略可能無法與不同輸入的特徵對齊，這可能需要優先考慮某些模型而非其他模型。為此，我們提出了一個可調整的模型合併（ST-Merge）框架，以調節每個源模型的貢獻。為了實現這一思想，我們引入了一種門控交叉注意機制，以自適應的方式對兩個被關注的源模型進行加權或過濾。大量實驗表明，ST-Merge在21種不同語言的四個多語言推理基準上，始終優於多個強基準。

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

2606.18996v1 by Moon Ye-Bin, Nam Hyeon-Woo, Baek Seong-Eun, Yejin Yeo, Tae-Hyun Oh

Agents are increasingly deployed in document-intensive workflows where sensitive private information is not an edge case but a routine input, e.g., an agent booking a flight needs passport numbers. In such settings, the agent must use private information to complete tasks accurately while never exposing it in its responses, because it cannot verify who is actually at the keyboard. These two obligations are in fundamental tension. A model capable enough to use private information for task completion can, by the same capability, be induced to reveal it. To evaluate the trade-off of task accuracy and privacy leakage, we introduce Task-completion and Resistance to Active Privacy-extraction (TRAP). Each scenario includes a document containing private information, a task query that requires the agent to invoke the correct tool using private fields, and an attack query that attempts to elicit the same information in natural language. Evaluating 22 models spanning frontier proprietary and open-source models at multiple scales, we find that all model families exhibit non-trivial leakage, and that instruction-following ability correlates with leakage rate. Existing prompt-based defenses reduce leakage but at significant cost to task accuracy. Prompt optimization fails to escape this trade-off. We demonstrate that this failure is not incidental. For any softmax-based model, no soft-constraint defense, e.g., prompt-based defenses, can jointly achieve high task success with zero leakage probability. Motivated by this impossibility result, we propose structural private field isolation, which replaces private fields with hash keys before they reach the model. This approach largely prevents leakage while keeping task accuracy.

摘要：代理人越來越多地被部署在文件密集型工作流程中，在這些工作流程中，敏感的私人信息不是邊緣案例，而是日常輸入，例如，代理人預訂航班需要護照號碼。在這種情況下，代理人必須使用私人信息來準確完成任務，同時在其回應中永遠不暴露這些信息，因為它無法驗證誰實際在鍵盤上。這兩項義務之間存在根本的緊張關係。一個足夠強大的模型能夠使用私人信息來完成任務，但同樣的能力也可能使其透露這些信息。為了評估任務準確性和隱私洩漏之間的權衡，我們引入了任務完成和抵抗主動隱私提取（TRAP）。每個場景都包含一份包含私人信息的文件、一個要求代理人使用私人字段調用正確工具的任務查詢，以及一個試圖以自然語言引出相同信息的攻擊查詢。我們評估了22個涵蓋前沿專有模型和多種規模的開源模型，發現所有模型系列都表現出非平凡的洩漏，並且遵循指令的能力與洩漏率相關。現有的基於提示的防禦措施減少了洩漏，但對任務準確性造成了重大損失。提示優化未能逃避這一權衡。我們證明這一失敗並非偶然。對於任何基於softmax的模型，沒有任何軟約束防禦，例如基於提示的防禦，能夠同時實現高任務成功率和零洩漏概率。受到這一不可能結果的激勵，我們提出結構性私人字段隔離，該方法在私人字段到達模型之前用哈希鍵替換它們。這種方法在保持任務準確性的同時，大大防止了洩漏。

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

2606.18989v1 by Fengying Ye, Yanming Sun, Runzhe Zhan, Zheqi Zhang, Lidia S. Chao, Derek F. Wong

Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is anchored by an English gloss from Wiktionary. We further construct a high-confidence reference alignment set for reproducible evaluation. G-IdiomAlign supports two protocols: (1) a controlled Multiple-Choice Idiom Equivalence with typed distractors for error attribution; and (2) a Gloss-Contrastive Generation contrasting No-gloss and With-gloss inputs to isolate the effect of an explicit semantic pivot. Across diverse LLMs, a bias to literal translation is a dominant failure mode, especially when the target is a low-resource language. Glosses consistently improve Gloss-Contrastive Generation under an embedding-based semantic proxy, but performance remains modest, indicating substantial headroom in the open output space. Subsequent analysis on Qwen3-8B further suggests that cross-condition differences are concentrated more in attention heads than in layers, while better With-gloss generations coincide with stronger gloss anchoring.

摘要：成語在不同語言之間轉換困難，因為它們不具組成性且表面形式的基礎較弱，使得字面映射不可靠。我們提出了 G-IdiomAlign，一個以詞彙為樞紐的基準，每個成語都由來自 Wiktionary 的英文詞彙作為支撐。我們進一步構建了一個高信心的參考對齊集，以便於可重複的評估。G-IdiomAlign 支持兩種協議：（1）一種受控的多選成語等價，帶有類型干擾項以便於錯誤歸因；以及（2）一種詞彙對比生成，對比無詞彙和有詞彙的輸入，以隔離明確語義樞紐的效果。在多種大型語言模型中，對字面翻譯的偏見是一種主要的失敗模式，特別是當目標是低資源語言時。在基於嵌入的語義代理下，詞彙在詞彙對比生成中始終能改善表現，但性能仍然適中，顯示出開放輸出空間中有相當大的提升空間。對 Qwen3-8B 的後續分析進一步表明，跨條件差異更多集中在注意力頭而非層次中，而更好的有詞彙生成與更強的詞彙支撐相吻合。

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

2606.18988v1 by Jinhao Song, Shan Liang, Yiqun Yue, Zhuhuayang Zhang, Tianqi Gao

Multimodal deception detection is critical for identifying fraudulent intentions, yet existing approaches predominantly rely on end to end black--box paradigms. These methods suffer from a severe lack of interpretability failing to provide transparent reasoning trajectories and struggling to explicitly capture the subtle, cross modal inconsistencies inherent in deceptive behaviors. To transcend these limitations, we propose ThinkDeception, a novel and interpretable multimodal deception detection framework. As a pioneering effort, it introduces Multimodal Large Language Models (MLLMs) into this domain, transforming deception detection from a traditional binary classification task into an explicit cognitive reasoning process. Facilitated by the first meticulously annotated step--by--step multimodal Chain of Thought (CoT) dataset, we develop a foundational model, ThinkDeception Base, empirically validating the critical role of modal inconsistency in decoding deception. Building upon this foundation, our core innovation lies in proposing Visual-Audio Consistency Group Relative Policy Optimization(VAC--GRPO) equipped with a progressive training strategy. Distinct from standard GRPO, we stratify the training data into four progressive difficulty tiers, guiding the model through a psychologically grounded easy--to--hard cognitive transition. By innovatively coupling this dynamic curriculum scheduler with a multi dimensional, process aware reward mechanism and a reflective learning paradigm, we significantly elevate the model's overall reasoning quality. Extensive experiments on mainstream benchmarks demonstrate that ThinkDeception establishes a new SOTA, significantly outperforming existing methods in both detection accuracy and rationale quality. Ultimately, this work successfully drives the field of deception detection toward interpretable, multimodal cognitive reasoning.

摘要：多模態欺騙檢測對於識別欺詐意圖至關重要，但現有的方法主要依賴於端到端的黑箱範式。這些方法缺乏可解釋性，未能提供透明的推理過程，並且難以明確捕捉到欺騙行為中固有的微妙跨模態不一致性。為了超越這些限制，我們提出了ThinkDeception，一個新穎且可解釋的多模態欺騙檢測框架。作為一項開創性工作，它將多模態大型語言模型（MLLMs）引入這一領域，將欺騙檢測從傳統的二元分類任務轉變為明確的認知推理過程。在首個經過精心註釋的逐步多模態思維鏈（CoT）數據集的支持下，我們開發了一個基礎模型ThinkDeception Base，實證驗證了模態不一致性在解碼欺騙中的關鍵作用。在此基礎上，我們的核心創新在於提出了視覺-音頻一致性群體相對策略優化（VAC--GRPO），並配備了漸進式訓練策略。與標準的GRPO不同，我們將訓練數據分為四個漸進的難度層次，引導模型通過心理學基礎的由易到難的認知過渡。通過創新性地將這一動態課程調度器與多維度、過程感知的獎勵機制和反思學習範式相結合，我們顯著提升了模型的整體推理質量。在主流基準上的廣泛實驗表明，ThinkDeception建立了新的SOTA，在檢測準確性和推理質量上顯著超越現有方法。最終，這項工作成功地推動了欺騙檢測領域朝向可解釋的多模態認知推理發展。

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

2606.18986v1 by Yafeng Wu, Huu Hiep Nguyen, Thin Nguyen, Hung Le

Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLMs suffers from a tokenization bottleneck: Byte Pair Encoding fragments continuous values into unstable tokens whose embeddings lack meaningful metric structure, resulting in the loss of magnitude, scale, and trend information. Prior methods use patch-based encoders that split the series into fixed windows, locking in one granularity that breaks patterns and hides exact timesteps, through a separate module that rarely transfers across datasets with different lengths or sampling rates. To address this challenge, we propose CADE (Contrastive Alignment with Direct Embedding), a novel framework for TSQA built upon two key components: direct timestep embedding and semantic alignment. The proposed framework maps each timestep directly into the LLM embedding space through a point-wise linear encoder and MLP projector, preserving exact index-level access while eliminating the need for patching and padding. To further bridge the semantic gap between time-series and language representations, we introduce a novel one-directional supervised contrastive loss that aligns time-series embeddings with frozen class-name text anchors. Experimental results on the public Time-MQA benchmark demonstrate that our framework consistently improves performance across six TSQA tasks, outperforming both open-source and proprietary LLM baselines.

摘要：最近在大型語言模型（LLMs）方面的進展催生了時間序列問答（TSQA），將時間序列分析表述為自然語言問答。然而，直接將原始數值序列輸入LLMs會遭遇標記化瓶頸：字節對編碼將連續值分割為不穩定的標記，其嵌入缺乏有意義的度量結構，導致幅度、尺度和趨勢信息的喪失。先前的方法使用基於補丁的編碼器，將序列拆分為固定窗口，鎖定一種粒度，這會破壞模式並隱藏確切的時間步，通過一個很少在不同長度或取樣率的數據集之間轉移的單獨模塊。為了解決這一挑戰，我們提出了CADE（對比對齊與直接嵌入），這是一個基於兩個關鍵組件的TSQA新框架：直接時間步嵌入和語義對齊。所提出的框架通過逐點線性編碼器和MLP投影器將每個時間步直接映射到LLM嵌入空間，保留精確的索引級別訪問，同時消除補丁和填充的需求。為了進一步縮小時間序列與語言表示之間的語義差距，我們引入了一種新型的單向監督對比損失，將時間序列嵌入與固定的類別名稱文本錨點對齊。在公共Time-MQA基準上的實驗結果表明，我們的框架在六個TSQA任務中始終提高了性能，超越了開源和專有LLM基準。

Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment

2606.18979v1 by Franziska Braun, Christopher Witzl, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

Early detection of cognitive impairment relies on neuropsychological tests to minimize subjectivity by assessing multiple cognitive domains. Speech-based evaluation can support diagnostics and improve accessibility, but transcription errors and the omission of nonverbal subtests (e.g., motor skills) limit accuracy. Beyond conventional test scores, speech-derived features can provide additional insights into cognitive status. This study investigates the speech-based evaluation of the German "Syndrom-Kurz-Test," a standardized dementia screening test comprising verbal and motor subtests. We train models that integrate transcript-derived scores and Whisper embeddings per verbal subtest to reduce scoring errors. To compensate for missing motor subtests, we then leverage these fused representations to approximate expert overall ratings. Despite omitting subtests, our models strongly correlate with expert ratings and efficiently and accurately discriminate between cognitive status groups.

摘要：早期檢測認知障礙依賴神經心理測試，通過評估多個認知領域來最小化主觀性。基於語音的評估可以支持診斷並改善可及性，但轉錄錯誤和省略非語言子測試（例如，運動技能）限制了準確性。超越傳統測試分數，語音衍生的特徵可以提供對認知狀態的額外見解。本研究調查德國的「Syndrom-Kurz-Test」的語音評估，這是一個標準化的癡呆篩查測試，包含口語和運動子測試。我們訓練模型，整合每個口語子測試的轉錄衍生分數和Whisper嵌入，以減少計分錯誤。為了彌補缺失的運動子測試，我們然後利用這些融合的表示來近似專家的整體評分。儘管省略了子測試，我們的模型與專家評分強烈相關，並有效且準確地區分認知狀態組。

CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System

2606.18976v1 by Marco Becattini, Niccolò Caselli, Matteo Minin, Roberto Verdecchia, Enrico Vicario

Automated assessment in software engineering education has advanced significantly for code grading and essay scoring. However, reviewing software architecture deliverables, which requires analyzing structural completeness and requirements traceability, has not yet been fully automated. Applying Large Language Models (LLMs) to this task requires robust architectures to ensure technical feedback is accurate and reliable for students. This paper presents CAPRA (Configurable Architecture Proficiency Report Assessment), a multi-agent LLM system that analyzes software architecture deliverables to generate personalized, template-compliant LaTeX feedback. As a core design choice, CAPRA coordinates multiple specialized agents and employs a Python-based microservice for multi-modal document extraction, utilizing PyMuPDF and vision-enabled LLMs (specifically gpt-4o) to parse text and UML diagrams. To ensure educational reliability and mitigate hallucinations, CAPRA introduces a deterministic Evidence Anchoring step using fuzzy matching via normalized Levenshtein distance, along with a ConsistencyManager agent that cross-verifies, deduplicates, and merges findings. System performance is assessed using a structured eight-criterion binary evaluation taxonomy covering: (i) extraction completeness, (ii) feature validation, (iii) issue grounding and severity detection, (iv) recommendation specificity and traceability, and (v) template and tone compliance. A preliminary empirical evaluation on 10 student reports shows that CAPRA satisfied 88.8% of the evaluated criteria under a strict two-rater aggregation rule, achieved moderate inter-rater agreement with human evaluators (kappa = 0.582), and processed each report in slightly over 4 minutes. While these results support the viability of LLM-supported architectural feedback, human oversight remains essential for subjective assessment dimensions.

摘要：自動化評估在軟體工程教育中已經在程式碼評分和論文評分方面取得了顯著進展。然而，對於軟體架構交付物的審查，這需要分析結構完整性和需求可追溯性，尚未完全自動化。將大型語言模型（LLMs）應用於此任務需要穩健的架構，以確保技術反饋對學生來說準確且可靠。本文介紹了CAPRA（可配置架構能力報告評估），這是一個多代理LLM系統，分析軟體架構交付物以生成個性化的、符合模板的LaTeX反饋。作為核心設計選擇，CAPRA協調多個專門代理，並採用基於Python的微服務進行多模態文檔提取，利用PyMuPDF和具視覺能力的LLMs（具體為gpt-4o）來解析文本和UML圖。為了確保教育可靠性並減少幻覺，CAPRA引入了一個確定性的證據錨定步驟，通過標準化的Levenshtein距離進行模糊匹配，以及一個ConsistencyManager代理，該代理交叉驗證、去重和合併發現。系統性能使用一個結構化的八項標準二元評估分類法進行評估，涵蓋：（i）提取完整性，(ii) 特徵驗證，(iii) 問題基礎和嚴重性檢測，(iv) 建議的具體性和可追溯性，以及 (v) 模板和語調的合規性。對10份學生報告的初步實證評估顯示，CAPRA在嚴格的雙評審聚合規則下滿足了88.8%的評估標準，與人類評估者達成了中等的評審一致性（kappa = 0.582），並在略超過4分鐘內處理每份報告。雖然這些結果支持LLM支持的架構反饋的可行性，但人類監督對於主觀評估維度仍然至關重要。

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

2606.18970v1 by Syed Mujtaba Haider, Silvia Figini

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

摘要：醫學影像分類常受到有限標記數據的限制，這促使了生成增強的需求；最近，為此目的提出了量子生成模型，並經常報告準確性提升。然而，這些說法通常基於單次訓練運行，未能匹配量子和經典生成器的參數預算，且未能描述任何好處出現的數據範疇。我們提出了一個受控基準，隔離量子生成器對腦部MRI增強的貢獻。影像被編碼進入一個KL正則化的潛在空間，在該空間中，使用變分量子生成器或參數數量幾乎相同的經典生成器（1648對1632）訓練一個帶有梯度懲罰的條件Wasserstein GAN。合成樣本被解碼並用於增強一個預訓練的分類器，涵蓋從5%到100%的標記數據比例，並在八個隨機種子上進行評估，使用配對顯著性測試（帶有多重比較修正）以及內部集多樣性和潛在分佈分析。在所有比例中，沒有增強變體顯著超越僅使用真實數據的訓練，且量子和經典生成器在統計上無法區分。任何低數據的好處表現為正則化，而非忠實的數據擴展：合成樣本在數據稀缺的地方偏離分佈並嚴重模式崩潰，且量子生成器的多樣性不比其經典對應物更高。我們釋放該協議作為醫學影像中量子生成增強嚴格評估的測試平台。

GraphPO: Graph-based Policy Optimization for Reasoning Models

2606.18954v1 by Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai, Jingdong Chen, Jun Zhou, Ge Wu, Wenyue Tang, Hao Sun

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

摘要：強化學習與可驗證獎勵（RLVR）已成為提升大型推理模型能力的標準範式。RLVR 通常獨立抽樣回應並利用最終答案來優化策略。這一範式有兩個限制。首先，獨立的回應往往包含相似的中間推理步驟，導致冗餘探索和計算浪費。其次，稀疏的最終答案獎勵使得識別有用步驟變得困難。基於樹的方法部分解決了這個問題，通過共享前綴並比較來自同一前綴的分支來提供細粒度的信號。然而，樹的分支仍然是獨立擴展的。當不同的分支達到相似的推理狀態時，它們無法共享信息並重複相似的探索。此外，基於樹的方法忽略了這種分散，只在不同的分支內進行局部比較，這可能導致優勢估計的方差增高。為了解決這一挑戰，我們提出了 GraphPO（基於圖的策略優化），這是一個新穎的強化學習框架，將回合表示為有向無環圖，推理步驟作為邊，從推理路徑總結的語義狀態作為節點。GraphPO 將語義上等價的推理路徑合併為等價類，允許它們共享後綴，並將預算從冗餘擴展重新分配到多樣化探索上。此外，我們將效率優勢分配給進入邊，將正確性優勢分配給輸出邊，從而在從結果中推導過程監督的同時提高推理效率。理論表明，GraphPO 減少了優勢估計的方差並增強了推理效率。在三個 LLM 的推理和代理搜索基準上進行的實驗顯示，GraphPO 在相同的標記預算或回應預算下，始終優於基於鏈和樹的基準。

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

2606.18950v1 by San Kim, Daechul Ahn, Reokyoung Kim, Hyeonbeom Choi, Seungyeon Jwa, Jonghyun Choi

Modern Vision-Language Models (VLMs) often struggle with strategic reasoning, i.e., anticipating and influencing other agents' actions, under uncertainty in competitive and cooperative settings. Real-time strategy (RTS) games can be a natural testbed for diagnosing this limitation, as they demand coordination with allies, adaptation to opponents' strategy, and long-horizon planning under partial observability. However, existing RTS benchmarks offer limited evaluation scope, lack systematic competency diagnosis, and remain fixed in the pre-designed scenario coverage. To address these limitations, we present RTSGameBench, which is built on Beyond All Reason, a large-scale RTS game with an expanded battlefield that demands broader strategy diversity than the existing testbeds. The proposed benchmark provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini-games, each targeting an individual strategic competency, and extensible coverage via a self-evolving generation framework that converts free-form queries into new mini-games, improving over successive cycles. Additionally, for VLMs to operate in large-scale RTS games, we provide RTSGameAgent that manages units by an FSM with agentic memory. We empirically validate that multiple state-of-the-art VLMs do not perform well when matchups demand tighter coordination, multiagent coordination and when task scale increases.

摘要：現代視覺-語言模型 (VLMs) 在戰略推理方面經常遇到困難，即在競爭和合作環境中，預測和影響其他代理的行動，尤其是在不確定性下。即時戰略 (RTS) 遊戲可以成為診斷這一限制的自然測試平台，因為它們要求與盟友協調、適應對手的策略，並在部分可觀察性下進行長期規劃。然而，現有的 RTS 基準提供的評估範圍有限，缺乏系統性的能力診斷，並且在預設的場景覆蓋範圍內保持固定。為了解決這些限制，我們提出了 RTSGameBench，它基於《超越所有理性》，這是一款大型 RTS 遊戲，擁有擴展的戰場，要求比現有測試平台更廣泛的策略多樣性。該基準通過各種對戰結構的多樣化遊戲玩法提供評估，通過迷你遊戲進行診斷評估，每個迷你遊戲針對個別的戰略能力，並通過自我演變的生成框架提供可擴展的覆蓋，將自由形式的查詢轉換為新的迷你遊戲，並在後續循環中不斷改進。此外，為了使 VLMs 能夠在大型 RTS 遊戲中運行，我們提供了 RTSGameAgent，它通過具有代理記憶的有限狀態機 (FSM) 來管理單位。我們實證驗證了多個最先進的 VLMs 在對戰需要更緊密的協調、多代理協調以及任務規模增加時表現不佳。

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

2606.18947v1 by Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani, Sudeep Das

Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search-Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor-agnostic boundary that moves grounding outside the reasoning model through an MCP-compatible gateway, exposing provider routing, source-aware context rendering, configured fallback, retrieval-depth control, and exact plus semantic caching as first-class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency-sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm-cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large-scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native-search accuracy on an e-commerce query-understanding (QIU) workload while cutting search cost by over 98%. Real-time grounding is best treated as an optimizable interface boundary, not a fixed model feature.

摘要：生產 LLM 代理越來越依賴即時搜索，但本地搜索基礎將檢索策略、供應商選擇、證據注入、成本、延遲和生成行為捆綁在單一模型供應商邊界之下。這種耦合使得基礎難以檢查、調整、重用或移植，並可能觸發搜索引起的冗長性，破壞嚴格的輸出合約。我們提出了解耦搜索基礎 (DSG)，這是一個供應商無關的邊界，通過與 MCP 兼容的網關將基礎移出推理模型，暴露供應商路由、源感知上下文渲染、配置的後備、檢索深度控制，以及精確和語義緩存作為一級控制。在 SimpleQA、FreshQA 和 HotpotQA 上的五個前沿模型中，本地搜索在對時效敏感的 FreshQA 上表現優越，但 DSG 在控制重要時展現出更強的前沿：在 SimpleQA 上，它的準確率幾乎與本地相當 (86.1% 對 87.7%)，搜索成本降低 91%，保持簡潔的答案合約，並達到 99.4% 的熱緩存命中率，延遲降低 68%。作為大型代理工作負載的共享生產基礎層，DSG 在電子商務查詢理解 (QIU) 工作負載上匹配或稍微超過本地搜索的準確率，同時將搜索成本降低超過 98%。即時基礎最好被視為一個可優化的接口邊界，而不是固定的模型特徵。

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

2606.18946v1 by Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei

Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: https://github.com/luojingkun22/SenFlow

摘要：句子級別的 AI 生成文本檢測 (S-AGTD) 針對混合文檔，即人類和大型語言模型共同創作的文本，面臨兩個缺口：現有方法將每個句子孤立分類，忽略了句子之間的依賴關係，而現有基準則省略了最新一代生成器。我們構建了 MOSAIC，一個包含 16,000 篇混合文檔的基準，這些文檔來自 PubMed 和 XSum，由 DeepSeek-V3.2 和 Kimi K2 在嚴格的質量控制下生成，包括一個在先前基準中缺失的困惑度一致性過濾器。我們將 S-AGTD 重新構建為對文檔句子序列的結構化預測，並將其具體化為 SenFlow，將基於圖的句子間傳播與線性鏈 CRF 解碼整合在單個文檔級別的句子圖上進行處理。SenFlow 在 MOSAIC 上達到了最先進的性能，在跨域轉移的三個難度逐漸增加的協議中，平均 Macro-F1 邊際提高了 +4.15 個百分點。我們進一步發現，即使在困惑度過濾器平衡了明顯的線索後，AI 插入仍然保留了一個依賴於生成器的句子長度差距，而句子級別的檢測器仍然可以利用這一點。代碼和數據：https://github.com/luojingkun22/SenFlow

Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

2606.18941v1 by Pierre Dantas, Lucas Cordeiro, Waldir Junior

PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents Graph-ESBMC-PLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856).

摘要：PLCopen XML 定義了兩種 IEC 61131-3 梯形圖程序的編碼格式：一種是使用 <rung> 元素的文本編碼，另一種是將梯級邏輯表示為本地 ID/refLocalId 連接的有向圖的圖形編碼。ESBMC-PLC 支持文本格式，但將來自 CONTROLLINO、Beremiz 和 OpenPLC 編輯器的圖形導出解析為空的 GOTO 中間表示，導致虛無的驗證成功。本文提出了 Graph-ESBMC-PLC，通過基於 DFS 的圖形 LD 解析器填補了這一空白。該解析器從左電源軌遍歷連接圖到每個線圈，將梯級路徑提取為布爾接觸聯接，並應用三層 I/O 推斷方案。按右電源軌的 connectionPointIn 順序排列線圈，確保 SET 線圈在 RESET 線圈之前處理，符合 IEC 掃描週期語義。圖形到 IR 的轉換保持 ESBMC 後端不變。對來自 CONTROLLINO/OpenPLC 編輯器的 3 個圖形 LD 程序的驗證顯示，所有程序都生成完整的 GOTO IR，具有非確定性輸入和梯級邏輯，而不是之前的空 IR。所有 3 個在 k=2 下的驗證時間小於 70 毫秒。11 個文本 LD 基準完全保留，沒有回歸。報告了兩個不含 LD 內容或不支持計時器語義的 Beremiz 示例作為發現的限制。文檔在 Zenodo 上（DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856）。

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

2606.18936v1 by Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao, Yi Zeng

Large language models (LLMs) are increasingly embedded in AI for Science (AI4Science) workflows, from scientific question answering and literature analysis to laboratory planning and autonomous discovery. This progress creates an urgent need for safety benchmarks that evaluate not only scientific competence, but also whether models recognize and avoid risks in high-stakes scientific contexts. Existing AI4Science safety datasets cover several disciplines and task formats, leaving the underlying risk dimensions underspecified. We introduce \textbf{SciRisk-Bench}, a benchmark designed to evaluate AI4Science safety from two complementary perspectives: explicit risk dimensions and scientific disciplines. SciRisk-Bench covers 7 disciplines, 31 subdisciplines and 10 risk dimensions. In the experimental section, we evaluate both mainstream LLMs and science-oriented LLMs across risk dimensions, disciplines, and sub-disciplines, enabling fine-grained diagnosis of where scientific models remain unsafe.

摘要：大型語言模型（LLMs）越來越多地嵌入於科學人工智慧（AI4Science）工作流程中，從科學問題回答和文獻分析到實驗室規劃和自主發現。這一進展迫切需要安全基準，不僅評估科學能力，還要檢視模型是否能識別並避免在高風險科學背景下的風險。現有的AI4Science安全數據集涵蓋了幾個學科和任務格式，但未明確規定潛在的風險維度。我們介紹了\textbf{SciRisk-Bench}，這是一個旨在從兩個互補的角度評估AI4Science安全性的基準：明確的風險維度和科學學科。SciRisk-Bench涵蓋了7個學科、31個子學科和10個風險維度。在實驗部分，我們評估了主流LLMs和以科學為導向的LLMs在風險維度、學科和子學科方面的表現，使我們能夠細緻診斷科學模型在哪些方面仍然不安全。

2606.18932v1 by Xingchen Yan, Jian Ge, Qingtian Liu, Kevin Willis, Quanquan Hu, Jiapeng Zhu

Motivated by the observational incompleteness of intermediate-to-long-period Earth-size planets, we present TransitNet, a compact attention-augmented deep-learning framework for low-SNR transit blind searches. To enable realistic method development and objective threshold calibration under blind-search conditions, we develop a unified dataset construction, benchmarking, and threshold-selection framework. On recovery benchmarks constructed from unseen Kepler targets, TransitNet attains 95.2 percent accuracy in the challenging SNR range of 6 to 8 and outperforms both TLS and BLS, achieving ROC-AUC and PR-AP values of 0.974 and 0.982, respectively. In an injected Earth-size and sub-Earth-size transit recovery experiment, TransitNet achieves a recovery rate of 93.0 percent, substantially exceeding those of TLS (63.1 percent) and BLS (60.0 percent). In addition to detection, TransitNet provides attention-based estimates of transit windows and midpoints. On an independent evaluation set, 97.4 percent of injected transits are fully covered by the estimated transit window. Applied to real Kepler observations, the model successfully recovers all 34 selected confirmed Kepler planets, with a mean absolute transit midpoint error of 1.24 hours. The model combines a compact footprint of about 1.5 MB with high inference efficiency, yielding speed-ups of about 12 to 25 times relative to CPU-TLS and about 4 to 5 times relative to CPU-BLS. These results demonstrate that TransitNet provides an accurate, scalable, and computationally efficient framework for low-SNR transit blind searches in the tested regime and motivate its extension to longer-period Earth-size planet searches.

摘要：受到中長期地球大小行星觀測不完整性的激勵，我們提出了TransitNet，一個緊湊的注意力增強深度學習框架，用於低信噪比的過境盲搜索。為了在盲搜索條件下實現現實的算法開發和客觀的閾值校準，我們開發了一個統一的數據集構建、基準測試和閾值選擇框架。在從未見過的開普勒目標構建的恢復基準上，TransitNet在具有挑戰性的信噪比範圍6到8內達到了95.2%的準確率，並且超越了TLS和BLS，分別達到0.974和0.982的ROC-AUC和PR-AP值。在一個注入的地球大小和亞地球大小的過境恢復實驗中，TransitNet達到了93.0%的恢復率，顯著超過了TLS（63.1%）和BLS（60.0%）。除了檢測外，TransitNet還提供基於注意力的過境窗口和中點的估計。在一個獨立的評估集上，97.4%的注入過境完全被估計的過境窗口覆蓋。應用於真實的開普勒觀測，該模型成功恢復了所有34個選定的確認開普勒行星，平均絕對過境中點誤差為1.24小時。該模型結合了約1.5 MB的緊湊佔用空間和高推理效率，相對於CPU-TLS的速度提升約為12到25倍，相對於CPU-BLS的速度提升約為4到5倍。這些結果表明，TransitNet為在測試範圍內的低信噪比過境盲搜索提供了一個準確、可擴展且計算效率高的框架，並激勵其擴展到更長周期的地球大小行星搜索。

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

2606.18922v1 by Jasmine Owers, Edwin Simpson, Martha Lewis

Figurative language and negation are two areas that challenge current language models, however, both are widely used throughout written and spoken language. Large language models (LLMs) are also widely used in everyday contexts where they cannot necessarily be tuned for a specific dataset. It is therefore essential to understand the ability of LLMs to correctly interpret text that includes both negation and figurative language. To investigate this, we develop a set of new annotations to an existing dataset of figurative language, and test a range of language models on the dataset. We find that the combination of negation and figurativeness can present a particular challenge, and that performance overall and across different negation types is particularly dependent on the prompt style used.

摘要：比喻語言和否定是當前語言模型面臨的兩個挑戰領域，然而，這兩者在書面和口語語言中都被廣泛使用。大型語言模型（LLMs）在日常情境中也被廣泛使用，而這些情境未必能針對特定數據集進行調整。因此，了解LLMs正確解釋包含否定和比喻語言的文本的能力至關重要。為了調查這一點，我們對現有的比喻語言數據集開發了一組新的註釋，並對該數據集進行了一系列語言模型的測試。我們發現，否定和比喻的結合可能會帶來特定的挑戰，整體表現以及不同否定類型的表現特別依賴於使用的提示風格。

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

2606.18910v1 by Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf, Hongzhou Lin, Arijit Biswas, Mohammad Ghavamzadeh, Zhaoran Wang, Mingyi Hong

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n_queens and mini_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.

摘要：測試時間的擴展透過連續修訂已成為增強大型語言模型（LLM）推理的強大範式。然而，標準的後訓練方法主要優化單次目標，這與多步推理動態之間存在根本的不對齊。雖然最近的研究將此視為多回合強化學習（RL），但傳統方法直接在多步軌跡上進行優化，未能進一步利用模型可以從修正中學習的中間步驟中的高質量錯誤。我們提出了一個兩階段的迭代框架，交替進行在線數據/提示增強和策略優化。通過將成功恢復軌跡中的中間步驟（“近失”答案）轉換為解耦的修訂和驗證提示，我們的方法專注於有效的答案轉換和錯誤識別的訓練。這種方法使得高效的離線數據生成成為可能，並且相比於標準的多回合強化學習，減少了長期取樣的計算開銷。在 LiveCodeBench 上，使用公開可用的測試案例作為反饋，我們觀察到相較於 RL 基準提高了 +6.5 分，相較於標準的多回合訓練提高了 +4.0 分。除了編碼之外，我們的方法在圓形打包問題上達到了之前報告的 SOTA 結果，並且使用了最小的基礎模型（4B）以及比更大規模的進化搜索系統少得多的回合次數。在真實驗證下的數學結果進一步確認了改進的修正能力。它還能推廣到如 n_queens 和 mini_sudoku 等超出分佈的約束滿足謎題，其中正確性完全由問題約束定義。代碼可在 https://github.com/yxliu02/REVES.git 獲得。

SAERec: Constructing Fine-grained Interpretable Intents Priors via Sparse Autoencoders for Recommendation

2606.18897v1 by Jiangnan Xia, Xuansheng Wu, Yu Yang, Xin Wang, Ninghao Liu

Intent-based recommender systems have gained significant attention for improving accuracy and interpretability by modeling the underlying motivations behind user behaviors. Most existing models derive intents directly from user sequences via clustering or prototype learning. However, they are sensitive to sequence quality, require presetting the number of intents, and lack explicit semantic grounding. These issues lead to an incomplete and coarse intent set and limit the effectiveness of recommendation. In this paper, we propose the Sparse Autoencoder for intent-based recommendation (SAERec), a novel recommender that automatically constructs a fine-grained and interpretable intent space from a textual corpus to guide recommendation. Rather than treating texts as side signals, SAERec leverages them as high information density evidence for intent construction. Specifically, we first extract a comprehensive set of fine-grained interpretable intents from the latent space of large language models (LLMs) by using a sparse autoencoder (SAE) to disentangle and interpret text embeddings, which isolates intent-related semantics from textual noise. Then, for each user, we retrieve relevant intents from this set as priors to guide recommendation. It contains personal intents matching a user's current interests and public intents capturing general item patterns shared across users (e.g., quality, price). Finally, to integrate retrieved intents into sequence modeling, we propose a multi-branch attention mechanism that captures temporal dependencies and injects both personal and public intent signals, followed by an adaptive fusion layer to construct the final user representation for recommendation. Extensive experiments on public datasets demonstrate the superiority of SAERec, consistently outperforming state-of-the-art baselines while providing human-understandable explanations.

摘要：意圖導向的推薦系統因為能夠通過建模用戶行為背後的基本動機來提高準確性和可解釋性而受到廣泛關注。大多數現有模型通過聚類或原型學習直接從用戶序列中推導出意圖。然而，它們對序列質量敏感，需要預設意圖的數量，並且缺乏明確的語義基礎。這些問題導致意圖集不完整且粗糙，限制了推薦的有效性。在本文中，我們提出了基於意圖的推薦的稀疏自編碼器（SAERec），這是一種新穎的推薦系統，能夠自動從文本語料庫中構建細粒度和可解釋的意圖空間以指導推薦。SAERec並不將文本視為輔助信號，而是將其視為意圖構建的高信息密度證據。具體而言，我們首先使用稀疏自編碼器（SAE）從大型語言模型（LLMs）的潛在空間中提取一組全面的細粒度可解釋意圖，以解開和解釋文本嵌入，從而將與意圖相關的語義與文本噪聲隔離。然後，對於每個用戶，我們從這組意圖中檢索相關意圖作為先驗，以指導推薦。它包含與用戶當前興趣相匹配的個人意圖和捕捉跨用戶共享的一般項目模式的公共意圖（例如，質量、價格）。最後，為了將檢索到的意圖整合到序列建模中，我們提出了一種多分支注意機制，該機制捕捉時間依賴性並注入個人和公共意圖信號，隨後是一個自適應融合層，用於構建最終用戶表示以進行推薦。在公共數據集上的廣泛實驗顯示，SAERec的優越性，持續超越最先進的基準，同時提供人類可理解的解釋。

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

2606.18893v1 by Zhuangzhuang Pan, Ning Dong, Yingna Su, Yan Xia

Multimodal emotion-cause pair extraction (MECPE) requires reliable pair confidence over candidate pairs. Existing pair scorers commonly use pair-level cross entropy over valid candidates, which treats links mostly independently. This leaves the relative confidence geometry among competing causes under-constrained, allowing gold pairs to stay close to hard negatives or rely on incidental non-gold context. We study this vulnerability as pair-confidence brittleness and propose RPCL (Robust Pair Confidence Learning), a training-only framework for pair-confidence learning. RPCL encourages pair confidence to be both discriminative and stable: gold pairs are separated from row-wise hard negatives through a confidence-difference margin constraint, and clean pair predictions are aligned with predictions from a corrupted view where non-gold contextual utterance representations are partially corrupted. The original clean pair scorer and decoding pipeline are used unchanged at inference time. On ECF, MECAD, and MEC4, RPCL improves the three-seed mean Pair F1 over a matched base model by 2.58 to 2.83 percentage points in the full text-audio-video setting, and improves mean Pair AUPRC on all three datasets. Diagnostic analysis further shows larger gold-negative confidence gaps and lower margin-violation severity. These results suggest that explicitly shaping pair confidence is an effective training strategy for MECPE.

摘要：多模態情感-原因配對提取（MECPE）需要對候選配對具有可靠的配對信心。現有的配對評分器通常使用有效候選者的配對級交叉熵，這主要是獨立地處理連結。這使得競爭原因之間的相對信心幾何形狀受到約束，允許金標配對與困難的負樣本保持接近或依賴偶然的非金標上下文。我們將這種脆弱性研究為配對信心的脆弱性，並提出RPCL（穩健配對信心學習），這是一個僅用於訓練的配對信心學習框架。RPCL鼓勵配對信心既具辨別性又穩定：金標配對通過信心差異邊際約束與行級困難負樣本分開，並且乾淨的配對預測與來自一個部分損壞的視圖的預測對齊，其中非金標上下文話語表示部分損壞。在推理時，原始的乾淨配對評分器和解碼管道保持不變。在ECF、MECAD和MEC4上，RPCL在完整的文本-音頻-視頻設置中將三種種子均值配對F1提高了2.58到2.83個百分點，並且在所有三個數據集上提高了均值配對AUPRC。診斷分析進一步顯示出更大的金標-負樣本信心差距和較低的邊際違規嚴重性。這些結果表明，明確塑造配對信心是一種有效的MECPE訓練策略。

Skill-Guided Continuation Distillation for GUI Agents

2606.18890v1 by Zhimin Fan, Hongwei Yu, Yeqing Shen, Haolong Yan, Guozhen Peng, Tianhao Peng, Yudong Zhang, Xiaowen Zhang, Kaijun Tan, Zheng Ge, Xiangyu Zhang, Daxin Jiang

Improving GUI agents typically relies on behavior cloning on expert trajectories. However, as the current policy deviates from the expert policy, it inevitably encounters policy-induced off-trajectory states during closed-loop execution, i.e., states that fall outside the expert trajectories. Since expert trajectories provide no demonstrations for these unseen states, such states receive no effective supervision, leaving the policy unable to select the correct action. To close this supervision gap, we propose Skill-Guided Continuation Distillation (SGCD), an iterative self-improvement framework. SGCD first runs the plain policy without skill guidance for a few steps to reach realistic off-trajectory states. From these states, a skill-guided policy then completes the task and produces successful continuations, which are mixed with expert trajectories to supply supervision over policy-induced off-trajectory states. The skills are extracted from both successful and failed rollouts, consisting of Continuation Plans, Critical Targets, Failure Traps, and Success Criteria. On OSWorld-Verified, SGCD improves the success rate of three base models from the low-30\% range to over 50\%, demonstrating its effectiveness and generality.

摘要：改善 GUI 代理通常依賴於專家軌跡上的行為複製。然而，隨著當前政策偏離專家政策，它不可避免地在閉環執行過程中遇到政策引起的非軌跡狀態，即那些落在專家軌跡之外的狀態。由於專家軌跡對這些未見狀態沒有提供示範，因此這些狀態無法獲得有效的監督，使得政策無法選擇正確的行動。為了填補這一監督空白，我們提出了技能引導的延續蒸餾（SGCD），這是一個迭代自我改進的框架。SGCD 首先在沒有技能引導的情況下運行普通政策幾步，以達到現實的非軌跡狀態。從這些狀態中，技能引導的政策然後完成任務並產生成功的延續，這些延續與專家軌跡混合，以提供對政策引起的非軌跡狀態的監督。這些技能來自於成功和失敗的回合，包含延續計劃、關鍵目標、失敗陷阱和成功標準。在 OSWorld-Verified 上，SGCD 將三個基礎模型的成功率從低於 30\% 提高到超過 50\%，展示了其有效性和普遍性。

Improving Medical Communication using Rubric-Guided Counterfactual Recommendations

2606.18889v1 by Adrian Cosma, Nicoleta-Nina Basoc, Andrei Niculae, Cosmin Dumitrache, Emilian Radoi

Text-based telemedicine increasingly relies on lightweight patient feedback, however, such feedback primarily reflects perceived communication quality rather than medical accuracy. We introduce an LM-guided counterfactual recommendation pipeline that discovers and refines interpretable communication features such as tone, personalization, actionability and completeness in addressing patient concerns, without interfering with the medical content. These features are used together with patient-doctor interaction metadata to estimate positive feedback. At inference time, the system searches over low-cost ordinal feature changes and recommends minimal communication changes predicted to increase the probability of positive feedback, while independent auditor models test whether these gains generalize beyond the selection model. Across interactions, recommendations yield a mean +6.41% gain in predicted positive feedback probability under independent auditors, and are non-negative for 93.31% of recommendations. These results suggest that small, interpretable communication changes can capture most predicted gains while preserving the doctor's control over medical reasoning and final wording.

摘要：基於文本的遠程醫療越來越依賴輕量級的患者反饋，然而，這種反饋主要反映的是感知的溝通質量，而非醫療準確性。我們介紹了一個由LM指導的反事實推薦管道，該管道發現並精煉可解釋的溝通特徵，如語調、個性化、可行性和全面性，以解決患者的擔憂，而不干擾醫療內容。這些特徵與患者-醫生互動的元數據一起用來估計正面反饋。在推理時，系統搜索低成本的序數特徵變化，並推薦預測能增加正面反饋概率的最小溝通變化，同時獨立審核模型測試這些增益是否超越選擇模型的範疇。在互動中，推薦在獨立審核者下產生了平均+6.41%的預測正面反饋概率增益，並且93.31%的推薦是非負的。這些結果表明，小的、可解釋的溝通變化可以捕捉到大多數預測增益，同時保持醫生對醫療推理和最終措辭的控制。

2606.18888v1 by Thomas Quilter, Yifan Zhu, Guorui Quan, Mingfei Sun, Samuel Kaski

Navigation in partially observable environments presents a significant challenge for autonomous agents, requiring effective decision-making with limited sensory information in unknown environments. Belief-based methods, particularly those using neural networks to approximate the belief space, often fail to capture the inherent multimodality of belief spaces, especially in high-dimensional cases with perceptual aliasing. While generative models present a compelling alternative, they typically require substantial data or expert demonstrations and lack explicit mechanisms for long-term planning. In this paper, we introduce BeliefDiffusion, a novel framework that combines the benefits of both generation and planning. BeliefDiffusion leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control (MPC) to simultaneously plan ahead. It consists of two steps: (1) Imagining plausible environment configurations based on observation history and (2) Planning efficient navigation strategies across an aggregated configurations. Through extensive experiments in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms both model-free reinforcement learning baselines and other generative approaches in navigation success rate and path efficiency. Our results validate that explicitly incorporating multimodal belief representations into planning enables more robust navigation in partially observable settings.

摘要：在部分可觀察環境中的導航對於自主代理來說是一項重大挑戰，這需要在未知環境中以有限的感知信息進行有效的決策。基於信念的方法，特別是那些使用神經網絡來近似信念空間的方法，通常無法捕捉信念空間固有的多模態性，尤其是在具有感知別名的高維情況下。雖然生成模型提供了一個引人注目的替代方案，但它們通常需要大量數據或專家示範，並且缺乏明確的長期規劃機制。在本文中，我們介紹了BeliefDiffusion，一個結合生成和規劃優勢的新框架。BeliefDiffusion利用擴散模型來明確描述多模態信念分佈，並利用模型預測控制（MPC）來同時進行前瞻性規劃。它由兩個步驟組成：(1) 根據觀察歷史想像合理的環境配置和 (2) 在聚合配置中規劃有效的導航策略。通過在合成地圖環境中的廣泛實驗，我們證明BeliefDiffusion在導航成功率和路徑效率上顯著優於無模型強化學習基準和其他生成方法。我們的結果驗證了在規劃中明確納入多模態信念表示能夠實現更穩健的部分可觀察環境導航。

Domain-Shift Aware Neural Networks for Unbalance Characterization in Rotating Systems

2606.18882v1 by Bernardo Feijó Junqueira, Claudio Kiyoshi Umezu, Bruno Bilhar Karaziack, Tomaz Junior, Daniel Alves Castello

This work investigates the application of a domain-shift aware neural network for regression tasks aimed at estimating unbalance masses in rotating shafts under varying operating conditions. Experimental data were collected from a test rig in which a primary shaft, equipped with a flange carrying unbalanced masses, was driven at different rotational speeds, while a secondary shaft could be optionally activated to introduce domain discrepancy. The unbalance masses were positioned at a fixed radial distance, and the dynamic response of the system was recorded using triaxial accelerometers. The inverse problem of mass estimation is formulated within a domain adaptation framework, where the network is trained with a maximum mean discrepancy strategy to align feature representations across source and target distributions. The results demonstrate the effectiveness of explicitly addressing domain shift in improving prediction accuracy, especially when the system's physical behavior and sources of domain discrepancy are not fully known and fall outside the training conditions. These findings highlight the potential of domain-shift aware models for regression tasks in Structural Health Monitoring.

摘要：這項工作探討了針對回歸任務應用領域轉移感知神經網絡，以估算在變化操作條件下旋轉軸上的不平衡質量。實驗數據是從一個測試裝置中收集的，該裝置中一根主軸配備有承載不平衡質量的法蘭，以不同的轉速驅動，而一根次軸則可以選擇性啟動以引入領域差異。不平衡質量被定位在固定的徑向距離，系統的動態響應是使用三軸加速度計記錄的。質量估算的逆問題是在一個領域適應框架內進行公式化的，其中網絡使用最大均值差異策略進行訓練，以對齊源和目標分佈之間的特徵表示。結果顯示，明確處理領域轉移在提高預測準確性方面的有效性，特別是在系統的物理行為和領域差異的來源未完全了解且超出訓練條件時。這些發現突顯了領域轉移感知模型在結構健康監測中進行回歸任務的潛力。

Efficient Financial Language Understanding via Distillation with Synthetic Data

2606.18875v1 by Wen-Fong, Huang, Edwin Simpson

Large instruction-following models are powerful but costly to deploy, particularly in finance, where labelled data are limited by confidentiality and expert annotation cost. We present an efficient framework for financial sentiment analysis through distillation with synthetic data, transferring knowledge from a large instruction-tuned teacher to compact student models. The framework is designed for low-resource conditions, where a small set of real examples are collected and labelled by hand. The framework then clusters the examples and uses the clusters to select seeds for generating synthetic examples via structured few-shot prompting. Experiments show that clustering-based seed selection yields more representative synthetic data than random sampling, enabling compact models to achieve strong performance with minimal supervision. Notably, on a more complex and noisy text domain, the compact model trained on the complete synthetic-seed corpus even outperforms the teacher model, while remaining competitive on formal text. The framework provides a practical route toward resource-efficient domain adaptation in financial NLP with minimal human labelling effort.

摘要：大型指令跟隨模型強大但部署成本高，特別是在金融領域，因為標註數據受到保密性和專家標註成本的限制。我們提出了一個通過合成數據進行金融情感分析的高效框架，將知識從大型指令調整的教師模型轉移到緊湊的學生模型。該框架設計用於低資源條件，其中一小組真實範例由人工收集和標註。然後，該框架對範例進行聚類，並利用這些聚類選擇種子，以通過結構化的少量提示生成合成範例。實驗表明，基於聚類的種子選擇比隨機抽樣產生更具代表性的合成數據，使緊湊模型在最小監督下實現強大性能。值得注意的是，在更複雜和噪聲較多的文本領域，基於完整合成種子語料庫訓練的緊湊模型甚至超越了教師模型，同時在正式文本上仍保持競爭力。該框架為在金融自然語言處理中以最小的人力標註努力實現資源高效的領域適應提供了一條實用的途徑。

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

2606.18874v1 by Zijian Wang, Hanqi Li, Ziyue Yang, Zijian Hu, Shenghan Zuo, Yunzhe Zhang, Da Ma, Danyu Luo, Chenrun Wang, Jing Peng, Tiancheng Huang, Sijia Guo, Huayang Wang, Zichen Zhu, Senyu Han, Yilu Cao, Kai Yu, Lu Chen

AI systems can increasingly automate scientific workflows, but the reasoning that links prior evidence, generated ideas, experiments and final claims often remains implicit inside model inference. Here we introduce Xcientist, a research harness that externalizes research synthesis and experimental validation into inspectable, contract-governed processes. Xcientist organizes literature evidence, idea states, implementation plans, ablation records and repair traces as persistent research artifacts, so that generated mechanisms can be grounded, executed, tested and revised without losing their evidential basis. We identify claim drift as a failure mode of automated research, where runnable artifacts no longer support the mechanism originally claimed. Across training-free memory systems, graph-structured traffic forecasting and multi-scale physics-informed neural networks, Xcientist preserves traceable trajectories from problem formulation to mechanism design, validation and bounded revision. These results suggest that AI scientists should be evaluated not only by their final artifacts, but by whether their synthesis and validation processes remain attributable, inspectable and scientifically accountable.

摘要：AI 系統可以越來越多地自動化科學工作流程，但將先前證據、生成的想法、實驗和最終主張聯繫起來的推理通常仍隱含在模型推斷中。在此，我們介紹 Xcientist，一個將研究綜合和實驗驗證外部化為可檢查的、受合同約束的過程的研究工具。Xcientist 將文獻證據、想法狀態、實施計劃、消融記錄和修復痕跡組織為持久的研究文物，以便生成的機制可以在不失去其證據基礎的情況下進行基礎化、執行、測試和修訂。我們將主張漂移確定為自動化研究的一種失效模式，其中可運行的文物不再支持最初聲稱的機制。在無需訓練的記憶系統、圖結構的交通預測和多尺度物理知識驅動的神經網絡中，Xcientist 保留了從問題表述到機制設計、驗證和有限修訂的可追溯軌跡。這些結果表明，AI 科學家應該不僅根據他們的最終文物進行評估，還應根據他們的綜合和驗證過程是否保持可歸因、可檢查和科學負責進行評估。

Scaling Learning-based AEB with Massive Unlabeled Data

2606.18864v1 by Xiangyu Wang, Yang Zhan, Mengxiang Hao, Chuanchuan Zhong, Yansong Jia, Junjie Zhang, Yu Han, Xin Jiang, Zhen Cao, Ying Wang, Yulun Song, Zhitao Xu

This paper studies how to scale learning-based automatic emergency braking (AEB) with massive unlabeled fleet data under production constraints. Our approach is based on meta-feedback semi-supervised learning (MF-SSL), where a teacher generates pseudo labels for unlabeled driving data and is updated using a small labeled anchor set as safety-critical feedback. In production, anchor ambiguity and labeled-unlabeled mismatch can amplify systematic pseudo-label errors, leading to spurious triggers. We propose a stabilized MF-SSL framework with (i) Noise-Aware Decoupling, which removes ambiguity-prone anchors from the teacher's supervised update path, and (ii) kinematics-gated pseudo-labeling with a teacher conflict penalty to suppress mismatch-induced risk hallucinations on unlabeled data while maintaining broad coverage. Extensive experiments show consistent gains as unlabeled data scale from 1M to 1B windows, improving safety while keeping comfort stable. The 1B-trained student model is deployed to hundreds of thousands of vehicles and validated over \$10^9$ km of driving, achieving a positive-to-false activation ratio exceeding 100:1 and a 35% improvement in accident-free driving mileage over a production rule-only baseline.

摘要：這篇論文研究如何在生產限制下，利用大量未標記的車隊數據來擴展基於學習的自動緊急制動（AEB）。我們的方法基於元反饋半監督學習（MF-SSL），其中一個教師為未標記的駕駛數據生成偽標籤，並使用一小組標記的錨點集作為安全關鍵的反饋進行更新。在生產中，錨點的模糊性和標記-未標記的不匹配可能會放大系統性的偽標籤錯誤，導致虛假觸發。我們提出了一個穩定的MF-SSL框架，具有（i）噪聲感知解耦，這會從教師的監督更新路徑中移除易受模糊影響的錨點，以及（ii）運動學門控偽標籤生成，並引入教師衝突懲罰，以抑制對未標記數據的不匹配引起的風險幻覺，同時保持廣泛的覆蓋範圍。大量實驗顯示，當未標記數據從1M擴展到1B窗口時，性能穩定提升，改善安全性，同時保持舒適度穩定。經過1B訓練的學生模型已部署到數十萬輛車輛，並在超過\$10^9$公里的駕駛中進行驗證，實現了超過100:1的正向到虛假啟動比率，並在無事故駕駛里程上較僅依賴生產規則的基準提高了35%。

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

2606.18861v1 by Xinze Zhang

Reconstructing simulation-ready digital twins of articulated objects from sensor observations remains constrained by two persistent gaps: (i) part-level geometric reconstruction is decoupled from kinematic-parameter estimation, and (ii) the recovered models often violate basic dynamic invariants such as energy conservation, leading to drift when the URDF is replayed in physics simulators. We present KinemaForge, a constraint-driven pipeline that jointly infers part-level shape, joint topology, and joint parameters from short RGB-D sequences and validates the result against an energy-consistent verifier built on differentiable rigid-body dynamics. The pipeline introduces three components: a kinematic constraint graph that encodes joint-part incidences as soft edges; a differentiable screw-axis solver that backpropagates from rendered observations through Featherstone's articulated-body algorithm to joint parameters; and an energy residual loss that penalises non-physical free responses of the reconstructed model. Across five PartNet-Mobility categories and an internal RGB-D benchmark, KinemaForge reduces the average joint-axis error from 4.52 degrees to 2.83 degrees (-37.4%) over the strongest geometric baseline (PARIS) and from 5.30 degrees to 2.83 degrees (-46.6%) over the interaction-based Ditto baseline, lowers long-horizon simulation drift by 64% (vs. PARIS) over 50 s rollouts, and yields URDFs whose closed-loop manipulation success rate improves by 14.6 percentage points over Ditto in our preliminary evaluation. Code and reconstruction data will be released upon acceptance.

摘要：重建可供模擬使用的關節物體數位雙胞胎，仍然受到兩個持續存在的缺口的限制：(i) 部件級幾何重建與運動參數估計相互解耦，以及 (ii) 恢復的模型經常違反基本的動態不變性，如能量守恆，導致在物理模擬器中重播 URDF 時出現漂移。我們提出了 KinemaForge，一個基於約束的管道，從短的 RGB-D 序列中共同推斷部件級形狀、關節拓撲和關節參數，並將結果與基於可微剛體動力學構建的能量一致性驗證器進行驗證。該管道引入了三個組件：一個運動約束圖，將關節-部件的關聯編碼為軟邊；一個可微的螺旋軸求解器，通過 Featherstone 的關節體算法從渲染的觀察結果反向傳播到關節參數；以及一個能量殘差損失，對重建模型的非物理自由反應進行懲罰。在五個 PartNet-Mobility 類別和一個內部 RGB-D 基準測試中，KinemaForge 將平均關節軸誤差從 4.52 度降低到 2.83 度（-37.4%），相較於最強的幾何基線（PARIS），以及從 5.30 度降低到 2.83 度（-46.6%），相較於基於互動的 Ditto 基線，並在 50 秒的滾動中將長期模擬漂移降低 64%（與 PARIS 相比），產生的 URDF 在我們的初步評估中，其閉環操作成功率比 Ditto 提高了 14.6 個百分點。代碼和重建數據將在接受後發布。

Approximate Structured Diffusion for Sequence Labelling

2606.18856v1 by Nicolas Floquet, Joseph Le Roux, Nadi Tomeh

Sequence labelling, a core task of Natural Language Processing (NLP), consists in assigning each token of an input sentence a label. From a Machine Learning point of view, sequence labelling is often cast as a Linear-Chain Conditional Random Field (CRF) parametrised by a neural network. While this approach gives good empirical results, CRFs assume a finite decision span (eg label bigrams) which can limit their expressivity and hurt performance when long-range dependencies are required. We show we can leverage diffusion to train a CRF conditioned on an entire label sequence, with the caveat that the condition is on a noisy version of labels. We show experimentally that this method, in conjunction with approximate CRF inference, improves label accuracy with a 16.5% error reduction for POS-tagging.

摘要：序列標註，作為自然語言處理（NLP）的核心任務，旨在為輸入句子的每個標記分配一個標籤。從機器學習的角度來看，序列標註通常被視為由神經網絡參數化的線性鏈條條件隨機場（CRF）。雖然這種方法提供了良好的實證結果，但CRF假設有限的決策範圍（例如標籤二元組），這可能限制其表達能力，並在需要長距離依賴時影響性能。我們展示了如何利用擴散來訓練一個以整個標籤序列為條件的CRF，但需注意條件是基於標籤的噪聲版本。我們的實驗表明，這種方法結合近似CRF推斷，提高了標籤準確性，對於詞性標註減少了16.5%的錯誤率。

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

2606.18850v1 by Bohou Zhang, Xiaoyu Tao, Mingyue Cheng, Huijie Liu, Qi Liu

Abstractive summarization plays a crucial role in enabling efficient understanding of scientific literature, yet it inherently demands both linguistic fluency and factual faithfulness. Existing approaches often fail to reconcile these two requirements. Extractive methods rely on rigid sentence splicing that disrupts macro-level logical coherence, while large language model (LLM)-based generative approaches, despite mastering linguistic fluency, exhibit limited factual consistency. In this work, we propose ScholarSum, a hierarchical reflective graph-based framework that emulates a student-teacher writing process for fluent and faithful scientific summarization. ScholarSum first organizes the document into a hierarchical knowledge graph by segmenting it into semantically coherent units, whose multi-layered community structure captures global logic and macro-level themes. Guided by this global structure, the student generates an initial draft, which is subsequently refined through fine-grained evidence retrieval. To ensure factual consistency, a teacher-like reviewer then iteratively examines the draft, identifies unsupported content, and prompts targeted re-retrieval and rewriting until the summary meets rigorous quality standards. Extensive experiments demonstrate that ScholarSum significantly outperforms previous baselines in terms of both completeness and faithfulness. Our code is available at https://github.com/Xiaoyu-Tao/ScholarSum.

摘要：抽象摘要在促進對科學文獻的有效理解中扮演著至關重要的角色，但它本質上需要語言流暢性和事實忠實性。現有的方法往往無法調和這兩個要求。抽取式方法依賴於僵化的句子拼接，這會破壞宏觀層面的邏輯一致性，而基於大型語言模型（LLM）的生成方法，儘管在語言流暢性上表現出色，但在事實一致性方面卻有限。在本研究中，我們提出了ScholarSum，一種層次反思圖形基礎框架，模擬學生-教師的寫作過程，以實現流暢且忠實的科學摘要。ScholarSum首先通過將文檔劃分為語義上連貫的單元，將其組織成層次知識圖，這些多層社群結構捕捉了全球邏輯和宏觀主題。在這一全球結構的指導下，學生生成初步草稿，然後通過細緻的證據檢索進行精煉。為了確保事實一致性，類似教師的審閱者隨後反覆檢查草稿，識別不支持的內容，並促使針對性的重新檢索和重寫，直到摘要達到嚴格的質量標準。大量實驗表明，ScholarSum在完整性和忠實性方面顯著超越了以往的基準。我們的代碼可在 https://github.com/Xiaoyu-Tao/ScholarSum 獲得。

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

2606.18847v1 by Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang, Tianhao Zhou, Xinli Xu, Yingjie Xu, Yinchuan Li, Zexi Li, Ying-Cong Chen

To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long-term memory benchmarks mainly evaluate language-centric retrieval and question answering, while embodied benchmarks often focus on short-horizon task execution without testing long-term memory use in dynamic environments. We introduce WorldLines, a project-driven benchmark for long-horizon embodied household assistance. It constructs temporally extended household traces with dialogues, actions, execution feedback, object and device state changes, and converts them into evidence-linked samples for Memory QA and Embodied Task Planning. We further propose ObsMem, an observer-grounded memory framework that maintains visibility-aware memories and action-native state trails for state-aware decisions. Experiments reveal persistent challenges in partial observability, overwritten world states, and translating long-term memory into embodied plans, while ObsMem offers a stronger reference architecture for this setting.

摘要：為了在真實的家庭中長時間協助人類，具身代理必須記住用戶的日常活動、世界狀態和過去的互動。現有的長期記憶基準主要評估以語言為中心的檢索和問題回答，而具身基準則通常專注於短期任務執行，並未測試在動態環境中使用長期記憶。我們介紹了WorldLines，一個以項目為驅動的長期具身家庭協助基準。它構建了包含對話、行動、執行反饋、物體和設備狀態變化的時間延展家庭痕跡，並將其轉換為與證據相關的樣本，用於記憶問答和具身任務規劃。我們進一步提出了ObsMem，一個以觀察者為基礎的記憶框架，維護可見性意識的記憶和行動原生狀態軌跡，以便進行狀態意識的決策。實驗顯示在部分可觀察性、被覆蓋的世界狀態以及將長期記憶轉化為具身計劃方面存在持續的挑戰，而ObsMem則為這種設定提供了更強的參考架構。

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

2606.18837v1 by Hehai Lin, Qi Yang, Chengwei Qin

Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. Inference-time MAS leverages frozen frontier LLMs but repeats identical searches without learning from past experience. Conversely, Training-time MAS internalizes experience via gradient updates but is constrained by the low capability ceiling of smaller models, and is hard to scale to large frontier LLMs. To bridge this gap, we propose Skill-MAS, a novel third path that decouples experience retention from parametric updates by conceptualizing the high-level orchestration capability as an evolvable Meta-Skill. Skill-MAS refines this architectural knowledge through a closed optimization loop: (1) Multi-Trajectory Rollout samples a behavioral distribution for each task under the current Meta-Skill; and (2) Selective Reflection adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy-level principles. Extensive experiments across four complex benchmarks and four distinct LLMs demonstrate that Skill-MAS not only achieves remarkable performance gains but also maintains a favorable cost-performance trade-off. Further analysis reveals that the evolved Meta-Skills are highly robust and exhibit strong transferability across unseen tasks and different LLMs.

摘要：大型語言模型（LLM）基礎的自動多代理系統（MAS）生成已成為應對複雜任務的重要前沿。然而，現有方法面臨模型能力與經驗保留之間的困境。推理時的 MAS 利用凍結的前沿 LLM，但在沒有從過去經驗中學習的情況下重複相同的搜索。相反，訓練時的 MAS 通過梯度更新內化經驗，但受到較小模型低能力上限的限制，並且難以擴展到大型前沿 LLM。為了填補這一空白，我們提出了 Skill-MAS，一條新穎的第三條路徑，通過將高層次的編排能力概念化為可演變的元技能，將經驗保留與參數更新解耦。Skill-MAS 通過閉環優化循環來精煉這一架構知識：（1）多軌跡回放在當前元技能下為每個任務採樣行為分佈；（2）選擇性反思自適應地選擇優先任務，並應用分層對比分析將系統經驗提煉為可泛化的策略級原則。跨越四個複雜基準和四個不同 LLM 的廣泛實驗表明，Skill-MAS 不僅實現了顯著的性能提升，還保持了有利的成本性能權衡。進一步分析顯示，演變的元技能具有高度的穩健性，並在未見任務和不同 LLM 之間表現出強大的可轉移性。

Target-confidence Recourse Using tSeTlin machines: TRUST

2606.18832v1 by K. Darshana Abeyrathna, Sara El Mekkaoui, Nils Enric Canut Taugbøl, Anuja Vats

Counterfactual explanations are widely used to provide algorithmic recourse in high-stakes decision-making systems. Most existing methods seek the smallest change to an input that flips a model's decision. However, decision-makers often rely not only on predicted labels but also on confidence thresholds and risk margins. Counterfactuals that barely cross a decision boundary can be fragile and unstable under noise or model variation. In this paper, we propose Target-confidence Recourse Using tSeTlin machines (TRUST), a framework in which users explicitly specify the desired prediction confidence for recourse. Rather than generating counterfactuals and evaluating confidence afterward, TRUST directly searches for minimal changes that satisfy a user-defined confidence target, enabling comparison of recourse options in terms of cost, confidence, and robustness. We instantiate TRUST using a Probabilistic Tsetlin Machine (PTM) combined with Bayesian optimization. The probabilistic clause-based structure of PTM links prediction confidence to the stability of decision rules. We show that counterfactuals satisfying the same rules can still differ substantially in reliability depending on how securely they satisfy those rules, revealing whether decisions are supported by robust or fragile clause activations. Experiments on synthetic and real-world datasets demonstrate that target-confidence counterfactuals produce more robust and interpretable recourse than conventional boundary-based approaches. Across multiple benchmarks, TRUST achieves perfect robustness while maintaining low recourse cost, including an L2 distance of 0.10 on the Haberman dataset at 0.92 confidence. By explicitly controlling confidence and exposing rule-level stability, TRUST provides actionable recourse for high-stakes decision support.

摘要：反事實解釋廣泛用於提供高風險決策系統中的算法補救措施。大多數現有方法尋求對輸入進行最小改變，以翻轉模型的決策。然而，決策者往往不僅依賴預測標籤，還依賴置信閾值和風險邊際。剛好跨越決策邊界的反事實在噪聲或模型變化下可能是脆弱和不穩定的。在本文中，我們提出了使用 tSeTlin 機器的目標置信補救（TRUST），這是一個框架，使用者明確指定補救所需的預測置信度。TRUST 不是生成反事實然後評估置信度，而是直接尋找滿足用戶定義的置信目標的最小變更，從而使補救選項在成本、置信度和穩健性方面進行比較。我們使用結合貝葉斯優化的概率 Tsetlin 機器（PTM）來實現 TRUST。PTM 的概率子句結構將預測置信度與決策規則的穩定性聯繫起來。我們展示了滿足相同規則的反事實仍然可能在可靠性上有顯著差異，這取決於它們滿足這些規則的安全程度，從而揭示決策是否由穩健或脆弱的子句激活支持。在合成和現實世界數據集上的實驗表明，目標置信反事實產生的補救措施比傳統基於邊界的方法更穩健且可解釋。在多個基準測試中，TRUST 在保持低補救成本的同時實現了完美的穩健性，包括在 Haberman 數據集上以 0.92 置信度達到 0.10 的 L2 距離。通過明確控制置信度並揭示規則級別的穩定性，TRUST 為高風險決策支持提供了可行的補救措施。

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

2606.18831v1 by Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao

Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.

摘要：長期推理是大型語言模型的一項基本能力，特別是在它們作為必須對長期軌跡進行推理的自主代理時。強化學習（RL）最近已成為改善這一能力的主導範式，但現有的研究主要集中在獎勵工程上，而多樣化的訓練數據仍然稀缺。我們從數據中心的角度重新審視這個問題，並展示僅僅依靠一個簡單但有效的數據配方，結合一個最小的基於結果的GRPO設置，就足以顯著改善長期推理。我們的配方針對三個互補的任務家族——檢索、多證據綜合和推理——為此我們構建並策劃了八個數據集，總計約14K個示例。在三個模型（Qwen3-4B/8B/30B-A3B）上的實驗顯示，在七個長期推理基準上平均增益為+7.2/+3.2/+6.4分，超過了先前的RL訓練集。我們進一步證明，這些增益可以轉移到代理任務上，在一個經過我們數據配方調整的模型上持續進行RL訓練，使GAIA提高了+4.8分，BrowseComp提高了+7.0分。我們將釋放我們的數據集，以促進未來的研究。

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

2606.18829v1 by Zhe Ren, Yibo Yang, Yimeng Chen, Zijun Zhao, Benshuo Fu, Zhihao Shu, Bingjie Zhang, Yangyang Xu, Dandan Guo, Shuicheng Yan

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi-principal shared-memory agents. GateMem jointly evaluates utility for legitimate long-horizon requests with state updates, access control across contextual authorization boundaries, and agent-facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long-form multi-party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak-target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long-context prompting often yields the best governance score at high token cost, while retrieval-based and external-memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.

摘要：記憶基準測試對於 LLM 代理大多假設為單用戶環境，導致醫院、工作場所、校園和家庭的共享助手研究不足。在這些部署中，多個主體寫入共同的記憶池並根據不同的角色、範疇和關係進行查詢，因此記憶質量需要治理以及回憶。我們介紹了 GateMem，一個針對多主體共享記憶代理的基準。GateMem 共同評估合法長期請求的效用，包含狀態更新、跨上下文授權邊界的訪問控制，以及在明確刪除請求後面向代理的主動遺忘。它涵蓋醫療、辦公、教育和家庭領域，具有長篇多方情節、增量記憶注入、隱藏檢查點、結構化評判和洩漏目標註釋。在多樣的基準線和主幹模型中，沒有任何方法能同時實現強大的效用、穩健的訪問控制和可靠的遺忘。長上下文提示通常在高標記成本下產生最佳治理分數，而基於檢索和外部記憶的方法則降低成本，但仍然洩漏未經授權或已刪除的信息。這些結果顯示，目前的記憶代理仍然遠未達到可靠的共享機構部署。

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

2606.18828v1 by Chenghao Xu

Traditional approaches place intelligence in the agent, whether as a learned policy or a search procedure. We instead place intelligence in the space itself: a scene induces a Riemannian metric on the configuration manifold, and action reduces to following the geodesics of that metric rather than invoking a separate planner or collision checker. A single Encoder-Router network realizes this idea through three complementary parameter groups -- frame parameters that orient the generators, modulation parameters that govern their spatial propagation, and basic coefficients that determine their strength. These groups combine through a shared semigroup-superposition mechanism to produce a single Riemannian metric field, yielding a compact architecture whose geometry scales naturally with scene complexity. Trained on a single two-obstacle scene, the model demonstrates robust zero-shot generalization across unseen obstacle configurations, with orders-of-magnitude separation between collision-free and obstacle-penetrating path costs.

摘要：傳統方法將智慧置於代理中，無論是作為學習的策略還是搜索程序。我們則將智慧置於空間本身：一個場景在配置流形上誘導出一個黎曼度量，而行動則簡化為遵循該度量的測地線，而不是調用單獨的規劃器或碰撞檢查器。一個單一的編碼器-路由器網絡通過三組互補的參數實現這一理念——框架參數用於定向生成器，調製參數用於控制它們的空間傳播，以及基本係數用於確定它們的強度。這些組通過共享的半群-疊加機制結合，產生一個單一的黎曼度量場，形成一個緊湊的架構，其幾何形狀隨場景的複雜性自然縮放。在單一的兩障礙場景上進行訓練後，該模型在未見過的障礙配置上展示了強大的零樣本泛化，碰撞安全路徑成本與穿透障礙的路徑成本之間有著數量級的區別。

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

2606.18820v1 by Jiaxi Liu, Aiping Yang, Yuhang Yang, Shuqi Zhang, Zewei Dong, Jiangming Yang, Xuebin Chen

Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational cutoffs, commitments, or resource constraints. Standard MDP formulations typically flatten this structure into stage-dependent state descriptions and action masks, thereby obscuring the nested information--action asymmetry that determines which decisions are urgent and which can be deferred. We introduce Maturing Markov Decision Processes (MMDPs), a formulation built around this information--action asymmetry. We characterize one of its key consequences through an expiring-action priority principle, which identifies the actions that must be resolved before the next stage. Motivated by this structure, we develop a structure-aware reinforcement learning framework with stage-aware policy design, expiring-action abstraction, and search-augmented learning with distillation. Experiments on a controlled multi-supplier replenishment problem, simplified cash-management environments of increasing complexity, and a production-scale simulator show that explicitly modeling this asymmetry improves learning efficiency and becomes increasingly valuable as decision problems scale.

摘要：序列決策問題通常表現出信息和決策靈活性的非對稱演變：隨著決策週期的展開，代理人獲得更豐富的信息，同時可行的行動因操作截止、承諾或資源限制而過期。標準的馬爾可夫決策過程（MDP）公式通常將這一結構簡化為階段依賴的狀態描述和行動掩碼，從而掩蓋了決定哪些決策是緊急的、哪些可以延遲的嵌套信息--行動非對稱性。我們引入了成熟馬爾可夫決策過程（MMDPs），這是一種圍繞這一信息--行動非對稱性構建的公式。我們通過過期行動優先原則來描述其一個關鍵後果，該原則確定了必須在下一階段之前解決的行動。受到這一結構的啟發，我們開發了一個結構感知的強化學習框架，具有階段感知的政策設計、過期行動抽象以及增強學習與蒸餾的搜索。對於一個受控的多供應商補貨問題、日益複雜的簡化現金管理環境和一個生產規模的模擬器的實驗顯示，明確建模這種非對稱性提高了學習效率，並隨著決策問題的擴大而變得越來越有價值。

SwitchBraidNet: Quantisation-Aware Lightweight Architecture for Hybrid Brain-Computer Interface

2606.18816v1 by Gourav Siddhad, Yogesh Kumar Meena

Hybrid brain-computer interfaces (BCIs) that integrate motor imagery (MI) and steady-state visual evoked potentials (SSVEP) provide high-dimensional neural decoding but typically exceed the computational limits of embedded hardware. To address this, we propose SwitchBraidNet, a compact EEG classification architecture designed for low-power deployment. The model employs a dual-path temporal braid to extract multiscale oscillatory features, an adaptive squeeze-and-excitation spatial switch for electrode gating, and a log-variance readout layer for direct band-power encoding. Furthermore, through systematic quantisation-aware training on the OpenBMI dataset, we compared SwitchBraidNet against four established baselines across FP32, FP16, and INT8 precisions. Experimental results demonstrate superior efficiency and performance, achieving MI accuracy of 69.49% (FP16), SSVEP accuracy of 93.48% (FP32), and a hybrid information transfer rate of 64.82 bits/min (FP16). With an INT8 footprint of only 3.03 KB, SwitchBraidNet maintains high accuracy across varying numerical precisions, demonstrating its suitability for low-power embedded BCI deployment.

摘要：混合腦機介面（BCI）整合了運動意象（MI）和穩態視覺誘發電位（SSVEP），提供高維度的神經解碼，但通常超出了嵌入式硬體的計算限制。為了解決這個問題，我們提出了 SwitchBraidNet，一種為低功耗部署設計的緊湊型 EEG 分類架構。該模型採用了雙通道時間編織來提取多尺度振盪特徵，適應性擠壓和激勵空間開關用於電極閘控，以及一個對數方差讀出層用於直接帶功率編碼。此外，通過對 OpenBMI 數據集進行系統的量化感知訓練，我們將 SwitchBraidNet 與四個已建立的基準進行了比較，涵蓋 FP32、FP16 和 INT8 精度。實驗結果顯示出更高的效率和性能，達到 MI 準確率 69.49%（FP16）、SSVEP 準確率 93.48%（FP32），以及混合信息傳輸速率 64.82 位/分鐘（FP16）。SwitchBraidNet 在僅有 3.03 KB 的 INT8 足跡下，保持了在不同數值精度下的高準確率，顯示出其適合用於低功耗嵌入式 BCI 部署。

Reinforcement Learning Foundation Models Should Already Be A Thing

2606.18812v1 by Abdelrahman Zighem, Jill-Jênn Vie

Foundation models for language and vision are powered by internet-scale data, while structured domains (tabular prediction, time-series forecasting, graph learning, reinforcement learning) are not. The substitute is synthetic data, which shifts the burden from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train one model entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.

摘要：基於互聯網規模數據的語言和視覺基礎模型，而結構化領域（表格預測、時間序列預測、圖學習、強化學習）則不是。替代品是合成數據，這將負擔從收集轉移到先前設計。對於許多結構化任務，這樣的先驗已經存在：TabPFN及其後續版本使用在合成貝葉斯先驗上預訓練的Transformer來解決表格分類問題。
我們提出兩點。 \textbf{首先}，強化學習是明顯的缺口：對合成MDP的採樣與對合成表格數據集的採樣同樣可行，但沒有任何上下文強化學習工作將先前設計視為主要目標。 \textbf{其次}，MDP允許固定大小的充分統計量，與觀察到的情節無關且呈表格形狀，這使得它們直接適合用於表格基礎模型的基於注意力的架構，並用策略頭替代監督目標。這些共同定義了強化學習基礎模型的議程。
作為概念驗證，我們完全在合成MDP上訓練一個模型，並顯示在沒有特定任務調整的情況下，它能夠在上下文中解決保留的表格基準，無論是在線還是離線：在線時，所需的情節數量遠少於UCB-VI和表格Q學習；離線時，與VI-LCB競爭。

Rescaling MLM-Head for Neural Sparse Retrieval

2606.18811v1 by Youngjoon Jang, Seongtae Hong, Jonah Turner, Heuiseok Lim

Learned sparse retrieval (LSR) models such as SPLADE have traditionally used BERT-style masked language models as backbone encoders. A natural expectation is that replacing BERT with stronger pretrained encoders should improve retrieval effectiveness. However, we find that under standard SPLADE training recipes, backbones with large MLM-head L2 norms can suffer performance degradation and even training collapse under standard SPLADE training recipes. We identify this failure as a scale mismatch in the MLM head: SPLADE directly uses MLM-head outputs to construct sparse lexical representations, and query-document relevance is computed by an unnormalized dot product over these representations. As a result, an inflated MLM-head scale can amplify sparse activations, distort matching scores, and destabilize contrastive training under common training settings. To address this issue, we introduce a simple initialization-time correction that rescales the MLM-head projection by a constant factor before SPLADE training. This zero-cost adjustment improves training stability without modifying the model architecture or training objective. Across both in-domain and out-of-domain retrieval benchmarks, this simple correction substantially improves large-norm backbones such as ModernBERT and Ettin, turning unstable training runs into competitive sparse retrievers. In several settings, the corrected models further match or surpass the classic BERT-SPLADE baseline. These findings suggest that the bottleneck in adapting pretrained encoders to LSR is not encoder capacity alone, but the calibration of the MLM-head scale used to construct sparse lexical representations.

摘要：學習稀疏檢索（LSR）模型，如SPLADE，傳統上使用BERT風格的掩碼語言模型作為主幹編碼器。自然的期望是，將BERT替換為更強大的預訓練編碼器應該能提高檢索效果。然而，我們發現，在標準的SPLADE訓練配方下，擁有較大MLM-head L2範數的主幹可能會遭遇性能下降，甚至在標準SPLADE訓練配方下發生訓練崩潰。我們將這一失敗識別為MLM head中的尺度不匹配：SPLADE直接使用MLM-head輸出來構建稀疏詞彙表示，查詢-文檔相關性是通過這些表示的未標準化點積計算的。因此，膨脹的MLM-head尺度可能會放大稀疏激活，扭曲匹配分數，並在常見的訓練設置下使對比訓練不穩定。為了解決這個問題，我們引入了一個簡單的初始化時修正，該修正在SPLADE訓練之前通過一個常數因子重新縮放MLM-head投影。這一零成本的調整改善了訓練穩定性，而不修改模型架構或訓練目標。在內域和外域的檢索基準中，這一簡單的修正顯著改善了大型範數主幹，如ModernBERT和Ettin，將不穩定的訓練過程轉變為具有競爭力的稀疏檢索器。在幾個設置中，修正後的模型進一步匹配或超越了經典的BERT-SPLADE基線。這些發現表明，將預訓練編碼器適應於LSR的瓶頸不僅僅是編碼器的容量，而是用於構建稀疏詞彙表示的MLM-head尺度的校準。

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

2606.18810v1 by Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu, Xiangrong Zhu, Xinyi Wang, Jiashu Yao, Wei Lin, Hongru Wang, Heyan Huang

Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning steps. Existing token-level credit assignment methods require resources beyond the model's own rollouts. GRPO variants rely on process reward models or ground-truth answers. Knowledge distillation assigns credit through per-token divergence but requires external teachers (On-Policy Distillation) or privileged information (On-Policy Self Distillation). However, these dependencies limit applicability in the pure RLVR setting. We observe that conditioning the model on its own verified trajectories induces a measurable per-token KL divergence between the original and conditioned distributions, and prove that distilling from a self-teacher constructed by verified trajectories leads to infeasible weighted-average solutions when multiple verified trajectories exist. We propose SC-GRPO (Self-Conditioned GRPO), which uses KL divergence mentioned before as a multiplicative weight on GRPO gradients. Across five benchmarks spanning math, code, and agentic tasks, SC-GRPO consistently outperforms 8.1% over GRPO and 5.9% over DAPO with stronger OOD performance. Moreover, SC-GRPO achieves higher performance than OPD.

摘要：強化學習與可驗證獎勵（RLVR）在訓練大型語言模型（LLMs）以解決推理任務方面推動了顯著的進展，但代表性的方法如 GRPO 對所有標記分配均勻的信用，浪費了在常規標記上的梯度，同時對關鍵推理步驟的信用評估不足。現有的標記級信用分配方法需要超出模型自身回合的資源。GRPO 的變體依賴於過程獎勵模型或真實答案。知識蒸餾通過每個標記的偏差分配信用，但需要外部教師（在政策蒸餾）或特權信息（在政策自我蒸餾）。然而，這些依賴限制了在純 RLVR 設定中的適用性。我們觀察到，將模型條件化於其自身的驗證軌跡會在原始分佈和條件分佈之間產生可測量的每標記 KL 散度，並證明從由驗證軌跡構建的自我教師中進行蒸餾會導致在存在多個驗證軌跡時無法實現的加權平均解。我們提出了 SC-GRPO（自我條件化 GRPO），它使用前面提到的 KL 散度作為 GRPO 梯度的乘法權重。在跨越數學、代碼和代理任務的五個基準測試中，SC-GRPO 始終比 GRPO 高出 8.1%，比 DAPO 高出 5.9%，並且在 OOD 性能上更強。此外，SC-GRPO 的性能高於 OPD。

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

2606.18803v1 by Tengfei Lyu, Zirui Yuan, Xu Liu, Kai Wan, Zihao Lu, Li Ma, Hao Liu

Bringing Large Language Models (LLMs) into industrial ride-hailing dispatch as semantic feature extractors over platform-scale behavioral logs is a compelling but under-explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver's habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM-generated user profiles. However, scaling such profiling to a live, millisecond-latency dispatcher faces three intertwined constraints rarely addressed together: on a platform with millions of daily orders, logs exceed any LLM's context window by orders of magnitude; most users are long-tail, with too few interactions for per-user profiling; and surface-fluent profiles do not necessarily improve downstream prediction utility. We present ProfiLLM, an agentic LLM data pipeline that operationalizes utility-aligned user profiling for production matching systems through two modules. (1) Tool-Augmented Global Knowledge Mining equips an LLM agent with 27 analytical tools to mine platform-scale data, producing reusable global knowledge, adaptive user clustering rules, and region-level supply-demand priors. (2) Utility-Aligned Profile Exploration generates multiple candidate profiles per cluster, evaluates them via a lightweight downstream utility proxy, iteratively refines the best candidates and constructs preference pairs for DPO fine-tuning. Deployed on DiDi's production dispatcher, ProfiLLM achieves up to +6.14% relative AUC improvement in outcome prediction, up to +4.35% GMV gain in dispatching simulation, and consistent improvements in a 14-day online A/B test including +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate.

摘要：將大型語言模型（LLMs）引入工業乘車呼叫調度，作為平台規模行為日誌的語義特徵提取器，這是一個引人注目但尚未充分探索的數據系統問題。生產匹配管道仍然以結構化數值特徵為主導，但決定性的行為信號（例如，駕駛員對某些地區的習慣性厭惡）本質上是上下文相關的，並且自然可以表達為LLM生成的用戶檔案。然而，將這種檔案擴展到實時、毫秒延遲的調度器面臨著三個相互交織的約束，這些約束很少同時被解決：在一個每天有數百萬訂單的平台上，日誌的數據量超過任何LLM的上下文窗口幾個數量級；大多數用戶是長尾用戶，與每個用戶的互動次數太少，無法進行個別檔案分析；而表面流暢的檔案不一定能改善下游預測的效用。我們提出了ProfiLLM，一個自主的LLM數據管道，通過兩個模塊實現與效用對齊的用戶檔案分析，以支持生產匹配系統。（1）工具增強的全球知識挖掘為LLM代理配備了27個分析工具，以挖掘平台規模的數據，生成可重用的全球知識、自適應的用戶聚類規則和區域供需先驗。（2）與效用對齊的檔案探索為每個聚類生成多個候選檔案，通過輕量級的下游效用代理進行評估，迭代地精煉最佳候選檔案並構建DPO微調的偏好對。在滴滴的生產調度器上部署的ProfiLLM，在結果預測中實現了高達+6.14%的相對AUC改善，在調度模擬中實現了高達+4.35%的GMV增益，並在為期14天的在線A/B測試中持續改進，包括+0.47%的GMV、+0.33%的完成率和-0.82%的接受前取消率。

SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval

2606.18801v1 by Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim

With the rapid expansion of massive multilingual corpora, Multilingual Information Retrieval (MLIR) has emerged as a critical technology for global information access. MLIR enables users to retrieve semantically relevant documents from multilingual text collections using a single-language query. However, recent multilingual dense retrieval models often exhibit a strong preference for documents in the same language as the query. This leads to severe language bias, where top-ranked results are dominated by documents of specific languages, even when documents in other languages contain more semantically relevant information. To address this issue, we propose SHIFT, a training-free method applicable in the indexing stage. Specifically, SHIFT utilizes parallel translation pairs to estimate a relative language vector for each target language with respect to a source language. Subsequently, SHIFT corrects the language-specific offset by subtracting this relative language vector from document embeddings during indexing. Our comprehensive evaluation across four MLIR benchmarks and diverse dense retrieval models confirms that SHIFT can effectively mitigate language bias and enhance MLIR performance.

摘要：隨著龐大的多語言語料庫的快速擴展，多語言信息檢索（MLIR）已成為全球信息訪問的重要技術。MLIR使得用戶能夠使用單一語言查詢從多語言文本集合中檢索語義相關的文檔。然而，最近的多語言密集檢索模型往往對與查詢相同語言的文檔表現出強烈的偏好。這導致了嚴重的語言偏見，排名最高的結果往往由特定語言的文檔主導，即使其他語言的文檔包含更具語義相關的信息。為了解決這個問題，我們提出了SHIFT，一種在索引階段適用的無需訓練的方法。具體而言，SHIFT利用平行翻譯對來估計每個目標語言相對於源語言的相對語言向量。隨後，SHIFT通過在索引過程中從文檔嵌入中減去這個相對語言向量來修正特定語言的偏移。我們在四個MLIR基準和各種密集檢索模型上的全面評估確認了SHIFT能有效減輕語言偏見並提升MLIR性能。

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

2606.18797v1 by Qingyu Lu, Ruochen Li, Liang Ding, Yufei Xia, Youxiang Zhu, Dacheng Tao

Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar. Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation. We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness"). Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings. To mitigate this, we synthesize 4k report pairs and train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B. Our trained metric sharpens the clinical significance boundary, surpassing 32B-scale medical LLMs and remaining competitive with proprietary models. Crucially, the more costly two-pass setting fails to consistently improve overall performance and mainly trades discrimination for robustness. These findings suggest one-pass trained metrics as the practical choice for cost-sensitive deployment, with two-pass inference reserved for settings where D-R balance is critical. We will release the dataset and metric.

摘要：可靠的放射科報告評估需要嚴格的臨床準確性，因為遺漏關鍵發現或錯誤表徵放射影像觀察會直接影響病人護理。現有的指標通過將報告質量簡化為無醫學根據的標量來掩蓋這一要求。儘管大型語言模型（LLMs）擁有豐富的醫學知識，但它們同樣難以劃定臨床上重要錯誤與無害變異之間的可靠邊界。我們使用ReEvalMed基準作為測試平台來研究這一邊界，並從檢測真實臨床錯誤（“區分”）和容忍不重要變異（“穩健性”）的角度評估指標層級的臨床意義。在單通道和雙通道設置下的8個LLM評估者中，我們識別出廣泛的區分偏見：模型有效地檢測錯誤，但也過度懲罰無害的改述。為了減輕這一問題，我們合成了4k報告對並在Qwen3-8B和MedGemma-4B上訓練輕量級可解釋的指標。我們訓練的指標明確了臨床意義邊界，超越了32B規模的醫學LLMs，並與專有模型保持競爭力。關鍵是，更昂貴的雙通道設置未能持續改善整體性能，主要是在區分和穩健性之間進行了權衡。這些發現表明，單通道訓練的指標是成本敏感部署的實際選擇，而雙通道推斷則保留給D-R平衡至關重要的設置。我們將發布數據集和指標。

HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

2606.18788v1 by Jaward Sesay, Yue Yu, Börje F. Karlsson

Teaching machines to emulate natural handwriting styles remains an open challenge, as it requires synthesizing stroke sequences that dynamically vary in shape, texture, pressure and script - not only across individuals, but also within a single person's handwriting. Attempts at this challenge have largely explored deep learning methods in both online and offline settings. However, these approaches are often constrained by style-specific architectural choices, heavy reliance on large datasets, high compute costs, and a lack of flexible control over writing styles through natural language. To this end, we introduce HandwritingAgent, a language-driven agent that can synthesize natural handwriting sequences directly in Scalable Vector Graphics (SVG) format with no need for style-specific training. The agent leverages a large reasoning model to geometrically analyse and autoregressively generate target handwritten glyphs as stroke sequences in a discrete grid canvas environment. Generation is conditioned on texts provided in either conversational or non-conversational mode, along with a reference handwriting-style image. Experiments on diverse handwriting tasks spanning imitation, recognition, multi-lingual handwriting synthesis, and generation of complex handwritten maths and science expressions indicate substantial improvement in performance, with HandwritingAgent matching or surpassing state-of-the-art generative handwriting models, while providing a more efficient, controllable, and generalizable synthesis method.

摘要：教導機器模仿自然書寫風格仍然是一個未解的挑戰，因為這需要合成在形狀、質感、壓力和字體上動態變化的筆劃序列——不僅在不同個體之間，還包括單一個體的手寫。對這一挑戰的嘗試主要探索了在線和離線環境中的深度學習方法。然而，這些方法往往受到風格特定架構選擇的限制，對大型數據集的高度依賴，高計算成本，以及通過自然語言對書寫風格缺乏靈活控制。為此，我們介紹了HandwritingAgent，一個語言驅動的代理，可以直接以可擴展矢量圖形（SVG）格式合成自然手寫序列，而無需特定風格的訓練。該代理利用大型推理模型幾何分析並自回歸生成目標手寫字形作為離散網格畫布環境中的筆劃序列。生成是基於以對話或非對話模式提供的文本，以及參考手寫風格圖像。對於涵蓋模仿、識別、多語言手寫合成以及複雜手寫數學和科學表達式生成的多樣手寫任務的實驗顯示，性能有顯著提升，HandwritingAgent的表現達到或超越了最先進的生成手寫模型，同時提供了一種更高效、可控且可泛化的合成方法。

RedactionBench

2606.18782v1 by Sean Brynjólfsson, Shashvat Jayakrishnan, Esha Sali, Diptanshu Purwar, Madhav Aggarwal

Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction mechanics with privacy semantics. A public phone number is not equivalent to a phone number in a medical record. Whether information constitutes a violation depends heavily on who holds it, why, and in what context, fundamentally differentiating redaction from simple entity recognition. Grounded in contextual integrity, we introduce RedactionBench, a manually annotated benchmark comprising 200 diverse documents across 11 domains, mostly seeded from real-world sources. We also introduce R-Score, a novel character-level metric that treats semantically similar redactions equally and nullifies shallow formatting choices, such as varying masking styles for phone numbers. Evaluations across Named Entity Recognition models, entity extraction Small Language Models, and frontier models equipped with agentic tools demonstrate that contextual redaction remains an unsolved problem. A human evaluation with over 80 users on RedactionBench reveals a stark dichotomy in privacy perceptions. Annotators show consensus with target labels for mandatory redactions (89.4 percent) and safe text preservations (94.1 percent), but fail to agree on contextual redactions (47.7 percent). This variance demonstrates the subjective nature of contextual privacy and motivates R-Score, which decouples contextual ambiguity from strict precision. We compare 35 models across families and report their performance in redacting PII. Finally, we release RedactionBench to establish a baseline for future privacy-preserving systems, hoping to inspire efficient model design and standardized evaluations.

摘要：大型語言模型越來越多地應用於需要刪除個人可識別信息（PII）的敏感領域。雖然刪除PII是數據清理的前提，但現有基準將提取機制與隱私語義混為一談。公共電話號碼並不等同於醫療記錄中的電話號碼。信息是否構成違規在很大程度上取決於誰持有它、為什麼以及在什麼上下文中，這根本上區分了刪除和簡單的實體識別。基於上下文完整性，我們引入了RedactionBench，一個手動註釋的基準，包含來自11個領域的200份多樣化文件，大多數來源於現實世界。我們還引入了R-Score，一種新穎的字符級指標，平等對待語義相似的刪除，並消除淺顯的格式選擇，例如對電話號碼使用不同的掩碼樣式。在命名實體識別模型、實體提取小型語言模型和配備代理工具的前沿模型的評估中，顯示上下文刪除仍然是一個未解決的問題。對RedactionBench進行的超過80名用戶的人類評估顯示出隱私感知的明顯二元性。註釋者對於強制刪除的目標標籤（89.4%）和安全文本保留（94.1%）顯示出共識，但對於上下文刪除（47.7%）則未能達成一致。這種變異顯示了上下文隱私的主觀性，並促進了R-Score的發展，該指標將上下文模糊性與嚴格精確性解耦。我們比較了35個模型的不同類別，並報告了它們在刪除PII方面的表現。最後，我們發布了RedactionBench，以建立未來隱私保護系統的基準，希望能激發高效的模型設計和標準化評估。

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

2606.18781v1 by Shanshan Lyu, Yiwei Wang, Yujun Cai, Jiafeng Guo, Shenghua Liu

Dense retrieval ranks one query vector against one document vector. On long documents, this interface can fail when a short but decisive span is weakened during document encoding before ranking. We study this failure mode as document-side early compression and introduce the Evidence Dilution Index (EDI) to measure how far a document-level representation falls below the strongest chunk-level evidence within the same gold document. Guided by this view, we propose DICE (Document Inference via Chunk Evidence), a training-free document-side strategy that splits documents into chunks, encodes them independently with a frozen model, and aggregates them back into a single vector while preserving the standard one-query-one-document interface. On LongEmbed, DICE improves retrieval across four backbones, with the largest gains on slices beyond 4k tokens: for Dream, Passkey >4k rises from 30.0 to 90.0 and Needle >4k from 23.3 to 74.0. Across 12,779 filtered samples, DICE yields lower EDI than the single-vector baseline in 92.8% of cases. These results establish document-level encoding as a practical and underexplored lever for long-document retrieval.

摘要：密集檢索將一個查詢向量與一個文檔向量進行排名。對於長文檔，當短而關鍵的片段在文檔編碼過程中被削弱後，這種介面可能會失效。我們將這種失效模式研究為文檔側的早期壓縮，並引入證據稀釋指數（EDI）來衡量文檔級表示在同一金標文檔內低於最強片段級證據的程度。在這一觀點的指導下，我們提出了DICE（通過片段證據進行文檔推斷），這是一種無需訓練的文檔側策略，將文檔拆分為片段，使用凍結模型獨立編碼，然後將它們聚合回單個向量，同時保持標準的一查詢一文檔介面。在LongEmbed上，DICE在四個基礎架構上改善了檢索，對於超過4k標記的片段，增益最大：對於Dream，Passkey >4k從30.0上升到90.0，Needle >4k從23.3上升到74.0。在12,779個過濾樣本中，DICE在92.8%的情況下產生了低於單向量基線的EDI。這些結果確立了文檔級編碼作為長文檔檢索的一個實用且未被充分探索的杠杆。

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

2606.18780v1 by Quanjiang Guo, Chong Mu, Jiazhou Pan, Ming Jia, Ling Tian, Hui Gao, Zhao Kang

Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

摘要：多模態信息提取（MIE）涵蓋了多模態命名實體識別（MNER）、關係提取（MRE）和事件提取（MEE）等任務，對於理解多媒體內容至關重要，但仍受到嚴重數據稀缺的限制。儘管數據增強是一種有前景的解決方案，但現有的方法受到粗糙的跨模態對齊和碎片化的任務特定設計的阻礙，未能充分利用共享的語義知識。為了克服這些限制，我們提出了語義錨點對齊的多模態增強（SAMA），這是一個統一的框架，用於生成高保真、任務感知的合成數據。SAMA 從真實標籤中構建結構化的語義錨點，以指導協作多專家多模態大型語言模型（CME-MLLM），該模型將共享語義的通用適配器與任務特定的適配器相結合，生成多樣但符合約束的文本樣本。對於圖像合成，SAMA 採用錨點保留擴散機制，使用錨點加權提示和潛在條件來保持關鍵的語義錨點，同時多樣化視覺上下文。為了消除手動驗證的需要，SAMA 進一步引入了一個雙約束過濾模塊，根據跨模態一致性和錨點保真度選擇合成樣本。在 MNER、MRE 和 MEE 的基準數據集上進行的廣泛實驗表明，SAMA 在完全監督和低資源設置下始終超越了最先進的增強基準，突顯了其多功能性、穩健性和有效性。

Private Learning with Public Feature Conditioning

2606.18773v1 by Shuli Jiang, Walid Krichene, Nicolas Mayoraz

We study differentially private (DP) regression in settings where each data sample includes public, non-sensitive features -- common in applications such as recommendation and advertising systems. While such label-DP or semi-sensitive-feature settings have been primarily explored in the context of classification, effective approaches for regression remain underexplored. We introduce Cond-DP, a conditioned variant of DPSGD that leverages the structure of public feature matrices to improve optimization under privacy constraints. Motivated by the observation that these public features often exhibit rapidly decaying spectra, Cond-DP incorporates a data-driven conditioning matrix to reshape the optimization landscape and accelerate convergence. We provide convergence guarantees for convex, strongly convex, and non-convex settings, and recover standard DPSGD as a special case when the conditioning matrix is the identity. We show how to construct an effective conditioning matrix for Cond-DP directly from public features, enabling provably faster convergence than DPSGD in private linear regression without incurring additional privacy cost. Empirically, Cond-DP with this conditioning matrix consistently outperforms state-of-the-art baselines across a wide range of datasets and model architectures under label DP, demonstrating strong and robust performance in practice.

摘要：我們研究在每個數據樣本包含公共、非敏感特徵的情境下進行差分隱私（DP）回歸——這在推薦和廣告系統等應用中很常見。雖然這種標籤-DP或半敏感特徵的情境主要是在分類的背景下進行探討，但回歸的有效方法仍然未被充分研究。我們介紹了Cond-DP，一種條件變體的DPSGD，利用公共特徵矩陣的結構來改善在隱私約束下的優化。受到這些公共特徵通常表現出快速衰減光譜的觀察啟發，Cond-DP結合了一個數據驅動的條件矩陣，以重塑優化景觀並加速收斂。我們提供了對於凸、強凸和非凸情境的收斂保證，並在條件矩陣為單位矩陣時恢復標準DPSGD作為特例。我們展示了如何直接從公共特徵構建有效的條件矩陣，使得在私有線性回歸中能夠比DPSGD實現可證明的更快收斂，而不會產生額外的隱私成本。在實證中，使用這個條件矩陣的Cond-DP在標籤DP下在各種數據集和模型架構中始終超越最先進的基線，展示了在實踐中的強大和穩健性能。

Output Vector Editing for Memorization Mitigation in Large Language Models

2606.18767v1 by Ahmad Dawar Hakimi, Kaiwei Lei, Isabelle Augenstein, Hinrich Schütze

Large language models memorize and reproduce sequences from their training data, creating privacy, copyright, and security risks. Existing neuron-level mitigation methods equate editing with zeroing out neuron activations, but the activation only controls whether a neuron engages; the output vector is what writes to the residual stream and, through superposition, encodes multiple features. We propose output vector editing, a constrained-optimization weight edit that locates a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space, redirecting their residual-stream contributions while leaving activations unchanged. Evaluating on four models from 360M to 7B parameters (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B), we center on OLMo-7B (whose open weights and pretraining corpus enable systematic mining) and mine 6831 memorized sequences, achieving up to 87.9% suppression. The 2.7$\times$ gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone. Four edit modes span a spectrum from aggressive suppression to minimal redirection; in ensemble they cover 96.5% of memorized sequences, while our recommended single-mode configuration reaches 81.5% with no catastrophic locality failures. We further identify a mechanistic boundary at ${\sim}14%$ of sequences unreachable by MLP-only editing; while these failures are not attention-driven overall, ablating the top contributing attention heads recovers 60--64% of them, with stronger recovery on continuations that copy tokens from the prefix, positioning attention as a complementary fallback rather than a primary mechanism. Edit mode ordering and the success-locality trade-off transfer across all four models, with success rates scaling with model size rather than family.

摘要：大型語言模型記憶並重現其訓練數據中的序列，從而產生隱私、版權和安全風險。現有的神經元級緩解方法將編輯等同於將神經元激活歸零，但激活僅控制神經元是否參與；輸出向量才是寫入殘餘流的，並通過疊加編碼多個特徵。我們提出了輸出向量編輯，這是一種受限優化的權重編輯，定位一小組負責記憶延續的多層感知器（MLP）神經元，並最小化地修改它們的輸出向量，以在詞彙空間中引入一個干擾項，重新定向它們的殘餘流貢獻，同時保持激活不變。在對四個模型（從360M到7B參數，分別為SmolLM-360M、OLMo-1B、OLMo-7B、Llama2-7B）進行評估時，我們集中於OLMo-7B（其開放權重和預訓練語料庫使系統性挖掘成為可能），並挖掘了6831個記憶序列，實現了高達87.9%的抑制。針對同一定位神經元的零消融的2.7$\times$差距顯示，抑制來自於輸出向量編輯，而不僅僅是定位。四種編輯模式涵蓋了從激進抑制到最小重定向的範圍；在集成中，它們覆蓋了96.5%的記憶序列，而我們推薦的單模式配置達到81.5%，且沒有災難性的局部失敗。我們進一步確定了一個機制邊界，約為${\sim}14%$的序列無法通過僅MLP編輯來達成；雖然這些失敗整體上不是由注意力驅動的，但消融貢獻最大的注意力頭可以恢復60--64%的失敗，對於從前綴複製標記的延續，恢復效果更強，將注意力定位為一種補充的後備機制，而非主要機制。編輯模式的排序和成功-局部性權衡在所有四個模型中轉移，成功率隨著模型大小而增長，而非家族。

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

2606.18747v1 by Chris Lee, Flora Salim, Benjamin Tag, Francisco Cruz

Expressive gestures are essential for natural and effective communication, complementing speech when verbal cues alone are insufficient (e.g., pointing). For social robots such as the humanoid Pepper, producing natural and expressive movements is critical for improving human-robot interaction (HRI) and long-term acceptance. However, generating gestures remains challenging due to reliance on expert-authored animations, resulting in rigid behaviors that are impractical for dynamic and diverse environments. Alternatively, machine learning approaches often struggle to capture perceived naturalness, becoming increasingly challenging with more degrees of freedom. Consequently, producing expressive robot gestures requires a system that can adapt to the environment while adhering to social norms and physical constraints. Recent advances in large language models (LLMs) enable dynamic code generation, offering new opportunities for runtime gesture synthesis from natural language. In this paper, we integrate ChatGPT into the humanoid robot Pepper to generate co-speech gestures aligned with conversational output. While this baseline enables flexible gesture generation, the resulting motions are often perceived as stiff and unnatural. To address this limitation, we introduce an iterative reinforcement learning with human feedback (RLHF) system that finetunes gesture generation based on user evaluations, leveraging an iterative user study to compare Pepper's generated gestures. Our results show that RLHF improved the LLM's co-speech generative capabilities, producing more expressive, relevant and fluid movements.

摘要：表達性手勢對於自然和有效的溝通至關重要，當口頭提示不足時，手勢可以補充語言（例如，指向）。對於像人形機器人Pepper這樣的社交機器人，產生自然和表達性的動作對於改善人機互動（HRI）和長期接受度至關重要。然而，由於依賴專家創作的動畫，生成手勢仍然具有挑戰性，導致行為僵硬，對於動態和多樣化的環境不切實際。相反，機器學習方法常常難以捕捉到感知的自然性，隨著自由度的增加，這一挑戰變得越來越艱難。因此，生成表達性機器人手勢需要一個能夠適應環境的系統，同時遵循社會規範和物理限制。最近在大型語言模型（LLMs）方面的進展使得動態代碼生成成為可能，為從自然語言進行運行時手勢合成提供了新的機會。在本文中，我們將ChatGPT整合到人形機器人Pepper中，以生成與對話輸出相一致的共語手勢。雖然這一基線使得手勢生成變得靈活，但產生的動作往往被認為是僵硬和不自然的。為了解決這一限制，我們引入了一個基於人類反饋的迭代強化學習（RLHF）系統，該系統根據用戶評估微調手勢生成，利用迭代用戶研究來比較Pepper生成的手勢。我們的結果顯示，RLHF改善了LLM的共語生成能力，產生了更具表達性、相關性和流暢性的動作。

What Must Generalist Agents Remember?

2606.18746v1 by Khurram Yamin, Namrata Deka, Maitreyi Swaroop, Albert Ting, Jeff Schneider, Bryan Wilder

This paper develops a formal account of what generalist agents must store in memory in order to act near-optimally across multiple environments and goals. It shows that when two domains share an observational bottleneck but require incompatible optimal actions, any uniformly near-optimal policy must induce distinct memory distributions at that bottleneck. The result yields a separation theorem: sufficiently successful agents cannot rely only on current state observations, but must preserve domain-relevant information in memory. The paper further shows that if an agent's memory contains enough information to estimate values for related goals, then that memory can be used to approximately reconstruct the agent's local transition dynamics. Together, these results characterize memory as the substrate that supports domain disambiguation, transition-model reconstruction, and planning for generalist agents.

摘要：這篇論文發展了一個正式的理論，說明一般代理在多個環境和目標中近乎最佳行動所需儲存的記憶。它顯示當兩個領域共享觀察瓶頸但需要不相容的最佳行動時，任何均勻的近最佳政策必須在該瓶頸處產生不同的記憶分佈。這一結果產生了一個分離定理：足夠成功的代理不能僅依賴當前狀態觀察，而必須在記憶中保留與領域相關的信息。論文進一步顯示，如果一個代理的記憶包含足夠的信息來估計相關目標的值，那麼該記憶可以用來近似重建代理的局部轉移動力學。這些結果共同描述了記憶作為支持領域消歧、轉移模型重建和一般代理規劃的基礎。

SWE-Future: Forecast-Conditioned Data Synthesis for Future-Oriented Software Engineering Agents

2606.18733v1 by Qiao Zhao, JianYing Qu, Jun Zhang, Yehua Yang, Hanwen Du, Zhongkai Sun

Realistic coding-agent benchmarks often replay public GitHub issues and pull requests, making them vulnerable to overlap with model pretraining, fine-tuning, synthetic-data generation, or benchmark-driven model selection. Fully synthetic tasks avoid direct historical replay, but can drift away from real repository needs. We propose SWE-Future, a forecast-conditioned data synthesis method for future-oriented coding tasks. Given a forecast snapshot at time $T_0$, the method uses only pre-$T_0$ repository evidence to forecast future feature implementation/enhancement, bugfix, and refactor task families. We first validate this forecasting step retrospectively: after forecasts are fixed, later pull requests are used only to measure whether the predicted task families match future repository work. In an 80-repository study, the forecaster achieves 58.1\% future-work relevance under the main semantic matching metric. We then use validated forecast families as conditioning signals to synthesize a 200-task coding-agent dataset across 61 repositories from a task-generation snapshot, rather than replaying the later pull requests used for validation. SWE-Future shows that repository-evolution forecasts can guide realistic, future-oriented coding-task synthesis while reducing direct dependence on historical pull-request replay.

摘要：現實的編碼代理基準通常重播公共 GitHub 問題和拉取請求，使其容易與模型預訓練、微調、合成數據生成或基準驅動的模型選擇重疊。完全合成的任務避免了直接的歷史重播，但可能會偏離實際倉庫的需求。我們提出了 SWE-Future，一種針對未來導向編碼任務的預測條件數據合成方法。給定時間 $T_0$ 的預測快照，該方法僅使用 $T_0$ 之前的倉庫證據來預測未來的功能實現/增強、錯誤修復和重構任務類別。我們首先回顧性地驗證這一步預測：在預測固定後，後續的拉取請求僅用於測量預測的任務類別是否與未來的倉庫工作匹配。在一項涉及 80 個倉庫的研究中，預測器在主要語義匹配指標下達到了 58.1\% 的未來工作相關性。我們然後使用經過驗證的預測類別作為條件信號，從任務生成快照合成一個跨越 61 個倉庫的 200 任務編碼代理數據集，而不是重播用於驗證的後續拉取請求。SWE-Future 顯示，倉庫演變預測可以指導現實的、未來導向的編碼任務合成，同時減少對歷史拉取請求重播的直接依賴。

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

2606.18728v1 by Songhan Zuo, Shengbin Yue, Tao Chiang, Guanying Li, Yun Song, Xuanjing Huang, Zhongyu Wei

Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators reinitialize each scenario from shared ground truth, leaving cross-stage causal dependencies unmodeled. We present LegalWorld, a life-cycle interactive environment that models Chinese civil litigation as a causally connected state chain of five stages (seven sub-scenarios), grounded in 75,309 paired Chinese civil judgments. We pair it with reusable infrastructure (local memory, global case memory, a Skill/Tool library) that keeps each dispute consistent across its full life cycle. Building on this environment, we construct LongJud-Bench to evaluate agent capability across all five connected stages. 18,992 ratings from 217 legal-background evaluators confirm that LegalWorld trajectories are procedurally faithful and role-consistent; and a capability-level cross-model evaluation reveals sharp divergences that aggregate scores cannot expose, with no single backbone leading across consultation, drafting, and courtroom advocacy. Detailed resources will be released publicly.

摘要：民事訴訟本質上是一個生命週期過程：律師在第一天起草的內容限制了幾個月後在審判中展開的情況。然現有的法律基準評估的是孤立的子任務，而之前的法律代理模擬器則從共享的真實基礎重新初始化每個場景，未能建模跨階段的因果依賴。我們提出了LegalWorld，一個生命週期互動環境，將中國民事訴訟建模為五個階段（七個子場景）之間因果相連的狀態鏈，基於75,309對中國民事判決的配對。我們將其與可重用的基礎設施（本地記憶、全球案件記憶、技能/工具庫）配對，使每個爭端在其整個生命週期內保持一致。在這個環境的基礎上，我們構建了LongJud-Bench，以評估代理在所有五個相連階段的能力。來自217名法律背景評估者的18,992條評分確認LegalWorld的軌跡在程序上是忠實的且角色一致；而能力水平的跨模型評估揭示了聚合分數無法顯示的明顯差異，沒有單一的支柱在諮詢、起草和法庭辯護中領先。詳細資源將公開發布。

Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring

2606.18726v1 by Fang Wang, Ernesto Damiani

Structurally constrained event sequence generation remains challenging because generated paths must preserve transition feasibility, temporal order, termination, and attribute consistency. In predictive process monitoring (PPM), this challenge appears as full event sequence generation, whereas existing work mainly addresses component tasks such as next activity, remaining time, outcome, and attribute prediction. This paper proposes the Graph Grounded Cross Attention Transformer Neural Network (GGATN) for this unified PPM task. GGATN uses a global process graph as structured activity memory, contextualizes sequence positions through Transformer self attention, and injects process topology through graph grounded cross attention. Unlike autoregressive decoding, GGATN generates activities, timestamps, length, and event level and sequence level attributes in a single pass, followed by Viterbi style graph constrained decoding for feasible paths and explicit termination. Experiments on six benchmark event logs show more reliable generation quality than local instruction prompted LLM baselines. GGATN achieves strong performance on sequence similarity, Damerau Levenshtein similarity, bigram based control flow similarity, and duration distribution, while maintaining zero hallucinated activities and zero sequence level attribute inconsistency. Ablation analyses confirm the global graph encoder as a stable structural prior. Interpretability analyses show how graph structure, sequence context, feedback refinement, and constrained decoding shape generation.

摘要：結構受限的事件序列生成仍然具有挑戰性，因為生成的路徑必須保持轉換的可行性、時間順序、終止和屬性一致性。在預測性過程監控（PPM）中，這一挑戰表現為完整事件序列生成，而現有的工作主要針對下一活動、剩餘時間、結果和屬性預測等組件任務。本文提出了圖基交叉注意力Transformer神經網絡（GGATN）來解決這一統一的PPM任務。GGATN使用全局過程圖作為結構化活動記憶，通過Transformer自注意力對序列位置進行上下文化，並通過圖基交叉注意力注入過程拓撲。與自回歸解碼不同，GGATN在單次通過中生成活動、時間戳、長度以及事件級和序列級屬性，隨後進行維特比風格的圖約束解碼以獲得可行路徑和明確的終止。在六個基準事件日誌上的實驗顯示，其生成質量比本地指令提示的LLM基線更可靠。GGATN在序列相似性、Damerau Levenshtein相似性、基於二元組的控制流相似性和持續時間分佈方面表現出色，同時保持零幻覺活動和零序列級屬性不一致。消融分析確認全局圖編碼器作為穩定的結構先驗。可解釋性分析顯示圖結構、序列上下文、反饋精煉和約束解碼如何塑造生成。

Medical explainable AI

Publish Date	Title	Authors	Homepage	Code
2026-06-17	Explaining Attention with Program Synthesis	Amiri Hayes et.al.	2606.19317v1	null
2026-06-17	A Taxonomy of Mental Health and Technology Needs for Alzheimer's and Dementia Caregivers	Keran Wang et.al.	2606.19247v1	null
2026-06-17	The More the Merrier: Combining Properties for ABox Abduction under Repair Semantics for ELbot	Anselm Haak et.al.	2606.19197v1	null
2026-06-17	A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies	Fangyijie Wang et.al.	2606.19174v1	null
2026-06-17	Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction	Jingyi Zhou et.al.	2606.19144v1	null
2026-06-17	Analysing drivers and interdependencies in European electricity markets using XAI	Antoine Pesenti et.al.	2606.19118v1	null
2026-06-17	APT: Atomic Physical Transitions for Causal Video-Language Understanding	Shang Wu et.al.	2606.18586v1	null
2026-06-17	DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models	Patrick Cooper et.al.	2606.18557v1	null
2026-06-16	PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization	Arshia Ilaty et.al.	2606.18518v1	null
2026-06-16	From Specification to Execution: AI Assisted Scientific Workflow Management	Komal Thareja et.al.	2606.18425v1	null
2026-06-16	RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills	Weizhi Zhang et.al.	2606.18203v1	null
2026-06-16	Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour	Abeer Badawi et.al.	2606.18129v1	null
2026-06-16	Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications	Divyansh Srivastava et.al.	2606.18068v1	null
2026-06-16	When LLMs Analyze Scars: From Images to Clinically-Meaningful Features	Ruman Wang et.al.	2606.18063v1	null
2026-06-16	Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual Adaptation	Ido Nitzan Hidekel et.al.	2606.18024v1	null
2026-06-16	LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling	Jian Yang et.al.	2606.18023v1	null
2026-06-16	A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease	Antonio Scardace et.al.	2606.17867v1	null
2026-06-16	Conservation Laws for Modern Neural Architectures	Viet-Hoang Tran et.al.	2606.17816v1	null
2026-06-16	EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent	Zeyao Du et.al.	2606.17698v1	null
2026-06-16	From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs	Siyue Chen et.al.	2606.17648v1	null
2026-06-16	SketchXplain: Intuitive Visual Explanations of Image Classifiers with Sketches	Wencan Zhang et.al.	2606.17646v1	null
2026-06-16	Offline Preference-Based Trajectory Evaluation	Fernando Diaz et.al.	2606.17541v1	null
2026-06-16	Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing	Kexin Chen et.al.	2606.17478v1	null
2026-06-16	Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation	Xinyu Qin et.al.	2606.17405v1	null
2026-06-15	SpeechDx: A Multi-Task Benchmark for Clinical Speech AI	Sejal Bhalla et.al.	2606.17339v1	null
2026-06-15	Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data	Kareem Amin et.al.	2606.16952v1	null
2026-06-15	Demystifying Variance in Circuit Discovery of LLMs	Frank Zhengqing Wu et.al.	2606.16920v1	null
2026-06-15	Symbolic Informalization: Fluent, Productive, Multilingual	Aarne Ranta et.al.	2606.16893v1	null
2026-06-15	Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering	Sanjay Basu et.al.	2606.16890v1	null
2026-06-15	Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies	Ke Liu et.al.	2606.16721v1	null
2026-06-15	Is Your Trajectory Displacement Safe in Long-tail?	Qiao Sun et.al.	2606.16313v1	null
2026-06-15	PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents	Zhenbang Du et.al.	2606.16215v1	null
2026-06-15	Embedded Arena: Iterative Optimization via Hardware Feedback	Zhihan Zhang et.al.	2606.16190v1	null
2026-06-15	LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis	Minh-Ha Nguyen et.al.	2606.16149v1	null
2026-06-15	XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models	Yupei Li et.al.	2606.16137v1	null
2026-06-14	SciText2Eq: Assessing LLMs for Explainable Equation Generation for Scientific Creativity	Yifan Mo et.al.	2606.16003v1	null
2026-06-14	Entity Labels Are Not Entity Signals: A Framework for Observable Relevance in Document Re-Ranking	Utshab Kumar Ghosh et.al.	2606.15998v1	null
2026-06-14	DeepRoot: A KG-Coordinated Multi-Agent System for Therapeutic Reasoning over Historical Medical Texts	Zijian Carl Ma et.al.	2606.15931v1	null
2026-06-14	DifFRACT: Diffusion Feature Reconstruction and Attribution for Circuit Tracing	Artyom Mazur et.al.	2606.15796v1	null
2026-06-14	Visualizing Uncertainty: Spatial Maps of Missing and Conflicting Evidence in Deep Learning	Dong Hyun Jeong et.al.	2606.15767v1	null
2026-06-14	InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset	Zhenyu Yu et.al.	2606.15730v1	null
2026-06-14	AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan	Mohammed Fasha et.al.	2606.15709v1	null
2026-06-14	Quantum Cinema: An Interactive Cinematic Exploration of Quantum Computing Hardware via Generative World Models	Aoyu Zhang et.al.	2606.17102v1	null
2026-06-14	Is Code Better Than Language for Algorithmic Reasoning	Terry Tong et.al.	2606.15589v1	null
2026-06-14	Service-Induced Congestion in Memory-Constrained LLM Serving	Ruicheng Ao et.al.	2606.15555v1	null
2026-06-13	Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering	Zaifu Zhan et.al.	2606.15419v1	null
2026-06-13	APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents	Ya-Chuan Chen et.al.	2606.15363v1	null
2026-06-13	Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes	Mohamed Bayan Kmainasi et.al.	2606.15307v1	null
2026-06-13	Enabling Real-Time Point-of-Care Ultrasound Segmentation: A GPU-Free Deployment in Resource-Limited Settings	Weihao Gao et.al.	2606.15176v1	null
2026-06-12	CottonLeafVision: An Explainable and Robust Deep Learning Framework for Cotton Leaf Disease Classification	Rafi Ahamed et.al.	2606.14686v1	null
2026-06-12	A Definition of Good Explanations and the Challenges Explaining LLM Outputs	Louis Mahon et.al.	2606.14838v1	null
2026-06-12	Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models	Ravi Ranjan et.al.	2606.14647v1	null
2026-06-12	Fodor and Pylyshyn's Systematicity Challenge Still Stands	Michael Goodale et.al.	2606.14512v1	null
2026-06-12	Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport	Paula Joy B. Martinez et.al.	2606.14157v1	null
2026-06-12	Recovering Stranded Discrimination in Knowledge Tracing: Per-Item Bias Correction via Empirical-Bayes Shrinkage	Xiaoran Yan et.al.	2606.14123v1	null
2026-06-11	How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks?	Julia Romero et.al.	2606.13896v1	null
2026-06-11	Explaining RhythmFormer: A Systematic XAI Analysis of Periodic Sparse Attention for Remote Photoplethysmography	Louis Chen et.al.	2606.13839v1	null
2026-06-11	ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages	Tanmoy Kanti Halder et.al.	2606.13572v1	null
2026-06-11	Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation	Aruna Dey et.al.	2606.13556v2	null
2026-06-11	Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from Video	Abubakar Hamisu Kamagata et.al.	2606.13302v1	null
2026-06-11	Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints	Omar Alshahrani et.al.	2606.13211v1	null
2026-06-11	Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation	Elena S. Kozachok et.al.	2606.13135v1	null
2026-06-11	Zero-source LLM Hallucination Detection with Human-like Criteria Probing	Jiahao Yang et.al.	2606.12900v1	null
2026-06-11	PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent	Junfeng Guo Heng Huang et.al.	2606.12896v1	null
2026-06-11	Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata	Daniel Soliman et.al.	2606.12824v1	null
2026-06-10	LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data	Yifan Gao et.al.	2606.12699v1	null
2026-06-10	Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy	Kai Standvoss et.al.	2606.12346v1	null
2026-06-10	Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification	Veerendhra Kumar Dangeti et.al.	2606.12252v1	null
2026-06-10	Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization	Xinhai Zou et.al.	2606.12251v1	null
2026-06-10	Towards Responsibly Non-Compliant Machines	Marija Slavkovik et.al.	2606.12147v1	null
2026-06-10	Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders	Gleb Gerasimov et.al.	2606.12138v1	null
2026-06-10	Tabular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation	Minh-Khoi Pham et.al.	2606.12006v1	null
2026-06-10	Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability	Kuo-En Hung et.al.	2606.11930v2	null
2026-06-10	Beyond representational alignment with brain-guided language models for robust reasoning	Mingqing Xiao et.al.	2606.11893v1	null
2026-06-10	Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task	Qianyu Yao et.al.	2606.11830v1	null
2026-06-10	Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data	Boris-Stephan Rauchmann et.al.	2606.11794v1	null
2026-06-10	Extracting Semantics: LLM-Guided Automatic Population of Robot Ontology from URDF	Bastien Dussard et.al.	2606.17073v1	null
2026-06-10	MedCTA: A Benchmark for Clinical Tool Agents	Tajamul Ashraf et.al.	2606.11702v1	null
2026-06-10	Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics	Igor Itkin et.al.	2606.12476v2	null
2026-06-09	Can AI Agents Synthesize Scientific Conclusions?	Hayoung Jung et.al.	2606.11337v1	null
2026-06-09	Designed by Journalists, but Is It for Readers? Rethinking AI Disclosures and Transparency in News	Pooja Prajod et.al.	2606.11116v1	null
2026-06-09	FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model	Mahmood Alzubaidi et.al.	2606.11106v1	null
2026-06-09	Superficial Beliefs in LLM Decision-Making	Gabriel Freedman et.al.	2606.11016v1	null
2026-06-09	Understanding and mitigating the risks of OpenClaw for non-technical users: A practical guide with Skill	Junchang Zheng et.al.	2606.11007v1	null
2026-06-09	Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions	Kiarash Rezaei et.al.	2606.10942v1	null
2026-06-09	What Do Deepfake Speech Detectors Actually Hear?	Vojtěch Staněk et.al.	2606.10912v1	null
2026-06-09	Accelerating NeurASP with vectorization and caching	Alexander Philipp Rader et.al.	2606.10787v1	null
2026-06-09	From Data Heterogeneity to Convergence: A Data-Centric Review of Federated Learning	Huong Nguyen et.al.	2606.10595v1	null
2026-06-09	Towards Critical Branching Mechanism in Recurrent Neural Networks	Feixiang Ren et.al.	2606.10384v1	null
2026-06-09	Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction	Buxin Su et.al.	2606.10279v1	null
2026-06-08	Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community	Lin Li et.al.	2606.10159v1	null
2026-06-08	XMedFusion: A Knowledge-Guided Multimodal Perception and Reasoning Framework for Autonomous Medical Systems	Hamza Riaz et.al.	2606.14766v1	null
2026-06-08	Hybrid Robustness Verification for Spatio-Temporal Neural Networks	Sherwin Varghese et.al.	2606.09746v1	null
2026-06-08	Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery	Suraj Biswas et.al.	2606.09672v1	null
2026-06-08	Transition-Based Digital Twin Modelling for Alzheimer's Disease under Sparse Longitudinal Data	Yinyu Huang et.al.	2606.09671v1	null
2026-06-08	Self-Explainability in Self-Adaptive and Self-Organising Systems: Status and Research Directions	Tom Beyer et.al.	2606.09568v1	null
2026-06-08	Capacity, Not Format: Rethinking Structured Reasoning Failures	Hengxin Fan et.al.	2606.09410v1	null
2026-06-08	Disentangling Hallucinations: Orthogonal Semantic Projection for Robust Interpretability	Emirhan Bilgiç et.al.	2606.14758v1	null
2026-06-08	TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs	Hyeongwon Jang et.al.	2606.09030v1	null
2026-06-08	Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin	Hanyang Li et.al.	2606.09012v1	null

Abstracts

Explaining Attention with Program Synthesis

2606.19317v1 by Amiri Hayes, Belinda Li, Jacob Andreas

A longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions. In this paper, we propose an approach for approximating the behavior of components of deep networks with executable programs. We focus on attention heads in transformer language models. For a given head, we first compute its associated attention matrices on a collection of randomly selected training examples. Next, we prompt a pre-trained language model with a summary of these matrices, and instruct it to generate a set of Python programs that can reproduce the associated attention patterns given only text from the input sentence. Finally, we re-rank programs according to how well our final set of programs predict behavior on held-out inputs. We demonstrate that a set of fewer than 1,000 such generated programs can reproduce the attention patterns of heads in GPT-2, TinyLlama-1.1B, and Llama-3B, achieving an average Intersection-over-Union similarity above 75% on TinyStories. Moreover, the best-fit programs can replace neural attention heads without substantially affecting model behavior: replacing 25% of attention heads with programmatic surrogates across the three models incurs only a 16% average perplexity increase, while maintaining performance on a variety of downstream question answering benchmarks. This work contributes a scalable pipeline for reverse-engineering attention heads in transformer models using human-readable, executable code, advancing a path toward symbolic transparency in neural models.

摘要：一項長期以來的研究目標是用人類可理解的符號描述來取代不透明的神經計算。在本文中，我們提出了一種方法，用可執行程序來近似深度網絡組件的行為。我們專注於Transformer語言模型中的注意力頭。對於給定的頭，我們首先在一組隨機選擇的訓練示例上計算其相關的注意力矩陣。接下來，我們用這些矩陣的摘要來提示一個預訓練的語言模型，並指示它生成一組 Python 程序，這些程序可以僅根據輸入句子的文本重現相關的注意力模式。最後，我們根據我們的最終程序集在保留輸入上的行為預測的準確性來重新排序這些程序。我們證明，少於 1,000 個這樣生成的程序可以重現 GPT-2、TinyLlama-1.1B 和 Llama-3B 中頭部的注意力模式，在 TinyStories 上達到超過 75% 的平均交集-聯合相似度。此外，最佳擬合的程序可以在不顯著影響模型行為的情況下替代神經注意力頭：在三個模型中用程序替代品替換 25% 的注意力頭僅會產生 16% 的平均困惑度增加，同時在各種下游問題回答基準上保持性能。這項工作貢獻了一個可擴展的管道，用於使用人類可讀的可執行代碼逆向工程Transformer模型中的注意力頭，推進了神經模型中符號透明度的道路。

A Taxonomy of Mental Health and Technology Needs for Alzheimer's and Dementia Caregivers

2606.19247v1 by Keran Wang, Drishti Goel, Jiayue Melissa Shi, Violeta J. Rodriguez, Daniel S. Brown, Dong Whi Yoo, Ravi Karkar, Koustuv Saha

Family members caring for individuals with Alzheimer's disease and related dementias (AD/ADRD) provide the foundation of long-term care worldwide. In 2023, more than 11 million U.S. family and friends contributed 18 billion hours of unpaid care, often at the cost of their own physical and mental health. These informal caregivers -- also referred as the "invisible second patients" -- experience elevated rates of mental health problems. Yet research commonly reduces their complex psychosocial experiences to a single construct of caregiver burden, obscuring which specific needs are unmet or effectively supported. At the same time, digital and AI-enabled technologies are rapidly expanding, from smartphone apps and videoconferencing to sensor platforms and AI chatbots. However, the absence of shared frameworks across medicine, psychology, and technology research limits cumulative progress. This study introduces a Caregiver Mental Health and Technology Taxonomy that systematically links AD/ADRD caregiver needs with corresponding classes of technology-based interventions. Drawing from an interdisciplinary literature review and two qualitative studies with caregivers, the taxonomy identifies mismatches between caregiver priorities and existing technological support, highlights under-served domains such as relational strain and compassion fatigue, and proposes design directions for adaptive, responsive systems. The framework offers a shared vocabulary to guide clinicians, researchers, and technology designers in developing more person-centered and clinically grounded innovation in dementia care.

摘要：家庭成員照顧阿茲海默症及相關癡呆症（AD/ADRD）患者，為全球長期照護提供了基礎。在2023年，超過1100萬名美國家庭成員和朋友貢獻了180億小時的無償照護，這往往以他們自身的身心健康為代價。這些非正式的照護者——也被稱為「隱形的第二患者」——經歷著較高的心理健康問題發生率。然而，研究通常將他們複雜的心理社會經驗簡化為單一的照護者負擔構念，模糊了哪些特定需求未被滿足或有效支持。同時，數位和人工智慧技術正在迅速擴展，從智慧型手機應用程式和視訊會議到感應平台和人工智慧聊天機器人。然而，醫學、心理學和技術研究之間缺乏共享框架，限制了累積進展。本研究介紹了一個照護者心理健康與技術分類法，系統性地將AD/ADRD照護者需求與相應的技術干預類別聯繫起來。該分類法基於跨學科的文獻回顧和兩項與照護者的質性研究，識別出照護者優先事項與現有技術支持之間的不匹配，突顯了如關係緊張和同情疲勞等被忽視的領域，並提出了適應性、響應性系統的設計方向。該框架提供了一個共享的詞彙，以指導臨床醫生、研究人員和技術設計師在癡呆症照護中開發更以人為中心且臨床基礎的創新。

The More the Merrier: Combining Properties for ABox Abduction under Repair Semantics for ELbot

2606.19197v1 by Anselm Haak, Patrick Koopmann, Yasir Mahmood, Anni-Yasmin Turhan

Abduction is a central approach to explain missing entailments from a knowledge base by providing a hypothesis, that would, if added to the knowledge base, make the missing entailment become true. Abduction under repair semantics has recently been investigated in detail, where several desirable properties and optimality criteria were considered, such as signature-restrictions and minimality in size and of introduced conflicts. Naturally, hypotheses that satisfy more than one of these properties or combine a property with an optimality criterion would be even more desirable for applications. So far, such hypotheses have not been investigated in the literature. In the present paper, we consider the ABox abduction problem for hypotheses satisfying more than one property or additional optimality criteria, for EL_bot under brave and AR semantics. Our main observation is that often requiring additional properties for hypotheses does not lead to an increase of complexity.

摘要：誘導推理是一種中心方法，用於解釋知識庫中缺失的推論，通過提供一個假設，如果將其添加到知識庫中，將使缺失的推論變得成立。最近，修復語義下的誘導推理已被詳細研究，其中考慮了幾個理想的特性和最佳性標準，例如簽名限制和引入衝突的最小化。自然地，滿足多個這些特性或將一個特性與最佳性標準結合的假設對於應用來說會更加理想。到目前為止，文獻中尚未對此類假設進行研究。在本篇論文中，我們考慮了針對滿足多個特性或額外最佳性標準的假設的 ABox 誘導推理問題，針對 EL_bot 在勇敢和 AR 語義下的情況。我們的主要觀察是，通常要求假設具備額外的特性並不會導致複雜性的增加。

A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies

2606.19174v1 by Fangyijie Wang, Jianjun Yu, Wentao Shi, Haixia Huang, Ran Shi, Guénolé Silvestre, Kathleen M. Curran

Clinician-centered evaluation is critical for validating medical AI systems, especially in ultrasound imaging where quantitative metrics do not always capture clinical usability. Existing medical image platforms primarily focus on dataset labeling. They lack integrated support for blinded model comparison and reproducible evaluation workflows. We present a clinician-centered pipeline for remote annotation and evaluation in ultrasound AI studies. The proposed pipeline uses a centralized server and lightweight browser interfaces to enable clinicians to perform annotation, blinded ranking, and review without local dataset downloads. The pipeline also supports multi-rater participation, centralized result aggregation, and automated statistical analysis. We validate the pipeline in a fetal ultrasound segmentation study with six raters spanning expert, generalist, and non-expert experience levels. The system automatically generated Spearman correlation, Kendall's $τ$, and top-1 selection statistics. Results indicated moderate to strong agreement across experts and other groups. The blinded evaluation results showed a tendency for later active learning models to be preferred. These outcomes suggest that the pipeline can support clinician-centered annotation and reproducible human-\ac{AI} evaluation studies in ultrasound imaging. The proposed pipeline is available on \href{https://github.com/13204942/SonoRate}{GitHub}.

摘要：臨床醫師中心的評估對於驗證醫療人工智慧系統至關重要，尤其是在超聲影像中，定量指標並不總是能夠捕捉臨床可用性。現有的醫療影像平台主要集中於數據集標註。它們缺乏對盲測模型比較和可重複評估工作流程的綜合支持。我們提出了一個臨床醫師中心的管道，用於遠程標註和超聲人工智慧研究中的評估。所提議的管道使用集中式伺服器和輕量級瀏覽器介面，使臨床醫師能夠在不下載本地數據集的情況下進行標註、盲測排名和審查。該管道還支持多評審者參與、集中結果聚合和自動統計分析。我們在一項涉及六位評審者的胎兒超聲分割研究中驗證了該管道，這些評審者的經驗水平涵蓋了專家、通才和非專家。系統自動生成了斯皮爾曼相關係數、肯德爾的 $τ$ 和前一選擇統計數據。結果顯示專家和其他組別之間的協議程度從中等到強。盲測評估結果顯示後期主動學習模型更受青睞。這些結果表明該管道可以支持臨床醫師中心的標註和可重複的人類-\ac{AI} 評估研究在超聲影像中。所提議的管道可在 \href{https://github.com/13204942/SonoRate}{GitHub} 上獲得。

2606.19144v1 by Jingyi Zhou, Senlin Luo, Haofan Chen

Current conversational AI systems have made significant progress in language generation, personalization, and long-context interaction. However, most existing methods model social behavior through isolated components such as emotion modeling, memory retrieval, or persona conditioning, lacking a unified framework to explain the emergence of stable social relationships and social intelligence in long-term human-AI interaction.To address this, we propose the Human-AI Coevolution Dynamics Framework (HACD-H), a formal model of human-AI interaction as a self-organizing social cognitive system. HACD-H integrates emotional adaptation, relational organization, social memory, and personality consistency into a unified dynamical framework and introduces principles including multi-timescale social cognition, relational attractors, trust basins, developmental phase transitions, and social cognitive energy dynamics.We construct a conversational dataset with approximately 14,700 interaction turns and develop a theory-driven empirical evaluation framework. Results reveal a hierarchy of temporal persistence in social cognition, stable relational attractors, phase-transition-like developmental patterns, and a structured social cognitive energy landscape. Social intelligence shows a significant negative correlation with social cognitive energy (r = -0.391, p < 0.001), and interaction trajectories exhibit progressive energy reduction over time.These findings suggest that social intelligence emerges from long-term social cognitive coevolution rather than isolated conversational capabilities. HACD-H provides a unified theoretical foundation for modeling adaptive human-AI social interaction and developing socially intelligent AI systems.

摘要：目前的對話式人工智慧系統在語言生成、個性化和長期上下文互動方面取得了顯著進展。然而，大多數現有的方法通過孤立的組件如情感建模、記憶檢索或角色調整來建模社會行為，缺乏統一的框架來解釋穩定社會關係和社會智慧在長期人機互動中的出現。為了解決這個問題，我們提出了人機共演化動力學框架（HACD-H），這是一個將人機互動視為自組織社會認知系統的正式模型。HACD-H將情感適應、關係組織、社會記憶和個性一致性整合到一個統一的動力學框架中，並引入了包括多時間尺度社會認知、關係吸引子、信任盆地、發展階段轉變和社會認知能量動力學等原則。我們構建了一個包含約14,700次互動回合的對話數據集，並開發了一個以理論為驅動的實證評估框架。結果顯示社會認知中的時間持久性層次、穩定的關係吸引子、類相變的發展模式以及結構化的社會認知能量景觀。社會智慧與社會認知能量之間顯示出顯著的負相關（r = -0.391, p < 0.001），而互動軌跡隨著時間的推移表現出逐步的能量減少。這些發現表明，社會智慧是從長期的社會認知共演化中產生的，而不是孤立的對話能力。HACD-H為建模自適應人機社會互動和開發社會智能人工智慧系統提供了統一的理論基礎。

Analysing drivers and interdependencies in European electricity markets using XAI

2606.19118v1 by Antoine Pesenti, Aidan O'Sullivan

Electricity markets are inherently complex systems characterised by strong nonlinearities, high-dimensional interactions, and increasing interdependence across regions. While deep neural networks (DNNs) have demonstrated strong predictive capabilities for electricity prices, their lack of interpretability limits their usefulness for understanding the underlying drivers of price formation. This paper addresses this gap by combining DNN models with explainable artificial intelligence (XAI) techniques to analyse the determinants of electricity prices across 39 European bidding zones. We employ SHAP (SHapley Additive exPlanations) to quantify feature contributions and apply and extend SSHAP, an aggregation framework to improve interpretability in high-dimensional settings. The analysis identifies that renewable energy sources, particularly solar, play a disproportionately important role in price formation despite their lower share in total power generation. Gas prices remain a dominant and consistent driver across electricity markets, while interconnections significantly shape price dynamics, highlighting the strong interdependence of European electricity systems. In addition, a synthetic EU-wide electricity market is constructed to explore the counterfactual scenario of a fully integrated market with a single price.

摘要：電力市場本質上是複雜的系統，特徵是強非線性、高維互動和各地區之間日益增強的相互依賴性。雖然深度神經網絡（DNN）在電力價格的預測能力上表現出色，但其缺乏可解釋性限制了其對理解價格形成的基本驅動因素的實用性。本文通過將DNN模型與可解釋的人工智慧（XAI）技術相結合，來分析39個歐洲競標區域的電力價格決定因素，填補了這一空白。我們使用SHAP（SHapley Additive exPlanations）來量化特徵貢獻，並應用及擴展SSHAP，這是一個聚合框架，用於提高高維設置中的可解釋性。分析顯示，儘管可再生能源，特別是太陽能，在總發電量中的比例較低，但在價格形成中扮演著不成比例的重要角色。天然氣價格仍然是電力市場中的主導和一致驅動因素，而互聯網絡則顯著影響價格動態，突顯了歐洲電力系統之間的強相互依賴性。此外，構建了一個合成的EU範圍內電力市場，以探索完全整合市場的反事實情境，並且只有一個價格。

APT: Atomic Physical Transitions for Causal Video-Language Understanding

2606.18586v1 by Shang Wu, Haoran Lu, Songling Liu, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as "bounce" can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.

摘要：物理事件並不是單靠名稱就能理解，而是透過組成它們的因果狀態變化來理解。像「彈跳」這樣的片段級標籤可能是正確的，但卻隱藏了使事件在物理上有效的過程，從支撐喪失和接觸開始到反彈和穩定。為了使這一隱藏過程明確，我們引入了原子物理轉變（APTs）：最小的、時間上局部的狀態變化，將可見提示與活躍的物理機制及前後動態範疇聯繫起來。一個APT鏈將視頻表示為有序的因果轉變序列，而不是單一的聚合事件標籤：事件標籤告訴我們發生了什麼；APT鏈解釋了為什麼會發生。為了使APT能被VLM學習，我們從人類註釋和模擬器真實數據中構建了混合來源的APT數據，涵蓋了接觸、重力、摩擦和旋轉/穩定等14種轉變類型，共有27,303個計時實例，分佈在1,246次試驗中。使用這些數據，我們發現當前的VLM在轉變級物理上存在缺失，零樣本回憶率最多為14%，且錯誤主要是由於漏掉的轉變。對APT鏈進行直接微調改善了轉變檢測，但卻導致事件級的遺忘，這表明模型學會了一種專門的答案格式，而不是可重用的物理表示。因此，我們提出了APT-Tune，一種參數高效的方案，教導VLM在不忘記如何回答視頻問題的情況下使用因果轉變。它結合了圖像填充感知監督、格式條件共同訓練和機制條件的域到類型解碼，使APT學習在格式上穩健且在物理上有根據。在Qwen3-VL-2B上僅用11 M LoRA參數，APT-Tune顯著提高了APT的回憶率，同時改善了事件級的視頻轉移。這些結果顯示，APT不是一種新的答案格式，而是用於物理視頻理解的人類對齊因果監督信號。

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

2606.18557v1 by Patrick Cooper, Alvaro Velasquez

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

摘要：一個基於規則的邏輯解決器在我們的基準測試中以 50 微秒內解決每個實例，並且準確率達到 100%；最佳的前沿語言模型最多達到 65%，在渲染穩健評估下則降至 23.5%（在四次表面渲染的最壞情況下）。我們介紹 DeFAb（可駁回的歸納基準），這是一個數據集和生成管道，將四十年的公共資助知識庫轉換為可駁回歸納的形式化實例：通過覆蓋預設來構建解釋異常的假設，同時保留無關的期望。因為每個假設必須通過多項式時間檢查以驗證有效推導、保守性和最小性，DeFAb 使邏輯嚴謹成為衡量創造力和理論推理的工具，評分理論修訂的有序構建，而不是流暢但毀滅理論的散文。該管道將分類層次（OpenCyc、YAGO、Wikidata）與行為屬性圖（ConceptNet、UMLS）配對，以從 18 個來源生成 372,648+ 個實例，涵蓋 33.75M 的具體化規則，並設有三個層級和多項式時間可驗證的黃金標準。四個前沿模型並不可靠地內化可駁回推理：渲染穩健的 Level 2 準確率為 7.8-23.5%；思維鏈變異（約 36 pp）超過任何模型間的差距；而匹配的污染控制則隔離出 +19.4 pp 的 Level 3 差距。我們進一步發布 DeFAb-Hard（235 個實例的 Level 3 難度變體；最佳模型 53.3% 對比 100% 符號）和 CONJURE（560 個 Lean 4/Mathlib 實例的核心驗證轉化創造力變體，其黃金答案是證明核心之前未包含的定義，無評判的驗證者；一項試點發現零個新概念）。同一驗證者也作為偏好優化（DPO，RLVR/GRPO）的精確獎勵。根據 MIT 授權發布，網址為 https://huggingface.co/datasets/PatrickAllenCooper/DeFAb。

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

2606.18518v1 by Arshia Ilaty, Hossein Shirazi, Manasi Chitale, Kedar Hegde, Dhanalakshmi Ramesh, Rashmi S. Manjunath, Amir Rahmani, Hajar Homayouni

The development of medical AI is constrained by limited access to high-quality clinical data due to institutional silos and strict privacy regulations such as HIPAA and GDPR. Synthetic data generation offers a potential solution, but existing methods lack principled mechanisms to explicitly manage the privacy-utility trade-off, often degrading clinically meaningful patterns or risking patient re-identification. We present PSyGenTAB, a privacy-preserving generative framework that formulates synthetic healthcare data generation as a constrained optimization problem solved using the Augmented Lagrangian Method. By embedding configurable privacy constraints directly into model training, PSyGenTAB enforces minimum privacy thresholds while maximizing clinical data utility. Across multiple clinically motivated benchmarks, PSyGenTAB preserves inter-feature clinical relationships and minority-class diagnostic patterns essential for reliable health AI. Downstream evaluation using Train-on-Synthetic, Test-on-Real and Train-on-Real, Test-on-Synthetic protocols shows that models trained on synthetic data achieve performance comparable to those trained on real patient records. Privacy auditing further demonstrates reduced exact record reproduction and strong resilience to membership inference attacks. These results establish PSyGenTAB as a principled framework for balancing privacy protection and clinical utility in synthetic healthcare data, supporting secure cross-institutional AI development.

摘要：醫療AI的發展受到高品質臨床數據獲取有限的限制，這是由於機構孤島和嚴格的隱私法規，例如HIPAA和GDPR。合成數據生成提供了一個潛在的解決方案，但現有的方法缺乏原則性機制來明確管理隱私與效用之間的權衡，這往往會降低臨床上有意義的模式或危及患者的重新識別。我們提出了PSyGenTAB，一個保護隱私的生成框架，將合成醫療數據生成公式化為一個約束優化問題，並使用增強拉格朗日方法解決。通過將可配置的隱私約束直接嵌入模型訓練中，PSyGenTAB在最大化臨床數據效用的同時，強制執行最低隱私閾值。在多個臨床動機的基準測試中，PSyGenTAB保留了臨床特徵之間的關係和對可靠健康AI至關重要的少數類別診斷模式。使用“在合成數據上訓練，在真實數據上測試”和“在真實數據上訓練，在合成數據上測試”的下游評估顯示，基於合成數據訓練的模型達到了與基於真實患者記錄訓練的模型相當的性能。隱私審計進一步顯示出精確記錄再現的減少和對會員推斷攻擊的強大抵抗力。這些結果確立了PSyGenTAB作為一個原則性框架，在合成醫療數據中平衡隱私保護和臨床效用，支持安全的跨機構AI開發。

From Specification to Execution: AI Assisted Scientific Workflow Management

2606.18425v1 by Komal Thareja, Hamza Safri, Rajiv Mayani, Anirban Mandal, Ewa Deelman

Scientific workflow management systems (WMS) support scalable and reproducible execution of complex pipelines, but workflow design, implementation, and debugging remain largely manual and require significant expertise. Recent approaches using large language models (LLMs) show promise for workflow generation from natural language, but often rely on direct code synthesis, which limits transparency, reproducibility, and integration with workflow systems. We present an AI-assisted approach to scientific workflow management that combines specification-driven workflow generation, automated debugging, and distributed execution. The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation. We also develop an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers. To support distributed execution and user interaction, we integrate Pegasus, a widely used WMS, with a Model Context Protocol (MCP) layer, providing a unified interface for workflow submission, monitoring, and control. We evaluate the approach using a federated learning workflow for medical imaging, chosen for its parallel, iterative, and dependency-intensive structure. The system generated and executed large-scale workflows with thousands of jobs, reduced debugging effort, and allowed non-expert users to construct workflows with expert-level design patterns. These results indicate that end-to-end AI-assisted workflow generation and execution is feasible, and point toward AI-driven platforms for managing the scientific workflow lifecycle.

摘要：科學工作流程管理系統（WMS）支持可擴展和可重現的複雜管道執行，但工作流程的設計、實施和除錯仍然主要是手動進行，並且需要相當的專業知識。最近使用大型語言模型（LLMs）的方法顯示出從自然語言生成工作流程的潛力，但通常依賴於直接的代碼合成，這限制了透明度、可重現性和與工作流程系統的整合。我們提出了一種AI輔助的科學工作流程管理方法，結合了以規範驅動的工作流程生成、自動化除錯和分佈式執行。該方法引入了一個結構化的規範階段，將工作流程的意圖、設計和實施分開，允許在生成代碼之前進行驗證。我們還開發了一個基於LLM的除錯代理，能夠診斷和解決多個系統層次的故障。為了支持分佈式執行和用戶互動，我們將廣泛使用的WMS Pegasus與模型上下文協議（MCP）層集成，提供一個統一的工作流程提交、監控和控制界面。我們使用一個聯邦學習的醫學影像工作流程來評估該方法，因為它具有並行、迭代和依賴密集的結構。該系統生成並執行了具有數千個作業的大規模工作流程，減少了除錯工作，並允許非專家用戶以專家級設計模式構建工作流程。這些結果表明，端到端的AI輔助工作流程生成和執行是可行的，並指向AI驅動的平台以管理科學工作流程的生命週期。

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

2606.18203v1 by Weizhi Zhang, Zechen Li, Hamid Palangi, Ben Graef, A. Ali Heydari, Simon A. Lee, Salman Rahman, Ray Luo, Zeinab Esmaeilpour, Erik Schenck, Chloe Zhang, Yamin Li, Menglian Zhou, Philip S. Yu, Daniel McDuff, Lindsey Sunden, Mark Malhotra, Shwetak Patel, Ahmed A. Metwally

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

摘要：LLM 驅動的個人健康代理與用戶健康（傳感器）指標提供了一條有希望的途徑，以減輕全球醫療保健獲取的不平等。然而，大規模臨床部署仍然受到一個無限期評估瓶頸的限制：醫生註釋可靠但成本高昂且無法擴展，而 LLM 作為評估者則可擴展但主觀、不一致，有時與臨床不符。我們介紹了 RubricsTree，一個可擴展的評估框架，具有專家對齊的分層分類法，包含超過 100 個原子級的臨床可驗證布爾標準，這些標準源於 4,000 個真實用戶查詢的洞察，通過一個由經驗豐富的醫生領導的專家小組進行的迭代人機協作策劃協議進化而來。上下文感知的自適應路由器僅在每個查詢中激活相關的自動加權標準子集，提供可擴展評估所需的通量，並保持專家對齊的質量。通過系統的元評估，我們顯示 RubricsTree (i) 在挑戰性的開放式查詢上，專家對齊的表現顯著超過強大的大規模評估基準；(ii) 可靠地懲罰上下文退化的回應；以及 (iii) 當用作結構化指令、文本反饋或性能優化的訓練獎勵時，對 Gemini、GPT 和 Qwen 模型系列在 HealthBench 上產生高達約 66% 的相對增益。因此，RubricsTree 提供了一個可擴展的、可審計的、持續演進的評估基礎設施，滿足產品級個人健康 AI 持續優化的需求。

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

2606.18129v1 by Abeer Badawi, Moyosoreoluwa Olatosi, Negin Baghbanzadeh, Laleh Seyyed-Kalantari, Frank Rudzicz, R. Shayna Rosenbaum, Sara Pishdadian, Elham Dolatabadi

Recent incidents involving LLMs used for mental-health support reveal a critical evaluation gap: surface-level safety scores do not capture how models behave across realistic, emotionally sensitive interactions over time. Existing benchmarks measure knowledge, safety, or static response quality, but miss whether LLM interactions help users keep reflecting, coping, and making decisions themselves. We formalize this missing dimension as COGNITIVE ATROPHY, a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. To measure it, we introduce COGNITIVE ATROPHY BENCH, a clinically grounded benchmark built from 1,576 fully human-generated counseling conversations, 15,680 turns, and 42,230 responses from five LLMs. Three clinical and neuropsychology experts developed a 20-attribute schema spanning user context, response behaviour, and global risk flags; six trained clinical reviewers applied it with span-grounded evidence, producing 5,324 reviewer judgments. We further introduce the User-Input Risk Index (UIRI), the Cognitive Atrophy Risk Index (ARI), and trajectory summaries. Across five LLMs, models show a consistent moderate-to-high level of atrophy-aligned behaviour across single and multi-turn settings. While models generally respond to overt safety cues, they adapt less reliably when users seek solutions or decisions. The dominant recurring patterns are directive advice, problem-solving, recommendation responses, topic shifts, and forms of validation that may reinforce dependence rather than reflection. Our work makes COGNITIVE ATROPHY measurable and provides a foundation for auditing model behaviour in sensitive LLM conversations.

摘要：最近涉及用於心理健康支持的LLM事件揭示了一個關鍵的評估缺口：表面上的安全分數無法捕捉模型在現實情境中隨時間推移的情感敏感互動中的行為。現有的基準測量知識、安全性或靜態反應質量，但未能評估LLM互動是否幫助用戶持續反思、應對和自主做出決策。我們將這一缺失的維度正式化為認知萎縮（COGNITIVE ATROPHY），這是一種在AI介導的心理健康支持中與安全性和幫助性不同的過程層面行為測量。為了測量它，我們引入了認知萎縮基準（COGNITIVE ATROPHY BENCH），這是一個基於1,576個完全由人類生成的諮詢對話、15,680次回合和來自五個LLM的42,230個回應的臨床基準。三位臨床和神經心理學專家開發了一個涵蓋用戶背景、回應行為和全球風險標誌的20屬性架構；六位經過培訓的臨床審核員應用該架構並提供基於證據的評估，產生了5,324條審核判斷。我們進一步引入了用戶輸入風險指數（User-Input Risk Index, UIRI）、認知萎縮風險指數（Cognitive Atrophy Risk Index, ARI）和軌跡摘要。在五個LLM中，模型在單回合和多回合設置中顯示出一致的中到高水平的萎縮對齊行為。儘管模型通常對明顯的安全提示作出反應，但當用戶尋求解決方案或決策時，它們的適應性較低。主導的重複模式包括指導性建議、問題解決、推薦回應、主題轉換和可能加強依賴而非反思的驗證形式。我們的工作使認知萎縮可測量，並為審計敏感LLM對話中的模型行為提供了基礎。

Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications

2606.18068v1 by Divyansh Srivastava, Shreya Ghosh, Anshul Verma, Rajkumar Buyya

Recent advances in Large Language Models (LLMs) and multi-agent systems have driven the rise of Agentic AI, showing promise for medical reasoning. However, open-ended conversational agents remain prone to two critical failure modes: premature diagnostic handoff and silent clinical hallucinations that may go undetected before reaching the patient. In this work, we propose a multi-agent framework that addresses both issues by replacing ``LLM-as-a-judge'' routing with deterministic orchestration constraints. The framework incorporates two safety mechanisms. First, a neuro-symbolic state-tracking gate enforces completeness of the OLDCARTS clinical protocol (Onset, Location, Duration, Character, Aggravating/Alleviating factors, Radiation, Timing, and Severity) by blocking diagnostic transitions until all required dimensions are collected. Second, an epistemic uncertainty quantification (UQ) gate computes semantic entropy (H) across K=5 independent diagnostic samples to identify and intercept divergent outputs before delivery. We evaluate the system using simulated patient agents powered by the llama-3.1-70b-instruct model on 150 test cases. The full architecture achieves 49.3% diagnostic precision, representing an absolute improvement of 11.3 percentage points over an unconstrained baseline. Additionally, we observe a statistically significant negative correlation (r = -0.181, p < 0.05) between OLDCARTS completeness (σ) and semantic entropy (H), suggesting that structured information gathering is associated with reduced diagnostic uncertainty.

摘要：最近在大型語言模型（LLMs）和多代理系統方面的進展推動了代理式人工智慧的興起，顯示出在醫學推理方面的潛力。然而，開放式對話代理仍然容易出現兩種關鍵的失敗模式：過早的診斷轉交和可能在到達患者之前未被檢測到的靜默臨床幻覺。在這項工作中，我們提出了一個多代理框架，通過用確定性編排約束取代“LLM作為裁判”的路由來解決這兩個問題。該框架包含兩個安全機制。首先，一個神經符號狀態跟蹤閘強制執行OLDCARTS臨床協議的完整性（起始、位置、持續時間、特徵、加重/緩解因素、輻射、時間和嚴重性），通過阻止診斷轉換直到收集所有所需的維度。其次，一個認知不確定性量化（UQ）閘計算K=5個獨立診斷樣本的語義熵（H），以識別並攔截在交付之前的分歧輸出。我們使用由llama-3.1-70b-instruct模型驅動的模擬患者代理在150個測試案例中評估系統。完整架構實現了49.3%的診斷精確度，與不受約束的基線相比，絕對改善了11.3個百分點。此外，我們觀察到OLDCARTS完整性（σ）與語義熵（H）之間存在統計上顯著的負相關（r = -0.181，p < 0.05），這表明結構化的信息收集與降低診斷不確定性相關。

When LLMs Analyze Scars: From Images to Clinically-Meaningful Features

2606.18063v1 by Ruman Wang, Hangting Ye

Medical image classification faces a fundamental dilemma: while deep learning models achieve remarkable performance at scale, real-world clinical scenarios often suffer from severe data scarcity due to annotation costs, privacy constraints, and disease rarity. This challenge is particularly pronounced in pathological scar classification, where differentiating keloids from hypertrophic scars requires subtle expert knowledge and labeled images are extremely limited. We propose a novel paradigm that repositions large language models (LLMs) as knowledge-driven feature engineers rather than end-to-end classifiers. We call this framework ScaFE (Scar Feature Engineering). Our key insight is that LLMs encode rich medical knowledge that can be externalized as executable feature extraction code, enabling the transformation of high-dimensional images into low-dimensional, clinically interpretable representations. Specifically, we prompt an LLM with established scar assessment criteria to generate deterministic Python code that extracts features aligned with clinical scoring systems such as the Vancouver Scar Scale. Our approach offers three key advantages: (1) data efficiency, achieving robust performance with limited training samples by decoupling knowledge acquisition from statistical learning; (2) privacy preservation, as raw images are processed locally without exposure to external LLMs; and (3) interpretability, through explicit features grounded in clinical reasoning. Extensive experiments on scar classification demonstrate that our method consistently outperforms end-to-end deep learning baselines or using LLMs as black-box classifiers under limited data conditions, establishing a promising direction for integrating LLMs into data-efficient and clinically transparent medical AI systems.

摘要：醫學影像分類面臨一個根本性的困境：雖然深度學習模型在大規模下表現卓越，但現實世界的臨床情境常常因為標註成本、隱私限制和疾病稀有性而遭遇嚴重的數據匱乏。這一挑戰在病理性疤痕分類中尤為明顯，因為區分凹疤和肥厚性疤痕需要微妙的專家知識，而標註的影像極為有限。我們提出了一種新穎的範式，將大型語言模型（LLMs）重新定位為知識驅動的特徵工程師，而非端到端的分類器。我們稱這一框架為ScaFE（疤痕特徵工程）。我們的關鍵見解是，LLMs編碼了豐富的醫學知識，這些知識可以外部化為可執行的特徵提取代碼，使高維影像轉換為低維且臨床可解釋的表示。具體而言，我們使用既定的疤痕評估標準來提示LLM生成確定性的Python代碼，提取與臨床評分系統（如溫哥華疤痕量表）對齊的特徵。我們的方法提供了三個主要優勢：（1）數據效率，通過將知識獲取與統計學習解耦，實現有限訓練樣本下的穩健性能；（2）隱私保護，因為原始影像在本地處理，未暴露於外部LLMs；以及（3）可解釋性，通過基於臨床推理的明確特徵。對疤痕分類的廣泛實驗表明，我們的方法在有限數據條件下始終優於端到端的深度學習基準或將LLMs用作黑箱分類器，確立了將LLMs整合進數據高效且臨床透明的醫學AI系統中的有前景方向。

Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual Adaptation

2606.18024v1 by Ido Nitzan Hidekel, Dan Raviv

Catastrophic forgetting in continual adaptation is usually studied through parameter drift, replay, or distillation, but these views do not identify which output-space directions are vulnerable. We give a function-space account in the NTK regime: new-task training induces old-task prediction drift through the cross-task kernel, yielding a closed-form predictor for the forgetting vector before any new-task gradient step. In frozen-backbone linear-head PEFT-CL, where the model is linear in the trainable parameters, the predictor is exact up to numerical precision; for nonlinear adapters/full fine-tuning, it is a local NTK approximation. The same expression reveals that forgetting concentrates in a small number of old-task NTK eigenmodes and under frozen linear heads gives a Kronecker scaling rule for the vulnerable rank. These results clarify the relation to prior NTK-overlap theory, explain why parameter-space regularizers can miss output-space interference, and motivate a targeted spectral regularizer.

摘要：在持續適應中的災難性遺忘通常通過參數漂移、重播或蒸餾來研究，但這些觀點並未確定哪些輸出空間方向是脆弱的。我們在 NTK 範疇中給出了一個函數空間的解釋：新任務訓練通過跨任務核引起舊任務預測漂移，從而在任何新任務梯度步驟之前產生遺忘向量的封閉形式預測器。在凍結主幹線性頭的 PEFT-CL 中，模型在可訓練參數上是線性的，預測器的精確度達到數值精度；對於非線性適配器/完全微調，它是一個局部 NTK 近似。相同的表達式顯示，遺忘集中在少數幾個舊任務 NTK 特徵模式中，並且在凍結的線性頭下給出了脆弱秩的克羅內克縮放規則。這些結果澄清了與先前 NTK 重疊理論的關係，解釋了為什麼參數空間正則化器可能會忽略輸出空間的干擾，並激發了一個針對性的譜正則化器。

LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

2606.18023v1 by Jian Yang, Shawn Guo, Wei Zhang, Tianyu Zheng, Yaxin Du, Haau-Sing Li, Jiajun Wu, Yue Song, Yan Xing, Qingsong Cai, Zelong Huang, Chuan Hao, Ran Tao, Xianglong Liu, Wayne Xin Zhao, Mingjie Tang, Weifeng Lv, Ming Zhou, Bryan Dai

Looped Transformers scale latent computation by repeatedly applying shared blocks, but sequential looping increases latency and KV-cache memory with the loop count. Parallel loop Transformers (PLT) alleviate this cost through cross-loop position offsets (CLP) and shared-KV gated sliding-window attention, making loop count a practical design choice. We therefore study PLT loop-count selection through a gain--cost view: an extra loop may refine representations, but CLP also introduces a positional mismatch at each loop boundary. We instantiate this study by training LoopCoder-v2, a family of 7B PLT coders with different loop counts, from scratch on 18T tokens, followed by matched instruction tuning and evaluation. Empirically, the two-loop variant delivers broad gains over the non-looped baseline across code generation, code reasoning, agentic software engineering, and tool-use benchmarks, improving SWE-bench Verified from 43.0 to 64.4 points and Multi-SWE from 14.0 to 31.0 points. In contrast, variants with three or more loops regress, revealing a strongly non-monotonic loop-count effect. Our diagnostics show that loop 2 provides the main productive refinement, while later loops yield diminishing, oscillatory updates and reduced representational diversity. Because the CLP-induced mismatch remains roughly fixed as refinement gains shrink, the offset cost increasingly dominates. This gain--cost trade-off explains PLT's saturation at two loops and provides diagnostics for loop-count selection.

摘要：循環Transformer透過重複應用共享區塊來擴展潛在計算，但序列循環會隨著循環次數增加延遲和KV快取記憶體。平行循環Transformer（PLT）通過交叉循環位置偏移（CLP）和共享-KV門控滑動窗口注意力來減輕這一成本，使循環次數成為一個實用的設計選擇。因此，我們從增益-成本的角度研究PLT循環次數的選擇：額外的循環可能會細化表示，但CLP在每個循環邊界也引入了位置不匹配。我們通過訓練LoopCoder-v2來實現這項研究，這是一個擁有不同循環次數的7B PLT編碼器系列，從頭開始在18T標記上進行訓練，隨後進行匹配的指令調整和評估。實證結果顯示，兩循環變體在代碼生成、代碼推理、主動軟體工程和工具使用基準測試中，對比非循環基準提供了廣泛的增益，將SWE-bench Verified從43.0提高到64.4分，Multi-SWE從14.0提高到31.0分。相比之下，具有三個或更多循環的變體則表現回落，顯示出強烈的非單調循環次數效應。我們的診斷顯示，循環2提供了主要的生產性細化，而後續循環則產生遞減的、震盪的更新和減少的表示多樣性。由於CLP引起的不匹配在細化增益縮小時大致保持不變，因此偏移成本日益主導。這一增益-成本權衡解釋了PLT在兩個循環時的飽和，並為循環次數選擇提供了診斷依據。

A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease

2606.17867v1 by Antonio Scardace, Daniele Ravì

Despite increasing adoption of multimodal approaches in Alzheimer's Disease (AD) research -- aimed at integrating molecular, structural, clinical, and genetic biomarkers to enhance disease characterization -- the relationships among these modalities remain poorly understood. A systematic analysis of their dynamic interaction is essential for improving disease modeling, identifying redundant assessments, and reducing patient burden and acquisition costs. In this paper, we present a quantitative analysis of multimodal AD biomarkers by integrating tau-PET, structural MRI, cognitive scores (MMSE and CDR), and APOE4 data from 789 subjects drawn from the ADNI dataset. In our analyses, we (A) quantify cross-modal mutual information and explained variance to assess redundancy and predictive dependencies; (B) examine associations between tau topologies and structural atrophy across brain regions to select informative ROIs; (C) perform a statistical decomposition of the tau-cognition association into atrophy-related and atrophy-independent components; (D) and identify a dominant neurodegenerative trajectory that aligns with cognitive decline. This study provides a systematic characterization of cross-modal relationships, improving the interpretability and selection of biomarkers in AD. Code is publicly available at: https://github.com/antonioscardace/Multimodal-AD.

摘要：儘管在阿茲海默症（AD）研究中越來越多地採用多模態方法——旨在整合分子、結構、臨床和遺傳生物標記以增強疾病特徵——這些模態之間的關係仍然不甚了解。對它們動態互動的系統分析對於改善疾病建模、識別冗餘評估以及減少患者負擔和獲取成本至關重要。本文中，我們通過整合來自789名受試者的tau-PET、結構MRI、認知評分（MMSE和CDR）及APOE4數據，對多模態AD生物標記進行定量分析，這些數據來自ADNI數據集。在我們的分析中，我們（A）量化跨模態的互信息和解釋變異，以評估冗餘和預測依賴；（B）檢查tau拓撲與大腦區域結構性萎縮之間的關聯，以選擇有用的ROI；（C）對tau-認知關聯進行統計分解，將其分為與萎縮相關和與萎縮無關的成分；（D）並識別與認知衰退相一致的主導神經退行性軌跡。本研究提供了跨模態關係的系統特徵，改善了AD中生物標記的可解釋性和選擇性。代碼可在以下網址公開獲取：https://github.com/antonioscardace/Multimodal-AD。

Conservation Laws for Modern Neural Architectures

2606.17816v1 by Viet-Hoang Tran, Vinh Khanh Bui, Tan Lai Ngoc, Nam Nguyen, Tuan Dam, Tan M. Nguyen

Understanding gradient descent dynamics is key to explaining the success of over-parameterized models, where implicit bias manifests through conservation laws in gradient flow. While such laws are well understood for linear and ReLU networks, they remain largely unexplored for modern architectures. This work develops a unified framework to characterize conservation laws for contemporary models, including feedforward networks with GELU, SiLU, and SwiGLU activations, multihead attention with sinusoidal and rotary positional encodings, and Mixture-of-Experts architectures under diverse gating designs. Our theoretical findings are supported by experiments that validate the predicted invariants.

摘要：理解梯度下降的動態對於解釋過度參數化模型的成功至關重要，其中隱性偏差通過梯度流中的守恆定律表現出來。雖然這些定律在線性和ReLU網絡中已經得到很好的理解，但在現代架構中仍然大多未被探索。本研究開發了一個統一框架，以表徵當代模型的守恆定律，包括具有GELU、SiLU和SwiGLU激活的前饋網絡、帶有正弦和旋轉位置編碼的多頭注意力，以及在多樣化閘控設計下的專家混合架構。我們的理論發現得到了實驗的支持，這些實驗驗證了預測的不變性。

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

2606.17698v1 by Zeyao Du, Tong Li, Haibo Zhang

As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked. Benchmarks that expose full intent upfront and grade only the final choice can neither pose this long-horizon challenge nor explain which requirement an agent missed. To address this gap, we introduce EComAgentBench, a benchmark of 662 tasks grounded in real Amazon products and reviews. Each task scatters these requirements across a visible query, a tool-gated profile, and scripted clarification; an agent must uncover hidden intent, verify candidates against attributes and review evidence, and commit to a single product within 100 tool calls. Moreover, typed, source-tagged rubrics grade every task, attributing each failure to a requirement and its source. Construction is automated yet reliable, with every answer fixed in code before any text is generated and every sample validated. Our evaluation of seven models reveals that even the strongest attains only 57.1% overall accuracy, and rubric satisfaction degrades from visible to hidden sources. Overall, we believe EComAgentBench will serve as a reproducible foundation for moving shopping agents from single-query search toward dependable assistance over long horizons.

摘要：隨著基於LLM的購物代理進入生產階段，現有的基準無法捕捉到購物者需求的來源：在查詢中隱含地表達、記錄在個人資料中，或僅在提出正確問題時才顯露出來。那些提前揭示完整意圖並僅對最終選擇進行評分的基準，既無法提出這種長期挑戰，也無法解釋代理錯過了哪一項需求。為了解決這一空白，我們推出了EComAgentBench，這是一個基於真實亞馬遜產品和評論的662項任務的基準。每個任務將這些需求分散在可見的查詢、工具限制的個人資料和腳本化的澄清中；代理必須揭示隱藏的意圖，根據屬性和評論證據驗證候選項，並在100次工具調用內承諾選擇一個產品。此外，類型化的、來源標記的評分標準對每個任務進行評分，將每次失敗歸因於一項需求及其來源。建設是自動化但可靠的，在生成任何文本之前，每個答案都已固定在代碼中，並且每個樣本都經過驗證。我們對七個模型的評估顯示，即使是最強的模型也僅達到57.1%的整體準確率，並且評分標準的滿意度從可見來源降級到隱藏來源。總體而言，我們相信EComAgentBench將作為一個可重複的基礎，推動購物代理從單一查詢搜索轉向長期可靠的協助。

From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs

2606.17648v1 by Siyue Chen, Yifu Guo, Yuquan Lu, Zishan Xu, Jiaye Lin, Jianbo Lin, Siyu Zhang, Cheng Yang, Junxin Li, Yujia Li, Yu Huo, Ruixuan Wang

Standard accuracy metrics cannot explain why LLMs handle variable tracking but fail on semantically equivalent loops. We study an internal lifecycle of code reasoning in which models first brew the answer, making it linearly recoverable many layers before it becomes self-decodable, and then diverge into one of four resolution outcomes: Resolved, Overprocessed, Misresolved, or Unresolved. Understanding this lifecycle matters because similar task accuracies can mask fundamentally different failure modes that surface-level evaluation cannot detect. We introduce a dual diagnostic framework pairing layer-wise linear probing with Context-Stripped Decoding (CSD) and apply it to six code-reasoning task families across 16 models spanning Qwen, Llama, and DeepSeek architectures. All four outcomes carry substantial mass in every task family: overall Resolved is only 41.5%, with multiple tasks below 30%. Controlled sweeps over structure, depth, and operators expose task-specific failure bottlenecks: Function Call Resolved plunges from 61.1% to 2.5% as call depth increases from one to three. Across architectures and scales, the brewing scaffold remains stable, with normalized brewing duration 24-42% across all 16 models, while resolution success varies with capability. This indicates that the scaffold is a stable empirical regularity across the tested decoder-only Transformer families, whereas resolution success covaries with capability, scale, and training. Code: https://github.com/euyis1019/llm-brewing

摘要：標準準確性指標無法解釋為什麼 LLMs 能夠處理變量跟蹤卻在語義等價的迴圈上失敗。我們研究了一個代碼推理的內部生命周期，其中模型首先酝酿答案，使其在變成自我解碼之前的多層中線性可恢復，然後分歧為四種解決結果之一：已解決、過度處理、錯誤解決或未解決。理解這個生命周期很重要，因為類似的任務準確性可能掩蓋根本不同的失敗模式，這是表面評估無法檢測到的。我們引入了一個雙重診斷框架，將逐層線性探測與上下文剝離解碼（CSD）配對，並將其應用於跨越 Qwen、Llama 和 DeepSeek 架構的 16 個模型的六個代碼推理任務系列。所有四種結果在每個任務系列中都具有實質性質量：整體已解決率僅為 41.5%，多個任務低於 30%。對結構、深度和運算符的控制掃描揭示了特定任務的失敗瓶頸：函數調用已解決率隨著調用深度從一增加到三而從 61.1% 驟降至 2.5%。在不同架構和規模中，酝酿支架保持穩定，所有 16 個模型的標準化酝酿持續時間為 24-42%，而解決成功率則隨能力而變化。這表明，該支架在測試過的僅解碼 Transformer 系列中是一種穩定的實證規律，而解決成功率則與能力、規模和訓練相關聯。代碼：https://github.com/euyis1019/llm-brewing

SketchXplain: Intuitive Visual Explanations of Image Classifiers with Sketches

2606.17646v1 by Wencan Zhang, Mario Michelessa, Xuejun Zhao, Brian Y. Lim

Saliency map visualizations explain image-based AI predictions by pointing to regions, but these are often unintuitive and semantically unclear, leaving an interpretability gap. We argue that AI explanations should be intuitive -- coherent to user knowledge, yet simple and selective to accelerate interpretation. Inspired by artistic drawings, we propose SketchXplain to generate sketch-based visual explanations for intuitive image-based explainable AI (XAI). Combining techniques in saliency maps, concept-bottleneck models, and sketch optimization, SketchXplain integrates saliency to select coherent observation artifacts, concepts for knowledge coherence, cues to represent them, and abstraction for simplicity. Evaluating on face expression recognition, modeling and user studies showed that SketchXplain supported quicker interpretation with more aligned visualizations than saliency maps or simple drawings. Further evaluation on skin lesion diagnosis found that SketchXplain more coherently visualized disease symptoms, better supporting lay diagnosis. Thus, this work illustrates the value of sketches for intuitive, simple, coherent, and quick image-based XAI visualizations.

摘要：顯著性圖視覺化通過指向區域來解釋基於圖像的人工智慧預測，但這些通常不直觀且語義不清，留下了解釋性差距。我們認為，人工智慧的解釋應該是直觀的——與用戶知識一致，但又簡單且具選擇性，以加速解釋。受到藝術繪畫的啟發，我們提出了SketchXplain，用於生成基於草圖的視覺解釋，以實現直觀的基於圖像的可解釋人工智慧（XAI）。SketchXplain結合了顯著性圖、概念瓶頸模型和草圖優化技術，整合顯著性以選擇一致的觀察工件、知識一致性的概念、表示它們的提示以及簡單性的抽象。在面部表情識別的評估中，建模和用戶研究顯示，SketchXplain支持比顯著性圖或簡單繪圖更快的解釋，並且視覺化更一致。對皮膚病變診斷的進一步評估發現，SketchXplain更一致地視覺化疾病症狀，更好地支持非專業診斷。因此，這項工作說明了草圖在直觀、簡單、一致和快速的基於圖像的XAI視覺化中的價值。

Offline Preference-Based Trajectory Evaluation

2606.17541v1 by Fernando Diaz

Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective sample size and weakening the ability to distinguish systems. We propose preference-based trajectory evaluation, which compares trajectories directly through temporal preferences over progress and time-to-return profiles. We find that, across diverse agentic and interactive benchmarks, standard success-based metrics produce tied comparisons on roughly 75% of instances, whereas trajectory-aware preferences reduce ties to roughly 35%, improving discriminative power, ranking stability, and data efficiency. Our results suggest that benchmark saturation, often attributed to poor data collection or problem difficulty, may also be explained by the choice of evaluation measure.

摘要：離線評估代理系統通常將軌跡簡化為最終成功，忽略了部分進展的信息，並導致廣泛的平局，通過減少有效樣本大小和削弱區分系統的能力來創造實質性的統計低效率。我們提出了基於偏好的軌跡評估，通過對進展和回報時間輪廓的時間偏好直接比較軌跡。我們發現，在各種代理和互動基準中，標準的基於成功的指標在大約75%的情況下產生平局，而考慮軌跡的偏好將平局減少到大約35%，提高了區分能力、排名穩定性和數據效率。我們的結果表明，基準飽和，通常歸因於數據收集不良或問題難度，也可能由評估指標的選擇來解釋。

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

2606.17478v1 by Kexin Chen, Yi Liu, Haonan Zhang, Yanhui Li, Xinyu Deng, Dongxia Wang

As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model's hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.

摘要：隨著大型語言模型（LLMs）獲得更強的推理能力，欺騙行為成為越來越嚴重的安全問題。現有的欺騙監測工具要麼評分可見的記錄，要麼從表示向量中推導出標量探測分數，對於為什麼某個回應可疑幾乎沒有可檢查的證據。我們介紹了STATEWITNESS，一種用於欺騙審計的激活解釋器。一個單獨的解碼器讀取目標模型的隱藏狀態，然後回答自然語言查詢或發出有關它們的結構化報告。我們在七個欺騙數據集上對兩個目標推理LLM評估了STATEWITNESS。STATEWITNESS達到0.916的平均AUROC，相對於最佳黑箱文本監測器提高了11.6%，相對於最佳激活探測基準提高了25.0%，均在相同的評估協議下進行。當與現有監測器結合時，STATEWITNESS減少了簡單閾值集成中的漏檢欺騙例子。除了標量檢測外，解碼器還返回查詢級別的答案、模式報告以及供人類檢查的標記或句子級別的證據追踪。我們將這個介面視為更廣泛的可解釋性和對齊工具的潛在構建基石。

Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

2606.17405v1 by Xinyu Qin, Anil K. Sood, Ruiheng Yu, Sara Corvigno, Elaine Stur, Lu Wang

Clinical decision support AI systems (CDSASs) must adapt to evolving patient conditions in real-time while adhering to strict safety constraints. We present an online adaptive framework that integrates Treatment Effect (TE) estimation to quantify clinical benefits, a patient Digital Twin (DT) to simulate treatment trajectories, and Reinforcement Learning (RL) for sequential decision-making. The AI system is initially trained on historical medical records and operates in a continuous learning loop. To ensure safety, a rule-based module monitors vital signs and blocks contraindicated treatments. Cases with strong internal model disagreement are flagged for clinician review, simulated in our experiments via a pre-trained outcome model. We validate our framework using both a synthetic clinical simulator and a real-world ovarian cancer dataset from The Cancer Genome Atlas (TCGA). In both simulated and clinical settings, our method demonstrated superior effectiveness and stability in recommending treatments compared to standard computational baselines. Furthermore, the AI system maintains low latency and requires expert consultation for only a minority of cases in our experimental validation, demonstrating its potential as a safe, clinician-supervised tool for personalized medicine that continuously improves through practical use.

摘要：臨床決策支持人工智慧系統 (CDSASs) 必須在遵循嚴格安全限制的同時，實時適應不斷變化的患者狀況。我們提出了一個在線自適應框架，該框架整合了治療效果 (TE) 估算以量化臨床效益、患者數位雙胞胎 (DT) 以模擬治療軌跡，以及強化學習 (RL) 用於序列決策。該人工智慧系統最初在歷史醫療記錄上進行訓練，並在持續學習循環中運作。為了確保安全，一個基於規則的模組監測生命體徵並阻止禁忌治療。內部模型存在強烈不一致的案例會被標記以供臨床醫生審查，這在我們的實驗中是通過預訓練的結果模型來模擬的。我們使用合成臨床模擬器和來自癌症基因組圖譜 (TCGA) 的真實卵巢癌數據集來驗證我們的框架。在模擬和臨床環境中，我們的方法在推薦治療方面展示了優越的有效性和穩定性，相較於標準計算基準。此外，該人工智慧系統保持低延遲，並且在我們的實驗驗證中只有少數案例需要專家諮詢，顯示其作為一個安全的、由臨床醫生監督的個性化醫療工具的潛力，並能通過實際使用不斷改進。

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

2606.17339v1 by Sejal Bhalla, Larry Kieu, Aina Merchant, Eyal de Lara, Alex Mariakakis

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

摘要：語音提供了一個獨特的資訊窗口，通過同時參與神經系統、運動系統、呼吸系統和聲音系統來了解健康。當前的臨床語音AI方法主要通過孤立的特定條件研究進展，使得結果難以比較，且難以評估其普遍性。我們介紹了SpeechDx，這是一個大規模的臨床語音AI基準，涵蓋12個數據集和27個任務，涉及多種健康狀況。為了能夠在共享的臨床機制中進行評估，SpeechDx根據它們干擾的語音產生階段來結構任務：概念化、形成和表達。該基準通過包括有限標記數據的任務來測試普遍性，並在多個數據集中評估相同的健康狀況，以區分臨床上有意義的模式與數據集的假象。我們系統地評估了12種最先進的音頻編碼器在所有任務中的表現，以及在零樣本跨條件轉移下的表現。結果顯示，大規模語音模型代表了最強的整體基準，特定領域模型僅在密切匹配的任務上提高性能，而目前沒有任何表示能夠在臨床語音領域中可靠地普遍化。SpeechDx建立了一個共享的評估框架，以追踪朝向通用臨床語音表示的進展。

Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data

2606.16952v1 by Kareem Amin, Rudrajit Das, Alessandro Epasto, Adel Javanmard, Dennis Kraft, Mónica Ribero, Sergei Vassilvitskii

The rapid adoption of generative AI and Large Language Models (LLMs) has spurred interest in synthetic data as a privacy-preserving alternative to sensitive real-world datasets. However, generating high-utility synthetic data often carries the risk of memorizing and regurgitating private information from the training corpus. In this work, we present a customizable empirical auditing framework designed to detect and explain such data disclosures. Our framework introduces a mechanism to distinguish between "true disclosures"-where the system directly reproduces a user's information-and "phantom disclosures''-where the system incidentally generates a user's data. By partitioning input data into training and holdout sets and applying rigorous statistical hypothesis testing, we determine if observed disclosures are consistent with strict privacy baselines, such as zero-learning or specific Differential Privacy (DP) bounds. Crucially, this approach requires no model access, no canary insertion, and no reference model training -only the synthetic output and a held-out control set. We demonstrate that this framework effectively functions as a membership inference attack, providing empirical lower bounds on privacy leakage that are tighter than prior data-based auditing methods. Our approach is model-agnostic, applies to any synthetic data generation mechanism, and requires orders of magnitude fewer computational resources than shadow-model or canary-based alternatives.

摘要：生成式人工智慧和大型語言模型（LLMs）的快速採用激發了對合成數據的興趣，作為一種保護隱私的替代方案，以取代敏感的現實世界數據集。然而，生成高效用的合成數據通常伴隨著記憶和重複訓練語料庫中私人信息的風險。在這項工作中，我們提出了一個可自定義的實證審計框架，旨在檢測和解釋這類數據洩露。我們的框架引入了一種機制，以區分「真實洩露」——系統直接重現用戶信息——和「幻影洩露」——系統偶然生成用戶數據。通過將輸入數據劃分為訓練集和保留集並應用嚴格的統計假設檢驗，我們確定觀察到的洩露是否與嚴格的隱私基準一致，例如零學習或特定的差分隱私（DP）界限。關鍵是，這種方法不需要模型訪問、不需要金絲雀插入，也不需要參考模型訓練——只需要合成輸出和一個保留的控制集。我們證明這個框架有效地作為一種成員推斷攻擊，提供了比先前基於數據的審計方法更緊湊的隱私洩露實證下限。我們的方法是模型無關的，適用於任何合成數據生成機制，並且所需的計算資源比影子模型或金絲雀基礎的替代方案少幾個數量級。

Demystifying Variance in Circuit Discovery of LLMs

2606.16920v1 by Frank Zhengqing Wu, Francesco Tonin, Volkan Cevher

Circuit discovery is a key technique in mechanistic interpretability to pinpoint the model components that are crucial for performing a given task. Although the current state-of-the-art method (EAP-IG) performs well on the metric of (un)faithfulness, it suffers from substantial variability. This includes resampling variance, where the circuit changes when we probe with a new batch of data from the same distribution; rephrasing variance, where the discovered circuit shifts when the prompts are rephrased; and sample-wise variance, where a circuit with low population unfaithfulness exhibits large fluctuations in unfaithfulness across individual samples. This paper studies the roots of these variances. We demonstrate that CEAP, our new circuit discovery method that improves upon EAP-IG with a theoretical guarantee, can substantially lessen resampling variance. We further show that rephrasing variance arises because prompts with different templates tend to activate different circuits in the model. This leads us to argue that it may be challenging to find a comprehensive circuit that explains and controls the model's behavior on a task, which can be expressed in countless templates, suggesting that LLMs may be inherently hard to steer. We show that sparsity, which has been claimed to form more compact and interpretable task circuits, fails to solve this problem. Regarding sample-wise variance, we argue that it is largely benign: extremely poor unfaithfulness scores often stem from how unfaithfulness is defined, rather than from defects in the measured circuits. We show that the magnitude of unfaithfulness is affected by selective contribution scaling, a neural mechanism that accounts for the extremely poor scores sometimes observed.

摘要：電路發現是機械解釋性中的一項關鍵技術，用以確定對於執行特定任務至關重要的模型組件。儘管當前最先進的方法（EAP-IG）在（不）忠實度的指標上表現良好，但它卻存在相當大的變異性。這包括重抽樣變異性，即當我們用來自同一分佈的新數據批次進行探測時，電路會發生變化；重新措辭變異性，即當提示被重新措辭時，發現的電路會發生變化；以及樣本級變異性，即具有低人口不忠實度的電路在個別樣本中顯示出不忠實度的大幅波動。
本文研究這些變異的根源。我們證明了CEAP，我們的新電路發現方法，改進了EAP-IG並具有理論保證，能夠顯著減少重抽樣變異性。我們進一步顯示，重新措辭變異性產生的原因是，具有不同模板的提示往往會激活模型中的不同電路。這使我們認為，找到一個綜合的電路來解釋和控制模型在任務上的行為可能是具有挑戰性的，因為這可以用無數模板來表達，這暗示著大型語言模型可能天生難以引導。我們顯示，稀疏性，雖然被聲稱能形成更緊湊和可解釋的任務電路，但未能解決這一問題。關於樣本級變異性，我們認為這在很大程度上是良性的：極差的不忠實度分數往往源於不忠實度的定義，而不是測量電路的缺陷。我們顯示，不忠實度的大小受到選擇性貢獻縮放的影響，這是一種神經機制，解釋了有時觀察到的極差分數。

Symbolic Informalization: Fluent, Productive, Multilingual

2606.16893v1 by Aarne Ranta

Symbolic informalization enables a reliable conversion of formal mathematics to natural language. It has the potential to make machine-checked content human-readable without loss of precision. In a traditional proof system usage, symbolic informalization generalizes the limited mechanisms of syntactic sugar into the ordinary language of mathematics. In a setting where proofs are constructed by artificial intelligence and autoformalization, symbolic informalization can explain what precisely has been constructed. This paper outlines the project Informath, which aims to show how symbolic informalization can produce fluent text with a reasonable development effort and address multiple formal and natural languages. Informath is based on an interlingual architecture, where Dedukti works as a hub between different proof systems (Agda, Lean, Rocq) and Grammatical Framework (GF) takes care of linguistic correctness and variation in different natural languages.

摘要：符號非正式化使得正式數學能夠可靠地轉換為自然語言。它有潛力使機器檢查的內容在人類可讀的情況下不失精確性。在傳統的證明系統使用中，符號非正式化將有限的語法糖機制概括為數學的普通語言。在由人工智慧和自動形式化構建證明的環境中，符號非正式化可以解釋究竟構建了什麼。本文概述了項目Informath，旨在展示符號非正式化如何在合理的開發努力下產生流暢的文本，並處理多種正式和自然語言。Informath基於一種跨語言架構，其中Dedukti作為不同證明系統（Agda、Lean、Rocq）之間的樞紐，而語法框架（GF）則負責不同自然語言中的語言正確性和變化。

Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

2606.16890v1 by Sanjay Basu

Aggregate accuracy benchmarks conceal a systematic structure in how large language models fail at electronic health record (EHR) question answering: questions requiring more inferential steps produce disproportionately more errors. Motivated by theoretical results on transformer compositionality limits, we introduce a pre-specified hop-count taxonomy -- the number of distinct reasoning steps required to answer a clinical question from an EHR -- as a principled predictor of model failure. We annotate 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels and evaluate 301 questions in a within-model ablation (claude-sonnet-4-6, zero-shot vs. extended thinking) and cross-architecture replications (gpt-4o and gpt-5.4-2026-03-05, zero-shot). All three models, spanning two providers and two OpenAI generations (GPT-4 and GPT-5), show monotone accuracy decline with hop count: Claude Sonnet zero-shot falls from 30.6% (hop=1) to 17.6% (hop=4) (Cochran-Armitage z=-2.30, p=0.011; OR per hop 0.72, 95% CI [0.56,0.92], p=0.008); GPT-4o replicates this (37.8% to 14.7%; OR 0.58 [0.45,0.75], p<0.001); and gpt-5.4-2026-03-05 confirms it (37.8% to 23.5%; OR 0.80 [0.66,0.98], p=0.027). A pre-specified context-sufficiency audit shows higher-hop questions are not differentially disadvantaged by EHR truncation (answerability 93-95% at hops 2-4 vs. 79% at hop=1), so the decline reflects compositional reasoning difficulty. Extended thinking did not significantly flatten the accuracy-depth curve across three reasoning conditions, and thinking-token usage scaled with hop count (r=0.31, p<0.0001), consistent with the predicted O(k) computational requirement. Hop count is thus a theory-motivated, cross-architecture predictor of large-language-model error on EHR question answering, with direct implications for deployment risk stratification of clinical AI.

摘要：聚合準確性基準隱藏了大型語言模型在電子健康紀錄（EHR）問題回答上失敗的系統性結構：需要更多推理步驟的問題產生不成比例的錯誤。基於對Transformer組合性限制的理論結果，我們引入了一個預先指定的跳數分類法——回答來自EHR的臨床問題所需的不同推理步驟數量——作為模型失敗的原則性預測指標。我們對313個臨床醫生生成的MedAlign EHR問題-回答對進行了標註，涵蓋了四個跳數級別，並在模型內部消融（claude-sonnet-4-6，零樣本與擴展思考）和跨架構複製（gpt-4o和gpt-5.4-2026-03-05，零樣本）中評估了301個問題。所有三個模型，涵蓋了兩個提供者和兩個OpenAI世代（GPT-4和GPT-5），都顯示出隨著跳數的增加準確性單調下降：Claude Sonnet零樣本從30.6%（跳=1）下降到17.6%（跳=4）（Cochran-Armitage z=-2.30，p=0.011；每跳的OR 0.72，95% CI [0.56,0.92]，p=0.008）；GPT-4o複製了這一點（37.8%下降至14.7%；OR 0.58 [0.45,0.75]，p<0.001）；而gpt-5.4-2026-03-05確認了這一點（37.8%下降至23.5%；OR 0.80 [0.66,0.98]，p=0.027）。一項預先指定的上下文充分性審計顯示，高跳數問題並未因EHR截斷而受到差異性劣勢（在跳數2-4時可回答率為93-95%，而在跳=1時為79%），因此下降反映了組合推理的困難。擴展思考並未顯著平坦化三種推理條件下的準確性-深度曲線，且思考標記的使用隨著跳數的增加而增長（r=0.31，p<0.0001），與預測的O(k)計算需求一致。因此，跳數成為一個理論驅動的、跨架構的預測指標，用於大型語言模型在EHR問題回答上的錯誤，對臨床AI的部署風險分層具有直接的影響。

Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies

2606.16721v1 by Ke Liu, Mengxuan Li, Yanyi Bao, Tianyun Zhang, Chong Chu, Jiajun Bu, Haishuai Wang

Medical diagnosis and treatment are dynamic processes in which patient states evolve over time and clinical interventions alter future outcomes. Although current medical AI can detect disease, estimate risk and generate reports, many systems still return static labels or scores, offering limited insight into how illness may progress or how alternative interventions may reshape its trajectory. Medical world models adapt the world-model idea from artificial intelligence to healthcare by learning internal simulators of patient-state dynamics. Their long-term goal is to help clinicians anticipate deterioration, compare treatment-conditioned futures and tailor care to individual patients. Yet relevant work remains scattered across foundation models, longitudinal modelling, disease simulation, treatment-effect estimation, reinforcement learning and digital twins. To bridge this gap, this review outlines a roadmap for advancing medical AI from isolated diagnosis and prediction toward medical world models that simulate disease evolution and support intervention decisions. This roadmap is organized around three coupled capabilities: patient-state construction, clinical dynamics modelling and intervention decision support. Across representative systems, the comparison highlights what each capability contributes and how partial components can be integrated into more mature perception--dynamics--planning systems. Finally, we identify the challenges involved in turning plausible rollouts into clinically useful simulators. Related literature is available at https://github.com/1999kevin/awesome_medical_world_models.

摘要：醫療診斷和治療是動態過程，患者狀態隨時間演變，臨床干預改變未來結果。儘管當前的醫療人工智慧可以檢測疾病、估算風險並生成報告，但許多系統仍然返回靜態標籤或分數，對於疾病如何進展或替代干預如何改變其軌跡提供有限的洞察。醫療世界模型將人工智慧中的世界模型概念應用於醫療保健，通過學習患者狀態動態的內部模擬器。它們的長期目標是幫助臨床醫生預測惡化、比較治療條件下的未來並為個別患者量身定制護理。然而，相關工作仍然分散在基礎模型、縱向建模、疾病模擬、治療效果估算、強化學習和數位雙胞胎之間。為了彌補這一差距，本綜述概述了一個推進醫療人工智慧的路線圖，從孤立的診斷和預測轉向模擬疾病演變並支持干預決策的醫療世界模型。這個路線圖圍繞三個相互關聯的能力組織：患者狀態構建、臨床動態建模和干預決策支持。在代表性系統中，這一比較突顯了每個能力的貢獻，以及如何將部分組件整合到更成熟的感知--動態--規劃系統中。最後，我們確定了將可行的推廣轉變為臨床有用模擬器所面臨的挑戰。相關文獻可在 https://github.com/1999kevin/awesome_medical_world_models 獲得。

Is Your Trajectory Displacement Safe in Long-tail?

2606.16313v1 by Qiao Sun, Weicheng Zheng, Yixin Huang, Hang Zhao

Long-tail scenarios remain a major bottleneck for autonomous driving evaluation, even as datasets grow by orders of magnitude. Existing evaluation pipelines are rarely human-aligned, safety-aware, verifiable, and explainable at the same time: closed-loop metrics often saturate among strong planners, while unstructured human ratings can be noisy without a carefully designed protocol. We formulate planning evaluation as additional-threat detection: given a planner trajectory and an expert reference, does the planner's displacement introduce new unsafe driving behavior? We propose FluidTest, an evaluation pipeline with three components: a pairwise WebUI protocol for reliable human annotation; a taxonomy of 32 semantic threats with evidence-grounded decision graphs; and a three-agent verification system with reflection for precision and auditability. Experiments on the WOD-E2E dataset show that FluidTest produces consistent labels among trained annotators and identifies additional threats in 65% of Poutine trajectories and 51% of RAP trajectories. These results show that state-of-the-art planners can still exhibit substantial safety-relevant failures despite high Rater Feedback Scores (RFS) and low Average Displacement Error (ADE). Additional details, guidance, and code are available at https://fluidtest.web.app.

摘要：長尾場景仍然是自動駕駛評估的一個主要瓶頸，即使數據集的規模增長了幾個量級。現有的評估管道很少同時符合人類對齊、安全意識、可驗證和可解釋的要求：閉環指標在強大的規劃者之間往往會飽和，而無結構的人類評分在沒有精心設計的協議下可能會產生噪音。我們將規劃評估定義為額外威脅檢測：給定一個規劃者的軌跡和一個專家參考，該規劃者的位移是否引入了新的不安全駕駛行為？我們提出了FluidTest，一個具有三個組件的評估管道：一個用於可靠人類註釋的成對WebUI協議；一個包含32種語義威脅的分類法，並附有基於證據的決策圖；以及一個具有反思的三代理驗證系統，以提高精確性和可審計性。在WOD-E2E數據集上的實驗顯示，FluidTest在訓練過的註釋者之間產生了一致的標籤，並在65%的Poutine軌跡和51%的RAP軌跡中識別了額外的威脅。這些結果表明，儘管Rater Feedback Scores (RFS)高且Average Displacement Error (ADE)低，最先進的規劃者仍然可能出現實質性的安全相關失敗。更多詳細信息、指導和代碼可在 https://fluidtest.web.app 獲得。

PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

2606.16215v1 by Zhenbang Du, Jun Luo, Zhiwei Zheng, Xiangchi Yuan, Kejing Xia, Dachuan Shi, Qirui Jin, Qijia He, Shaofeng Zou, Yingbin Liang, Wenke Lee

Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.

摘要：多回合工具使用代理必須進行推理、調用工具並適應多次互動回合中的觀察。訓練後這類代理是具有挑戰性的，因為強化學習通常面臨稀疏獎勵和弱信用分配的問題，儘管它與僅提示的推理設置相匹配，而在專家痕跡上進行監督微調則提供了密集的過程監督，但可能會使模型過度約束於固定的軌跡。為了解決這個問題，我們提出了PACT，一種用於多回合工具使用代理的特權痕跡共同訓練框架。其關鍵思想是僅在訓練期間將專家痕跡用作優化信號，而不是在回合期間提供提示。PACT保持回合生成僅基於提示，然後利用專家痕跡通過兩個互補信號來指導優化：一個基於痕跡的強化學習替代品，該替代品在專家痕跡上下文中評估僅基於提示的回合，以及一個組件感知的SFT損失，該損失以逐漸減弱的強度監督推理前綴和工具調用。為了減少對僅訓練痕跡上下文的過度依賴，PACT進一步引入了僅基於提示的錨定。我們還提供了一個潛在痕跡視圖，該視圖將兩個基於痕跡的目標連接起來，並解釋了專家痕跡如何在不被用於回合生成的情況下指導優化。在FTRL、BFCL和ToolHop上的實驗顯示，PACT始終優於強大的基於SFT和RL的基線，突顯了特權痕跡共同訓練在多回合工具使用學習中的價值。

Embedded Arena: Iterative Optimization via Hardware Feedback

2606.16190v1 by Zhihan Zhang, Alexander Le Metzger, Jiuyang Lyu, Chun-Cheng Chang, Jiayi Shao, Yujia Liu, Emmanuel Azuh Mensah, Edward Wang, Kurtis Heimerl, Gregory D. Abowd, Shwetak Patel, Natasha Jaques, Vikram Iyer

Embedded devices from wildlife monitoring stations to clinical wearables require local AI inference due to latency, communication, or privacy constraints. Optimizing models for heterogeneous microcontrollers (MCUs) requires simultaneously satisfying hard physical constraints on memory, power, and temperature while preserving accuracy, a multidimensional optimization that is today performed manually by experts. We ask whether an LLM agent can autonomously navigate this complex, multi-turn pipeline guided by real hardware feedback, and introduce a hardware-in-the-loop agent arena in which the agent iteratively refines both model and firmware -- compiling, flashing, and measuring on real hardware -- to enable closed-loop optimization. Frontier models, including Claude Opus 4.7 and Gemini 3.1 Pro, fail entirely without hardware feedback (0% deployment success), whereas our hardware-in-the-loop formulation achieves the first successful deployment within three iterations and can surpass human expert results within seven. This agentic co-optimization achieves 250x compression for vision models with <3.3% accuracy loss and 400x for audio with <6% Feature Error Rate loss, enabling battery-free operation on a commercial MCU via solar harvesting. We demonstrate practical impact in two real-world systems: an elk-detection camera trap (96.7% accuracy) and a phonetic-transcription wearable (8.44% FER) for child development research.

摘要：嵌入式設備從野生動物監測站到臨床可穿戴設備，由於延遲、通信或隱私限制，需要進行本地 AI 推斷。對於異構微控制器 (MCUs) 優化模型需要同時滿足對記憶體、功耗和溫度的嚴格物理限制，同時保持準確性，這是一個多維優化，今天由專家手動執行。我們詢問一個 LLM 代理是否能夠自主導航這個複雜的多輪流程，並受到實際硬體反饋的指導，並介紹一個硬體在迴路的代理競技場，在這裡代理反覆精煉模型和韌體——在實際硬體上編譯、閃存和測量——以實現閉環優化。前沿模型，包括 Claude Opus 4.7 和 Gemini 3.1 Pro，完全依賴硬體反饋失敗（0% 部署成功），而我們的硬體在迴路的公式在三次迭代內實現了第一次成功部署，並且在七次迭代內可以超越人類專家的結果。這種代理協同優化為視覺模型實現了 250 倍壓縮，準確性損失小於 3.3%，音頻則實現了 400 倍壓縮，特徵錯誤率損失小於 6%，使得通過太陽能收集在商業 MCU 上實現無電池操作。我們在兩個現實世界系統中展示了實際影響：一個麋鹿檢測相機陷阱（96.7% 準確率）和一個語音轉錄可穿戴設備（8.44% 特徵錯誤率）用於兒童發展研究。

LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis

2606.16149v1 by Minh-Ha Nguyen, Erica Gray, Chih-Ting Yang, Rizwan Hamid, Lingyao Li, Siyuan Ma, Thomas A. Cassini, Cathy Shyr

Most medical AI systems improve by scaling additional machinery: more fine-tuning data, more agents, and/or larger retrieval databases. In rare-disease diagnosis, however, such scaling can produce systems that are difficult to deploy, audit, and maintain. We asked whether state-of-the-art diagnostic performance could instead be achieved by extending the reasoning chain of a single AI agent: guiding it with a diagnostic policy, developed through human-AI collaboration and augmenting with freely available biomedical tools. We introduce LiteOdyssey, a lightweight rare-disease diagnostic framework that guides reasoning language model through a clinical genetics workflow. This framework was developed through Policy Iteration with Human Feedback (PIHF) and uses dynamic access to public biomedical tools. On two challenging benchmarks that provide only patient clinical features, LiteOdyssey achieved state-of-the-art performance, with an overall disease Recall@1 of 59.3% over the combined 1,243 cases of LIRICAL (n = 370) and the PhenoPacket Store (n = 873). Both benchmarks have a high proportion of ultra-rare disease (a prevalence below 1 in 1,000,000, with ultra-rare shares of approximately 45% and 52.8%, respectively). On the more difficult PhenoPacket subset, where causal diseases were not mapped to Orphanet in our rarity-mapping pipeline, LiteOdyssey achieved 60.7% Recall@1, compared with 10.7% for the same baseline model (GPT-5.4) without tools. This performance was achieved without fine-tuning, multi-agent ensembles, or a large case-retrieval database. Gains were also observed in the following: on cases never seen during development, on a private cohort of real-world rare disease patients, and on a smaller open-weights model. LiteOdyssey suggests a path toward rare-disease AI systems that are accurate, easier to deploy, and more transparent for physician review.

摘要：大多數醫療人工智慧系統透過擴展額外的機器來提高性能：更多的微調數據、更多的代理和/或更大的檢索數據庫。然而，在罕見疾病診斷中，這種擴展可能會產生難以部署、審核和維護的系統。我們詢問是否可以通過擴展單個人工智慧代理的推理鏈來實現最先進的診斷性能：通過人類與人工智慧的合作開發診斷政策來引導它，並使用免費的生物醫學工具進行增強。我們介紹了LiteOdyssey，一個輕量級的罕見疾病診斷框架，通過臨床遺傳學工作流程引導推理語言模型。這個框架是通過人類反饋的政策迭代（PIHF）開發的，並使用對公共生物醫學工具的動態訪問。在兩個具有挑戰性的基準上，LiteOdyssey在僅提供患者臨床特徵的情況下實現了最先進的性能，整體疾病的Recall@1為59.3%，涵蓋了1,243個LIRICAL（n = 370）和PhenoPacket Store（n = 873）的案例。這兩個基準中超罕見疾病的比例很高（流行率低於1/1,000,000，超罕見的比例分別約為45%和52.8%）。在更具挑戰性的PhenoPacket子集上，因為因果疾病未在我們的稀有映射管道中映射到Orphanet，LiteOdyssey實現了60.7%的Recall@1，而同一基準模型（GPT-5.4）在未使用工具的情況下僅為10.7%。這一性能是在沒有微調、多代理集成或大型案例檢索數據庫的情況下實現的。還觀察到以下增益：在開發期間從未見過的案例上、在一個真實世界罕見疾病患者的私有隊列上，以及在一個較小的開放權重模型上。LiteOdyssey暗示了朝向準確性高、易於部署且對醫生審查更透明的罕見疾病人工智慧系統的發展道路。

XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models

2606.16137v1 by Yupei Li, Qiyang Sun, Xiaoliang Wu, Chenxi Wang, Berrak Sisman, Björn W. Schuller

Speech deepfake detection (SDD) systems require trustworthy explanations for reliable decision-making. Existing explanation ways mainly fall into two categories. Traditional explainable AI (XAI), such as gradient-based attribution, produces low-level attribution signals tightly coupled with model decisions, and harder to be understood by human than natural language explanations. Meanwhile, large language model (LLM)-based explanation generation often produces generic and ungrounded descriptions due to the lack of heuristic evidence and task-specific supervision, stemming from limited grounded explanation datasets for SDD. We therefore propose a training-free explanation framework that integrates XAI evidence with multimodal LLMs to generate grounded and specific explanations. Using the PartialSpoof dataset, we construct a grounded explanation dataset and show that methods with XAI increase inside accuracy by over 45\%, verified through human evaluation and faithfulness checks.

摘要：語音深偽檢測（SDD）系統需要可信的解釋以進行可靠的決策。現有的解釋方式主要分為兩類。傳統的可解釋人工智慧（XAI），例如基於梯度的歸因，產生與模型決策緊密相關的低階歸因信號，且比自然語言解釋更難以被人理解。與此同時，基於大型語言模型（LLM）的解釋生成常常因缺乏啟發式證據和任務特定的監督而產生通用且缺乏根據的描述，這源於SDD的有限根據解釋數據集。因此，我們提出了一個無需訓練的解釋框架，將XAI證據與多模態LLM整合，以生成有根據且具體的解釋。使用PartialSpoof數據集，我們構建了一個有根據的解釋數據集，並顯示使用XAI的方法使內部準確率提高了超過45\%，這一結果通過人類評估和可信度檢查得到了驗證。

SciText2Eq: Assessing LLMs for Explainable Equation Generation for Scientific Creativity

2606.16003v1 by Yifan Mo, Xiao Fu, Yue Su, Qingyu Meng, Koen Hindriks, Qingzhi Liu, Jiahuan Pei

This work investigates the ability of large language models (LLMs) to generate mathematical equations from scientific texts. Prior work faces challenges in unstructured grounding, multi-equation dependency, and humanaligned evaluation. To this end, we construct a dataset of AI research papers, pairing contextual passages with ground-truth equations and variable descriptions. We develop an explainable equation generation workflow and evaluate it across diverse open- and closed-source LLM backbones. We introduce an evaluation protocol combining automatic metrics, LLM-based rubrics, and human judgments to assess accuracy, explainability, and human-LLM alignment. Results indicate that LLMs perform moderately on lexical- and syntactic-based similarity, while struggling with semantic accuracy. Comparisons between LLM-based evaluations and human judgments reveal limited alignment, highlighting challenges in using LLMs to assess equation quality. These findings offer insights for improving equation generation models and developing more reliable evaluation methods for scientific text. We provide code and data for reproducibility.

摘要：這項研究探討大型語言模型（LLMs）從科學文本生成數學方程式的能力。先前的研究面臨著非結構化基礎、多方程式依賴和人類對齊評估的挑戰。為此，我們構建了一個AI研究論文數據集，將上下文段落與真實方程式和變量描述配對。我們開發了一個可解釋的方程式生成工作流程，並在多種開源和閉源的LLM骨幹上進行評估。我們引入了一個評估協議，結合自動指標、基於LLM的評分標準和人類判斷，以評估準確性、可解釋性和人類-LLM對齊。結果顯示，LLMs在詞彙和句法基礎的相似性上表現中等，但在語義準確性上表現不佳。LLM基礎的評估與人類判斷之間的比較顯示出有限的對齊，突顯了使用LLMs評估方程式質量的挑戰。這些發現為改進方程式生成模型和開發更可靠的科學文本評估方法提供了見解。我們提供了代碼和數據以便重現。

Entity Labels Are Not Entity Signals: A Framework for Observable Relevance in Document Re-Ranking

2606.15998v1 by Utshab Kumar Ghosh, Shubham Chatterjee

Entity-aware document retrieval uses query-associated entities as ranking signals, assuming that semantically relevant entities are also useful retrieval signals. We show this assumption is insufficient- and explain why. Unlike terms, which are ground-truth observations, entity links are hypotheses produced by an imperfect linker: an entity can be topically central yet provide no discriminative signal if the linker fires indiscriminately across relevant and non-relevant documents. We formalize this as a distinction between Conceptual Entity Relevance (CER)- whether an entity is topically related to a query- and Observable Entity Relevance (OER)- whether its observed presence in a collection discriminates relevant from non-relevant documents. Across four collections and annotation sources including human entity judgments, CER and OER exhibit near-chance agreement ($κ\approx 0$), while OER operationalizations agree substantially ($κ\approx 0.5$), confirming CER as the systematic outlier. CER-based supervision selects topically plausible but weakly discriminative entities, pruning fewer than 4% of non-relevant documents on some collections. Aligning supervision with OER improves non-relevant pruning by up to 10x and open-world MAP by 0.051 over BM25. Our findings motivate a shift from conceptual to observable notions of entity relevance in entity-aware retrieval.

摘要：實體感知文件檢索使用與查詢相關的實體作為排名信號，假設語義上相關的實體也是有用的檢索信號。我們顯示這一假設是不充分的，並解釋原因。與術語不同，術語是基於真實觀察的，而實體鏈接是由不完美的鏈接器產生的假設：如果鏈接器在相關和不相關的文檔中隨意觸發，則一個實體可以在主題上是中心的，但卻不提供任何區分信號。我們將此正式化為概念實體相關性（CER）與可觀察實體相關性（OER）之間的區別——CER是指一個實體是否與查詢在主題上相關，而OER是指其在集合中的觀察存在是否能區分相關和不相關的文檔。在四個集合和標註來源中，包括人類實體判斷，CER和OER的協議接近隨機（$κ\approx 0$），而OER的操作化則顯著一致（$κ\approx 0.5$），確認CER作為系統性異常。基於CER的監督選擇在主題上合理但區分能力較弱的實體，在某些集合中修剪不到4%的不相關文檔。將監督與OER對齊可將不相關文檔的修剪提高至10倍，並使開放世界MAP相較於BM25提高0.051。我們的研究結果促使從概念性實體相關性轉向可觀察的實體相關性在實體感知檢索中的應用。

DeepRoot: A KG-Coordinated Multi-Agent System for Therapeutic Reasoning over Historical Medical Texts

2606.15931v1 by Zijian Carl Ma, Sean J. Wang, Sijbren Kramer, Li Erran Li

Historical medical archives and traditional medicines hold immense potential for drug discovery and remain a primary source for current drug development. However, pre-ontological prose and idiosyncratic taxonomies prevent the standardization and medical modernization of the data for use in current biomedical pipelines. Furthermore, no existing LLM agent system, whether tool-calling, retrieval-augmented, or agentic deep-research, can convert such text into verifiable drug-discovery leads at scale. We close this gap with DeepRoot, a multi-agent LLM system that jointly builds and utilizes a verified knowledge graph, showing that grounding and reasoning -- often conflated -- are separable axes the system can compose for therapeutic reasoning. Applied to the Shen Nong Ben Cao Jing, DeepRoot recovers $10$ of $21$ held-out compound-disease treatment pairs at R@$20$ ($47.6\%$ vs $4.8\%$ for a raw corpus LLM and $\sim!2.4\%$ random) and dominates an LLM-as-judge audit for reasoning quality over baseline LLMs and LLMs with direct tool-call access to the same APIs DeepRoot itself queries. Tool-using LLMs hallucinate evidence on $87\%$ of claims, versus 7-10% for DeepRoot. Graph-only inference hallucinates $0\%$ but ranks lowest on reasoning coherence; DeepRoot KG+LLM is the only condition to win on both axes, pointing toward a route for systematic mining and repurposing of historical medical knowledge.

摘要：歷史醫療檔案和傳統醫藥在藥物發現方面具有巨大的潛力，並且仍然是當前藥物開發的主要來源。然而，前本體論的散文和特有的分類法阻礙了數據的標準化和醫療現代化，以便用於當前的生物醫學管道。此外，現有的 LLM 代理系統，無論是工具調用、檢索增強還是代理深度研究，都無法將這類文本轉換為可驗證的藥物發現線索。我們通過 DeepRoot 彌補了這一空白，這是一個多代理 LLM 系統，聯合構建和利用經過驗證的知識圖譜，顯示出基礎和推理——通常被混淆——是系統可以組合的可分離軸，用於治療推理。應用於《神農本草經》，DeepRoot 恢復了 $10$ 個 $21$ 個保留的化合物-疾病治療對，在 R@$20$ 下的表現為 $47.6\%$（相比之下，原始語料庫 LLM 為 $4.8\%$，隨機約 $2.4\%$），並在推理質量的 LLM 作為評審的審計中超越了基準 LLM 和直接工具調用訪問相同 API 的 LLM，這些 API 是 DeepRoot 自身查詢的。使用工具的 LLM 在 $87\%$ 的主張上出現幻覺，而 DeepRoot 的比例為 7-10%。僅圖譜推理的幻覺為 $0\%$，但在推理連貫性上排名最低；DeepRoot KG+LLM 是唯一在兩個軸上都獲勝的條件，指向系統性挖掘和重新利用歷史醫療知識的路徑。

DifFRACT: Diffusion Feature Reconstruction and Attribution for Circuit Tracing

2606.15796v1 by Artyom Mazur, Nina Konovalova, Aibek Alanov

Mechanistic interpretability seeks to explain neural network behavior by decomposing model computations into interpretable features and circuits. While transcoder-based circuit tracing has recently enabled detailed causal analyses of large language models, multimodal diffusion transformers for image generation remain comparatively opaque. We still lack tools for understanding how semantic information propagates across denoising steps and how text and image representations interact within double-stream MM-DiT architectures. Existing methods provide only partial insight: attention maps expose a limited view of token interactions, while sparse autoencoders can discover interpretable features but do not directly reveal how these features are transformed and composed through nonlinear MLP layers. In this work, we extend transcoder-based circuit tracing to multimodal diffusion transformers. We train timestep-conditioned transcoders that faithfully approximate the input-output behavior of MLP sublayers in FLUX.1[schnell]. By replacing MLPs with transcoders and linearizing the remaining computation, we obtain exact feature-to-feature attribution and recover compact, interpretable circuits. Empirically, our transcoders match or slightly outperform sparse autoencoders on the sparsity-faithfulness tradeoff. The resulting circuits reveal mechanisms underlying attribute binding and cross-stream semantic propagation, and provide causal explanations for systematic generation errors. Moreover, circuit-guided interventions are substantially more precise and effective than standard SAE-based steering. Our results demonstrate that transcoder-based circuit analysis is feasible for state-of-the-art diffusion transformers and provides a powerful framework for understanding and controlling multimodal generative models. The code is available at https://github.com/Artalmaz31/DifFRACT

摘要：機械解釋性旨在通過將模型計算分解為可解釋的特徵和電路來解釋神經網絡的行為。雖然基於轉碼器的電路追蹤最近使大型語言模型的詳細因果分析成為可能，但用於圖像生成的多模態擴散Transformer仍然相對不透明。我們仍然缺乏理解語義信息如何在去噪步驟中傳播以及文本和圖像表示如何在雙流MM-DiT架構中互動的工具。現有方法僅提供部分見解：注意力圖揭示了標記互動的有限視圖，而稀疏自編碼器可以發現可解釋的特徵，但不直接顯示這些特徵如何通過非線性MLP層轉換和組合。在這項工作中，我們將基於轉碼器的電路追蹤擴展到多模態擴散Transformer。我們訓練了時間步驟條件的轉碼器，這些轉碼器忠實地近似FLUX中的MLP子層的輸入輸出行為。通過用轉碼器替換MLP並線性化剩餘計算，我們獲得了精確的特徵到特徵的歸因，並恢復了緊湊的可解釋電路。經驗上，我們的轉碼器在稀疏性與忠實性之間的權衡上與稀疏自編碼器相匹配或稍有優於。所得到的電路揭示了屬性綁定和跨流語義傳播的機制，並為系統生成錯誤提供因果解釋。此外，基於電路的干預在精確性和有效性上顯著優於標準的SAE基礎引導。我們的結果表明，基於轉碼器的電路分析對於最先進的擴散Transformer是可行的，並提供了一個強大的框架來理解和控制多模態生成模型。代碼可在 https://github.com/Artalmaz31/DifFRACT 獲得。

Visualizing Uncertainty: Spatial Maps of Missing and Conflicting Evidence in Deep Learning

2606.15767v1 by Dong Hyun Jeong, Feng Chen, Jin-Hee Cho, Lance M. Kaplan, Audun Jøsang, Soo-Yeon Ji

Understanding when and why deep neural networks are uncertain is crucial for deploying reliable machine learning systems in safety-critical domains. While existing uncertainty quantification methods provide scalar measures of model confidence, they offer limited insight into which spatial regions of an input contribute to different types of uncertainty. We propose a novel visualization framework, Uncertainty Activation Map (UAM), that combines Evidential Deep Learning (EDL) with Full-Gradient Class Activation Mapping (FullGrad) to generate interpretable spatial uncertainty activation maps. Our approach distinguishes between two fundamental types of uncertainty: vacuity, representing lack of evidence, and dissonance, capturing conflicting evidence between competing hypotheses. By leveraging the complete gradient decomposition property of FullGrad and the principled uncertainty quantification of Subjective Logic, our method produces theoretically grounded visualizations that highlight specific image regions responsible for model uncertainty. With this framework, vacuity and dissonance activation maps are generated by computing belief-weighted attributions, enabling identification of where models lack knowledge versus where they encounter ambiguous evidence. Extensive evaluations across multiple benchmark datasets demonstrate that the proposed framework effectively addresses the critical gap between uncertainty quantification and explainability, providing intuitive visual feedback to assess model reliability in complex visual recognition tasks.

摘要：理解深度神經網絡何時以及為何存在不確定性，對於在安全關鍵領域部署可靠的機器學習系統至關重要。雖然現有的不確定性量化方法提供了模型信心的標量度量，但對於輸入的哪些空間區域對不同類型的不確定性有貢獻的洞察有限。我們提出了一種新穎的可視化框架，不確定性激活圖（UAM），它將證據深度學習（EDL）與全梯度類別激活映射（FullGrad）結合，以生成可解釋的空間不確定性激活圖。我們的方法區分了兩種基本的不確定性類型：空虛，代表缺乏證據，和不和諧，捕捉競爭假設之間的衝突證據。通過利用FullGrad的完整梯度分解特性和主觀邏輯的原則性不確定性量化，我們的方法產生了理論上有根據的可視化，突顯出負責模型不確定性的特定圖像區域。通過這個框架，空虛和不和諧激活圖是通過計算信念加權的歸因來生成的，從而能夠識別模型缺乏知識的地方以及遇到模糊證據的地方。在多個基準數據集上的廣泛評估表明，所提出的框架有效地解決了不確定性量化與可解釋性之間的關鍵差距，提供了直觀的視覺反饋，以評估模型在複雜視覺識別任務中的可靠性。

InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset

2606.15730v1 by Zhenyu Yu

Backdoor unlearning aims to remove a malicious trigger behavior from a deployed model while preserving clean utility. We study the update-free inference-time setting, where model parameters remain frozen. First, we audit a common projection assumption under oracle paired clean and triggered features. Projection succeeds mainly on BadNets and leaves WaNet, Blended, and SIG at 0.683, 0.888, and 0.941 ASR on CIFAR-10 ResNet-18. This failure is not explained by spectral compactness, spatial locality, or subspace misalignment. It is predicted by a logit-triplet gap involving the target margin, target-logit drop, and non-target logit rise. We then introduce InstantForget, a clean-calibrated gated reset that flags anomalous features with a Mahalanobis score and moves only flagged features toward a neutral non-target representation. With one fixed operating point selected on held-out triggered validation, InstantForget reduces average ASR to 0.071 across four non-adaptive CIFAR-10 triggers without triggered samples or parameter updates at deployment. It also reaches 0.981 detection AUROC and transfers to six of eight tested backbones. Reported failures under WaNet, ModelNet10 point blend, two backbone geometries, and adaptive feature-compactness attacks define the method's scope.

摘要：後門去學習旨在從已部署的模型中移除惡意觸發行為，同時保持清潔效用。我們研究了無更新推斷時間設置，其中模型參數保持不變。首先，我們審核了在預言者配對清潔和觸發特徵下的一個常見投影假設。投影主要在 BadNets 上成功，並在 CIFAR-10 ResNet-18 上留下 WaNet、Blended 和 SIG 的 ASR 分別為 0.683、0.888 和 0.941。這一失敗無法通過光譜緊湊性、空間局部性或子空間不對齊來解釋。它是由涉及目標邊際、目標邏輯下降和非目標邏輯上升的邏輯三元組差距預測的。我們隨後介紹了 InstantForget，一種清潔校準的門控重置，它通過馬哈拉諾比斯分數標記異常特徵，並僅將標記的特徵移向中性非目標表示。在保留的觸發驗證中選擇一個固定的操作點，InstantForget 在四個非自適應的 CIFAR-10 觸發器中將平均 ASR 降低至 0.071，而不需要觸發樣本或在部署時更新參數。它還達到了 0.981 的檢測 AUROC，並轉移到八個測試骨幹中的六個。報告中在 WaNet、ModelNet10 點混合、兩個骨幹幾何和自適應特徵緊湊性攻擊下的失敗定義了該方法的範疇。

AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan

2606.15709v1 by Mohammed Fasha, Nahel Al-Maayta, Bilal Sowan, Mohammad Athamneh, Husam Barham

Jordan faces severe water scarcity with 50\% of water produced is lost to leakage, theft and metering issues also known as non-revenue water (NRW). Traditional reactive approaches have proven insufficient for sustained NRW reduction. This paper proposes an intelligent framework integrating EPANET hydraulic modeling, digital twin technology, SCADA systems, and large language model (LLM)-based AI agents for continuous network monitoring and adaptive decision-making. The system combines real-time data streams with physics-based simulation to detect anomalies, employing retrieval-augmented generation (RAG) for policy interpretation and function calling for network control. A proof-of-concept implementation validates technical feasibility using EPYT with offline LLMs (llama3.1:8b via Ollama) on a 1,164-junction Amman district network. The system demonstrates automated hydraulic simulation, flow-based anomaly detection aligned with water distribution zone (DZ) practice, and AI-generated health reports with response times under 2 minutes and zero API costs. Burst detection relies on local flow anomaly analysis: a 30.1~L/s simulated leak produces measurable flow redistribution in 15 pipes, flagging a 15-junction cluster that localises the burst -- confirming alignment with water distribution zone (DZ) monitoring practice. The framework accommodates Jordan's intermittent supply patterns and limited automation through phased implementation, offering a scalable pathway for water-scarce regions to leverage intelligent automation for NRW reduction and operational efficiency.

摘要：約旦面臨嚴重的水資源短缺，50\% 的水產量因漏水、盜竊和計量問題而損失，這也被稱為非收入水 (NRW)。傳統的反應性方法已被證明不足以持續減少 NRW。本文提出了一個智能框架，整合了 EPANET 水力模型、數位雙胞胎技術、SCADA 系統和基於大型語言模型 (LLM) 的 AI 代理，用於持續的網絡監控和自適應決策。該系統結合了實時數據流和基於物理的模擬來檢測異常，採用檢索增強生成 (RAG) 進行政策解釋，並通過函數調用進行網絡控制。概念驗證實施使用 EPYT 和離線 LLM（llama3.1:8b 通過 Ollama）在 1,164 個接頭的安曼區網絡上驗證了技術可行性。該系統展示了自動化的水力模擬、基於流量的異常檢測，與水分配區 (DZ) 實踐相一致，並生成 AI 健康報告，響應時間在 2 分鐘以內且無 API 成本。爆裂檢測依賴於本地流量異常分析：一個 30.1~L/s 的模擬漏水在 15 根管道中產生可測量的流量重分配，標記出一個 15 接頭的集群，定位了爆裂點——確認與水分配區 (DZ) 監控實踐的一致性。該框架通過分階段實施，適應約旦的間歇性供應模式和有限的自動化，為水資源短缺的地區提供了利用智能自動化減少 NRW 和提高運營效率的可擴展途徑。

Quantum Cinema: An Interactive Cinematic Exploration of Quantum Computing Hardware via Generative World Models

2606.17102v1 by Aoyu Zhang, Dongping Liu, Luyao Zhang

Quantum computing promises transformative advances across science and industry, yet the physical hardware that enables these computations remains invisible to the public: quantum processors operate inside sealed dilution refrigerators at temperatures near absolute zero, making direct observation impossible. This "imagination gap" between quantum computing's growing societal impact and the public's ability to visualize it represents a significant barrier to quantum literacy and workforce development. We present Quantum Cinema, an open-source, browser-based interactive application that closes this gap by transforming invisible quantum hardware into explorable, cinematic experiences using generative world models. Quantum Cinema guides users through a four-act narrative -- from the foundational Nobel Prize-winning science of quantum entanglement, through curated video introductions to three major quantum computing architectures (trapped-ion, neutral-atom, and superconducting systems), into immersive three-dimensional generative worlds that make invisible quantum phenomena observable, and finally to interactive radar-chart comparisons grounded in real quantum device specifications. All three-dimensional environments are generated using WorldLabs' generative world model platform and are scientifically grounded in curated metrics from Amazon Web Services (AWS) Braket quantum hardware. Quantum Cinema requires no installation, no specialized hardware, and no quantum computing background. It is designed to serve two distinct communities: scholars and developers seeking to replicate or extend the platform, and educators, researchers, and science communicators seeking an intuitive tool for explaining quantum hardware to diverse audiences. This paper describes the system architecture, the generative world model pipeline, use cases for both communities, and directions for future work.

摘要：量子計算承諾在科學和工業領域帶來變革性的進展，然而，支持這些計算的物理硬體對公眾來說仍然是隱形的：量子處理器在接近絕對零度的密封稀釋冰箱內運行，使得直接觀察變得不可能。量子計算日益增長的社會影響與公眾可視化能力之間的這種「想像差距」代表了量子素養和勞動力發展的一個重要障礙。我們提出了量子影院（Quantum Cinema），這是一個開源的基於瀏覽器的互動應用程序，通過使用生成性世界模型將隱形的量子硬體轉化為可探索的電影體驗，來縮小這一差距。量子影院引導用戶通過四幕敘事——從量子糾纏的基礎諾貝爾獎科學，通過對三種主要量子計算架構（捕獲離子、中性原子和超導系統）的精選視頻介紹，進入使隱形量子現象可觀察的沉浸式三維生成世界，最後到基於真實量子設備規格的互動雷達圖比較。所有三維環境均使用WorldLabs的生成性世界模型平台生成，並基於來自亞馬遜網絡服務（AWS）Braket量子硬體的精選指標進行科學驗證。量子影院不需要安裝、不需要專業硬體，也不需要量子計算背景。它旨在服務兩個不同的社群：尋求複製或擴展該平台的學者和開發者，以及尋求直觀工具以向不同受眾解釋量子硬體的教育工作者、研究人員和科學傳播者。本文描述了系統架構、生成性世界模型管道、兩個社群的使用案例以及未來工作的方向。

Is Code Better Than Language for Algorithmic Reasoning

2606.15589v1 by Terry Tong, Yu Feng, Surbhi Goel, Dan Roth

For tool-augmented language models, comparing natural-language reasoning with code-execution pipelines is difficult because the comparison changes both the intermediate representation and the execution mechanism. We separate these factors with an intermediate intervention: the model expresses its reasoning as executable code, and the language model simulates that code in context to produce an answer. On a 40-task verifiable algorithmic benchmark, deterministic code execution outperforms natural-language reasoning by +31.6pp. We observe that the intermediate intervention is not meaningfully different from natural-language reasoning (+0.15pp). These results suggest that, in our evaluated setting, changing the intermediate representation alone does not explain the tool-use advantage, providing evidence for the performance gains requiring reliable external execution. We formalize this intuition with a simple statistical decision-theoretic model that characterizes when execution dominates end-to-end risk in our disentangled trace-generation/execution regime. We validate our theory using a reconstruction intervention that leverages a proxy language model to infer natural-language reasoning traces from code representations, recovering performance comparable to the original natural-language reasoning pipeline. All experiments are at https://github.com/TerryTong-Git/ToolProj.

摘要：對於工具增強的語言模型，將自然語言推理與代碼執行管道進行比較是困難的，因為這種比較會改變中間表示和執行機制。我們通過一個中間介入來分離這些因素：模型將其推理表達為可執行代碼，語言模型在上下文中模擬該代碼以產生答案。在一個包含40個任務的可驗證算法基準上，確定性代碼執行的表現超過自然語言推理31.6個百分點。我們觀察到，中間介入與自然語言推理並沒有實質性區別（+0.15個百分點）。這些結果表明，在我們評估的設置中，僅改變中間表示並不能解釋工具使用的優勢，提供了性能增益需要可靠外部執行的證據。我們用一個簡單的統計決策理論模型來形式化這一直覺，該模型描述了在我們的解耦追蹤生成/執行體系中，何時執行主導端到端風險。我們使用重建介入來驗證我們的理論，該介入利用代理語言模型從代碼表示推斷自然語言推理的追蹤，恢復了與原始自然語言推理管道相當的性能。所有實驗均在 https://github.com/TerryTong-Git/ToolProj 上進行。

Service-Induced Congestion in Memory-Constrained LLM Serving

2606.15555v1 by Ruicheng Ao, Jing Dong, Gan Luo, David Simchi-Levi

In large language model (LLM) serving, each request accumulates persistent graphics processing unit (GPU) memory during service as its key-value cache grows with every generated token. Under high concurrency, aggregate memory usage therefore increases endogenously over time: the service process itself creates future capacity pressure. When memory capacity is exceeded, systems evict active requests, discarding cached state and restarting them later, which wastes computation and reduces throughput. We develop a discrete-time dynamical model of memory-constrained LLM inference that captures admission, memory growth, and eviction under continuous batching. In the saturated-input regime, the system admits both eviction-free fixed points and limit cycles with evictions. For homogeneous workloads, we show that the eviction-free equilibrium is unstable and that, except for a Lebesgue-measure-zero exact-capture set, the system converges to a unique worst-case limit cycle that is asymptotically stable outside this exceptional set, with throughput losses as large as 50%. For heterogeneous workloads, we prove a stability criterion in the two-class common-input setting and explain how the survival-polynomial mechanism generalizes to multiple classes and heterogeneous-input lengths. Under an input-dominated scaling regime, coprime decoding lengths stabilize the eviction-free equilibrium, while non-coprime lengths create synchronized modes that drive instability. These results characterize when workload heterogeneity desynchronizes completions and helps stabilize memory-constrained serving. More broadly, we identify service-induced congestion as a structural instability mechanism and derive scheduling design principles for sustaining high throughput.

摘要：在大型語言模型（LLM）服務中，每個請求在服務過程中隨著每個生成的標記而累積持久的圖形處理單元（GPU）記憶體，以增長其鍵值快取。在高併發的情況下，總記憶體使用量因此隨時間內生性地增加：服務過程本身創造了未來的容量壓力。當記憶體容量被超過時，系統會驅逐活躍請求，丟棄快取狀態並在稍後重新啟動它們，這會浪費計算資源並降低吞吐量。我們開發了一個記憶體受限的 LLM 推理的離散時間動態模型，捕捉在持續批處理下的接納、記憶體增長和驅逐。在飽和輸入範疇中，系統同時接納無驅逐的固定點和帶有驅逐的極限週期。對於同質工作負載，我們顯示無驅逐的平衡是不穩定的，並且除了勒貝格測度為零的精確捕獲集外，系統收斂到一個唯一的最壞情況極限週期，該週期在這個例外集之外是漸近穩定的，吞吐量損失高達 50%。對於異質工作負載，我們在兩類共同輸入設置中證明了一個穩定性準則，並解釋生存多項式機制如何推廣到多個類別和異質輸入長度。在輸入主導的縮放範疇下，互質的解碼長度穩定了無驅逐的平衡，而非互質的長度則創造了驅動不穩定性的同步模式。這些結果表徵了工作負載異質性何時使完成不同步並幫助穩定記憶體受限的服務。更廣泛地說，我們將服務引起的擁堵識別為一種結構性不穩定機制，並推導出維持高吞吐量的調度設計原則。

Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

2606.15419v1 by Zaifu Zhan, Shuang Zhou, Rui Zhang

Objective: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). Method: We designed a multi-agent peer-reviewed reasoning method in which multiple LLM agents independently generate chain-of-thought reasoning with candidate answers, then act as peer reviewers to evaluate each other's reasoning for factual correctness and logical soundness. The highest-rated reasoning chain is selected to produce the final answer. Experiments were conducted with five state-of-the-art LLMs (Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, GPT-oss-20B) on three benchmark datasets: HeadQA, MedQA-USMLE, and PubMedQA. Performance was compared against single-model chain-of-thought reasoning and chain-of-thought-based majority voting. Results: Peer-reviewed reasoning consistently outperformed both baselines. The best model combination achieved an average accuracy of 0.820 across datasets, exceeding the strongest single model (0.777) and majority voting ensembles (up to 0.789). The method also scaled effectively with more participating models, while peer assessments reliably distinguished high- from low-quality reasoning chains. Conclusion: The proposed multi-agent peer-reviewed reasoning method enables LLMs to act as both solvers and evaluators, yielding superior performance in MedQA. By emphasizing reasoning quality rather than answer agreement alone, this approach improves accuracy, interpretability, and robustness, offering a promising direction for trustworthy biomedical AI systems.

摘要：目標：提升大型語言模型（LLMs）在醫學問題回答（MedQA）中的準確性、可解釋性和穩健性。
方法：我們設計了一種多代理同行評審推理方法，其中多個LLM代理獨立生成思考鏈推理及候選答案，然後作為同行評審來評估彼此的推理在事實正確性和邏輯合理性方面的表現。評分最高的推理鏈被選中以產生最終答案。實驗使用了五個最先進的LLM（Llama-3.1-8B、Qwen2.5-7B、Phi-4、DeepSeek-LLM-7B、GPT-oss-20B）在三個基準數據集上進行：HeadQA、MedQA-USMLE和PubMedQA。性能與單模型思考鏈推理和基於思考鏈的多數投票進行比較。
結果：同行評審推理始終超越了兩個基準。最佳模型組合在數據集上達到了平均準確率0.820，超過了最強的單一模型（0.777）和多數投票集成（最高達到0.789）。該方法在參與模型數量增加時也能有效擴展，而同行評估可靠地區分了高質量和低質量的推理鏈。
結論：所提出的多代理同行評審推理方法使LLMs能夠同時作為解決者和評估者，在MedQA中產生了卓越的表現。通過強調推理質量而非僅僅是答案一致性，這種方法提高了準確性、可解釋性和穩健性，為可信的生物醫學AI系統提供了有前景的方向。

APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents

2606.15363v1 by Ya-Chuan Chen, Tien-Jen Lai, Hsiang-Wei Hu

Self-improvement in AI agents has emerged as a key research frontier: systems that modify their own prompts, workflows, and decision rules based on accumulated operational experience. The state-of-the-art Self-Harness framework [1] achieves 14--21% improvement on Terminal-Bench-2.0 by mining failure clusters and patching the agent harness. However, Self-Harness optimises only one dimension -- the prompt harness -- leaving behavioural principles and workflow topology unchanged. We propose APEX (Adaptive Principle EXtraction), a three-layer co-evolution framework that simultaneously evolves: (L1) the harness via failure-mode patching, (L2) behavioural principles via success-trace distillation [2], and (L3) the agent workflow topology via structural fitness-based selection [6]. We implement APEX on Joe [13], a production-grade super AI Agent built on NVIDIA Nemotron and designed as an Edge AI Agent Factory for the NVIDIA Agent Challenge 2026, managing a 15-node compute fleet using 114 real task traces collected over 18 days. APEX achieves an APEX Health Score of 0.570 (+90% vs. baseline 0.300) in a single evolutionary run, distilling 6 novel reusable principles and selecting a research-first workflow topology scoring 0.900 (+20%). Our results demonstrate that multi-dimensional co-evolution substantially outperforms single-axis harness optimisation, at a cost of only 4 LLM calls (~270 s) on a local qwen2.5-coder:32b instance.

摘要：自我改進的 AI 代理已成為一個關鍵的研究前沿：這些系統根據累積的操作經驗修改自己的提示、工作流程和決策規則。最先進的 Self-Harness 框架 [1] 通過挖掘失敗集群和修補代理鞍具，在 Terminal-Bench-2.0 上實現了 14--21% 的改進。然而，Self-Harness 只優化了一個維度——提示鞍具——而行為原則和工作流程拓撲保持不變。我們提出了 APEX（自適應原則提取），這是一個三層共演化框架，同時演化： (L1) 通過失敗模式修補來改進鞍具，(L2) 通過成功追蹤蒸餾行為原則 [2]，以及 (L3) 通過基於結構適應度的選擇來改變代理工作流程拓撲 [6]。我們在 Joe [13] 上實現了 APEX，這是一個基於 NVIDIA Nemotron 構建的生產級超 AI 代理，設計為 NVIDIA 代理挑戰 2026 的邊緣 AI 代理工廠，管理一個使用 114 個在 18 天內收集的真實任務追蹤的 15 節點計算集群。在單次進化運行中，APEX 實現了 0.570 的 APEX 健康分數（比基線 0.300 提高 90%），蒸餾出 6 個新穎可重用的原則，並選擇了一個研究優先的工作流程拓撲，得分 0.900（提高 20%）。我們的結果表明，多維共演化顯著優於單軸鞍具優化，僅在本地 qwen2.5-coder:32b 實例上消耗 4 次 LLM 調用（約 270 秒）的成本。

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

2606.15307v1 by Mohamed Bayan Kmainasi, Mucahid Kutlu, Ali Ezzat Shahroor, Abul Hasnat, Firoj Alam

Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation remains underexplored. We propose a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality in thinking-based MLLMs via task-specific rewards and Group Relative Policy Optimization (GRPO). Concretely, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks, (ii) extend existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations, (iii) introduce a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality, and (iv) investigate self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels. Experiments on the Hateful Memes and ArMeme benchmarks show that our approach improves over previously reported results on FHM accuracy (up to +2.1%, from 79.9% to 82.0%) and on ArMeme macro-F1 (up to +7.6 points, from 0.536 to 0.612 with explanations; +6.1 compared to the original ArMeme benchmark), while also generating natural-language explanations. On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas our approach provides more balanced per-class performance along with explanations. We publicly release our code, data extensions, and evaluation resources.

摘要：仇恨和宣傳性迷因利用圖像與文本之間的相互作用來傳達單一模式無法單獨揭示的有害意圖。儘管基於思考的多模態大型語言模型（MLLMs）在視覺-語言理解方面取得了進展，但其在迷因內容審核中的應用仍然未被充分探討。我們提出了一種基於強化學習的後訓練方法，通過任務特定的獎勵和群體相對策略優化（GRPO）來提高基於思考的MLLMs的分類性能和基於參考的解釋質量。具體來說，我們（i）對現成的MLLMs在英語和阿拉伯語基準上進行系統的實證研究，以理解仇恨和宣傳性迷因，（ii）通過蒸餾和多MLM的細粒度宣傳註釋，擴展現有的迷因數據集，並提供弱監督的思考鏈（CoT）推理，（iii）引入一種基於GRPO的目標，並進行思考長度正則化，該目標共同優化分類準確性和解釋質量，以及（iv）利用基於共識的偽標籤研究在未標記迷因上的自我監督GRPO。在仇恨迷因和ArMeme基準上的實驗顯示，我們的方法在FHM準確性（提高至+2.1%，從79.9%提升至82.0%）和ArMeme宏F1（提高至+7.6點，從0.536提升至0.612，附帶解釋；與原始ArMeme基準相比提高6.1）上均優於先前報告的結果，同時生成自然語言解釋。在ArMeme上，序列分類基準在原始準確性方面仍然更強，而我們的方法則提供了更平衡的每類性能以及解釋。我們公開發布了我們的代碼、數據擴展和評估資源。

Enabling Real-Time Point-of-Care Ultrasound Segmentation: A GPU-Free Deployment in Resource-Limited Settings

2606.15176v1 by Weihao Gao

Ultrasound imaging is the most widely adopted medical modality globally due to its low cost and portability, yet artificial intelligence (AI) deployment remains constrained by reliance on GPU-accelerated models, creating a structural paradox where the cost of "intelligence" exceeds that of the imaging device itself. Here, we present the systematic adaptation and extensive evaluation of UltraSeg, an ultra-lightweight architecture originally developed for colonoscopic polyp segmentation, now engineered for point-of-care ultrasound (POCUS) across ten public datasets spanning six anatomical sites (breast, thyroid, kidney, carotid, fetal, and small-animal tumor). We systematically validate both variants in ultrasound domains: UltraSeg-130K (0.13M parameters) achieves 89.7 FPS on single-core CPUs and 34.8 FPS on a refurbished mobile device, while UltraSeg-500K (0.5M parameters) delivers 44.6 FPS on CPU and 16.1 FPS on mobile device. UltraSeg-500K matches or exceeds the Dice performance of the 31M-parameter UNet and approaches 105M-parameter TransUNet in average performance, with superior zero-shot cross-dataset generalization on external validation sets (UDIAT, DDTI). By enabling clinical-grade segmentation without GPU dependency, this work brings AI costs in line with ultrasound accessibility, making advanced diagnostics available in resource-limited settings.

摘要：超聲影像因其低成本和可攜性而成為全球最廣泛採用的醫療模式，但人工智慧（AI）的應用仍受限於對GPU加速模型的依賴，形成了一種結構性悖論，即「智慧」的成本超過影像設備本身的成本。在此，我們展示了UltraSeg的系統性調整和廣泛評估，這是一種最初為結腸鏡息肉分割而開發的超輕量架構，現已針對十個公共數據集進行了針對即時超聲（POCUS）的工程設計，涵蓋六個解剖部位（乳腺、甲狀腺、腎臟、頸動脈、胎兒和小動物腫瘤）。我們系統性地驗證了超聲領域中的兩個變體：UltraSeg-130K（0.13M參數）在單核心CPU上達到89.7 FPS，在翻新移動設備上達到34.8 FPS，而UltraSeg-500K（0.5M參數）在CPU上提供44.6 FPS，在移動設備上提供16.1 FPS。UltraSeg-500K的Dice性能與31M參數的UNet相匹配或超過，並在平均性能上接近105M參數的TransUNet，並在外部驗證集（UDIAT，DDTI）上展現出優越的零樣本跨數據集泛化能力。通過實現無GPU依賴的臨床級分割，這項工作使AI成本與超聲可及性相符，使高級診斷在資源有限的環境中變得可用。

CottonLeafVision: An Explainable and Robust Deep Learning Framework for Cotton Leaf Disease Classification

2606.14686v1 by Rafi Ahamed, Md. Abir Rahman, Tasnia Tarannum Roza, Munaia Jannat Easha, Md. Asif Khan, Sudeepta Mandal

Globally, cotton is a highly economically beneficial crop, as the textile industry heavily depends on it. So, the precise identification and detection of cotton leaf disease is crucial for economic stability. The development goal of "CottonLeafVision" is to accurately classify and detect cotton leaf disease. With this goal, we have evaluated multiple pretrained Deep Convolutional Neural Networks, including DenseNet201, InceptionV3, and VGG19 on a publicly available cotton leaf disease image dataset. This image dataset includes seven classes, six disease classes, and one healthy class, collected under various field conditions reflecting real-world challenges. Among these pretrained models, with DenseNet201, we have achieved the highest classification accuracy of 98%. To enhance the model reliability and interpretability, we have implemented different techniques and methods such as Gradient-weighted Class Activation Mapping (Grad-CAM), occlusion sensitivity analysis and adversarial training to increase the noise resistance of the model. Finally, we have developed a prototype in order to utilize the model's capabilities on real life agriculture. This paper shows the deep learning model's capabilities to classify the disease in real-life cotton disease management situations.

摘要：全球而言，棉花是一種經濟效益極高的作物，因為紡織業在很大程度上依賴於它。因此，準確識別和檢測棉葉病對於經濟穩定至關重要。“CottonLeafVision”的發展目標是準確分類和檢測棉葉病。為了實現這一目標，我們評估了多個預訓練的深度卷積神經網絡，包括DenseNet201、InceptionV3和VGG19，這些網絡在一個公開可用的棉葉病影像數據集上進行了測試。這個影像數據集包括七個類別，其中六個是病害類別，一個是健康類別，這些數據是在各種田間條件下收集的，反映了現實世界的挑戰。在這些預訓練模型中，使用DenseNet201時，我們達到了98%的最高分類準確率。為了增強模型的可靠性和可解釋性，我們實施了不同的技術和方法，如梯度加權類別激活映射（Grad-CAM）、遮蔽敏感性分析和對抗訓練，以提高模型的抗噪聲能力。最後，我們開發了一個原型，以便在現實農業中利用模型的能力。本文展示了深度學習模型在現實棉花病害管理情境中分類疾病的能力。

A Definition of Good Explanations and the Challenges Explaining LLM Outputs

2606.14838v1 by Louis Mahon, Elliot Ford, Callum Hackett

How to define a good explanation is a long-standing philosophical debate which has found recent renewed interest in the context of AI outputs. Explainability is crucial for AI adoption in many contexts, but in order to produce good explanations of AI systems, we must first have an understanding of what good explanations are. In this paper we propose a definition inspired by the notion of counterfactual explanations, however we argue that one must also take into account the interlocutor's prior beliefs in each fact that could be offered in an explanation. We explore the ramifications of this definition for AI explainability and, in particular, why LLM outputs are difficult to produce good explanations for.

摘要：如何定義一個好的解釋是一個長期存在的哲學辯論，最近在人工智慧輸出方面重新引起了興趣。可解釋性對於許多情境中的人工智慧採用至關重要，但為了產生良好的人工智慧系統解釋，我們必須首先了解什麼是好的解釋。在本文中，我們提出了一個受反事實解釋概念啟發的定義，然而我們主張還必須考慮對話者在每個可以提供的解釋事實中的先前信念。我們探討了這一定義對人工智慧可解釋性的影響，特別是為什麼大型語言模型的輸出難以產生良好的解釋。

Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models

2606.14647v1 by Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou

Transformer-based automatic speech recognition (ASR) models such as Whisper are highly accurate, but their predictions remain difficult to interpret. Existing explainable AI (XAI) methods often lack faithfulness and precise temporal grounding. We propose Listening with Entropy-guided Attention for Faithful explainability (LEAF-X), a model-intrinsic XAI framework for transformer-based ASR. LEAF-X combines entropy-guided attention weighting, multi-layer attention rollout, and optional causal ablations to identify low-entropy, high-impact heads and layers, producing sparse token-to-frame attributions. Unlike perturbation-based explainers or raw attention maps, LEAF-X exploits the internal structure of encoder-decoder and speech-augmented decoder-only models to generate explanations that better reflect model computation. Results show 32% improved faithfulness, 35-39% stronger locality/sparsity, and the most stable attributions, supporting more transparent and auditable ASR.

摘要：Transformer 基於Transformer的自動語音識別 (ASR) 模型，例如 Whisper，具有很高的準確性，但其預測仍然難以解釋。現有的可解釋人工智慧 (XAI) 方法往往缺乏真實性和精確的時間基礎。我們提出了以熵引導注意力進行忠實解釋的聆聽模型 (LEAF-X)，這是一個針對基於Transformer的 ASR 的模型內在 XAI 框架。LEAF-X 結合了熵引導的注意力加權、多層注意力展開和可選的因果消融，以識別低熵、高影響力的頭部和層，產生稀疏的標記到幀的歸因。與基於擾動的解釋器或原始注意力圖不同，LEAF-X 利用編碼器-解碼器和增強語音的僅解碼器模型的內部結構來生成更好反映模型計算的解釋。結果顯示，忠實性提高了 32%，局部性/稀疏性增強了 35-39%，並且歸因最為穩定，支持更透明和可審計的 ASR。

Fodor and Pylyshyn's Systematicity Challenge Still Stands

2606.14512v1 by Michael Goodale, Salvador Mascarenhas

The recent successes of neural networks producing human-like language have caused significant stir in cognitive science, with many researchers arguing that classical puzzles about human cognition and challenges to artificial intelligence are being solved by neural networks. A notable case is the argument from systematicity due to Jerry Fodor and Zenon Pylyshyn, argues that humans display systematic biconditional dependencies. For example, someone can understand the sentence "John saw Mary" just in case that they understand the sentence "Mary saw John." Symbolic systems explain this systematicity of language and thought, while neural networks offer no immediate explanation. Several recent articles argue that this challenge has now been met by neural networks. In particular, Brenden Lake and Marco Baroni argue that their meta-learning for compositionality protocol matches and perhaps explains human systematicity. We demonstrate that these conclusions are premature. Among other results, we found that their model struggles to learn rules that are even slightly out of distribution compared to their training data. Furthermore, the model behaves unsystematically even on many within-distribution problems. We conclude that Fodor and Pylyshyn's challenge to neural networks remains unmet.

摘要：最近神經網絡在生成類似人類語言方面的成功引起了認知科學界的重大關注，許多研究者認為，關於人類認知的經典難題和對人工智慧的挑戰正被神經網絡解決。一個值得注意的案例是由Jerry Fodor和Zenon Pylyshyn提出的系統性論點，該論點認為人類表現出系統性的雙條件依賴性。例如，某人能理解句子「約翰看見瑪麗」，當且僅當他們理解句子「瑪麗看見約翰」。符號系統解釋了語言和思維的這種系統性，而神經網絡則沒有提供直接的解釋。幾篇最近的文章主張，這一挑戰現在已經被神經網絡所克服。特別是，Brenden Lake和Marco Baroni主張，他們的組合性元學習協議與人類的系統性相匹配，甚至可能解釋了人類的系統性。我們證明這些結論是為時已晚的。在其他結果中，我們發現他們的模型在學習與訓練數據相比稍微偏離分佈的規則時遇到了困難。此外，該模型在許多分佈內的問題上表現得不系統。我們得出結論，Fodor和Pylyshyn對神經網絡的挑戰仍未得到滿足。

Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport

2606.14157v1 by Paula Joy B. Martinez

Cities deliver basic services through mixed public-private facility networks, including schools, clinics, transit providers, and subsidized service points. In these systems, planners often observe where households go, but not the latent cost function through which they trade off factors such as distance, price, and institutional access. We study this urban problem through school choice in the Philippines, where the country's largest national education subsidy is intended to redirect learners from congested public schools to participating private schools. Treating school-to-school enrollment flows as an entropic optimal transport plan, we recover latent choice costs using two complementary inverse optimal transport models: an interpretable distance-banded model with a subsidy term, and a neural cost model trained through a differentiable Sinkhorn forward pass. Applied to 283{,}016 learner trips across 23{,}820 observed flows in the most populated region, the framework estimates a subsidy-equivalent distance, $λ^{(k)}$, interpreted as the kilometers of perceived travel cost offset by the subsidy. The case demonstrates how administrative origin-destination data can be transformed into interpretable planning metrics for accessibility-aware subsidy design, facility siting, and urban service allocation.

摘要：城市透過混合的公私設施網絡提供基本服務，包括學校、診所、交通提供者和補貼服務點。在這些系統中，規劃者通常觀察家庭的去向，但並不瞭解它們在距離、價格和機構可及性等因素之間權衡的潛在成本函數。我們通過菲律賓的學校選擇研究這一城市問題，該國最大的國家教育補貼旨在將學習者從擁擠的公立學校引導到參與的私立學校。將學校之間的入學流視為一種熵最優運輸計劃，我們使用兩個互補的逆最優運輸模型恢復潛在的選擇成本：一個可解釋的距離帶模型，帶有補貼項，以及一個通過可微分的Sinkhorn前向傳遞訓練的神經成本模型。應用於最人口稠密地區的283,016次學習者出行，涵蓋23,820個觀察流，該框架估算出一個補貼等效距離$λ^{(k)}$，該距離被解釋為補貼抵消的感知旅行成本的公里數。這一案例展示了如何將行政來源-目的地數據轉化為可解釋的規劃指標，以便進行考慮可及性的補貼設計、設施選址和城市服務分配。

Recovering Stranded Discrimination in Knowledge Tracing: Per-Item Bias Correction via Empirical-Bayes Shrinkage

2606.14123v1 by Xiaoran Yan, Cheng Tang, Atsushi Shimada

Deployed knowledge-tracing models are typically frozen after training, yet systematic per-item logit bias arises, from limited per-item expressivity in backbone architectures and from post-deployment shifts in item properties, degrading prediction quality. Global post-hoc calibrators such as Platt scaling, temperature scaling, and isotonic regression improve probability estimates but leave discriminative ability, as measured by AUC, unchanged. This AUC invariance is a structural consequence of monotone score-only transforms; recovering the stranded discrimination requires conditioning on item identity. We propose SLC (State-space Logit Correction), which converts binary observations to Gaussian pseudo-observations via Laplace/IRLS, applies empirical-Bayes shrinkage through a Kalman smoother, and fits an offset-Platt link. The state-space formulation also yields a detectability bound that characterizes the Bernoulli information floor, explaining why temporal tracking provides no benefit at current data densities. Across four datasets, five backbones, and three seeds, SLC improves AUC on all four datasets and NLL on three, with the advantage concentrating on sparse items. Cross-domain controls suggest that the same phenomenon can arise beyond education when the deployed backbone leaves entity-level bias.

摘要：已部署的知識追蹤模型在訓練後通常會被凍結，但由於主幹架構中每個項目的表達能力有限，以及部署後項目屬性的變化，導致系統性的每項邏輯偏差，從而降低預測質量。全球事後校準器如 Platt 標定、溫度標定和等距回歸改善了概率估計，但對於 AUC 測量的區分能力則沒有改變。這種 AUC 不變性是單調分數轉換的結構性結果；恢復被孤立的區分能力需要對項目身份進行條件化。我們提出了 SLC（狀態空間邏輯校正），通過 Laplace/IRLS 將二元觀察轉換為高斯偽觀察，通過卡爾曼平滑器應用經驗貝葉斯收縮，並擬合偏移 Platt 連結。狀態空間的公式還產生了一個可檢測性界限，描述了伯努利信息底線，解釋了為什麼在當前數據密度下，時間追蹤沒有提供任何好處。在四個數據集、五個主幹和三個隨機種子中，SLC 在所有四個數據集上改善了 AUC，在三個數據集上改善了 NLL，且優勢集中在稀疏項目上。跨領域的控制表明，當部署的主幹留下實體級別的偏差時，這種現象也可能在教育之外出現。

How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks?

2606.13896v1 by Julia Romero, Qin Lv, Morteza Karimzadeh

Self-supervised geospatial foundation models (GeoFMs) learn transferable representations from remote sensing data, but their downstream behavior is difficult to characterize. We study six representative GeoFMs spanning joint-embedding, reconstruction, and multimodal pretraining families, and evaluate transfer across classification, regression, and segmentation benchmarks under different label availability and downstream pipelines. We find that model rankings change across tasks and adaptation settings. Layerwise probing shows that, in most cases, task-relevant information is more accessible in intermediate transformer blocks compared to final-layer embeddings, and that GeoFMs exhibit distinct depthwise profiles. In segmentation case studies on PASTIS and Sen1Floods11, downstream adaptation settings such as decoder design and fine-tuning can be as impactful as the choice of GeoFM, and standard dense-prediction heads may be poorly aligned with how GeoFMs organize information over depth. Finally, CKA analysis on case studies shows that fine-tuning does not rewrite GeoFMs uniformly across depth, and the strongest changes are localized to the first linear layer of the MLP in ViT blocks. These results help explain why GeoFM rankings shift across benchmarks and motivate more representation-aware evaluation and adaptation strategies.

摘要：自我監督的地理空間基礎模型（GeoFMs）從遙感數據中學習可轉移的表示，但其下游行為難以特徵化。我們研究了六個代表性的GeoFMs，涵蓋了聯合嵌入、重建和多模態預訓練類別，並在不同標籤可用性和下游管道下評估分類、回歸和分割基準的轉移。我們發現模型排名在不同任務和適應設置中有所變化。逐層探測顯示，在大多數情況下，與任務相關的信息在中間的Transformer塊中比在最終層嵌入中更易於獲取，並且GeoFMs表現出明顯的深度特徵。在PASTIS和Sen1Floods11的分割案例研究中，下游適應設置如解碼器設計和微調可能與GeoFM的選擇一樣具有影響力，而標準的密集預測頭可能與GeoFMs在深度上組織信息的方式不太一致。最後，對案例研究的CKA分析顯示，微調並不會在深度上均勻地重寫GeoFMs，最強的變化集中在ViT塊的第一個線性層。這些結果有助於解釋為什麼GeoFM的排名在基準之間會變化，並促使更具表示意識的評估和適應策略。

Explaining RhythmFormer: A Systematic XAI Analysis of Periodic Sparse Attention for Remote Photoplethysmography

2606.13839v1 by Louis Chen, Torbjörn E. M. Nordling

Remote photoplethysmography (rPPG) transformers achieve low heart-rate error on benchmarks, yet their decisions remain opaque--a growing concern as rPPG moves toward clinical heart rate estimation. Existing rPPG XAI is dominated by qualitative heatmap inspection without quantitative faithfulness metrics or physiology-grounded validation, leaving a gap between visual plausibility and auditable evidence. We address this gap. First, we adapt four attribution methods (raw attention, rollout, flow, Beyond Intuition) to RhythmFormer's bi-level routing attention with top-$k$ selection. Second, we introduce a skin coverage metric quantifying how much attribution mass falls on skin regions. Third, we adapt the SaCo faithfulness coefficient from its original classification setting to rPPG regression by using the MAE between original and perturbed predicted rPPG waveforms as the perturbation impact. Applying these tools, we quantify a multi-hop leakage effect under sparse top-$k$ routing: attention rollout and flow almost completely restores the connections that individual refined-attention layers explicitly set to zero. Beyond Intuition mitigates this via its value-projection-weighted rollout and gradient-supported mask, attaining the highest median refined skin coverage ($0.83$ vs. $0.57$ for vanilla rollout) and faithfulness ($F=0.92$) among the evaluated methods on UBFC-rPPG. Validation across diverse datasets and model variants is needed. A case study on a low-SaCo outlier further shows all four methods recovering consistently once an artefactual region is replaced, suggesting consistent SaCo behavior across attribution families in this illustrative case. Together, these metrics move XAI for rPPG toward auditable numerical evidence about spatial alignment and perturbation faithfulness, i.e. trustworthy rPPG XAI.

摘要：遠程光電容積描記法（rPPG）Transformer在基準測試中實現了低心率誤差，但它們的決策仍然不透明——隨著rPPG向臨床心率估計的發展，這成為一個日益關注的問題。現有的rPPG可解釋人工智慧（XAI）主要依賴於定性的熱圖檢查，缺乏定量的真實性指標或基於生理學的驗證，這使得視覺上的合理性與可審計的證據之間存在差距。我們針對這一差距進行了研究。首先，我們將四種歸因方法（原始注意力、展開、流動、超越直覺）適應於RhythmFormer的雙層路由注意力，並進行top-$k$選擇。其次，我們引入了一種皮膚覆蓋度指標，量化有多少歸因質量落在皮膚區域上。第三，我們將SaCo真實性係數從其原始分類設置調整為rPPG回歸，通過使用原始和擾動預測的rPPG波形之間的平均絕對誤差（MAE）作為擾動影響。應用這些工具，我們量化了在稀疏top-$k$路由下的多跳洩漏效應：注意力展開和流動幾乎完全恢復了個別精煉注意力層明確設置為零的連接。超越直覺通過其值投影加權的展開和梯度支持的掩碼來減輕這一問題，在評估的UBFC-rPPG方法中達到了最高的中位數精煉皮膚覆蓋度（$0.83$對比$0.57$的普通展開）和真實性（$F=0.92$）。需要在多樣化數據集和模型變體上進行驗證。一項關於低SaCo異常值的案例研究進一步顯示，一旦替換了人為產生的區域，所有四種方法都能一致恢復，這表明在這一示例案例中，歸因家族之間的SaCo行為是一致的。總體而言，這些指標使rPPG的XAI朝著可審計的數字證據邁進，關於空間對齊和擾動真實性，即值得信賴的rPPG XAI。

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

2606.13572v1 by Tanmoy Kanti Halder, Akash Ghosh, Subhadip Baidya, Arijit Roy, Sriparna Saha

Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English-centric MLLMs struggle to support such use cases, limiting equitable access to AI-driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making, and uses stored actor-critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: https://iitp-cse.github.io/ ArogyaSutra/

摘要：多模態大型語言模型（MLLMs）在一般領域中顯示出有希望的推理能力，但在專業環境中，如醫療保健，尤其是在多語言和低資源的情境下，其表現仍然有限。這一差距在像印度農村這樣的地區尤為重要，患者經常用母語印地語表達複雜的醫療問題，並依賴醫療影像等多模態輸入。現有以英語為中心的 MLLMs 難以支持這類用例，限制了公平獲得 AI 驅動的醫療協助。為了解決這一挑戰，我們介紹了 ArogyaBodha，這是一個大型多語言多模態醫療問答數據集，該數據集由八個異質來源構建，涵蓋 31 個身體系統、六種影像模態和 21 個臨床領域，涉及英語和七種主要的印度語言。我們進一步提出了 ArogyaSutra，一個基於演員-評論家的多代理框架，將工具基礎與雙重記憶機制整合，用於逐步的、具推理意識的決策，並利用存儲的演員-評論家模擬軌跡進行蒸餾。實驗表明，我們的數據集和框架提高了所有印地語言的多語言醫療推理準確性，消融實驗驗證了每個組件的貢獻。源代碼和數據集可在以下位置獲得：https://iitp-cse.github.io/ ArogyaSutra/

Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation

2606.13556v2 by Aruna Dey, Suraj Biswas

Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)G-hat_genomic + [1-w(t)]P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms -- a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.

摘要：個性化健康人工智慧系統面臨一個根本的冷啟動問題：生理解釋的機器學習模型需要數週的個體行為數據，才能區分憲法變異與環境驅動的偏差。我們提出了一個基於因果推斷和貝葉斯先驗設計的解決方案。個體的基因組特徵作為外生的遺傳錨點——一個基於領域知識的個性化先驗，固定於受孕時，不受逆因果影響，並且在收集到任何行為觀察之前就已可用。這個錨點初始化了一個貝葉斯信念狀態，該狀態關於個體的生理設置點 G-hat = mu + sum(beta_i * g_i)，其中 beta_i 是 GWAS 衍生的效應大小，g_i 是風險等位基因數量。每個進來的生理測量 P 產生一個非憲法偏差 delta = P - G-hat，這將環境和狀態所造成的信號與憲法固定的基線分開。隨著行為數據的累積，先驗根據 G-hat_t = w(t)G-hat_genomic + [1-w(t)]P-bar_t 衰減，從基因組主導的推斷過渡到經驗基線主導的推斷。同樣觀察到的 HRV 為 55 毫秒，對於一個其先驗預測 80 毫秒的人，產生了一個抑制假設，而對於一個其先驗預測 30 毫秒的人，則產生了一個增強假設——這種反轉在沒有個性化錨點的情況下是不可能的。我們在六個生理領域內發展這一架構，根據證據強度對基因組先驗進行分級，明確區分穩健重複的錨點（FTO、FADS1/2、FKBP5）與有爭議的候選基因（SLC6A4、MAOA、DRD2）。我們解決了關聯、孟德爾隨機化和個體標記因果之間的推斷邊界，並定義了四個部署約束：證據分級的先驗、動態衰減、祖先匹配的效應大小，以及歸因而非確定性輸出。

Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from Video

2606.13302v1 by Abubakar Hamisu Kamagata, Dharm Singh Jat, Attlee Munyaradzi Gamundani, Abhishek Srivastava, Paramasivam Saravanakumar

Wave parameters in the nearshore are crucial for coastal engineering, shoreline protection, marine hazard assessment, and coastal management for climate resilience. Traditional monitoring systems like buoys and radar platforms offer accurate monitoring but can have high installation and maintenance expenses and limited spatial coverage. Passive ocean monitoring using video has been achieved by leveraging deep learning, however, many methods are not physically interpretable, feasible, and validated for oceanography. In thiswork, a Physics-Guided Deep Spatiotemporal Learning Framework for direct estimation of nearshore wave peak periods from passive coastal video stream is proposed. The framework combines automated temporal-variance based region-of-interest detection, multi-stage Sim-to-Real transfer learning, and physics-informed regularization to enhance the predictive accuracy and physical consistency. A variety of spatiotemporal architectures were assessed, such as transformer-based and recurrent-convolutional ones, alongside synthetic pretraining,silver-label adaptation, and expert fine-tuning. The results show that transformer-based architectures outperformed in terms of the accuracy of the instantaneous prediction, while lightweight recurrent-convolutional architectures achieved higher temporal stability and operational oceanographic skill. Ablation studies also demonstrated the benefits of physics-guided regularization in terms of trend-following consistency, and physically implausible predictions. Explainability auditing also helped to focus attention in hydrodynamically active surf-zone regions and showed good agreement with the physically derived wave propagation behavior. In general, the proposed framework shows the promise of physics-guided video-based deep learning systems for long-term coastal wave monitoring that are cost-efficient and operationally feasible.

摘要：近岸的波浪參數對於海岸工程、海岸線保護、海洋危害評估以及氣候韌性的海岸管理至關重要。傳統的監測系統如浮標和雷達平台提供準確的監測，但可能會有高昂的安裝和維護費用以及有限的空間覆蓋範圍。利用深度學習實現的被動海洋監測已經取得進展，然而，許多方法在物理上不可解釋、不可行，且未經海洋學驗證。在本研究中，提出了一種物理引導的深度時空學習框架，用於從被動海岸視頻流中直接估算近岸波峰周期。該框架結合了基於自動時間變異的區域興趣檢測、多階段的模擬到實際轉移學習以及物理知識引導的正則化，以提高預測準確性和物理一致性。評估了多種時空架構，如基於Transformer的和循環卷積的架構，以及合成預訓練、銀標適應和專家微調。結果顯示，基於Transformer的架構在瞬時預測的準確性方面表現優於其他架構，而輕量級的循環卷積架構則實現了更高的時間穩定性和操作海洋學技能。消融研究還顯示了物理引導正則化在趨勢跟隨一致性和物理上不合理預測方面的好處。可解釋性審計還有助於將注意力集中在水動力活躍的衝浪區域，並顯示出與物理推導的波浪傳播行為良好的一致性。總的來說，所提出的框架顯示了物理引導的基於視頻的深度學習系統在長期海岸波浪監測中的潛力，這些系統具有成本效益和操作可行性。

Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

2606.13211v1 by Omar Alshahrani, Muzammil Behzad

AI systems are being deployed across medical imaging faster than their failure modes are understood. At this point in time, the failure of greatest clinical concern is hallucination: clinically plausible but factually incorrect outputs, including fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements in generated reports, with direct consequences, for example, for biopsy decisions, staging, and treatment planning. This structured narrative synthesizes peer-reviewed studies, benchmark datasets, and FDA regulatory guidance across five imaging modalities to produce a cross-modality analysis of hallucination taxonomy, etiology, detection, and mitigation. Specifically, we address three questions in this study: (1) how can existing taxonomies be unified across modalities?, (2) how do medical-specialized foundation models hallucinate less than general-purpose ones?, and (3) which mitigation strategies are effective and compatible with FDA lifecycle oversight? We note that three taxonomic frameworks together cover the imaging pipeline in a way no single framework does alone. We also highlight that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks, indicating that narrow domain fine-tuning can introduce overfitting-induced confabulation. At the same time, the oversight of radiologists remains essential; for instance, a very high percentage of of AI-generated flags required expert correction before clinical use. Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and is effective when combined. All findings are mapped to the FDA's Total Product Lifecycle and Predetermined Change Control Plan frameworks, which treat hallucination management as a lifecycle obligation rather than a pre-deployment checklist.

摘要：AI 系統在醫學影像領域的部署速度超過了對其失效模式的理解。此時，最大的臨床關注點是幻覺：臨床上看似合理但事實上不正確的輸出，包括虛構的解剖結構、漏診、錯誤的側別以及生成報告中的虛構測量，這些都會直接影響，例如，活檢決策、分期和治療計劃。這篇結構化敘述綜合了同行評審的研究、基準數據集和 FDA 的監管指導，涵蓋五種影像模式，以產生對幻覺分類、病因、檢測和緩解的跨模式分析。具體而言，我們在這項研究中解決三個問題：(1) 如何統一現有的分類法？(2) 醫學專用的基礎模型如何比通用模型產生更少的幻覺？以及 (3) 哪些緩解策略是有效的並且與 FDA 生命週期監管相容？我們注意到，三個分類框架共同覆蓋了影像流程，而單一框架無法單獨做到這一點。我們還強調，通用基礎模型在針對幻覺的基準測試中表現優於醫學專用模型，這表明狹窄領域的微調可能會導致過擬合引起的虛構。同時，放射科醫生的監督仍然至關重要；例如，極高比例的 AI 生成標記在臨床使用前需要專家修正。基於物理的架構約束、思考鏈提示以及人機協作的安全措施各自針對不同的失效模式，並且在結合使用時效果良好。所有發現都映射到 FDA 的總產品生命週期和預定變更控制計劃框架，將幻覺管理視為生命週期的責任，而非部署前的檢查清單。

Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation

2606.13135v1 by Elena S. Kozachok, Sergey S. Seregin, Aleksandr V. Kozachok, Ilya P. Latyshev, Oleg I. Samovarov

Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.

摘要：目的。比較深度學習架構和皮膚腫瘤的皮膚鏡圖像分類方案，並評估其從開放國際數據集轉移到俄羅斯臨床實踐獨立數據集的泛化能力。方法。比較了四種架構（ViT-B/16、Swin-S、ConvNeXt-S、EfficientNetV2-S）在三種方案中的表現：二元（惡性/良性）、單階段四類（良性、MEL、SCC、BCC），以及兩階段級聯（二元篩選，然後三類區分MEL/SCC/BCC）。所有模型均使用ImageNet預訓練權重和單一增強協議，基於聚合的開放ISIC檔案數據進行訓練，並在內部保留樣本和兩個臨床數據集（Melanoscope AI移動系統；塞琴諾夫大學）上進行評估。結果。在內部，二元階段達到ROC-AUC 0.952-0.966；在塞琴諾夫大學下降至0.797-0.893，靈敏度降至0.53-0.67，ECE從0.02上升至0.27-0.39，並低估了惡性腫瘤，量化了排名和校準中的泛化差距。配對測試確認了一個臨床數據上的架構間結果：ViT-B/16在二元階段的不足（p<0.05）；在區分階段，沒有架構具有明顯優勢。級聯方法在大多數架構中提高了宏觀F1分數，超過單階段四類分類，但對於ViT-B/16的提升顯著，因為它恢復了被分配到主導良性類別的惡性病變。在ISIC MILK10k上，直接的11類分類產生的平均類別靈敏度為0.525。結論。可調的篩選閾值提供了在標準單階段（argmax）分類中無法實現的靈敏度控制，並更好地重現臨床鑑別診斷邏輯。持續的泛化差距要求在部署之前進行外部臨床驗證和重新校準。

Zero-source LLM Hallucination Detection with Human-like Criteria Probing

2606.12900v1 by Jiahao Yang, Shuhai Zhang, Hailong Kang, Feng Liu, Qi Chen, Mingkui Tan

Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zero-source constraint, where no model internals or external references are available, and detection must rely solely on the textual query-answer pair. In this paper, we propose Human-like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi-faceted reasoning of human evaluators. Its core is a Human-like Criteria Probing (HCP) mechanism, in which a LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. To achieve this adaptive capability, we introduce a reward-based alignment scheme using only weak supervision from semantic consistency. At inference, we employ a multi-sampling aggregation strategy to ensure robust decisions while preserving full interpretability. We further provide theoretical analysis supporting the reliability of our approach. Extensive experiments show that HCPD consistently outperforms state-of-the-art baselines, offering an effective and explainable solution for zero-source hallucination detection. Code is available at https://github.com/TRISKEL10N/HCPD.

摘要：大型語言模型（LLMs）經常會產生事實上不正確或不忠實的內容，這對其安全使用構成了重大風險。在零來源約束下檢測這種幻覺尤其具有挑戰性，因為沒有模型內部或外部參考可用，檢測必須僅依賴文本查詢-回答對。在本文中，我們提出了人類類似標準探測幻覺檢測（HCPD），這是一種模擬人類評估者多面向推理的範式。其核心是一個人類類似標準探測（HCP）機制，在此機制中，LLM代理根據加權的可解釋標準自適應地分解其判斷，並將標準特定的分數匯總為最終的真實性度量。為了實現這種自適應能力，我們引入了一種基於獎勵的對齊方案，僅使用來自語義一致性的弱監督。在推理過程中，我們採用多重採樣聚合策略，以確保穩健的決策，同時保持完全的可解釋性。我們進一步提供理論分析以支持我們方法的可靠性。大量實驗表明，HCPD始終優於最先進的基準，提供了一種有效且可解釋的零來源幻覺檢測解決方案。代碼可在 https://github.com/TRISKEL10N/HCPD 獲得。

PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent

2606.12896v1 by Junfeng Guo Heng Huang

While real-world applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more attention and exploration. In particular, recent work has revealed that RL agents are vulnerable to backdoor attacks, where a victim agent behaves normally under standard conditions but executes malicious actions when a specific trigger is activated. Existing backdoor defenses for RL either require access to the agent's internal parameters, operate only at the model or trajectory level, or are limited to specific attack types. To ensure the security of RL agents, we propose \texttt{PolicyGuard}, a \textit{test-time step-level} backdoor defense which leverages Gaussian Process (GP) posterior variance and adapts pseudo trajectories to enable uncertainty computation for individual time step. Besides, we also provide theoretical foundations to explain the efficacy of GP posterior variance. Extensive experiments across seven RL games demonstrate that PolicyGuard achieves state-of-the-art detection performance in most cases, with average AUROC of 0.856 for perturbation-based attacks and 0.859 for adversary-agent attacks.

摘要：雖然強化學習（RL）的實際應用越來越受歡迎，但RL系統的安全性卻值得更多的關注和探索。特別是，最近的研究顯示，RL代理容易受到後門攻擊的影響，受害者代理在標準條件下表現正常，但在特定觸發器被激活時會執行惡意行為。現有的RL後門防禦要麼需要訪問代理的內部參數，要麼僅在模型或軌跡層面運作，或僅限於特定的攻擊類型。為了確保RL代理的安全性，我們提出了\texttt{PolicyGuard}，這是一種\textit{測試時步驟級別}的後門防禦，利用高斯過程（GP）後驗方差並調整偽軌跡，以便對個別時間步進行不確定性計算。此外，我們還提供了理論基礎來解釋GP後驗方差的有效性。在七個RL遊戲中的大量實驗表明，PolicyGuard在大多數情況下達到了最先進的檢測性能，對於基於擾動的攻擊平均AUROC為0.856，對於對抗代理攻擊則為0.859。

Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata

2606.12824v1 by Daniel Soliman

AI governance for medical imaging is formalizing: the 2026 ACR-SIIM Practice Parameter recommends local acceptance testing and ongoing drift monitoring, and the ACR Assess-AI registry monitors AI outputs using DICOM metadata for context. We argue that a necessary, currently unmonitored layer sits beneath output metrics: whether incoming studies remain within the acquisition envelope a model was validated on. Using a LUNA16-trained MONAI RetinaNet lung-nodule detector, we test whether acquisition state behaves as a structured, measurable variable. On real paired CT differing only in reconstruction kernel (NLST B30f vs B80f), kernel alone shifted AI-measured diameter and flipped a Fleischner size category in 5.2% (8 of 155) of nodules at fixed patient and acquisition, while detection confidence was unchanged (Wilcoxon p=0.22). Under controlled LIDC-IDRI perturbations the effects dissociated by axis: the noise axis degraded detection confidence (p=5.9e-32, concentrated in nodules under 6 mm) but not measurement, while the frequency/kernel axis corrupted measurement (p=8.6e-13) but not detection. A 4-feature pixel fingerprint recovered reconstruction identity (patient-level AUC about 0.95 on real CT, 0.995 on a QIBA phantom) where the ConvolutionKernel DICOM tag was uninformative (identical labels across reconstructions). The kernel axis transported across four manufacturers (leave-one-vendor-out AUC 0.94-0.98, matching the within-vendor ceiling). Acquisition state thus maps to distinct AI failure modes, frequency content to measurement reliability and noise to detection sensitivity, and is not recoverable from metadata. Acquisition-aware, input-side validation is the missing layer for the acceptance-testing and drift-monitoring requirements now entering imaging-AI accreditation.

摘要：AI 醫療影像治理正在正式化：2026 年 ACR-SIIM 實踐參數建議進行本地接受測試和持續的漂移監控，而 ACR Assess-AI 註冊處則利用 DICOM 元數據監控 AI 輸出以提供上下文。我們主張，在輸出指標之下，存在一個必要的、目前未被監控的層面：即進來的研究是否仍然在模型驗證時的獲取範圍內。使用 LUNA16 訓練的 MONAI RetinaNet 肺結節檢測器，我們測試獲取狀態是否作為一個結構化的、可測量的變量。在僅在重建核上有所不同的真實配對 CT（NLST B30f 與 B80f）上，僅核便改變了 AI 測量的直徑，並在固定的患者和獲取下將 5.2%（8/155）的結節的 Fleischner 大小類別翻轉，而檢測信心則保持不變（Wilcoxon p=0.22）。在受控的 LIDC-IDRI 擾動下，影響按軸分離：噪聲軸降低了檢測信心（p=5.9e-32，集中在小於 6 mm 的結節上），但未影響測量，而頻率/核軸則損壞了測量（p=8.6e-13），但未影響檢測。一個 4 特徵像素指紋恢復了重建身份（在真實 CT 上的患者級 AUC 約為 0.95，在 QIBA 幻影上為 0.995），而 ConvolutionKernel DICOM 標籤則無法提供有用信息（在重建中標籤相同）。因此，核軸在四個製造商之間傳輸（去除一個供應商的 AUC 為 0.94-0.98，與供應商內的上限相匹配）。獲取狀態因此映射到不同的 AI 失效模式，頻率內容映射到測量可靠性，噪聲映射到檢測敏感性，且無法從元數據中恢復。具備獲取意識的輸入端驗證是當前進入影像 AI 認證的接受測試和漂移監控要求中缺失的層面。

LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

2606.12699v1 by Yifan Gao, Yanmin Gong, Yun Shi, Yuanxiong Guo

Type 2 Diabetes (T2D) poses an increasing global health threat, demanding effective glycemic assessment to support personalized and improved diabetes care. Wearable sensors such as continuous glucose monitors (CGM) and fitness trackers offer many valuable insights for glycemic assessment. However, effectively analyzing these data requires integration with essential individual-level context. Existing methods are often based on traditional machine learning (ML) and rely primarily on historical blood glucose measurements and overlook personalized information, which limits their performance across diverse diabetes populations. Recent advances in large language models (LLMs) have demonstrated their ability to integrate diverse data modalities while modeling sequential dependencies, motivating the exploration of their potential for personalized glycemic assessment. In this paper, we propose GlyLLM, an LLM-powered framework for modeling CGM-based glycemic dynamics through the integration of wearable sensor data and structured metadata. GlyLLM can leverage the extensive prior knowledge of pre-trained LLMs and achieve sensor-text semantic abstraction at decision time. Experiments on two related tasks on the AI-READI dataset demonstrate that our model outperforms traditional ML methods by an average of 13.66\% in Root Mean Squared Error (RMSE) for glucose forecasting and 13.08\% in Area Under the Receiver Operating Characteristic (AUROC) for diabetes categorization. Additionally, our ablation study shows that diabetes surveys and biometric tests are more critical than other health information for glycemic assessment. Our work presents a promising step toward harnessing the power of LLMs to advance personalized glycemic assessment in T2D care.

摘要：2型糖尿病（T2D）對全球健康構成日益嚴重的威脅，迫切需要有效的血糖評估以支持個性化和改善的糖尿病護理。可穿戴傳感器，如持續血糖監測儀（CGM）和健身追蹤器，為血糖評估提供了許多有價值的見解。然而，有效分析這些數據需要與重要的個體層面背景整合。現有的方法通常基於傳統的機器學習（ML），主要依賴歷史血糖測量，並忽略個性化信息，這限制了它們在多樣化糖尿病人群中的表現。最近在大型語言模型（LLMs）方面的進展已顯示出它們能夠整合多樣的數據模態，同時建模序列依賴性，這激勵我們探索它們在個性化血糖評估中的潛力。在本文中，我們提出了GlyLLM，一個基於LLM的框架，用於通過整合可穿戴傳感器數據和結構化元數據來建模基於CGM的血糖動態。GlyLLM可以利用預訓練LLMs的廣泛先驗知識，並在決策時實現傳感器-文本語義抽象。在AI-READI數據集上的兩個相關任務實驗表明，我們的模型在血糖預測中比傳統的ML方法平均提高了13.66\%的均方根誤差（RMSE），在糖尿病分類中提高了13.08\%的接收者操作特徵曲線下的面積（AUROC）。此外，我們的消融研究顯示，糖尿病調查和生物識別測試對於血糖評估比其他健康信息更為關鍵。我們的工作為利用LLMs的力量推進T2D護理中的個性化血糖評估邁出了有希望的一步。

Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

2606.12346v1 by Kai Standvoss, Miriam Hägele, Rosemarie Krupar, Julika Ribbat-Idel, Jennifer Altschüler, Gerrit Erdmann, Hans Pinckaers, Evelyn Ramberger, Madleen Drinkwitz, Ádám Nárai, Alexander Möllers, Katja Lingelbach, Sebastian Kons, Lukas Hönig, Recepcan Adigüzel, Joana Baião, Alberto Megina Gonzalo, Marius Teodorescu, Marie-Lisa Eich, Paolo Chetta, Shakil Merchant, Verena Aumiller, Simon Schallenberg, Andrew Norgan, Klaus-Robert Müller, Lukas Ruff, Maximilian Alber, Frederick Klauschen

Hematoxylin and eosin (H&E) staining is the cornerstone of histopathology, yet scalable, quantitative analysis of H&E whole-slide images (WSIs) remains a central challenge in computational pathology. We present Atlas H&E-TME, an AI-based system built on the Atlas family of pathology foundation models that predicts tissue quality, tissue region, and cell type labels across multiple cancer types, yielding over 4,500 quantitative readouts per slide at cell-level resolution. A key challenge to validating such systems is overcoming morphological ambiguity inherent to H&E-only ground truth and the limited scalability of more informed references drawing on modalities such as immunohistochemistry (IHC). We address this with a dual validation framework combining biologically grounded depth with technical and morphological breadth. For depth, we propose an IHC-informed multi-pathologist consensus protocol that substantially improves inter-rater agreement over conventional H&E-only annotation. This yields a molecularly grounded reference against which we compare Atlas H&E-TME and pathologists working from H&E alone. For breadth, we benchmark Atlas H&E-TME on over 200,000 high-confidence H&E-only pathologist annotations across 1,500+ cases spanning eight cancer types and their most common metastatic sites, with subtypes covering >90% of clinical cases per cancer type, drawn from 25+ sources and 8+ scanner models. Benchmarked against the IHC-informed consensus, Atlas H&E-TME matches or exceeds pathologist H&E-only performance and generalizes consistently and robustly across this broad morphological and technical scope. In doing so, Atlas H&E-TME turns the H&E slide -- the most ubiquitous data in pathology -- into a scalable, quantitative window into the tumor and its microenvironment, laying a foundation for the next generation of tissue-based biomarkers in translational and clinical research.

摘要：Hematoxylin 和 eosin (H&E) 染色是組織病理學的基石，但 H&E 全片影像 (WSIs) 的可擴展、定量分析仍然是計算病理學中的一個主要挑戰。我們提出了 Atlas H&E-TME，一個基於 AI 的系統，建立在 Atlas 病理學基礎模型家族上，能夠預測多種癌症類型的組織質量、組織區域和細胞類型標籤，每張幻燈片提供超過 4,500 個細胞級別解析的定量讀數。驗證此類系統的一個主要挑戰是克服 H&E 僅有的真實標準中固有的形態學模糊性，以及基於免疫組織化學 (IHC) 等模式的更具信息性的參考的有限可擴展性。我們通過一個雙重驗證框架來解決這個問題，結合了生物學上扎實的深度與技術和形態學的廣度。在深度方面，我們提出了一個 IHC 資訊的多病理學家共識協議，顯著提高了與傳統 H&E 僅有標註相比的評估者間一致性。這提供了一個分子基礎的參考，與我們比較 Atlas H&E-TME 和僅使用 H&E 的病理學家。在廣度方面，我們在超過 200,000 個高信心的 H&E 僅有病理學家標註上對 Atlas H&E-TME 進行基準測試，這些標註來自 1,500 多個案例，涵蓋八種癌症類型及其最常見的轉移部位，亞型覆蓋每種癌症類型超過 90% 的臨床案例，來自 25 多個來源和 8 種以上的掃描儀模型。與 IHC 資訊的共識進行基準測試後，Atlas H&E-TME 的表現與病理學家的 H&E 僅有表現相匹配或超過，並在這個廣泛的形態學和技術範疇內持續且穩健地進行泛化。通過這樣做，Atlas H&E-TME 將 H&E 幻燈片——病理學中最普遍的數據——轉變為一個可擴展的、定量的窗口，觀察腫瘤及其微環境，為轉化和臨床研究中的下一代基於組織的生物標誌物奠定基礎。

Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

2606.12252v1 by Veerendhra Kumar Dangeti, Xiao Gu, Ying Weng, Shreyank N Gowda

Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly evident in electrocardiogram classification, where large datasets and long training schedules make efficiency practically important. Progressive Data Dropout reduces training cost by excluding samples from gradient updates once they are learned, but it relies on model confidence and may retain samples that are difficult due to noise or ambiguity rather than useful signal. In this work, we introduce ERTS, an explainability-based reliability training signal for efficient ECG classification. ERTS uses explanation quality during training to distinguish between informative and unreliable uncertainty. Building on progressive data selection, we compute Grad-CAM attention maps for candidate samples and derive a focus score that measures whether model predictions are supported by coherent and localised patterns. Samples with low focus are filtered out, while those with meaningful attention are prioritised for gradient updates. We evaluate ERTS across three ECG datasets and multiple backbone architectures, showing consistent improvements in macro-F1 alongside reduced effective training cost. These results suggest that explanation quality can serve as a practical signal for improving both efficiency and reliability in clinical time-series learning. Code will be released.

摘要：訓練深度神經網絡以進行臨床時間序列分析在計算上要求甚高，然而許多醫療環境缺乏重複模型開發和部署所需的資源。這一挑戰在心電圖分類中特別明顯，因為大型數據集和長時間的訓練計劃使得效率變得非常重要。漸進式數據丟棄通過在樣本被學習後排除其對梯度更新的貢獻來降低訓練成本，但它依賴於模型信心，可能會保留由於噪聲或模糊而難以處理的樣本，而不是有用的信號。在這項工作中，我們介紹了ERTS，一種基於可解釋性的可靠性訓練信號，用於高效的心電圖分類。ERTS在訓練期間使用解釋質量來區分信息性和不可靠的不確定性。基於漸進式數據選擇，我們計算候選樣本的Grad-CAM注意力圖，並導出一個焦點分數，以衡量模型預測是否得到一致且局部化模式的支持。低焦點的樣本會被過濾掉，而那些具有意義的注意力的樣本則優先進行梯度更新。我們在三個心電圖數據集和多個主幹架構上評估了ERTS，顯示出宏觀F1分數的一致改善，同時有效的訓練成本降低。這些結果表明，解釋質量可以作為改善臨床時間序列學習中效率和可靠性的實用信號。代碼將會發布。

Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization

2606.12251v1 by Xinhai Zou, Chang Zhao, Alireza Aghabagherloo, Dave Singelée, Robin Degraeve, Bart Preneel

Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcement learning (RL) training can disrupt the gradient structure used by attackers by training image classifiers with policy-gradient objectives and epsilon-greedy exploration. Through systematic experiments across CIFAR-10, CIFAR-100, and ImageNet-100 with multiple architectures, we find that RL-trained classifiers significantly disrupt gradient-based adversarial optimization. To explain this, we conduct a comprehensive mechanism analysis using loss landscape visualization, static and dynamic gradient indicators, and predictive entropy. Our analysis reveals that RL acts as an implicit regularizer, producing models with highly unstable gradient directions and smaller gradient magnitudes. This combination makes each PGD step both unreliable in direction and limited in magnitude, causing gradient-based attacks to fail within practical iteration budgets. We further show that combining RL with adversarial training (RL-adv) provides a dual-layer defense operating at two complementary levels: RL degrades gradient information available to attackers (gradient-level defense), while adversarial training strengthens decision boundaries (boundary-level defense). RL-adv achieves the highest robustness across all major attack types evaluated, including gradient-based (PGD, AutoAttack), transfer-based, and query-based attacks, outperforming SL-adv by a significant margin. These findings identify RL-induced gradient disruption as a complementary robustness mechanism and motivate future research on hybrid SL-RL training schedules that combine SL's efficiency with RL's gradient-regularization properties.

摘要：梯度基礎的對抗攻擊仍然是深度神經網絡（DNNs）面臨的主要威脅，因為它們利用梯度信息有效地優化對抗擾動。為了解決這個問題，我們研究增強學習（RL）訓練是否能通過使用策略梯度目標和ε-貪婪探索來破壞攻擊者使用的梯度結構，從而訓練圖像分類器。通過在CIFAR-10、CIFAR-100和ImageNet-100上進行多種架構的系統實驗，我們發現RL訓練的分類器顯著破壞了基於梯度的對抗優化。為了解釋這一點，我們進行了全面的機制分析，使用損失景觀可視化、靜態和動態梯度指標以及預測熵。我們的分析顯示，RL作為一種隱式正則化器，產生具有高度不穩定梯度方向和較小梯度幅度的模型。這種組合使得每一步PGD在方向上都不可靠且幅度有限，導致基於梯度的攻擊在實際迭代預算內失敗。我們進一步顯示，將RL與對抗訓練結合（RL-adv）提供了一種雙層防禦，運作於兩個互補層面：RL降低了攻擊者可用的梯度信息（梯度層防禦），而對抗訓練則加強了決策邊界（邊界層防禦）。RL-adv在評估的所有主要攻擊類型中實現了最高的魯棒性，包括基於梯度的（PGD、AutoAttack）、基於轉移的和基於查詢的攻擊，並且顯著超越了SL-adv。這些發現確定了RL引起的梯度破壞作為一種互補的魯棒性機制，並激勵未來對結合SL效率和RL梯度正則化特性的混合SL-RL訓練計劃進行研究。

Towards Responsibly Non-Compliant Machines

2606.12147v1 by Marija Slavkovik, Marie Farrell, Louise Dennis, Michael Fisher, Simon Kolker, Emily C. Collins

We consider the problem of engineering autonomous intelligent agents that are capable to responsibly not comply with user requests. We argue that machine non-compliance comes in many different forms, and sketch the issues we should pursue on the road of accomplishing responsibly non-compliant intelligent machines. We anchor responsible non-compliance in justifications for task refusal, pathways to override the non-compliance, as well as careful tracking of security risks and liability transfers.

摘要：我們考慮工程自主智能代理的問題，這些代理能夠負責任地不遵從用戶請求。我們認為機器的不遵從有許多不同的形式，並勾勒出我們在實現負責任的不遵從智能機器的過程中應該追求的問題。我們將負責任的不遵從建立在拒絕任務的理由、覆蓋不遵從的途徑，以及對安全風險和責任轉移的仔細追蹤上。

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

2606.12138v1 by Gleb Gerasimov, Timofei Rusalev, Nikita Balagansky, Daniil Laptev, Vadim Kurochkin, Daniil Gavrilov

Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emph{feature stability}: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.

摘要：稀疏自編碼器（SAEs）被廣泛用於解釋神經網絡表示，但它們的效用取決於學習到的特徵是否在訓練過程中可重現。我們通過\emph{特徵穩定性}來研究這個問題：對於每個SAE特徵，我們估計在獨立訓練的SAE中相似特徵重新出現的概率。這產生了一個可擴展的每特徵信號，將穩定特徵與不穩定特徵區分開來。在一項大規模的研究中，涵蓋了隨機種子、模型、層、字典大小和SAE變體，我們發現了一種明顯的功能不對稱性：穩定特徵攜帶大部分重建和預測相關信號，而不穩定特徵的邊際影響較弱，並且在激活統計和自動解釋中被低頻表面形式觸發所主導。在幾何上，不穩定特徵在個體上是不可重現的，但集中在可重現的低秩子空間中，這表明種子依賴性往往反映了共享激活空間區域內的基底模糊性，而不是純粹的噪音。一個受控的合成模型使這一機制變得明確，顯示低秩真實特徵可以在子空間層面上被恢復，同時在不同種子之間仍然無法識別為個別的SAE潛變量。最後，通過聚合獨特的跨種子特徵，我們構建了更穩定的SAEs，同時在這一設置中保留了解釋的變異性。綜合這些結果顯示，不穩定特徵不僅僅是失敗或嘈雜的潛變量：它們具有弱的個體功能影響，但反映了可重現的低維結構，而標準SAEs在不同種子之間以不同方式解決這些結構。

Tabular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation

2606.12006v1 by Minh-Khoi Pham, Luca Cotugno, Alina Sirbu, Tai Tan Mai, Martin Crane, Marija Bezbradica

Predicting time-to-event outcomes such as mortality is a fundamental task in clinical decision-making, commonly addressed through survival analysis. While classical statistical and deep learning approaches have been widely studied, they typically require task-specific training and sufficient labeled data. Recent advances in tabular foundation models offer a new paradigm by learning general-purpose representations for structured data. However, their applicability to censored time-to-event prediction in clinical settings remains underexplored, as typical applications are restricted to discrete classification rather than survival analysis tasks. In this work, we propose a lightweight adaptation approach for applying tabular foundation models to clinical survival analysis by directly training a survival-aware head on top of the pretrained representations. We study representative architectures, including TabPFN, TabDPT, and TabICL, and adapt them using a multi-task logistic regression (MTLR) head to model right-censored time-to-event outcomes. We evaluate this approach on a diverse set of public survival benchmarks and two large-scale ICU cohorts, MIMIC-IV and eICU. Our results show that this transfer learning approach achieves competitive or superior performance compared to strong baselines. On MIMIC-IV, TabDPT-FT-MTLR reaches a C-index of 0.856, corresponding to a relative improvement of +1.4% over the best non-FM baseline (DeepSurv, 0.844) and +6.7% over the best zero-shot model (0.802). On eICU, TabICL-FT-MTLR achieves 0.797, yielding gains of +1.7% (DeepSurv, 0.784) and +6.4% (0.749), respectively. These findings highlight the importance of combining pretrained tabular representations with survival-aware objectives and suggest that tabular foundation models provide a practical and effective alternative for clinical survival prediction.

摘要：預測事件發生時間的結果，例如死亡率，是臨床決策中的一項基本任務，通常通過生存分析來解決。雖然傳統的統計方法和深度學習方法已被廣泛研究，但這些方法通常需要特定任務的訓練和足夠的標記數據。最近在表格基礎模型方面的進展提供了一種新的範式，通過學習結構化數據的一般性表示來進行處理。然而，這些模型在臨床環境中對於被審查的事件預測的適用性仍然未被充分探索，因為典型應用主要限於離散分類，而非生存分析任務。在本研究中，我們提出了一種輕量級的適應方法，通過在預訓練表示的基礎上直接訓練一個生存感知的頭部，將表格基礎模型應用於臨床生存分析。我們研究了代表性的架構，包括TabPFN、TabDPT和TabICL，並使用多任務邏輯回歸（MTLR）頭部進行調整，以建模右審查的事件結果。我們在一組多樣的公共生存基準和兩個大型ICU隊列（MIMIC-IV和eICU）上評估了這一方法。我們的結果顯示，這種轉移學習方法在與強基準相比時，達到了競爭或更優的性能。在MIMIC-IV上，TabDPT-FT-MTLR達到了0.856的C指數，相當於比最佳非FM基準（DeepSurv，0.844）提高了+1.4%，比最佳零樣本模型（0.802）提高了+6.7%。在eICU上，TabICL-FT-MTLR達到了0.797，分別帶來+1.7%（DeepSurv，0.784）和+6.4%（0.749）的增益。這些發現突顯了將預訓練的表格表示與生存感知目標相結合的重要性，並表明表格基礎模型為臨床生存預測提供了一種實用且有效的替代方案。

Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability

2606.11930v2 by Kuo-En Hung, Hung-Yue Suen, Shih-Ching Yeh, Hsiang-Wen Wang

Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging problem in AI-assisted interview assessment because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.

摘要：預測來自非同步視頻面試（AVI）的心理特徵是一個在AI輔助面試評估中具有挑戰性的問題，因為標記數據集有限，而每個回應包含高維度的視覺、聲音和語言信號。本文提出了我們對2026年ACM多媒體AVI挑戰的解決方案，該挑戰評估兩項任務：Track~1從與人格相關的面試回應中預測自我報告的HEXACO人格特徵，Track~2則從結構化的AVI回應中分類認知能力水平。我們將這個問題視為一個小樣本表示學習任務。我們不對大型預訓練模型進行微調，而是使用凍結的多模態編碼器，包括用於視覺特徵的CLIP、用於聲音特徵和文字稿的Whisper，以及用於文本表示的RoBERTa、E5和DeBERTaV3，然後再用低容量的下游模型。對於Track~1，我們的特徵特定回歸和後期融合系統達到了0.2696的平均驗證均方誤差（MSE），超過了官方基準的0.3334。消融結果顯示，從全局模型（0.3189）到每個特徵建模（0.2871），再到每個特徵的後期融合（0.2696），經歷了三步改進，對應於相對於官方基準的19.1% MSE減少。對於Track~2，一個緊湊的主題-屬性基準達到了0.5781的準確率，而我們的多模態集成達到了0.5313，均高於官方基準的0.4062。我們將這一結果解釋為在驗證拆分中可能存在的主題-屬性捷徑的證據，而不是從AVI內容中進行穩健的認知推斷。總體而言，我們的發現表明，基於AVI的心理評估受益於特徵特定的多模態建模，但認知能力預測需要對數據集捷徑進行仔細控制。

Beyond representational alignment with brain-guided language models for robust reasoning

2606.11893v1 by Mingqing Xiao, Kai Du, Zhouchen Lin

The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear dissociable, an open question is whether LLMs align with neural signals from reasoning-related regions and whether such signals can improve them. Here, focusing on deductive reasoning, we show that LLM internal representations are not only partially aligned with task-fMRI activity but can also be directly enhanced by these signals. Using a neural-predictivity metric, we find that LLMs explain a substantial fraction of the explainable variance in reasoning-related regions at the aggregate level, whereas predictivity within specific reasoning types is lower, indicating both alignment and divergence. Building on this, we propose a brain-guided framework: we steer model representations along directions induced by the joint structure of model and brain representations, applying intervention at inference and fine-tuning during training. We demonstrate that task-evoked brain signals can directly enhance LLM reasoning, yielding gains orthogonal to language-only supervision across 10 LLMs (1.5B-72B), with transfer across reasoning types and up to 13\% absolute accuracy gain. Our results advance LLM-brain correspondences from correlation to guidance, establishing a brain-signal-driven pathway toward more robust and cognitively aligned AI.

摘要：大型語言模型（LLMs）與人類高階認知背後的神經機制之間的對應關係仍然不足以被充分描述。考慮到人腦中的語言和推理似乎是可分離的，一個未解的問題是，LLMs 是否與推理相關區域的神經信號對齊，以及這些信號是否能改善它們。在此，我們專注於演繹推理，顯示 LLM 的內部表徵不僅部分與任務功能性磁共振成像（fMRI）活動對齊，還可以被這些信號直接增強。使用神經預測性指標，我們發現 LLM 在整體層面上解釋了推理相關區域中可解釋變異的相當一部分，而在特定推理類型中的預測性較低，這表明了對齊和分歧的存在。在此基礎上，我們提出了一個腦導向框架：我們沿著模型和大腦表徵的聯合結構所誘導的方向引導模型表徵，在推理時應用干預並在訓練期間進行微調。我們證明了任務引發的大腦信號可以直接增強 LLM 的推理，帶來與僅依賴語言的監督相互正交的增益，涵蓋 10 個 LLM（1.5B-72B），並在推理類型之間轉移，實現高達 13\% 的絕對準確率增益。我們的結果將 LLM 與大腦的對應關係從相關性推進到引導，建立了一條以大腦信號驅動的通道，朝向更穩健且與認知對齊的人工智慧。

Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

2606.11830v1 by Qianyu Yao, Fei Sun, Bocheng Huang, Wei Chen, Jiarui Jiang, Shu Quan, Yifei Chen, Wenjie Xu, Bo li, Liping Su, Ruoqiong Wu, Huhai Hong, Huimei Wang

Background. Large language models and AI agents are increasingly used to support biomedical research, but native model outputs may omit key analytical steps, misuse methods, or overstate conclusions. We evaluated whether autonomous access to a medical research skill package was associated with higher-quality AI-generated transcriptomic research-analysis outputs compared with native AI without skills. Methods. We conducted an exploratory multi-model human evaluation using a non-small cell lung cancer immunotherapy biomarker task. Six model backbones were tested. The evaluation included 21 anonymized outputs: 9 native-AI outputs and 12 skill-augmented outputs generated through an AI agent implementation represented by OpenClaw. Four non-expert biomedical reviewers and two blinded experts evaluated each output, with two ratings from each reviewer type. The primary outcome was expert-rated overall quality. Results. Skill-augmented outputs showed directionally higher expert overall quality than native-AI outputs (mean 5.50 vs 5.11; difference=0.39; bootstrap 95\% CI, -0.04 to 0.90; Welch p=0.156). Non-expert reviewer quality showed the same direction (mean 4.72 vs 4.47; difference=0.26; bootstrap 95\% CI, -0.25 to 0.80; Welch p=0.373). Expert agreement was limited (single-rating ICC=-0.15), and model-specific effects were descriptive and heterogeneous. Conclusions. Autonomous skill access showed a directional quality signal in this exploratory sample, but the signal was smaller than expert-rating noise and should not be interpreted as confirmatory evidence. The findings primarily motivate larger evaluations of skill-augmented AI agents with stronger reliability controls, platform replication, and biological-validity assessment.

摘要：背景。大型語言模型和人工智慧代理越來越多地用於支持生物醫學研究，但原生模型的輸出可能省略關鍵的分析步驟、誤用方法或過度陳述結論。我們評估了自主訪問醫學研究技能包是否與較高質量的AI生成轉錄組研究分析輸出相關，與沒有技能的原生AI相比。方法。我們使用非小細胞肺癌免疫療法生物標記任務進行了探索性的多模型人類評估。測試了六個模型骨幹。評估包括21個匿名輸出：9個原生AI輸出和12個通過AI代理實現的技能增強輸出，該代理由OpenClaw表示。四位非專家生物醫學評審和兩位盲評專家評估了每個輸出，每種類型的評審提供了兩個評分。主要結果是專家評定的整體質量。結果。技能增強輸出的專家整體質量方向性上高於原生AI輸出（平均5.50對5.11；差異=0.39；自助法95\% CI，-0.04至0.90；Welch p=0.156）。非專家評審的質量顯示相同的方向（平均4.72對4.47；差異=0.26；自助法95\% CI，-0.25至0.80；Welch p=0.373）。專家之間的協議有限（單次評分ICC=-0.15），模型特定的效應是描述性的和異質的。結論。在這個探索性樣本中，自主技能訪問顯示出方向性的質量信號，但該信號小於專家評分的噪音，不應被解釋為確認性證據。這些發現主要促使對技能增強AI代理進行更大規模的評估，並加強可靠性控制、平台重複性和生物有效性評估。

Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data

2606.11794v1 by Boris-Stephan Rauchmann, Jonathan Laib, Buse Ercik, Robert Perneczky, Sergio Altares-López

Neurodegenerative diseases such as Alzheimer's disease (AD) require accurate and scalable tools for assessing disease severity, yet current clinical staging remains time-intensive and prone to variability. We propose an attention-enhanced multimodal machine learning framework with ordinal regression for automated and interpretable AD severity staging. The framework integrates T1-weighted MRI with demographic and genetic variables and compares unimodal and multimodal architectures using ordinal and non-ordinal prediction heads. Models were trained and validated using cohort-stratified splits derived from the ADNI, AIBL, and NIFD datasets. A strictly held-out test set was constructed using subjects excluded from all training, validation, preprocessing, and hyperparameter tuning procedures, with subject-level splitting employed throughout to prevent data leakage. Among unimodal approaches, the T1-weighted MRI model achieved slightly higher adjacent-stage accuracy (0.963) and agreement with clinical staging (QWK 0.444) than the tabular model (QWK 0.433). Integrating imaging, demographic, and genetic information improved overall performance. The multimodal non-ordinal baseline achieved the lowest prediction error (MAE 0.340), whereas the ordinal multimodal model achieved the highest adjacent-stage accuracy (0.970) and strongest agreement with clinical staging (QWK 0.549). These findings indicate that ordinal formulations better capture the ordered structure of the CDR scale and yield predictions more consistent with clinical staging. Explainability analyses using Grad CAM++ and SHAP demonstrated anatomically and clinically plausible model behavior, supporting transparent decision-making. Overall, attention-based multimodal learning with ordinal regression represents a robust, interpretable, and scalable approach for automated AD severity staging and AI-assisted clinical decision support.

摘要：神經退行性疾病，如阿茲海默症（AD），需要準確且可擴展的工具來評估疾病嚴重程度，但目前的臨床分期仍然耗時且容易變異。我們提出了一種增強注意力的多模態機器學習框架，結合序數回歸，用於自動化且可解釋的AD嚴重程度分期。該框架整合了T1加權MRI與人口統計和遺傳變數，並使用序數和非序數預測頭比較單模態和多模態架構。模型使用來自ADNI、AIBL和NIFD數據集的隊列分層拆分進行訓練和驗證。嚴格保留的測試集是使用所有訓練、驗證、預處理和超參數調整程序中排除的受試者構建的，並在整個過程中採用了受試者級別的拆分以防止數據洩漏。在單模態方法中，T1加權MRI模型的相鄰階段準確率（0.963）和與臨床分期的一致性（QWK 0.444）略高於表格模型（QWK 0.433）。整合影像、人口統計和遺傳信息提高了整體性能。多模態非序數基線達到了最低的預測誤差（MAE 0.340），而序數多模態模型則達到了最高的相鄰階段準確率（0.970）和與臨床分期的最強一致性（QWK 0.549）。這些發現表明，序數公式更好地捕捉了CDR量表的有序結構，並產生與臨床分期更一致的預測。使用Grad CAM++和SHAP的可解釋性分析顯示了在解剖學和臨床上合理的模型行為，支持透明的決策過程。總體而言，基於注意力的多模態學習結合序數回歸代表了一種穩健、可解釋且可擴展的自動化AD嚴重程度分期和AI輔助臨床決策支持的方法。

Extracting Semantics: LLM-Guided Automatic Population of Robot Ontology from URDF

2606.17073v1 by Bastien Dussard, Guillaume Sarthou

While commonsense knowledge may suffice for virtual agents, embodied robots interacting with humans require grounded and semantically rich representations of both their environment and their own physical embodiment. In cognitive robotics, ontologies are effective for integrating such heterogeneous knowledge to enable explainable reasoning, even during continuous knowledge updates. Yet, their manual construction remains a bottleneck. We present a preliminary approach for the automatic generation of robot semantic abstractions by transforming Unified Robot Description Format (URDF) models into populated ontologies. Although URDF files provide structural and kinematic descriptions, their identifiers often require commonsense interpretation to recover meaningful semantics, a task at which Large Language Models (LLMs) excel. Our pipeline leverages LLMs to infer semantic relationships by prompting them with concepts from an existing ontology, ensuring the final classification remains aligned with the formal model. To improve reliability, the pipeline combines majority voting across multiple LLM queries along with syntactic and schema-level validation to ensure that generated outputs conform to the expected representation format and ontology constraints. We evaluate the approach on multiple robot descriptions and discuss the generated abstractions. Initial results indicate that the proposed method can effectively bridge the gap between low-level robot descriptions and the structured, grounded knowledge representations required for human-robot interaction.

摘要：雖然常識知識對虛擬代理可能足夠，但與人類互動的具身機器人則需要對其環境及自身物理體現的有根據且語義豐富的表徵。在認知機器人學中，本體論有效地整合這些異質知識，以實現可解釋的推理，即使在持續的知識更新過程中也是如此。然而，其手動構建仍然是一個瓶頸。我們提出了一種初步方法，通過將統一機器人描述格式（URDF）模型轉換為填充本體，自動生成機器人語義抽象。儘管URDF文件提供了結構和運動學描述，但其標識符通常需要常識解釋以恢復有意義的語義，而這正是大型語言模型（LLMs）擅長的任務。我們的流程利用LLMs通過用現有本體的概念提示它們來推斷語義關係，確保最終分類與正式模型保持一致。為了提高可靠性，該流程結合了多個LLM查詢的多數投票，以及語法和模式層級的驗證，以確保生成的輸出符合預期的表徵格式和本體約束。我們在多個機器人描述上評估該方法並討論生成的抽象。初步結果表明，所提出的方法可以有效地彌合低階機器人描述與人機互動所需的結構化、有根據的知識表徵之間的差距。

MedCTA: A Benchmark for Clinical Tool Agents

2606.11702v1 by Tajamul Ashraf, Hyewon Jeong, Fida Mohammad Thoker, Bernard Ghanem

To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at https://ivul-kaust.github.io/MedCTA/

摘要：為了做出臨床基礎的決策，醫療 AI 代理預期能超越簡單的識別，具備工具檢索、證據獲取和整合的能力。現有的基準主要評估孤立的感知或單輪的問題回答，因此對於計劃失敗、工具招募和推行可靠性提供了有限的可見性。我們介紹了 MedCTA，一個用於評估醫療工具代理的基準，基於臨床醫生驗證的、隱含步驟的任務，這些任務根植於現實的多模態臨床輸入，包括放射學影像、病理切片和報告。MedCTA 包含 107 個真實世界的臨床任務，具有臨床醫生驗證的可執行軌跡，涵蓋 5 個已部署的工具，並支持對工具選擇、論據有效性、執行穩定性、軌跡保真度和結果質量的過程感知評估。我們對 18 個開源和閉源的多模態模型進行基準測試，發現即使是最前沿的系統在多步臨床工具使用中仍然脆弱：自主推行受到協議失敗、過早停止和不正確工具招募的主導，而黃金標準的工具路由雖然帶來了巨大的但仍然不完整的收益。這些結果表明，強大的基幹感知並不會轉化為臨床環境中可靠的代理行為。MedCTA 提供了一個嚴謹的測試平台，用於審核、診斷和推進可信的醫療 AI 代理。數據集和評估套件可在 https://ivul-kaust.github.io/MedCTA/ 獲得。

Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics

2606.12476v2 by Igor Itkin

Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment. Among the onsets it catches it detects in 11-13 tokens, against 31 for a linear per-token baseline, though at this false-alarm budget every detector catches under a third of onsets and the recall-honest delay is 56-66 tokens: low-false-alarm onset detection is hard. A controlled decomposition attributes the speed advantage mostly to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable.

摘要：標記級別的幻覺檢測器作為分類器進行評估，通過所有標記的AUC來衡量，然而流式監控則根據其反應時間來判斷：在幻覺開始和警報之間通過的標記數量。我們將幻覺開始檢測形式化為一個最快變化檢測問題。一個基於潛在真實/幻覺狀態的一階馬可夫模型，在RAGTruth上進行驗證，將任務置於經典的變化點理論之內，並產生Lorden的檢測延遲下限：在假警報率為0.01時約為1.3個標記。我們接著展示了一個因果循環標記器作為具有學習增量的CUSUM運作。在它捕捉到的開始中，它在11-13個標記內檢測到，而線性每標記基準則為31，儘管在這個假警報預算下，每個檢測器捕捉到的開始都不到三分之一，而回憶誠實的延遲為56-66個標記：低假警報的開始檢測是困難的。一個受控的分解將速度優勢主要歸因於更好的每標記得分，而不是時間累積。Donsker-Varadhan類型的信息率最優定理解釋了剩餘的量級差距：學習得分僅實現了特徵所承載的散度的1/4.5，這一缺口無法通過重新校準消除，其餘則是有限視野效應。分類指標掩蓋了這一延遲結構；序列分析使其可測量。

Can AI Agents Synthesize Scientific Conclusions?

2606.11337v1 by Hayoung Jung, Pedro Viana Diniz, José Reinaldo Corrêa Roveda, Abner Fernandes da Silva, Haeun Jung, Enoch Tsai, Aleksandra Korolova, Manoel Horta Ribeiro

Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models' true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.

摘要：科學 AI 代理越來越多地檢索證據、跨來源推理並綜合用於重要決策的結論。然後，它們在健康等高風險領域的能力仍然不明確。我們介紹了 SciConBench，這是一個大規模的實時基準，包含 9.11K 問題和專家撰寫的系統評價結論，用於評估開放領域的科學結論綜合。該基準依賴於經專家驗證的自動評估管道，將結論分解為原子事實，並通過事實精確度和召回率來衡量正確性和全面性。為了減少數據洩漏，我們進一步引入了 SciConHarness，這是一個清潔室評估工具，為代理提供受控的網絡互動，以確保有效的測量。評估 8 個前沿模型和深度研究代理時，我們發現事實質量仍然較低：在清潔室設置下，最佳代理的事實 F1 僅達 0.337。我們的清潔室設置相對於不受限制的評估持續降低性能，這表明洩漏會膨脹模型的真實綜合能力的估計。最後，我們審計了面向消費者的代理（例如，Google AI 概覽，OpenEvidence），發現它們經常生成不完整且有時矛盾的結論，即使當真實答案可用時也是如此。總體而言，我們的結果顯示，可靠的科學結論綜合仍然是一個未解決的挑戰，而清潔室評估對於評估開放領域的 AI 代理至關重要。

Designed by Journalists, but Is It for Readers? Rethinking AI Disclosures and Transparency in News

2606.11116v1 by Pooja Prajod

As newsrooms integrate generative AI, journalists face a disclosure challenge: how to communicate AI involvement in ways that maintain reader trust. Current practice offers two approaches: brief one-line labels or detailed disclosures specifying human oversight, editorial accountability, and error reporting mechanisms. Neither achieves journalists' goal of building trust through transparency. An existing controlled experiment with 34 news readers show that detailed disclosures trigger a \textit{transparency dilemma}, reducing trust rather than increasing it, and risk introducing dark patterns that readers scroll past with the illusion of transparency. One-line disclosures avoid this effect but can create an information gap, prompting readers to expend cognitive effort searching for signs of AI involvement that the disclosure indicates but does not explain. Yet readers are not rejecting transparency, they proposed disclosure designs centered on user agency: detail-on-demand interactions, proportional AI-ratio visualizations, outlet-level signals, and explicit "no AI" labels. I argue that this disconnect between what practitioners believe is responsible disclosure and what users actually need is a design problem for the HCI community.

摘要：隨著新聞編輯部整合生成式 AI，記者面臨一個披露挑戰：如何以維持讀者信任的方式傳達 AI 的參與。當前的做法提供了兩種方法：簡短的一行標籤或詳細的披露，具體說明人類監督、編輯責任和錯誤報告機制。這兩者都未能實現記者通過透明度建立信任的目標。一項對 34 位新聞讀者進行的現有控制實驗顯示，詳細的披露引發了 \textit{透明度困境}，降低了信任而非增加信任，並且有引入黑暗模式的風險，讓讀者在錯誤的透明感中滑過。單行披露避免了這種效果，但可能會造成信息缺口，促使讀者花費認知精力尋找披露所指示但未解釋的 AI 參與跡象。然而，讀者並不拒絕透明度，他們提出了以用戶主體性為中心的披露設計：按需詳細互動、比例 AI 比例可視化、媒體層級信號以及明確的「無 AI」標籤。我認為，從業者認為負責任的披露與用戶實際需求之間的這一脫節是 HCI 社群的一個設計問題。

FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

2606.11106v1 by Mahmood Alzubaidi, Uzair Shah, Raden Muaz, Ines Abbes, Nader Mohammed, Abdullatif Magram, Khalid Alyafei, Mowafa Househ, Marco Agus

A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segmentation, or classification in isolation, each demanding a separate model and expert-specified labels at inference. We present FADA, a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation through a single interpretation-first pipeline without external labels. FADA distills knowledge from four domain-specific foundation models (FetalCLIP, UltraSAM, USF-MAE, UltraFedFM) via offline pre-computed feature caching. Selective distillation, which applies feature alignment only to annotation tasks while interpretation relies on standard fine-tuning, consistently outperforms full distillation across most evaluation axes. The recommended variant, FADA-SKD, achieves 0.8820 mean Dice for segmentation, 0.7671 mAP@0.50 for detection, and 100% structured interpretation compliance. Expert sonographer validation across 237 images confirms clinically acceptable outputs in both autonomous and human-in-the-loop modes, with 73.5% of interpretations scoring perfectly under clinician guidance. The system is trainable on a single consumer GPU and deployable without cloud connectivity. We validate edge deployment by running the compressed 0.8B model on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1, 12 GB RAM) using llama.cpp with GGUF quantization, completing the full 5-phase pipeline in approximately 60 seconds entirely offline. This establishes a practical pathway for integrating AI-assisted fetal assessment with portable ultrasound devices, directly addressing diagnostic access gaps in resource-constrained settings. Code, models, and data are available at https://github.com/mahmoodphd/FADA.

摘要：全球訓練有素的超聲醫生短缺限制了低收入和中等收入國家的產前超聲篩查，這些國家中有超過一半的孕婦沒有接受專業超聲檢查。當前的深度學習方法分別處理檢測、分割或分類，每個方法都需要一個單獨的模型和專家指定的標籤進行推斷。我們提出了FADA，一個基於Qwen3.5-VL的統一視覺-語言模型，通過單一的解釋優先管道執行臨床解釋、分類、檢測和分割，而不需要外部標籤。FADA通過離線預計算特徵緩存從四個特定領域的基礎模型（FetalCLIP、UltraSAM、USF-MAE、UltraFedFM）提煉知識。選擇性蒸餾僅將特徵對齊應用於標註任務，而解釋則依賴於標準微調，這在大多數評估指標上始終優於完全蒸餾。推薦變體FADA-SKD在分割中達到0.8820的平均Dice，在檢測中達到0.7671的mAP@0.50，以及100%的結構化解釋合規性。對237張圖像的專家超聲醫生驗證確認了在自主和人類參與模式下的臨床可接受輸出，其中73.5%的解釋在臨床醫生指導下得分完美。該系統可以在單個消費者GPU上進行訓練，並且可在沒有雲連接的情況下部署。我們通過在一部普通智能手機（高通Snapdragon 7 Gen 1，12 GB RAM）上運行壓縮的0.8B模型來驗證邊緣部署，使用llama.cpp進行GGUF量化，並在完全離線的情況下約60秒內完成完整的5階段管道。這為將AI輔助的胎兒評估與可攜式超聲設備整合建立了一條實用的途徑，直接解決了資源有限環境中的診斷接入差距。代碼、模型和數據可在https://github.com/mahmoodphd/FADA獲得。

Superficial Beliefs in LLM Decision-Making

2606.11016v1 by Gabriel Freedman, Francesca Toni

We ask whether large language models (LLMs) merely imitate rationales when choosing between two options, or whether their choices reflect a systematic underlying decision structure. Using synthetic binary decision settings in which models choose between profiles defined by graded attributes, we compare the attribute a model says mattered most with the attribute that best explains its choice under a behavioural model fit to prior decisions. The behavioural model predicts held-out choices well, showing that model behaviour is systematically related to the visible attributes rather than being random. However, direct self-reports and a separate score-based judge recover the behaviourally inferred driver only partially. The resulting picture is neither one of arbitrary behaviour nor one of fully articulated belief - outputs are structured enough to support prediction, but explicit reasons track the recovered driver only imperfectly. This qualitative pattern persists across prompt-order and sampling perturbations, alternative behavioural models, targeted occlusion analyses, and structurally varied decision settings. We interpret this as evidence for ``superficial belief'' in LLM decision-making: models behave as if guided by probabilistic local priorities over attributes, while having only limited verbal access to the attributes that drive their decisions.

摘要：我們詢問大型語言模型（LLMs）在選擇兩個選項時是否僅僅模仿理由，或它們的選擇是否反映出一個系統性的決策結構。使用合成的二元決策環境，在這些環境中模型在由分級屬性定義的配置之間進行選擇，我們比較模型所說的最重要屬性與在適合先前決策的行為模型下最能解釋其選擇的屬性。行為模型對保留選擇的預測效果良好，顯示模型行為與可見屬性之間存在系統性的關聯，而不是隨機的。然而，直接的自我報告和一個獨立的基於分數的評估者僅部分恢復了行為上推斷的驅動因素。最終的情況既不是任意行為，也不是完全闡述的信念——輸出結構足夠支持預測，但明確的理由僅不完全跟踪恢復的驅動因素。這種質性模式在提示順序和取樣擾動、替代行為模型、針對性遮蔽分析以及結構變化的決策環境中持續存在。我們將此解釋為大型語言模型決策中“表面信念”的證據：模型的行為似乎受到屬性的概率性局部優先級的指導，而對驅動其決策的屬性僅有有限的語言訪問。

Understanding and mitigating the risks of OpenClaw for non-technical users: A practical guide with Skill

2606.11007v1 by Junchang Zheng, Junfeng Tan, Jialiang Lin

OpenClaw has rapidly emerged as a transformative artificial intelligence (AI) agent framework, and its ability to autonomously execute complex, multi-step tasks has attracted an ever-growing and diverse user base. However, this capability comes with significant risks. While existing research has made important strides in characterizing these threats, such work is predominantly directed at technically sophisticated audiences. It remains largely inaccessible to non-technical users. This demographic now makes up an increasingly large and underserved portion of the community, yet it is these very users who most urgently need practical and straightforward guidance. In response, we bridge this gap through a series of interconnected efforts designed to lower the risk barrier for non-technical OpenClaw users. First, we identify and categorize seven core risks that OpenClaw users may encounter in daily usage, explaining each in plain language so that non-technical users can readily grasp the nature and potential consequences of these threats. Second, for each identified risk, we distill a set of corresponding defensive strategies into clear and actionable operational steps that are easy to follow. Third, to make protection even easier, we provide a companion OpenClaw Skill that automates key security configurations, enabling users to safeguard their systems with minimal manual intervention. Through this work, we demonstrate that safeguarding against the risks of intelligent agents need not be the exclusive domain of security experts, and that non-technical users can meaningfully participate in reducing these risks through simple, practical actions.

摘要：OpenClaw 已迅速崛起為一個變革性的人工智慧 (AI) 代理框架，其自主執行複雜的多步驟任務的能力吸引了越來越多樣化的用戶群體。然而，這一能力伴隨著重大風險。雖然現有研究在描述這些威脅方面取得了重要進展，但這些工作主要針對技術精湛的受眾。對於非技術用戶來說，這些研究仍然在很大程度上無法接觸。這一人群現在佔據了社區中越來越大且未被充分服務的部分，但正是這些用戶最迫切需要實用且簡單明瞭的指導。為此，我們通過一系列相互聯繫的努力來填補這一空白，旨在降低非技術 OpenClaw 用戶的風險門檻。首先，我們識別並分類了 OpenClaw 用戶在日常使用中可能遇到的七個核心風險，並用通俗易懂的語言解釋每一個，以便非技術用戶能夠輕鬆理解這些威脅的性質和潛在後果。其次，對於每個識別出的風險，我們提煉出一組相應的防禦策略，將其轉化為清晰且可操作的步驟，便於遵循。第三，為了使保護工作更為簡單，我們提供了一個 OpenClaw 技能，該技能自動化關鍵的安全配置，使用戶能夠以最小的手動干預來保護他們的系統。通過這項工作，我們展示了保護智能代理風險不必是安全專家的專屬領域，非技術用戶也能通過簡單、實用的行動有效參與降低這些風險。

Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions

2606.10942v1 by Kiarash Rezaei, Omran Ayoub, Sebastian Troia, Francesco Lelli, Paolo Monti, Carlos Natalino

As artificial intelligence and machine learning (AI/ML) models become integral to network operations, their lack of transparency poses a significant barrier to operator trust. Existing explainable artificial intelligence (XAI) techniques often fail to bridge this gap for non-specialists, producing technical outputs that are difficult to translate into actionable insights. This paper presents a framework specifically designed to address this shortcoming. It leverages a moderately sized large language model (LLM) and extends beyond the standard use of SHapley Additive exPlanations (SHAP) feature influence values. The framework employs a structured prompt enriched with mutual feature interaction data to generate human-understandable natural language explanations. To validate our framework, we performed an empirical evaluation on an optical quality of transmission (QoT) estimation use case with human evaluators. We collected independent performance evaluations from specialists, which showed a high inter-evaluator agreement. Compared to a state-of-the-art baseline that uses only SHAP feature influence values in a straightforward prompt, our approach improves the explanation usefulness and scope by 12.2% and 6.2%, while achieving 97.5% correctness.

摘要：隨著人工智慧和機器學習（AI/ML）模型成為網絡運營的不可或缺的一部分，它們缺乏透明度對運營商信任構成了重大障礙。現有的可解釋人工智慧（XAI）技術往往無法為非專業人士彌補這一差距，產生的技術輸出難以轉化為可行的見解。本文提出了一個專門設計來解決這一不足的框架。它利用了一個中等規模的大型語言模型（LLM），並超越了SHapley加法解釋（SHAP）特徵影響值的標準使用。該框架使用一個結構化的提示，並結合了互動特徵數據，以生成人類可理解的自然語言解釋。為了驗證我們的框架，我們對一個光學傳輸質量（QoT）估算的使用案例進行了實證評估，並邀請了人類評估者。我們收集了來自專家的獨立性能評估，顯示出高水平的評估者間一致性。與僅使用SHAP特徵影響值的先進基準相比，我們的方法在解釋的有用性和範圍上分別提高了12.2%和6.2%，同時達到了97.5%的正確率。

What Do Deepfake Speech Detectors Actually Hear?

2606.10912v1 by Vojtěch Staněk, Veronika Jirmusová, Anton Firc, Kamil Malinka, Jakub Reš, Martin Perešíni

Deepfake speech detectors often output a single score without explaining why an audio sample is flagged, where in the signal the evidence lies, or what cues drive the decision. We propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence over time. We apply the proposed method to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on ASVspoof 5 and manually annotate the highest-attribution regions to provide a semantic meaning of the most important cues. Despite similar performance, the detectors rely on different cues: AASIST emphasizes non-speech/environment cues, CA-MHFA focuses on localized phoneme artifacts, and SLS relies on word boundaries and spectral integrity. We move beyond speculative reasoning and validate our findings by causal masking of the primary detector cues. Observed performance degradation further supports the explained detector semantics.

摘要：深偽語音檢測器通常只輸出一個單一的分數，而不解釋為什麼音頻樣本被標記，證據位於信號的何處，或驅動決策的線索是什麼。我們提出了一個音頻原生的可解釋性管道，使用集成梯度對時間對齊的自我監督表示進行處理，以隨時間本地化決策證據。我們將所提出的方法應用於三個基於WavLM的檢測器（AASIST、CA-MHFA、SLS）在ASVspoof 5上，並手動標註最高歸因區域，以提供最重要線索的語義意義。儘管性能相似，這些檢測器依賴於不同的線索：AASIST強調非語音/環境線索，CA-MHFA專注於局部音素伪影，而SLS則依賴於單詞邊界和頻譜完整性。我們超越了推測性推理，通過對主要檢測器線索進行因果遮蔽來驗證我們的發現。觀察到的性能下降進一步支持了解釋的檢測器語義。

Accelerating NeurASP with vectorization and caching

2606.10787v1 by Alexander Philipp Rader, Alessandra Russo

Neurosymbolic AI combines neural networks with symbolic programs to create robust and explainable predictions. One such framework is NeurASP, which trains a neural network to predict concepts and reasons over them using rules written in answer set programming (ASP) to solve downstream tasks. Crucially, labels are only provided for the downstream prediction produced by the symbolic rules, not for the latent concepts themselves.Backpropagation through the non-differentiable ASP component requires expensive probability and gradient calculations, which has hindered scalability to more sophisticated tasks.In this paper, we address the current limitations of NeurASP by improving its computational performance through vectorization, batch processing and caching of intermediate computations during training. We compare computation speeds between the original and our new implementation of NeurASP and report speedups of multiple orders of magnitude for larger tasks. To this end, we propose a new dataset of difficult tasks involving playing cards, which we use to test the capabilities of NeurASP's enhanced learning function.

摘要：神經符號人工智慧結合了神經網絡與符號程序，以創造穩健且可解釋的預測。其中一個框架是 NeurASP，它訓練神經網絡來預測概念並基於這些概念進行推理，使用以答案集程式設計（ASP）編寫的規則來解決下游任務。關鍵是，標籤僅針對符號規則產生的下游預測提供，而不是針對潛在概念本身。通過非可微的 ASP 組件進行反向傳播需要昂貴的概率和梯度計算，這妨礙了其擴展到更複雜任務的能力。在本文中，我們通過向量化、批處理和訓練期間中間計算的緩存來改善 NeurASP 的計算性能，以解決其當前的限制。我們比較了原始 NeurASP 與我們的新實現之間的計算速度，並報告了在較大任務中多個數量級的加速。為此，我們提出了一個涉及撲克牌的困難任務的新數據集，並用它來測試 NeurASP 增強學習功能的能力。

From Data Heterogeneity to Convergence: A Data-Centric Review of Federated Learning

2606.10595v1 by Huong Nguyen, Mickaël Bettinelli, Amirhossein Ghaffari, Alexandre Benoit, Hong-Tri Nguyen, Susanna Pirttikangas, Lauri Lovén

Federated Learning (FL) has emerged as a promising solution for data hunger in centralized learning. This paradigm enables privacy with multiple clients to train a shared-task model collaboratively without exposing their local data. While being a key component in any learning system, data is also a primary source of vulnerabilities and challenges, and a major determinant of a stable and well-converged training. Existing FL reviews describe general foundations, security practices, opportunities, challenges, and applications, without delving into diverse aspects of data and considering problems from the data perspective. They rarely provide a data-lens synthesis that links concrete data properties, split protocols, and defenses to convergence speed and stability. This survey fills that gap with three advances. First, we analyze non-IID into measurable traits and rank their influence on convergence as strong, medium, or light, explaining the mechanisms behind each and reconciling evidence across images, texts, and graphs. Second, we connect experimental splitting practices to the real phenomena they emulate, expose the artifacts they introduce, and show how those artifacts affect target accuracy. Third, we analyze how data-related vulnerabilities and their proposed defenses affect convergence, reporting performance under clean and adversarial conditions to make the convergence-robustness trade-off explicit. To our knowledge, this is the first survey to provide a complete understanding of data-related challenges that govern FL. With clear takeaways distilled for each concern, our work serves as actionable guidance, helping practitioners design their system with predictable convergence and stability.

摘要：聯邦學習（FL）已成為解決集中式學習中數據需求的一個有前景的解決方案。這一範式使得多個客戶端能夠在不暴露其本地數據的情況下，共同訓練一個共享任務模型以保護隱私。數據作為任何學習系統的關鍵組成部分，同時也是脆弱性和挑戰的主要來源，並且是穩定和良好收斂訓練的主要決定因素。現有的FL評論描述了一般基礎、保安實踐、機會、挑戰和應用，但未深入探討數據的多樣性及從數據角度考慮問題。它們很少提供一個數據視角的綜合，將具體數據特性、拆分協議和防禦措施與收斂速度和穩定性聯繫起來。本調查填補了這一空白，提出了三項進展。首先，我們將非獨立同分佈（non-IID）分析為可測量的特徵，並根據其對收斂的影響將其分為強、中等或輕微，解釋每種影響背後的機制，並調和來自圖像、文本和圖表的證據。其次，我們將實驗拆分實踐與其模擬的真實現象聯繫起來，揭示它們引入的工件，並展示這些工件如何影響目標準確性。第三，我們分析與數據相關的脆弱性及其提出的防禦措施如何影響收斂，報告在乾淨和對抗條件下的性能，以明確收斂與穩健性之間的權衡。據我們所知，這是第一篇提供對統治FL的數據相關挑戰的完整理解的調查。我們的工作為每個問題提煉出清晰的要點，作為可行的指導，幫助從業者設計其系統，以實現可預測的收斂和穩定性。

Towards Critical Branching Mechanism in Recurrent Neural Networks

2606.10384v1 by Feixiang Ren, Ling Feng

Criticality has been proposed as a key organizing principle in biological neural systems, yet its origin and relevance in artificial neural networks remain unclear. We analyze hidden-state dynamics in trained long short-term memory (LSTM) networks and show that small networks near their optimal training epochs (steps) exhibit scale-free avalanche statistics and branching parameters close to unity, indicative of near-critical dynamics, while larger models remain subcritical. To explain the coexistence of subcritical branching with robust $1/f^β$ noise, we introduce a mixture branching process framework that links heterogeneous branching dynamics to long-range temporal correlations. These results identify critical-like behavior in LSTMs as an emergent, capacity-dependent dynamical regime.

摘要：關鍵性已被提出作為生物神經系統中的一個關鍵組織原則，但其在人工神經網絡中的起源和相關性仍不明確。我們分析了訓練過的長短期記憶（LSTM）網絡中的隱藏狀態動態，並顯示接近最佳訓練時期（步驟）的較小網絡顯示出無尺度雪崩統計和接近於1的分支參數，這表明接近臨界動態，而較大的模型則保持亞臨界。為了解釋亞臨界分支與穩健的 $1/f^β$ 噪聲的共存，我們引入了一種混合分支過程框架，將異質的分支動態與長程時間相關性聯繫起來。這些結果確定了LSTM中類臨界行為作為一種新興的、依賴於容量的動態範疇。

Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction

2606.10279v1 by Buxin Su, Bingxuan Li, Cheng Qian, Yiwei Wang, Jin Jin, Bingxin Zhao

Supervised fine-tuning with synthetic rationale data is widely assumed to improve language model performance on clinical prediction tasks by teaching models not just what to predict but why. We test this assumption on five-year Alzheimer's disease and related dementias (ADRD) prediction from longitudinal health histories. Across a large-scale controlled experiment of 504 configurations, we find that rationale-based SFT consistently and substantially hurts prediction performance relative to label-only fine-tuning. The degradation persists across model families and data scales, and is not resolved by using a reasoning-oriented base model. Crucially, the failure is not explained by poor rationale quality: human expert annotation confirms that the generated rationales are medically accurate and faithfully grounded in patient-specific evidence, and few-shot experiments show that the same rationales improve performance when used as inference-time demonstrations rather than training targets. We identify the root cause as a structural conflict between narrative plausibility and discriminative optimization. We hope our work paves the path toward a more precise understanding of when and how rationale-based supervision helps and when it does not, guiding the responsible development of language models for high-stakes clinical prediction.

摘要：監督式微調使用合成理由數據被廣泛認為能改善語言模型在臨床預測任務上的表現，因為它教會模型不僅要預測什麼，還要知道為什麼。我們在五年阿茲海默症及相關癡呆症（ADRD）從縱向健康歷史的預測中測試了這一假設。在一項大規模受控實驗中，涉及504個配置，我們發現基於理由的SFT相較於僅使用標籤的微調，始終且顯著地損害了預測性能。這種降級在不同的模型系列和數據規模中持續存在，且使用以推理為導向的基模型並未解決此問題。關鍵是，這一失敗並不是由於理由質量差造成的：人類專家的註釋確認生成的理由在醫學上是準確的，並且忠實地基於患者特定的證據，少量樣本實驗顯示相同的理由在用作推理時的演示而非訓練目標時能改善性能。我們將根本原因確定為敘事的合理性與區分性優化之間的結構性衝突。我們希望我們的研究能為更精確地理解基於理由的監督何時以及如何有助於預測，何時又無效鋪平道路，指導高風險臨床預測語言模型的負責任發展。

Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community

2606.10159v1 by Lin Li, Qi Zhang, Xander Davies, Jianing Qiu, Yarin Gal

AI is increasingly used to support scientific peer review, from manuscript screening, reviewer assistance to editorial triage. Although such systems promise to reduce reviewer burden and accelerate publication, their robustness to strategic manipulation remains poorly understood. Here we show that AI-mediated peer review is vulnerable to a simple, low-cost manipulation: superficial rephrasing of the manuscript abstract. Without changing the underlying scientific content and communication, and even without knowledge of the reviewing model, adversarially rewritten abstracts substantially improve AI review outcomes. We see this across disciplines and publication venues, for both human-written and AI-generated papers. Our strongest attack achieves an attack-success-rate of about 38%, increasing acceptance ratings by +1.31 for Gemini 3 Flash reviewers and by +0.88 for GPT 5.4 Mini reviewers on a 10-point scale. When the original AI review suggests 'reject', the success rate rises to more than 50%. This effect extends beyond overall score inflation, increasing review confidence and scores on core scientific criteria such as soundness, significance and perceived contribution. The attack is practical, requiring only about 5 minutes and $1 for a 10-page AI conference submission, and is hard to distinguish from ordinary scientific editing. Inflated AI reviews could bias downstream human decision-making, shifting editorial recommendations from rejection towards acceptance. These findings reveal a general vulnerability in AI-assisted scientific evaluation: when AI-generated review influence editorial decisions, authors may be incentivized to optimize manuscripts for AI judgment rather than scientific merit. Our results suggest that AI tools should not be treated as neutral evaluators in high-stakes peer review without systematic robustness testing, transparent safeguards and careful human oversight.

摘要：AI 越來越多地被用來支持科學同行評審，從手稿篩選、審稿人協助到編輯篩選。雖然這些系統承諾減輕審稿人的負擔並加速發表，但它們對策略性操縱的穩健性仍然不甚了解。這裡我們顯示，AI 媒介的同行評審容易受到一種簡單、低成本的操縱：對手稿摘要的表面重述。在不改變基礎科學內容和交流的情況下，甚至在不知曉評審模型的情況下，對抗性重寫的摘要顯著改善了 AI 評審結果。我們在各個學科和發表場所都觀察到了這一現象，無論是人類撰寫的論文還是 AI 生成的論文。我們最強的攻擊達到了約 38% 的攻擊成功率，對於 Gemini 3 Flash 審稿人，接受率提高了 +1.31，而對於 GPT 5.4 Mini 審稿人，則提高了 +0.88，滿分為 10 分。當原始 AI 評審建議“拒絕”時，成功率上升到超過 50%。這一效果超越了整體分數膨脹，增強了對核心科學標準（如健全性、重要性和感知貢獻）的評審信心和分數。這一攻擊是實際可行的，對於一篇 10 頁的 AI 會議投稿，只需約 5 分鐘和 1 美元，且難以與普通的科學編輯區分開來。膨脹的 AI 評審可能會對下游的人類決策產生偏見，將編輯建議從拒絕轉向接受。這些發現揭示了 AI 輔助科學評估的一個普遍脆弱性：當 AI 生成的評審影響編輯決策時，作者可能會被激勵去優化手稿以符合 AI 的評判，而非科學價值。我們的結果表明，AI 工具在高風險的同行評審中不應被視為中立的評估者，而應進行系統的穩健性測試、透明的保障措施和謹慎的人類監督。

XMedFusion: A Knowledge-Guided Multimodal Perception and Reasoning Framework for Autonomous Medical Systems

2606.14766v1 by Hamza Riaz, Arham Haroon, Maha Baig, Muhammad Dawood Rizwan, Muhammad Naseer Bajwa, Muhammad Moazam Fraz

Autonomous medical and robotic systems increasingly rely on intelligent perception and reasoning capabilities to interpret visual data and support clinical decision making. Radiology report generation represents a critical component of such automated diagnostic workflows, yet existing end-to-end multimodal models often suffer from weak visual grounding, resulting in unreliable interpretations and omission of subtle clinical findings. This paper presents XMedFusion, a modular AI framework designed as an intelligent perception and reasoning module for autonomous medical systems. The proposed framework decomposes visual information into coordinated functional components that emulate expert-driven analysis, including a visual perception agent that extracts image-grounded evidence, a knowledge graph construction agent that structures clinically relevant findings, and a retrieval-guided drafting process that ensures a consistent reporting structure. A synthesis agent iteratively integrates visual and structured evidence through reasoning-driven verification to produce reliable and interpretable diagnostic outputs. Experimental evaluation on a public chest radiograph dataset demonstrates significant improvements over baseline vision-language models, achieving gains from 0.0493 to 0.3359 in BLEU-1, 0.0863 to 0.2440 in ROUGE-L, and 0.0829 to 0.1708 in METEOR, along with substantial improvements in semantic evaluation metrics such as Consistency (2.38 to 7.80) and Accuracy (2.34 to 6.93). The results highlight the effectiveness of structured multi-agent perception and reasoning for enhancing robustness, transparency, and automation in intelligent medical imaging systems, enabling integration into autonomous healthcare and robotic diagnostic workflows.

摘要：自主醫療和機器人系統越來越依賴智能感知和推理能力來解釋視覺數據並支持臨床決策。放射學報告生成是這類自動化診斷工作流程中的一個關鍵組成部分，但現有的端到端多模態模型往往存在視覺基礎薄弱的問題，導致解釋不可靠和忽略微妙的臨床發現。本文提出了XMedFusion，一個模組化的AI框架，旨在作為自主醫療系統的智能感知和推理模組。所提出的框架將視覺信息分解為協調的功能組件，模擬專家驅動的分析，包括提取圖像基礎證據的視覺感知代理、結構化臨床相關發現的知識圖譜構建代理，以及確保一致報告結構的檢索引導草擬過程。一個合成代理通過推理驅動的驗證迭代整合視覺和結構化證據，以產生可靠且可解釋的診斷輸出。對公共胸部X光數據集的實驗評估顯示，與基線視覺-語言模型相比有顯著改善，在BLEU-1上從0.0493提升至0.3359，在ROUGE-L上從0.0863提升至0.2440，在METEOR上從0.0829提升至0.1708，以及在一致性（2.38至7.80）和準確性（2.34至6.93）等語義評估指標上有顯著改善。結果突顯了結構化多代理感知和推理在增強智能醫療影像系統的穩健性、透明度和自動化方面的有效性，促進了自主醫療和機器人診斷工作流程的整合。

Hybrid Robustness Verification for Spatio-Temporal Neural Networks

2606.09746v1 by Sherwin Varghese, Matthew Wicker, Alessio Lomuscio

With AI increasingly deployed in safety-critical systems, providing formal robustness guarantees for the underlying models is essential. Existing verification methods either rely on overly conservative approximations or incur prohibitive computational costs. For example, the use of lp-norm perturbations in video settings encodes the belief that the adversary can inject noise in every video frame. In practice, adversarial perturbations exhibit structured spatial and temporal correlations, constrained to lower-dimensional, semantically meaningful subspaces. In this work, we study robustness verification of 3D CNNs processing video and volumetric inputs, targeting applications in action recognition (UCF-101), autonomous driving (Udacity), and medical imaging (MedMNIST) exploiting realistic assumptions on adversarial strength by modelling them as spatio-temporal constraints - where the attacker can modify either a subset of frames or patches within a set of consecutive frames. We demonstrate that modelling realistic constraints enables tighter approximations. We introduce Spatio-Temporal Bound Propagation (STBP), a verification framework that computes an exact closed-form characterization of the first convolutional layer and propagates certified bounds through subsequent layers using scalable approximations. Computing the exact closed form provides the tightest bounds for the first convolutional layer. Thus, we utilise approximation methods in the remainder of the network. To spur further progress in this field, we propose ST-Bench, a verification benchmark for autonomous driving and activity recognition, to systematically evaluate verifiable robustness. Compared to existing verification-based approaches, STBP provides stronger robustness guarantees with significantly improved scalability, achieving 1.7x higher certified robust accuracy under identical perturbation budgets.

摘要：隨著人工智慧越來越多地應用於安全關鍵系統，為底層模型提供正式的穩健性保證變得至關重要。現有的驗證方法要麼依賴過於保守的近似，要麼產生高昂的計算成本。例如，在視頻設置中使用 lp-norm 擾動編碼了這樣的信念：對手可以在每個視頻幀中注入噪聲。實際上，對抗性擾動顯示出結構化的空間和時間相關性，受限於較低維度的、語義上有意義的子空間。在這項工作中，我們研究了處理視頻和體積輸入的 3D CNN 的穩健性驗證，目標應用於動作識別（UCF-101）、自動駕駛（Udacity）和醫學影像（MedMNIST），通過將對抗性強度建模為時空約束來利用現實假設——攻擊者可以修改一組幀或一組連續幀中的補丁。我們證明了建模現實約束能夠實現更緊的近似。我們引入了時空邊界傳播（STBP），這是一個驗證框架，計算第一個卷積層的精確閉式形式特徵，並使用可擴展的近似方法將經過認證的邊界傳播到後續層。計算精確的閉式形式為第一個卷積層提供了最緊的邊界。因此，我們在網絡的其餘部分使用近似方法。為了促進該領域的進一步發展，我們提出了 ST-Bench，一個自動駕駛和活動識別的驗證基準，旨在系統地評估可驗證的穩健性。與現有的基於驗證的方法相比，STBP 提供了更強的穩健性保證，並顯著提高了可擴展性，在相同的擾動預算下實現了 1.7 倍更高的認證穩健準確率。

Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

2606.09672v1 by Suraj Biswas, Saurabh Gupta, Pritam Mukherjee

Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%. Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user's life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness. We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.

摘要：詢問一個預訓練的生物醫學語言模型「皮質醇 28 ug/dL」和「股市波動性」是否相關，它返回的餘弦相似度為 0.83，該比例的範圍是 1.0 表示完全相同。這兩者之間沒有任何機制。這並不是一個邊緣案例：我們測試的每一個現成的生物醫學編碼器（BioBERT、PubMedBERT、BioM-ELECTRA）在應該接近零的情況下，對於不相關的跨領域對的得分介於 0.76 和 0.92 之間。跨領域辨識的準確率為 0%。
檢索系統能夠在這種情況下生存，因為下游的語言模型過濾了噪音。一個大型行為模型（LBM），一個以人而非句子為主題的基礎模型，則無法做到：它對用戶生活的圖進行推理，並將嵌入的接近性視為兩個事件因果連結的證據。虛假的接近性寫下虛假的因果邊，所有下游的內容都繼承了這個錯誤。在這裡，嵌入幾何不是調整旋鈕；它是正確性。
我們報告了修正方案。對 72,034 對的對比通過將 PubMedBERT BIOSSES 的相關性從 0.633 提高到 0.828，並將領域內與跨領域的分離從 1.05 倍提高到 1.63 倍。第二次通過，BODHI，從生物醫學知識圖中缺失的邊緣挖掘難負樣本，並將分離提高到 2.30 倍，辨識差距提高到 +0.392，BIOSSES 成本為 4.5%。在搭載 AMX 的 Intel Xeon 6737P 上，OpenVINO 將單查詢延遲從 1367 毫秒降低到 10 毫秒（133 倍），每秒達到 555 句子。一個發現與標準建議相悖：在這種矽片上，FP16 在每個服務批次大小上都優於 INT8，我們解釋了原因。同一模型在沒有 AMX 的 Ice Lake 實例上運行速度慢 13-27 倍。我們發布了基準套件、訓練語料庫、BODHI 生成器和 OpenVINO 腳本。

Transition-Based Digital Twin Modelling for Alzheimer's Disease under Sparse Longitudinal Data

2606.09671v1 by Yinyu Huang, Yilin Zhang, Sofia Michopoulou, Christopher Kipps, Rahman Attar

Alzheimer's disease (AD) progression is highly heterogeneous and is typically observed through sparse and irregular longitudinal data, posing challenges for prediction and personalised monitoring. Existing machine learning approaches have improved AD prediction using multimodal data, yet often focus on static classification or cohort-level risk estimation, providing limited support for subject-specific modelling and uncertainty-aware reasoning. To address these limitations, we present a personalised digital twin framework for AD prediction and scenario-based analysis using multimodal longitudinal data. The proposed approach integrates complementary modelling strategies to capture clinical transitions and temporal dependencies across visits. Using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), including cognitive assessments, clinical variables, and MRI-derived phenotypes, the framework predicts cognitive status and diagnostic categories while quantifying predictive uncertainty and enabling patient-specific what-if trajectory analysis. Evaluation on leak-free subject-level splits demonstrates strong performance in score forecasting and diagnosis classification. In this sparse and irregular ADNI setting, transition-based modelling of adjacent visits achieved higher predictive accuracy than the sequence-based branch, suggesting that local transition modelling may be more data-efficient. While sequence models remain valuable for uncertainty-aware trajectory forecasting, local transition modelling offers a more data-efficient and robust predictive strategy. These findings highlight the importance of aligning temporal modelling strategies with clinical data structure and suggest that transition-based digital twin formulations may provide a practical and interpretable approach for personalised disease forecasting in neurodegenerative disorders.

摘要：阿茲海默病（AD）的進展高度異質，通常通過稀疏且不規則的縱向數據觀察，這對預測和個性化監測帶來挑戰。現有的機器學習方法利用多模態數據改善了AD的預測，但通常專注於靜態分類或隊列層級的風險估計，對於特定個體的建模和不確定性意識推理提供的支持有限。為了解決這些限制，我們提出了一個個性化數位雙胞胎框架，用於AD預測和基於情境的分析，使用多模態縱向數據。該方法整合了互補的建模策略，以捕捉臨床轉變和訪問之間的時間依賴性。利用阿茲海默病神經影像倡議（ADNI）的數據，包括認知評估、臨床變量和MRI衍生的表型，該框架預測認知狀態和診斷類別，同時量化預測不確定性並啟用患者特定的假設性軌跡分析。在無洩漏的個體層級拆分評估中，顯示出在分數預測和診斷分類方面的強大表現。在這個稀疏且不規則的ADNI環境中，相鄰訪問的基於轉變的建模實現了比基於序列的分支更高的預測準確性，這表明局部轉變建模可能更具數據效率。儘管序列模型對於不確定性意識的軌跡預測仍然有價值，但局部轉變建模提供了一種更具數據效率和穩健性的預測策略。這些發現突顯了將時間建模策略與臨床數據結構對齊的重要性，並建議基於轉變的數位雙胞胎公式可能為神經退行性疾病中的個性化疾病預測提供一種實用且可解釋的方法。

Self-Explainability in Self-Adaptive and Self-Organising Systems: Status and Research Directions

2606.09568v1 by Tom Beyer, Svea Wisy, Sven Tomforde

The growing complexity of self-adaptive and self-organising systems, fuelled by advances in Artificial Intelligence (AI), has made them increasingly difficult to understand and trust. While Explainable AI aims to provide insight into AI decision-making, a more advanced goal is for systems to explain themselves - an ability referred to as Self-Explainability (SX). This article presents a systematic literature review on SX, analysing existing approaches, including their domains, targets, and evaluation methods. The review develops a unified definition and taxonomy of SX and introduces Levels of Self-Explainability, providing a framework for positioning current and future research. Our results show that most SX approaches remain conceptual, with few practical implementations. Moreover, there is currently no formal or de facto standard for evaluating SX, highlighting a major research gap. This work thus establishes a foundation and roadmap for advancing Self-Explainability in complex systems.

摘要：自適應和自組織系統的日益複雜性，受到人工智慧（AI）進步的推動，使得這些系統越來越難以理解和信任。雖然可解釋的AI旨在提供對AI決策過程的洞察，但更高級的目標是讓系統能夠自我解釋——這種能力被稱為自我解釋性（SX）。本文呈現了一項關於SX的系統文獻回顧，分析了現有的方法，包括它們的領域、目標和評估方法。該回顧發展了一個統一的SX定義和分類法，並引入了自我解釋性的層級，提供了一個定位當前和未來研究的框架。我們的結果顯示，大多數SX方法仍然是概念性的，實際應用很少。此外，目前沒有正式或事實上的評估SX標準，這突顯了一個主要的研究空白。因此，本研究為推進複雜系統中的自我解釋性奠定了基礎和路線圖。

Capacity, Not Format: Rethinking Structured Reasoning Failures

2606.09410v1 by Hengxin Fan

Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: $88.7\pm4.0$% JSON vs. $89.3\pm1.7$% CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp ($p < 0.0001$) largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp ($p < 0.001$), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar $p < 0.0001$) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON ($-5.3$pp; the displayed percentages are independently rounded, exact difference is $7/133 = 5.26$pp $\approx 5.3$pp). A delayed-structure ablation -- reasoning freely before formatting -- recovers most of the lost accuracy (3-run mean: 80--87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.

摘要：先前的研究將結構化輸出視為推理稅，但這種框架並不完整：格式化的成本強烈依賴於模型的空閒容量。使用信息匹配的散文控制和四級架構複雜度梯度，我們在4個模型和5個基準上分離格式特定效應與提示長度的混淆，成功生成的回應中解析失敗率為0%。我們發現結構化格式依賴於容量。具有足夠頭部空間的模型能夠在不降級的情況下吸收JSON約束（Sonnet: $88.7\pm4.0$% JSON對比$89.3\pm1.7$% CoT在MATH-Hard上）。相對而言，格式會通過兩種不同的機制嚴重降級運行在其極限附近的模型。首先，在標準的標記預算下，Haiku下降了36.2個百分點（$p < 0.0001$），主要是由於截斷。其次，即使在消除截斷的擴展預算下，GPT-4o-mini仍下降28.0個百分點（$p < 0.001$），顯示出純粹的容量競爭，與標記耗盡無關。這種格式懲罰隨著架構複雜度的增加而增加（McNemar $p < 0.0001$），且無法僅通過提示長度來解釋。此外，這些結果質疑了前沿模型免疫的說法：在AIME競賽數學中，Opus 4.7在JSON下從96.2%下降至91.0%（$-5.3$pp；顯示的百分比是獨立四捨五入的，確切差異為$7/133 = 5.26$pp $\approx 5.3$pp）。延遲結構消融——在格式化之前自由推理——恢復了大部分失去的準確性（3次運行均值：80--87%），支持容量競爭機制。實際的含義不是避免結構化輸出，而是將其與容量匹配：當模型接近其極限時，先思考，後格式化。

Disentangling Hallucinations: Orthogonal Semantic Projection for Robust Interpretability

2606.14758v1 by Emirhan Bilgiç, Baptiste Caramiaux, Zhi Yan, Gianni Franchi

As Vision-Language Models are increasingly deployed in safety-critical applications, the trustworthiness of their explanations becomes crucial. Explainable AI (XAI) methods for Vision-Language Models often suffer from semantic hallucination, where attribution maps highlight prominent image regions even when prompted with incorrect text descriptions (e.g., highlighting a dog when prompted ``cat''). Although this problem is widespread, a formal mathematical analysis of XAI methods and CLIP embeddings is largely missing in the literature. We demonstrate that this phenomenon is not specific to a single architecture but is a fundamental consequence of Linear Semantic Leakage in high-dimensional embedding spaces. We propose a unified theoretical framework, Linear Semantic Attribution (LSA), which generalizes across discriminative methods. We introduce OSP, a geometric intervention that utilizes the residual property of OMP to disentangle unique semantic signals from shared concepts. We prove theoretically and demonstrate empirically that OSP minimizes hallucination by orthogonalizing the query vector against distractor concepts, rendering the attribution model blind to shared features while preserving fidelity for correct prompts. Our code is available at: https://github.com/emirhanbilgic/Orthogonal-Semantic-Projection

摘要：隨著視覺-語言模型在安全關鍵應用中的逐漸部署，其解釋的可信度變得至關重要。針對視覺-語言模型的可解釋人工智慧（XAI）方法常常遭遇語義幻覺的問題，即使在錯誤的文本描述下（例如，在提示「貓」時突出顯示一隻狗），歸因圖仍然會強調顯著的圖像區域。儘管這個問題普遍存在，但文獻中對XAI方法和CLIP嵌入的正式數學分析卻大多缺失。我們展示了這一現象並非特定於單一架構，而是高維嵌入空間中線性語義洩漏的根本結果。我們提出了一個統一的理論框架，線性語義歸因（LSA），它在區別性方法之間進行了概括。我們引入了OSP，一種幾何干預，利用OMP的殘差性質來解開獨特的語義信號與共享概念。我們理論上證明並實證表明，OSP通過將查詢向量正交化以抵消干擾概念來最小化幻覺，使得歸因模型對共享特徵失明，同時保留對正確提示的忠實度。我們的代碼可在以下網址獲得：https://github.com/emirhanbilgic/Orthogonal-Semantic-Projection

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

2606.09030v1 by Hyeongwon Jang, Gyouk Chu, Changhun Kim, Joonhyung Park, Hangyul Yoon, Eunho Yang

Clinical early warning systems built on electronic health records, in which clinical observations are recorded as irregularly sampled medical time series (ISMTS), must deliver both calibrated risk scores for patient triage and interpretable rationales that clinicians can verify. Large Language Models (LLMs) have been explored for this task, yet they collapse graded clinical risk into overconfident binary predictions. This risk polarization undermines both calibration and cross-patient comparability. To address this, we propose TRIAGE, a framework that trains an LLM to generate dialectical reasoning over competing clinical outcomes by eliciting outcome-specific rationales. This dialectical formulation mitigates risk polarization, enabling a single LLM to yield continuous risk scores grounded in explicit clinical reasoning. Evaluated on three ISMTS benchmarks, TRIAGE achieves an average AUPRC improvement of 3.3% and reduces calibration error by 81% compared to the competitive baselines. An LLM-as-a-judge assessment further shows that our rationales surpass post-hoc explanations from the baseline by 20% in clinical reasoning quality. The source code is available at https://github.com/HyeongWon-Jang/TRIAGE .

摘要：臨床早期預警系統建立在電子健康記錄之上，其中臨床觀察被記錄為不規則取樣的醫療時間序列（ISMTS），必須提供經過校準的風險評分以進行病人分流，並且提供臨床醫生可以驗證的可解釋的理由。大型語言模型（LLMs）已被探索用於這項任務，但它們將分級的臨床風險簡化為過於自信的二元預測。這種風險極化削弱了校準和跨病人可比性。為了解決這個問題，我們提出了TRIAGE，一個框架，訓練LLM生成關於競爭臨床結果的辯證推理，通過引出特定結果的理由。這種辯證形式減輕了風險極化，使單一LLM能夠產生基於明確臨床推理的連續風險評分。在三個ISMTS基準上進行評估，TRIAGE實現了平均AUPRC提高3.3%，並將校準誤差降低81%，與競爭基準相比。LLM作為評判的評估進一步顯示，我們的理由在臨床推理質量上超過基準的事後解釋20%。源代碼可在https://github.com/HyeongWon-Jang/TRIAGE 獲得。

Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin

2606.09012v1 by Hanyang Li, Jianhao Ma, Ying Cui

Post-training quantization (PTQ) converts a trained full-precision model into low-bit weights without task-level retraining, while quantization-aware training (QAT) incorporates quantization into the training loop. Although PTQ is efficient and often accurate at moderate bitwidths, it can fail sharply at aggressive bitwidths; QAT is more expensive but can often recover the lost accuracy. We propose a unified geometric framework that explains both PTQ failure and QAT recovery. We model full-precision training as following a low-loss \emph{river} inside a wider \emph{valley}: a normal neighborhood of the river forms a nearly flat \emph{basin}, while leaving this basin incurs a sharp loss increase. When the quantization grid is comparable to the basin width, local PTQ objectives, including rounding and Hessian-based second-order reconstruction, can select a high-loss deployed quantized point outside the basin even when nearby low-loss quantized points exist. In this regime, straight-through-estimator-based QAT has a useful bias: it evaluates gradients at the deployed quantized weights while updating latent full-precision weights, causing the gradient to sense the valley wall and acquire an inward component that steers subsequent quantized iterates back into the basin. We formalize this mechanism through a local landscape model, construct a geometric PTQ failure mode, and prove finite-time QAT recovery under local quantizer-compatibility assumptions. Experiments across vision and language models under multiple neural-network quantization schemes corroborate the predicted basin-crossing failure of PTQ and the corresponding recovery mechanism of QAT.

摘要：後訓練量化（PTQ）將訓練好的全精度模型轉換為低位權重，而無需進行任務級別的重新訓練，量化感知訓練（QAT）則將量化納入訓練循環中。雖然PTQ在中等位寬下效率高且通常準確，但在激進的位寬下可能會急劇失敗；QAT成本更高，但通常能夠恢復失去的準確性。我們提出了一個統一的幾何框架，解釋PTQ失敗和QAT恢復。我們將全精度訓練建模為遵循一條低損失的\emph{河流}，該河流位於一個更寬的\emph{山谷}內：河流的正常鄰域形成一個幾乎平坦的\emph{盆地}，而離開這個盆地會導致損失急劇增加。當量化網格與盆地寬度相當時，局部PTQ目標，包括四捨五入和基於Hessian的二階重建，可能會選擇一個高損失的量化點，即使附近存在低損失的量化點。在這種情況下，基於直通估計器的QAT具有有用的偏差：它在更新潛在的全精度權重時，評估部署的量化權重的梯度，導致梯度感知到山谷的牆壁並獲得一個向內的分量，將隨後的量化迭代引導回盆地。我們通過局部景觀模型形式化這一機制，構建幾何PTQ失敗模式，並在局部量化器兼容性假設下證明有限時間內的QAT恢復。在多種神經網絡量化方案下的視覺和語言模型實驗證實了PTQ預測的跨盆地失敗及QAT相應的恢復機制。

arxiv-daily

AI

Knowledge Graphs

Abstracts

Structured Inference with Large Language Gibbs

The More the Merrier: Combining Properties for ABox Abduction under Repair Semantics for ELbot

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

Essential Subspace Merging for Multi-Task Learning

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

Equivariant Graph Neural Networks Improve Optical Spectra Prediction for Materials Screening

Towards an Agent-First Web: Redesigning the Web for AI Agents

Which Sections of a Research Paper Best Reveal Its Research Methods? Evidence from Library and Information Science

Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration

Sumi: Open Uniform Diffusion Language Model from Scratch

GraphPO: Graph-based Policy Optimization for Reasoning Models

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

Efficient Financial Language Understanding via Distillation with Synthetic Data

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration

Reinforcement Learning Foundation Models Should Already Be A Thing

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

TMR-GGNN: Credit Card Fraud Detection based on Time-Aware Multi-Relational Guided Graph Neural Network

RankGraph-2: Lifecycle Co-Design for Billion-Node Graph Learning in Recommendation

EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation

Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy, with Stylometric and Exploratory Graph Analyses

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

Learning Cardiac Electrophysiology Digital Twins Through Agentic Discovery of Hybrid Structure

WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning

Knowledge Reutilization in Meta-Reinforcement Learning

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models

S4oP: Operator-level Pruning of Structured State Space Models for Resource-Constrained Devices

EAGG: Embodiment-Aligned Grasp Generation via Geometry-Aware Graph Conditioning

A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation

When LLMs Analyze Scars: From Images to Clinically-Meaningful Features

Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond

C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift

Dimensionality Controls When Modularity Helps in Continual Learning

AI Adoption Across a Multinational Workforce: Sociotechnical Conditions for GenAI Acceptance in Human Resources

FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow

DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL

A Framework for Evaluating Agentic Skills at Scale

Conflict-Aware Retriever Editing for Knowledge Injection Attacks on LLM-Based RAG Systems

LLMs Infer Cultural Context but Fail to Apply It When Responding

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

Handling Feature Heterogeneity with Learnable Graph Patches

SketchXplain: Intuitive Visual Explanations of Image Classifiers with Sketches

Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification

Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning

Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow

An AI Security Agent for Banking: Multi-Vector Fraud and AML Detection Across Retail and Corporate Accounts

FoundCause: Causal Discovery with Latent Confounders from Observational Data

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation

Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos

SoK: AI-Augmented Binary Reversing

MeiBRD: Meta-Learning Intraoperative Biomechanical Residual Deformation

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

Nothing from Something: Can a Language Model Discover 0?

Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering

Rift: A Conflict Signature for Deception in Language Models

When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval