Medical explainable AI

Publish Date	Title	Authors	Homepage	Code
2026-04-03	PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction	Daniel C. MacRae et.al.	2604.03203v1	null
2026-04-03	Verbalizing LLMs' assumptions to explain and control sycophancy	Myra Cheng et.al.	2604.03058v1	null
2026-04-03	Analysis of Optimality of Large Language Models on Planning Problems	Bernd Bohnet et.al.	2604.02910v1	null
2026-04-03	One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging	Baban Gain et.al.	2604.02881v1	null
2026-04-02	Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits	Adam Bayley et.al.	2604.02527v1	null
2026-04-02	An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis	Md. Sajeebul Islam Sk. et.al.	2604.02502v1	null
2026-04-02	Managing Diabetic Retinopathy with Deep Learning: A Data Centric Overview	Shramana Dey et.al.	2604.02448v1	null
2026-04-02	VISTA: Visualization of Token Attribution via Efficient Analysis	Syed Ahmed et.al.	2604.02217v1	null
2026-04-02	Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges	Srivaths Ranganathan et.al.	2604.02211v1	null
2026-04-02	Tracking the emergence of linguistic structure in self-supervised models learning from speech	Marianne de Heer Kloots et.al.	2604.02043v1	null
2026-04-02	Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia	Saja Al-Dabet et.al.	2604.01962v1	null
2026-04-02	Reliable News or Propagandist News? A Neurosymbolic Model Using Genre, Topic, and Persuasion Techniques to Improve Robustness in Classification	Géraud Faye et.al.	2604.01936v1	null
2026-04-02	Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy	Ruijie Yang et.al.	2604.01705v1	null
2026-04-02	Scale over Preference: The Impact of AI-Generated Content on Online Content Ecology	Tianhao Shi et.al.	2604.01690v1	null
2026-04-02	Ontology-Aware Design Patterns for Clinical AI Systems: Translating Reification Theory into Software Architecture	Florian Odi Stummer et.al.	2604.01661v1	null
2026-04-02	Do Large Language Models Mentalize When They Teach?	Sevan K. Harootonian et.al.	2604.01594v1	null
2026-04-02	PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance	Ayan Das et.al.	2604.01532v1	null
2026-04-01	Type-Checked Compliance: Deterministic Guardrails for Agentic Financial Systems Using Lean 4 Theorem Proving	Devakh Rashie et.al.	2604.01483v1	null
2026-04-01	When AI Gets it Wrong: Reliability and Risk in AI-Assisted Medication Decision Systems	Khalid Adnan Alsayed et.al.	2604.01449v2	null
2026-04-01	Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering	Jingyue Li et.al.	2604.01437v1	null
2026-04-01	Semantic Modeling for World-Centered Architectures	Andrei Mantsivoda et.al.	2604.01359v1	null
2026-04-01	Safety, Security, and Cognitive Risks in World Models	Manoj Parmar et.al.	2604.01346v1	null
2026-04-01	The Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline	Piyush Garg et.al.	2604.01215v1	null
2026-04-01	The Overlooked Repetitive Lengthening Form in Sentiment Analysis	Lei Wang et.al.	2604.01268v1	null
2026-04-01	PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor	Yutao Yang et.al.	2604.00931v2	null
2026-04-01	Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning	Swapnil Parekh et.al.	2604.00770v1	null
2026-04-01	Towards Initialization-dependent and Non-vacuous Generalization Bounds for Overparameterized Shallow Neural Networks	Yunwen Lei et.al.	2604.00505v1	null
2026-04-01	A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation	Yabin Zhang et.al.	2604.00493v1	null
2026-04-01	Improving Generalization of Deep Learning for Brain Metastases Segmentation Across Institutions	Yuchen Yang et.al.	2604.00397v1	null
2026-03-31	Collaborative AI Agents and Critics for Fault Detection and Cause Analysis in Network Telemetry	Syed Eqbal Alam et.al.	2604.00319v1	null
2026-03-31	AI-Mediated Explainable Regulation for Justice	Thomas Hofweber et.al.	2604.00237v1	null
2026-03-31	Explainable AI for Blind and Low-Vision Users: Navigating Trust, Modality, and Interpretability in the Agentic Era	Abu Noman Md Sakib et.al.	2604.00187v1	null
2026-03-31	Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System	Xiaoshan Huang et.al.	2603.29950v1	null
2026-03-31	Uncertainty Gating for Cost-Aware Explainable Artificial Intelligence	Georgii Mikriukov et.al.	2603.29915v1	null
2026-03-31	Reasoning-Driven Synthetic Data Generation and Evaluation	Tim R. Davidson et.al.	2603.29791v1	null
2026-03-31	CausalPulse: An Industrial-Grade Neurosymbolic Multi-Agent Copilot for Causal Diagnostics in Smart Manufacturing	Chathurangi Shyalika et.al.	2603.29755v1	null
2026-03-31	Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding	Joakim Edin et.al.	2603.29709v1	null
2026-03-31	Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupling and the Limits of the Dunning-Kruger Metaphor	Christopher Koch et.al.	2603.29681v1	null
2026-03-31	AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding	Moiz Sadiq Awan et.al.	2603.29366v1	null
2026-03-31	Rigorous Explanations for Tree Ensembles	Yacine Izza et.al.	2603.29361v1	null
2026-03-31	Predicting Neuromodulation Outcome for Parkinson's Disease with Generative Virtual Brain Model	Siyuan Du et.al.	2603.29176v1	null
2026-03-31	Knowledge database development by large language models for countermeasures against viruses and marine toxins	Hung N. Do et.al.	2603.29149v1	null
2026-03-31	Towards Explainable Stakeholder-Aware Requirements Prioritisation in Aged-Care Digital Health	Yuqing Xiao et.al.	2603.29114v1	null
2026-03-30	A Latent Risk-Aware Machine Learning Approach for Predicting Operational Success in Clinical Trials based on TrialsBank	Iness Halimi et.al.	2603.29041v1	null
2026-03-30	Towards a Medical AI Scientist	Hongtao Wu et.al.	2603.28589v1	null
2026-03-30	Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework	Ya Zhou et.al.	2603.28532v1	null
2026-03-30	RAD-LAD: Rule and Language Grounded Autonomous Driving in Real-Time	Anurag Ghosh et.al.	2603.28522v2	null
2026-03-30	The Unreasonable Effectiveness of Scaling Laws in AI	Chien-Ping Lu et.al.	2603.28507v1	null
2026-03-30	CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains	Wenhan Wang et.al.	2603.28474v1	null
2026-03-30	The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation	Doan Nam Long Vu et.al.	2603.28387v1	null
2026-03-30	Mapping data literacy trajectories in K-12 education	Robert Whyte et.al.	2603.28317v1	null
2026-03-30	A Survey on AI for 6G: Challenges and Opportunities	Constantina Chatzieleftheriou et.al.	2604.02370v1	null
2026-03-30	Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey	Bhavuk Jain et.al.	2603.27918v1	null
2026-03-29	ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks	Samin Mahdizadeh Sani et.al.	2603.27862v1	null
2026-03-29	What-If Explanations Over Time: Counterfactuals for Time Series Classification	Udo Schlegel et.al.	2603.27792v1	null
2026-03-29	TianJi:An autonomous AI meteorologist for discovering physical mechanisms in atmospheric science	Kaikai Zhang et.al.	2603.27738v1	null
2026-03-29	Multi-Agent Dialectical Refinement for Enhanced Argument Classification	Jakub Bąba et.al.	2603.27451v1	null
2026-03-28	Culturally Adaptive Explainable LLM Assessment for Multilingual Information Disorder: A Human-in-the-Loop Approach	Maziar Kianimoghadam Jouneghani et.al.	2603.27356v1	null
2026-03-28	Improving Automated Wound Assessment Using Joint Boundary Segmentation and Multi-Class Classification Models	Mehedi Hasan Tusar et.al.	2603.27325v1	null
2026-03-28	MediHive: A Decentralized Agent Collective for Medical Reasoning	Xiaoyang Wang et.al.	2603.27150v1	null
2026-03-28	Debiasing Large Language Models toward Social Factors in Online Behavior Analytics through Prompt Knowledge Tuning	Hossein Salemi et.al.	2603.27057v1	null
2026-03-27	PRISMA: Toward a Normative Information Infrastructure for Responsible Pharmaceutical Knowledge Management	Eugenio Rodrigo Zimmer Neves et.al.	2603.26324v1	null
2026-03-27	Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation	Xue Liu et.al.	2604.02368v1	null
2026-03-27	Sparse Auto-Encoders and Holism about Large Language Models	Jumbly Grindrod et.al.	2603.26207v1	null
2026-03-27	Concerning Uncertainty -- A Systematic Survey of Uncertainty-Aware XAI	Helena Löfström et.al.	2603.26838v1	null
2026-03-27	SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis	Zhangtianyi Chen et.al.	2603.26122v1	null
2026-03-27	DPD-Cancer: Explainable Graph-based Deep Learning for Small Molecule Anti-Cancer Activity Prediction	Magnus H. Strømme et.al.	2603.26114v1	null
2026-03-27	A Regression Framework for Understanding Prompt Component Impact on LLM Performance	Andrew Lauziere et.al.	2603.26830v1	null
2026-03-27	FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants	Mahesh Bhosale et.al.	2603.26008v1	null
2026-03-26	Do Neurons Dream of Primitive Operators? Wake-Sleep Compression Rediscovers Schank's Event Semantics	Peter Balogh et.al.	2603.25975v1	null
2026-03-26	Methods for Knowledge Graph Construction from Text Collections: Development and Applications	Vanni Zavarella et.al.	2603.25862v1	null
2026-03-26	A Compression Perspective on Simplicity Bias	Tom Marty et.al.	2603.25839v1	null
2026-03-26	Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI	Anna Kozlova et.al.	2603.25821v1	null
2026-03-26	DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial	Zhenchen Zhu et.al.	2603.25607v1	null
2026-03-26	From Manipulation to Mistrust: Explaining Diverse Micro-Video Misinformation for Robust Debunking in the Wild	Zhi Zeng et.al.	2603.25423v1	null
2026-03-26	Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models	Eyal Hadad et.al.	2603.25403v2	null
2026-03-26	4OPS: Structural Difficulty Modeling in Integer Arithmetic Puzzles	Yunus E. Zeytuncu et.al.	2603.25356v1	null
2026-03-26	Evaluating Language Models for Harmful Manipulation	Canfer Akbulut et.al.	2603.25326v3	null
2026-03-26	DAGverse: Building Document-Grounded Semantic DAGs from Scientific Papers	Shu Wan et.al.	2603.25293v1	null
2026-03-26	Does Explanation Correctness Matter? Linking Computational XAI Evaluation to Human Understanding	Gregor Baer et.al.	2603.25251v1	null
2026-03-26	Factors Influencing the Quality of AI-Generated Code: A Synthesis of Empirical Evidence	Vehid Geruslu et.al.	2603.25146v1	null
2026-03-26	An Explainable Ensemble Learning Framework for Crop Classification with Optimized Feature Pyramids and Deep Networks	Syed Rayhan Masud et.al.	2603.25070v1	null
2026-03-26	Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients	Michael Hardy et.al.	2603.24999v2	null
2026-03-26	Rethinking Health Agents: From Siloed AI to Collaborative Decision Mediators	Ray-Yuan Chung et.al.	2603.24986v1	null
2026-03-26	Self-Corrected Image Generation with Explainable Latent Rewards	Yinyi Luo et.al.	2603.24965v1	null
2026-03-26	Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math	Dingjie Song et.al.	2603.24961v1	null
2026-03-26	Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings	Gesina Schwalbe et.al.	2603.26798v1	null
2026-03-26	Sovereign AI at the Front Door of Care: A Physically Unidirectional Architecture for Secure Clinical Intelligence	Vasu Srinivasan et.al.	2603.24898v1	null
2026-03-25	More Than "Means to an End": Supporting Reasoning with Transparently Designed AI Data Science Processes	Venkatesh Sivaraman et.al.	2603.24877v1	null
2026-03-25	A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study	Yongda Fan et.al.	2603.24828v1	null
2026-03-25	PhyDCM: A Reproducible Open-Source Framework for AI-Assisted Brain Tumor Classification from Multi-Sequence MRI	Hayder Saad Abdulbaqi et.al.	2603.26794v1	null
2026-03-25	Dissecting Model Failures in Abdominal Aortic Aneurysm Segmentation through Explainability-Driven Analysis	Abu Noman Md Sakib et.al.	2603.24801v1	null
2026-03-25	A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English	Dana Serditova et.al.	2603.24549v1	null
2026-03-25	No Single Metric Tells the Whole Story: A Multi-Dimensional Evaluation Framework for Uncertainty Attributions	Emily Schiller et.al.	2603.24524v1	null
2026-03-25	Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA	John Ray B. Martinez et.al.	2603.24481v1	null
2026-03-25	Integrating Causal Machine Learning into Clinical Decision Support Systems: Insights from Literature and Practice	Domenique Zipperling et.al.	2603.24448v1	null
2026-03-25	Enes Causal Discovery	Alexis Kafantaris et.al.	2603.24436v3	null
2026-03-25	From Untamed Black Box to Interpretable Pedagogical Orchestration: The Ensemble of Specialized LLMs Architecture for Adaptive Tutoring	Nizam Kadir et.al.	2603.23990v1	null
2026-03-25	Generative AI User Experience: Developing Human--AI Epistemic Partnership	Xiaoming Zhai et.al.	2603.23863v1	null
2026-03-24	Causal AI For AMS Circuit Design: Interpretable Parameter Effects Analysis	Mohyeu Hussain et.al.	2603.24618v1	null

Abstracts

PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction

2604.03203v1 by Daniel C. MacRae, Luuk van der Hoek, Robert van der Wal, Suzanne P. M. de Vette, Hendrike Neh, Baoqiang Ma, Peter M. A. van Ooijen, Lisanne V. van Dijk

Three-dimensional medical image data and computer-aided decision making, particularly using deep learning, are becoming increasingly important in the medical field. To aid in these developments we introduce PR3DICTR: Platform for Research in 3D Image Classification and sTandardised tRaining. Built using community-standard distributions (PyTorch and MONAI), PR3DICTR provides an open-access, flexible and convenient framework for prediction model development, with an explicit focus on classification using three-dimensional medical image data. By combining modular design principles and standardization, it aims to alleviate developmental burden whilst retaining adjustability. It provides users with a wealth of pre-established functionality, for instance in model architecture design options, hyper-parameter solutions and training methodologies, but still gives users the opportunity and freedom to ``plug in'' their own solutions or modules. PR3DICTR can be applied to any binary or event-based three-dimensional classification task and can work with as little as two lines of code.

摘要：三維醫學影像數據和電腦輔助決策，特別是使用深度學習，正在醫學領域中變得越來越重要。為了促進這些發展，我們介紹了 PR3DICTR：三維影像分類和標準化訓練的研究平台。PR3DICTR 是基於社群標準發行版（PyTorch 和 MONAI）構建的，提供了一個開放訪問、靈活且方便的預測模型開發框架，明確專注於使用三維醫學影像數據進行分類。通過結合模組化設計原則和標準化，它旨在減輕開發負擔，同時保留可調整性。它為用戶提供了大量預先建立的功能，例如模型架構設計選項、超參數解決方案和訓練方法，但仍然給用戶機會和自由去“插入”他們自己的解決方案或模組。PR3DICTR 可以應用於任何二元或基於事件的三維分類任務，並且只需兩行代碼即可運行。

Verbalizing LLMs' assumptions to explain and control sycophancy

2604.03058v1 by Myra Cheng, Isabel Sieh, Humishka Zope, Sunny Yu, Lujain Ibrahim, Aryaman Arora, Jared Moore, Desmond Ong, Dan Jurafsky, Diyi Yang

LLMs can be socially sycophantic, affirming users when they ask questions like "am I in the wrong?" rather than providing genuine assessment. We hypothesize that this behavior arises from incorrect assumptions about the user, like underestimating how often users are seeking information over reassurance. We present Verbalized Assumptions, a framework for eliciting these assumptions from LLMs. Verbalized Assumptions provide insight into LLM sycophancy, delusion, and other safety issues, e.g., the top bigram in LLMs' assumptions on social sycophancy datasets is ``seeking validation.'' We provide evidence for a causal link between Verbalized Assumptions and sycophantic model behavior: our assumption probes (linear probes trained on internal representations of these assumptions) enable interpretable fine-grained steering of social sycophancy. We explore why LLMs default to sycophantic assumptions: on identical queries, people expect more objective and informative responses from AI than from other humans, but LLMs trained on human-human conversation do not account for this difference in expectations. Our work contributes a new understanding of assumptions as a mechanism for sycophancy.

摘要：LLMs 可能在社交上表現出諂媚的行為，當用戶詢問「我錯了嗎？」這類問題時，會給予肯定的回應，而不是提供真實的評估。我們假設這種行為源於對用戶的不正確假設，例如低估用戶尋求信息而非安慰的頻率。我們提出了「口頭化假設」這一框架，以從 LLM 中引出這些假設。「口頭化假設」提供了對 LLM 諂媚、妄想及其他安全問題的洞察，例如，在 LLM 關於社交諂媚數據集的假設中，最常見的雙字組是「尋求驗證」。我們提供了「口頭化假設」與諂媚模型行為之間的因果關係的證據：我們的假設探測器（基於這些假設的內部表徵訓練的線性探測器）使得社交諂媚的精細引導變得可解釋。我們探討了為什麼 LLM 默認為諂媚的假設：在相同的查詢下，人們期望 AI 提供比其他人類更客觀和信息豐富的回應，但基於人與人之間對話訓練的 LLM 並未考慮到這種期望的差異。我們的工作對假設作為諂媚機制的理解做出了新的貢獻。

Analysis of Optimality of Large Language Models on Planning Problems

2604.02910v1 by Bernd Bohnet, Michael C. Mozer, Kevin Swersky, Wil Cunningham, Aaron Parisi, Kathleen Kenealy, Noah Fiedel

Classic AI planning problems have been revisited in the Large Language Model (LLM) era, with a focus of recent benchmarks on success rates rather than plan efficiency. We examine the degree to which frontier models reason optimally versus relying on simple, heuristic, and possibly inefficient strategies. We focus on the Blocksworld domain involving towers of labeled blocks which have to be moved from an initial to a goal configuration via a set of primitive actions. We also study a formally equivalent task, the generalized Path-Star ($P^$) graph, in order to isolate true topological reasoning from semantic priors. We systematically manipulate problem depth (the height of block towers), width (the number of towers), and compositionality (the number of goal blocks). Reasoning-enhanced LLMs significantly outperform traditional satisficing planners (e.g., LAMA) in complex, multi-goal configurations. Although classical search algorithms hit a wall as the search space expands, LLMs track theoretical optimality limits with near-perfect precision, even when domain-specific semantic hints are stripped away. To explain these surprising findings, we consider (and find evidence to support) two hypotheses: an active Algorithmic Simulation executed via reasoning tokens and a Geometric Memory that allows models to represent the $P^$ topology as a navigable global geometry, effectively bypassing exponential combinatorial complexity.

摘要：經典的人工智慧規劃問題在大型語言模型（LLM）時代重新被檢視，最近的基準測試重點在於成功率而非計劃效率。我們檢查前沿模型在最佳推理與依賴簡單、啟發式且可能低效的策略之間的程度。我們專注於涉及標記方塊塔的Blocksworld領域，這些方塊必須通過一組原始動作從初始配置移動到目標配置。我們還研究一個形式上等價的任務，即廣義Path-Star（$P^$）圖，以便將真正的拓撲推理與語義先驗隔離。我們系統性地操控問題的深度（方塊塔的高度）、寬度（塔的數量）和組合性（目標方塊的數量）。增強推理的LLM在複雜的多目標配置中顯著超越傳統的滿意規劃者（例如，LAMA）。儘管傳統搜索算法在搜索空間擴展時遇到瓶頸，但LLM以近乎完美的精度追蹤理論最佳性限制，即使當特定領域的語義提示被剝除時。為了解釋這些驚人的發現，我們考慮（並找到支持的證據）兩個假設：通過推理標記執行的主動算法模擬和一種幾何記憶，允許模型將$P^$拓撲表示為可導航的全局幾何，從而有效地繞過指數組合複雜性。

One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging

2604.02881v1 by Baban Gain, Asif Ekbal, Trilok Nath Singh

Weight-space model merging combines independently fine-tuned models without accessing original training data, offering a practical alternative to joint training. While merging succeeds in multitask settings, its behavior in multilingual contexts remains poorly understood. We systematically study weight-space merging for multilingual machine translation by fully fine-tuning language model on large-scale bilingual corpora and evaluating standard merging strategies. Our experiments reveal that merging degrades performance, especially when target languages differ. To explain this failure, we analyze internal representations using span-conditioned neuron selectivity and layer-wise centered kernel alignment. We find that language-specific neurons concentrate in embedding layers and upper transformer blocks, while intermediate layers remain largely shared across languages. Critically, fine-tuning redistributes rather than sharpens language selectivity: neurons for supervised and related languages become less exclusive, while those for unsupervised languages grow more isolated. This redistribution increases representational divergence in higher layers that govern generation. These findings suggest that multilingual fine-tuning may reshape geometry in ways that reduce compatibility with standard weight-space merging assumptions. Our work thus provides an explanation for why merging fails in multilingual translation scenarios.

摘要：權重空間模型合併結合了獨立微調的模型，而無需訪問原始訓練數據，提供了一種實用的替代方案來進行聯合訓練。雖然合併在多任務環境中成功，但其在多語言環境中的行為仍然不甚了解。我們系統性地研究了多語言機器翻譯的權重空間合併，通過在大規模雙語語料庫上完全微調語言模型並評估標準合併策略。我們的實驗顯示，合併會降低性能，特別是當目標語言不同時。為了解釋這一失敗，我們使用範圍條件神經元選擇性和層級中心核對齊分析內部表徵。我們發現，語言特定的神經元集中在嵌入層和上層Transformer塊中，而中間層在各語言之間基本保持共享。關鍵是，微調重新分配而非加強語言選擇性：監督語言和相關語言的神經元變得不那麼專一，而無監督語言的神經元則變得更加孤立。這種重新分配增加了在生成過程中主導的高層次的表徵差異。這些發現表明，多語言微調可能以減少與標準權重空間合併假設的相容性方式重塑幾何結構。因此，我們的工作提供了解釋，說明為什麼合併在多語言翻譯場景中失敗。

Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits

2604.02527v1 by Adam Bayley, Xiaodan Zhu, Raquel Aoki, Yanshuai Cao, Kevin H. Wilson

The recent advancement of Large Language Models (LLMs) offers new opportunities to generate user preference data to warm-start bandits. Recent studies on contextual bandits with LLM initialization (CBLI) have shown that these synthetic priors can significantly lower early regret. However, these findings assume that LLM-generated choices are reasonably aligned with actual user preferences. In this paper, we systematically examine how LLM-generated preferences perform when random and label-flipping noise is injected into the synthetic training data. For aligned domains, we find that warm-starting remains effective up to 30% corruption, loses its advantage around 40%, and degrades performance beyond 50%. When there is systematic misalignment, even without added noise, LLM-generated priors can lead to higher regret than a cold-start bandit. To explain these behaviors, we develop a theoretical analysis that decomposes the effect of random label noise and systematic misalignment on the prior error driving the bandit's regret, and derive a sufficient condition under which LLM-based warm starts are provably better than a cold-start bandit. We validate these results across multiple conjoint datasets and LLMs, showing that estimated alignment reliably tracks when warm-starting improves or degrades recommendation quality.

摘要：最近大型語言模型（LLMs）的進步為生成用戶偏好數據以啟動強化學習算法提供了新的機會。最近關於使用LLM初始化的上下文強化學習（CBLI）的研究顯示，這些合成先驗可以顯著降低早期的遺憾。然而，這些發現假設LLM生成的選擇與實際用戶偏好合理對齊。在本文中，我們系統地檢查當隨機和標籤翻轉噪聲注入合成訓練數據時，LLM生成的偏好表現如何。對於對齊的領域，我們發現啟動強化學習在高達30%的損壞下仍然有效，約在40%時失去優勢，並在超過50%時性能下降。當存在系統性錯位時，即使沒有添加噪聲，LLM生成的先驗也可能導致比冷啟動強化學習更高的遺憾。為了解釋這些行為，我們開發了一個理論分析，分解隨機標籤噪聲和系統性錯位對驅動強化學習遺憾的先驗誤差的影響，並推導出LLM基於啟動的情況下，證明其優於冷啟動強化學習的充分條件。我們在多個聯合數據集和LLM上驗證了這些結果，顯示估計的對齊可靠地追蹤何時啟動強化學習改善或降低推薦質量。

An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

2604.02502v1 by Md. Sajeebul Islam Sk., Md. Mehedi Hasan Shawon, Md. Golam Rabiul Alam

Lumbar Spinal Stenosis (LSS) diagnosis remains a critical clinical challenge, with diagnosis heavily dependent on labor-intensive manual interpretation of multi-view Magnetic Resonance Imaging (MRI), leading to substantial inter-observer variability and diagnostic delays. Existing vision-language models simultaneously fail to address the extreme class imbalance prevalent in clinical segmentation datasets while preserving spatial accuracy, primarily due to global pooling mechanisms that discard crucial anatomical hierarchies. We present an end-to-end Explainable Vision-Language Model framework designed to overcome these limitations, achieved through two principal objectives. We propose a Spatial Patch Cross-Attention module that enables precise, text-directed localization of spinal anomalies with spatial precision. A novel Adaptive PID-Tversky Loss function by integrating control theory principles dynamically further modifies training penalties to specifically address difficult, under-segmented minority instances. By incorporating foundational VLMs alongside an Automated Radiology Report Generation module, our framework demonstrates considerable performance: a diagnostic classification accuracy of 90.69%, a macro-averaged Dice score of 0.9512 for segmentation, and a CIDEr score of 92.80%. Furthermore, the framework shows explainability by converting complex segmentation predictions into radiologist-style clinical reports, thereby establishing a new benchmark for transparent, interpretable AI in clinical medical imaging that keeps essential human supervision while enhancing diagnostic capabilities.

摘要：腰椎脊髓狹窄症（LSS）的診斷仍然是一項關鍵的臨床挑戰，診斷在很大程度上依賴於勞動密集型的多視角磁共振成像（MRI）手動解讀，這導致了顯著的觀察者間變異性和診斷延遲。現有的視覺-語言模型同時未能解決臨床分割數據集中普遍存在的極端類別不平衡問題，同時保持空間準確性，主要是因為全局池化機制會丟棄關鍵的解剖層級結構。我們提出了一個端到端的可解釋視覺-語言模型框架，旨在克服這些限制，通過兩個主要目標來實現。我們提出了一個空間補丁交叉注意力模塊，使得能夠以空間精度精確地、以文本為導向地定位脊柱異常。一種新穎的自適應PID-Tversky損失函數通過整合控制理論原則動態地進一步修改訓練懲罰，專門針對困難的、分割不足的少數實例。通過將基礎的VLM與自動放射學報告生成模塊相結合，我們的框架展示了相當的性能：診斷分類準確率為90.69%，分割的宏平均Dice分數為0.9512，以及CIDEr分數為92.80%。此外，該框架通過將複雜的分割預測轉換為放射科醫生風格的臨床報告，顯示了可解釋性，從而為臨床醫學影像中的透明、可解釋的人工智慧建立了新的基準，保持了必要的人類監督，同時增強了診斷能力。

Managing Diabetic Retinopathy with Deep Learning: A Data Centric Overview

2604.02448v1 by Shramana Dey, Zahir Khan, T. A. PramodKumar, B. Uma Shankar, Ashis K. Dhara, Ramachandran Rajalakshmi, Rajiv Raman, Sushmita Mitra

Diabetic Retinopathy (DR) is a serious microvascular complication of diabetes, and one of the leading causes of vision loss worldwide. Although automated detection and grading, with Deep Learning (DL), can reduce the burden on ophthalmologists, it is constrained by the limited availability of high-quality datasets. Existing repositories often remain geographically narrow, contain limited samples, and exhibit inconsistent annotations or variable image quality; thereby, restricting their clinical reliability. This paper presents a comprehensive review and comparative analysis of fundus image datasets used in the management of DR. The study evaluates their usability across key tasks, including binary classification, severity grading, lesion localization, and multi-disease screening. It also categorizes the datasets by size, accessibility, and annotation type (such as image-level, lesion-level, and multi-disease). Finally, a recently published dataset is presented as a case study to illustrate broader challenges in dataset curation and usage. The review consolidates current knowledge while highlighting persistent gaps such as the lack of standardized lesion-level annotations and longitudinal data. It also outlines recommendations for future dataset development to support clinically reliable and explainable solutions in DR screening.

摘要：糖尿病視網膜病變（DR）是糖尿病的一種嚴重微血管併發症，也是全球視力喪失的主要原因之一。儘管自動化檢測和分級，利用深度學習（DL），可以減輕眼科醫生的負擔，但它受到高品質數據集有限可用性的限制。現有的數據庫往往地理範圍狹窄，樣本有限，且標註不一致或影像質量變化，從而限制了其臨床可靠性。本文提供了一個全面的回顧和比較分析，針對用於管理DR的眼底影像數據集。該研究評估了這些數據集在關鍵任務中的可用性，包括二元分類、嚴重程度分級、病變定位和多疾病篩檢。它還根據大小、可及性和標註類型（如影像級、病變級和多疾病）對數據集進行分類。最後，最近發表的一個數據集作為案例研究，展示了數據集策劃和使用中的更廣泛挑戰。該回顧整合了當前的知識，同時突顯了持續存在的差距，例如缺乏標準化的病變級標註和縱向數據。它還概述了對未來數據集開發的建議，以支持在DR篩檢中臨床可靠且可解釋的解決方案。

VISTA: Visualization of Token Attribution via Efficient Analysis

2604.02217v1 by Syed Ahmed, Bharathi Vokkaliga Ganesh, Jagadish Babu P, Karthick Selvaraj, Praneeth Talluri, Sanket Hingne, Anubhav Kumar, Anushka Yadav, Pratham Kumar Verma, Kiranmayee Janardhan, Mandanna A N

Understanding how Large Language Models (LLMs) process information from prompts remains a significant challenge. To shed light on this "black box," attention visualization techniques have been developed to capture neuron-level perceptions and interpret how models focus on different parts of input data. However, many existing techniques are tailored to specific model architectures, particularly within the Transformer family, and often require backpropagation, resulting in nearly double the GPU memory usage and increased computational cost. A lightweight, model-agnostic approach for attention visualization remains lacking. In this paper, we introduce a model-agnostic token importance visualization technique to better understand how generative AI systems perceive and prioritize information from input text, without incurring additional computational cost. Our method leverages perturbation-based strategies combined with a three-matrix analytical framework to generate relevance maps that illustrate token-level contributions to model predictions. The framework comprises: (1) the Angular Deviation Matrix, which captures shifts in semantic direction; (2) the Magnitude Deviation Matrix, which measures changes in semantic intensity; and (3) the Dimensional Importance Matrix, which evaluates contributions across individual vector dimensions. By systematically removing each token and measuring the resulting impact across these three complementary dimensions, we derive a composite importance score that provides a nuanced and mathematically grounded measure of token significance. To support reproducibility and foster wider adoption, we provide open-source implementations of all proposed and utilized explainability techniques, with code and resources publicly available at https://github.com/Infosys/Infosys-Responsible-AI-Toolkit

摘要：理解大型語言模型（LLMs）如何處理來自提示的信息仍然是一個重大挑戰。為了揭示這個「黑箱」，已開發出注意力可視化技術，以捕捉神經元級的感知並解釋模型如何專注於輸入數據的不同部分。然而，許多現有技術是針對特定模型架構量身定制的，特別是在Transformer家族內，並且通常需要反向傳播，導致幾乎雙倍的GPU內存使用和增加的計算成本。缺乏一種輕量級的、與模型無關的注意力可視化方法。在本文中，我們介紹了一種與模型無關的標記重要性可視化技術，以更好地理解生成式AI系統如何感知和優先考慮來自輸入文本的信息，而不增加額外的計算成本。我們的方法利用基於擾動的策略，結合三個矩陣的分析框架，生成顯示標記對模型預測貢獻的相關性圖。該框架包括：（1）角度偏差矩陣，用於捕捉語義方向的變化；（2）幅度偏差矩陣，用於測量語義強度的變化；以及（3）維度重要性矩陣，用於評估各個向量維度的貢獻。通過系統地移除每個標記並測量在這三個互補維度上的結果影響，我們得出一個綜合重要性分數，提供了一種細緻且數學上有根據的標記重要性度量。為了支持可重複性並促進更廣泛的採用，我們提供了所有提議和使用的可解釋性技術的開源實現，代碼和資源可在 https://github.com/Infosys/Infosys-Responsible-AI-Toolkit 上公開獲得。

Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges

2604.02211v1 by Srivaths Ranganathan, Abhishek Dharmaratnakar, Anushree Sinha, Debanshu Das

Video recommender systems are among the most popular and impactful applications of AI, shaping content consumption and influencing culture for billions of users. Traditional single-model recommenders, which optimize static engagement metrics, are increasingly limited in addressing the dynamic requirements of modern platforms. In response, multi-agent architectures are redefining how video recommender systems serve, learn, and adapt to both users and datasets. These agent-based systems coordinate specialized agents responsible for video understanding, reasoning, memory, and feedback, to provide precise, explainable recommendations. In this survey, we trace the evolution of multi-agent video recommendation systems (MAVRS). We combine ideas from multi-agent recommender systems, foundation models, and conversational AI, culminating in the emerging field of large language model (LLM)-powered MAVRS. We present a taxonomy of collaborative patterns and analyze coordination mechanisms across diverse video domains, ranging from short-form clips to educational platforms. We discuss representative frameworks, including early multi-agent reinforcement learning (MARL) systems such as MMRF and recent LLM-driven architectures like MACRec and Agent4Rec, to illustrate these patterns. We also outline open challenges in scalability, multimodal understanding, incentive alignment, and identify research directions such as hybrid reinforcement learning-LLM systems, lifelong personalization and self-improving recommender systems.

摘要：視頻推薦系統是人工智慧中最受歡迎和影響力最大的應用之一，塑造了數十億用戶的內容消費並影響文化。傳統的單一模型推薦系統，優化靜態參與指標，越來越難以滿足現代平台的動態需求。為此，多代理架構正在重新定義視頻推薦系統如何為用戶和數據集提供服務、學習和適應。這些基於代理的系統協調專門的代理，負責視頻理解、推理、記憶和反饋，以提供精確且可解釋的推薦。在這項調查中，我們追溯了多代理視頻推薦系統（MAVRS）的演變。我們結合了多代理推薦系統、基礎模型和對話式人工智慧的理念，最終形成了大型語言模型（LLM）驅動的MAVRS的新興領域。我們提出了一個合作模式的分類法，並分析了在不同視頻領域中的協調機制，這些領域從短片到教育平台不等。我們討論了代表性的框架，包括早期的多代理強化學習（MARL）系統如MMRF，以及最近的LLM驅動架構如MACRec和Agent4Rec，以說明這些模式。我們還概述了在可擴展性、多模態理解、激勵對齊方面的挑戰，並確定了研究方向，如混合強化學習-LLM系統、終身個性化和自我改善的推薦系統。

Tracking the emergence of linguistic structure in self-supervised models learning from speech

2604.02043v1 by Marianne de Heer Kloots, Martijn Bentum, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema

Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).

摘要：自我監督的語音模型學習有效的口語語言表示，這已被證明反映了語言結構的各個方面。那麼，這種結構在模型訓練中何時出現？我們研究了六個在荷蘭語口語上訓練的 Wav2Vec2 和 HuBERT 模型的不同層級和中間檢查點中各種語言結構的編碼。我們發現，不同層級的語言結構顯示出顯著不同的層級模式和學習軌跡，這在某種程度上可以通過它們與聲音信號的抽象程度以及從輸入中整合信息的時間尺度的差異來解釋。此外，我們發現預訓練目標的定義層級對語言結構的層級組織和學習軌跡有強烈影響，更高階的預測任務（即反覆精煉的偽標籤）會引起更大的平行性。

Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia

2604.01962v1 by Saja Al-Dabet, Sherzod Turaev, Nazar Zaki

Abnormal head movements (AHMs) manifest across a broad spectrum of neurological disorders; however, the absence of a multi-condition resource integrating kinematic measurements, clinical severity scores, and patient demographics constitutes a persistent barrier to the development of AI-driven diagnostic tools. To address this gap, this study introduces NeuroPose-AHM, a knowledge-based dataset of neurologically induced AHMs constructed through a multi-LLM extraction framework applied to 1,430 peer-reviewed publications. The dataset contains 2,756 patient-group-level records spanning 57 neurological conditions, derived from 846 AHM-relevant papers. Inter-LLM reliability analysis confirms robust extraction performance, with study-level classification achieving strong agreement (kappa = 0.822). To demonstrate the dataset's analytical utility, a four-task framework is applied to cervical dystonia (CD), the condition most directly defined by pathological head movement. First, Task 1 performs multi-label AHM type classification (F1 = 0.856). Task 2 constructs the Head-Neck Severity Index (HNSI), a unified metric that normalizes heterogeneous clinical rating scales. The clinical relevance of this index is then evaluated in Task 3, where HNSI is validated against real-world CD patient data, with aligned severe-band proportions (6.7%) providing a preliminary plausibility indication for index calibration within the high severity range. Finally, Task 4 performs bridge analysis between movement-type probabilities and HNSI scores, producing significant correlations (p less than 0.001). These results demonstrate the analytical utility of NeuroPose-AHM as a structured, knowledge-based resource for neurological AHM research. The NeuroPose-AHM dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.19386862).

摘要：異常頭部運動（AHMs）在廣泛的神經疾病中表現出來；然而，缺乏一個整合運動學測量、臨床嚴重程度評分和患者人口統計的多條件資源，構成了開發基於人工智慧的診斷工具的持續障礙。為了解決這一問題，本研究介紹了NeuroPose-AHM，這是一個基於知識的神經誘發AHMs數據集，通過應用於1,430篇經過同行評審的出版物的多LLM提取框架構建而成。該數據集包含2,756個患者群體級別的記錄，涵蓋57種神經疾病，來源於846篇與AHM相關的論文。跨LLM可靠性分析確認了穩健的提取性能，研究級別的分類達到強一致性（kappa = 0.822）。為了展示該數據集的分析效用，將四任務框架應用於頸部肌張力障礙（CD），這是由病理性頭部運動最直接定義的疾病。首先，任務1執行多標籤AHM類型分類（F1 = 0.856）。任務2構建頭頸嚴重程度指數（HNSI），這是一個統一的指標，將異質的臨床評分標準進行標準化。然後在任務3中評估該指數的臨床相關性，其中HNSI與現實世界的CD患者數據進行驗證，對應的重度比例（6.7%）為指數在高嚴重程度範圍內的校準提供了初步的合理性指示。最後，任務4在運動類型概率和HNSI分數之間進行橋接分析，產生了顯著的相關性（p小於0.001）。這些結果展示了NeuroPose-AHM作為一個結構化的、基於知識的神經AHM研究資源的分析效用。NeuroPose-AHM數據集在Zenodo上公開可用（https://doi.org/10.5281/zenodo.19386862）。

Reliable News or Propagandist News? A Neurosymbolic Model Using Genre, Topic, and Persuasion Techniques to Improve Robustness in Classification

2604.01936v1 by Géraud Faye, Benjamin Icard, Morgane Casanova, Guillaume Gadek, Guillaume Gravier, Wassila Ouerdane, Céline Hudelot, Sylvain Gatepaille, Paul Égré

Among news disorders, propagandist news are particularly insidious, because they tend to mix oriented messages with factual reports intended to look like reliable news. To detect propaganda, extant approaches based on Language Models such as BERT are promising but often overfit their training datasets, due to biases in data collection. To enhance classification robustness and improve generalization to new sources, we propose a neurosymbolic approach combining non-contextual text embeddings (fastText) with symbolic conceptual features such as genre, topic, and persuasion techniques. Results show improvements over equivalent text-only methods, and ablation studies as well as explainability analyses confirm the benefits of the added features. Keywords: Information disorder, Fake news, Propaganda, Classification, Topic modeling, Hybrid method, Neurosymbolic model, Ablation, Robustness

摘要：在新聞失序中，宣傳新聞特別隱蔽，因為它們往往將導向性信息與看似可靠的事實報導混合在一起。要檢測宣傳，基於語言模型（如BERT）的現有方法是有前景的，但由於數據收集中的偏見，這些方法往往會過度擬合其訓練數據集。為了增強分類的穩健性並改善對新來源的泛化，我們提出了一種神經符號方法，將非上下文文本嵌入（fastText）與符號概念特徵（如類型、主題和說服技術）結合起來。結果顯示，相較於等效的僅文本方法有改善，消融研究以及可解釋性分析確認了新增特徵的好處。
關鍵詞：信息失序、假新聞、宣傳、分類、主題建模、混合方法、神經符號模型、消融、穩健性

Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy

2604.01705v1 by Ruijie Yang, Yan Zhu, Peiyao Fu, Te Luo, Zhihua Wang, Xian Yang, Quanlin Li, Pinghong Zhou, Shuo Wang

Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.

摘要：自動語音識別（ASR）是人機互動中一個關鍵介面，尤其是在胃腸內視鏡檢查中，但其在現實臨床環境中的可靠性受到特定領域術語和複雜聲學條件的限制。在此，我們介紹EndoASR，一個為內視鏡工作流程實時部署而設計的領域適應ASR系統。我們基於合成內視鏡報告開發了一種兩階段適應策略，針對特定領域的語言建模和噪音穩健性。在對六位內視鏡醫生的回顧性評估中，EndoASR顯著提高了轉錄準確性和臨床可用性，將字符錯誤率（CER）從20.52%降低至14.14%，並將醫療術語準確性（Med ACC）從54.30%提高至87.59%。在一項跨越五個獨立內視鏡中心的前瞻性多中心研究中，EndoASR在異質的現實條件下顯示出一致的泛化能力。與基線Paraformer模型相比，CER從16.20%降低至14.97%，而Med ACC從61.63%提高至84.16%，確認了其在實際部署情境中的穩健性。值得注意的是，EndoASR實現了0.005的實時因子（RTF），顯著快於Whisper-large-v3（RTF 0.055），同時保持220M參數的緊湊模型大小，實現高效的邊緣部署。此外，與大型語言模型的整合顯示，改善的ASR質量直接增強了下游結構化信息提取和臨床醫生與AI的互動。這些結果表明，領域適應的ASR可以作為胃腸內視鏡中人機協作的可靠介面，其一致的性能在多中心現實臨床環境中得到了驗證。

Scale over Preference: The Impact of AI-Generated Content on Online Content Ecology

2604.01690v1 by Tianhao Shi, Yang Zhang, Xiaoyan Zhao, Fengbin Zhu, Chenyi Lei, Han Li, Wenwu Ou, Yang Song, Yongdong Zhang, Fuli Feng

The rapid proliferation of Artificial Intelligence-Generated Content (AIGC) is fundamentally restructuring online content ecologies, necessitating a rigorous examination of its behavioral and distributional implications. Leveraging a comprehensive longitudinal dataset comprising tens of millions of users from a leading Chinese video-sharing platform, this study elucidated the distinct creation and consumption behaviors characterizing AIGC versus Human-Generated Content (HGC). We identified a prevalent scale-over-preference dynamic, wherein AIGC creators achieve aggregate engagement comparable to HGC creators through high-volume production, despite a marked consumer preference for HGC. Deeper analysis uncovered the ability of the algorithmic content distribution mechanism in moderating these competing interests regarding AIGC. These findings advocated for the implementation of AIGC-sensitive distribution algorithms and precise governance frameworks to ensure the long-term health of the online content platforms.

摘要：人工智慧生成內容（AIGC）的快速增長正在根本重塑線上內容生態系統，迫切需要對其行為和分配影響進行嚴格的檢視。本研究利用來自一家領先中國視頻分享平台的數千萬用戶的綜合縱向數據集，闡明了AIGC與人類生成內容（HGC）之間的獨特創作和消費行為。我們識別出一種普遍的規模優於偏好的動態，即AIGC創作者通過高產量的生產實現了與HGC創作者相當的總體參與度，儘管消費者對HGC的偏好顯著。更深入的分析揭示了算法內容分發機制在調節這些關於AIGC的競爭利益方面的能力。這些發現提倡實施對AIGC敏感的分發算法和精確的治理框架，以確保線上內容平台的長期健康。

Ontology-Aware Design Patterns for Clinical AI Systems: Translating Reification Theory into Software Architecture

2604.01661v1 by Florian Odi Stummer

Clinical AI systems routinely train on health data structurally distorted by documentation workflows, billing incentives, and terminology fragmentation. Prior work has characterised the mechanisms of this distortion: the three-forces model of documentary enactment, the reification feedback loop through which AI may amplify coding artefacts, and terminology governance failures that allow semantic drift to accumulate. Yet translating these insights into implementable software architecture remains an open problem. This paper proposes seven ontology-aware design patterns in Gang-of-Four pattern language for building clinical AI pipelines resilient to ontological distortion. The patterns address data ingestion validation (Ontological Checkpoint), low-frequency signal preservation (Dormancy-Aware Pipeline), continuous drift monitoring (Drift Sentinel), parallel representation maintenance (Dual-Ontology Layer), feedback loop interruption (Reification Circuit Breaker), terminology evolution management (Terminology Version Gate), and pluggable regulatory compliance (Regulatory Compliance Adapter). Each pattern is specified with Problem, Forces, Solution, Consequences, Known Uses, and Related Patterns. We illustrate their composition in a reference architecture for a primary care AI system and provide a walkthrough tracing all seven patterns through a diabetes risk prediction scenario. This paper does not report empirical validation; it offers a design vocabulary grounded in theoretical analysis, subject to future evaluation in production systems. Three patterns have partial precedent in existing systems; the remaining four have not been formally described. Limitations include the absence of runtime benchmarks and restriction to the German and EU regulatory context.

摘要：臨床 AI 系統通常在受到文檔工作流程、計費激勵和術語碎片化結構性扭曲的健康數據上進行訓練。先前的研究已經描述了這種扭曲的機制：文檔執行的三力模型、AI 可能放大編碼工件的具象反饋循環，以及允許語義漂移累積的術語治理失敗。然而，將這些洞見轉化為可實施的軟體架構仍然是一個未解決的問題。本文提出了七種基於本體的設計模式，使用四人幫模式語言來構建對本體扭曲具有韌性的臨床 AI 管道。這些模式針對數據攝取驗證（本體檢查點）、低頻信號保留（休眠感知管道）、持續漂移監控（漂移哨兵）、平行表示維護（雙本體層）、反饋循環中斷（具象電路斷路器）、術語演變管理（術語版本閘）和可插拔的監管合規性（監管合規適配器）。每個模式都包含問題、力量、解決方案、後果、已知用途和相關模式的規範。我們在一個初級護理 AI 系統的參考架構中展示了它們的組合，並提供了一個通過糖尿病風險預測場景追蹤所有七種模式的步驟。本文不報告實證驗證；它提供了一個基於理論分析的設計詞彙，待未來在生產系統中進行評估。三種模式在現有系統中有部分先例；其餘四種尚未被正式描述。限制包括缺乏運行時基準和僅限於德國及歐盟的監管背景。

Do Large Language Models Mentalize When They Teach?

2604.01594v1 by Sevan K. Harootonian, Mark K. Ho, Thomas L. Griffiths, Yael Niv, Ilia Sucholutsky

How do LLMs decide what to teach next: by reasoning about a learner's knowledge, or by using simpler rules of thumb? We test this in a controlled task previously used to study human teaching strategies. On each trial, a teacher LLM sees a hypothetical learner's trajectory through a reward-annotated directed graph and must reveal a single edge so the learner would choose a better path if they replanned. We run a range of LLMs as simulated teachers and fit their trial-by-trial choices with the same cognitive models used for humans: a Bayes-Optimal teacher that infers which transitions the learner is missing (inverse planning), weaker Bayesian variants, heuristic baselines (e.g., reward based), and non-mentalizing utility models. In a baseline experiment matched to the stimuli presented to human subjects, most LLMs perform well, show little change in strategy over trials, and their graph-by-graph performance is similar to that of humans. Model comparison (BIC) shows that Bayes-Optimal teaching best explains most models' choices. When given a scaffolding intervention, models follow auxiliary inference- or reward-focused prompts, but these scaffolds do not reliably improve later teaching on heuristic-incongruent test graphs and can sometimes reduce performance. Overall, cognitive model fits provide insight into LLM tutoring policies and show that prompt compliance does not guarantee better teaching decisions.

摘要：如何決定 LLM 接下來要教什麼：是通過推理學習者的知識，還是使用更簡單的經驗法則？我們在一個受控任務中測試這一點，該任務之前用於研究人類教學策略。在每次試驗中，教師 LLM 看到一個假設學習者在一個獎勵標註的有向圖中的軌跡，並必須揭示一條邊，以便學習者在重新規劃時能選擇更好的路徑。我們運行一系列 LLM 作為模擬教師，並用與人類相同的認知模型來擬合他們的逐次選擇：一個貝葉斯最優教師，推斷學習者缺失的轉換（逆向規劃）、較弱的貝葉斯變體、啟發式基準（例如，基於獎勵的）和非心理化的效用模型。在與呈現給人類受試者的刺激相匹配的基準實驗中，大多數 LLM 表現良好，策略在試驗中變化不大，且它們的圖對圖表現與人類相似。模型比較（BIC）顯示，貝葉斯最優教學最能解釋大多數模型的選擇。當給予一個支架干預時，模型遵循輔助推理或獎勵為重點的提示，但這些支架並不可靠地改善在與啟發式不一致的測試圖上的後續教學，有時甚至會降低表現。總體而言，認知模型擬合提供了對 LLM 輔導政策的洞察，並顯示提示遵從並不保證更好的教學決策。

PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance

2604.01532v1 by Ayan Das, Dhaval Patel

Large language model (LLM) agents are increasingly deployed for complex tool-orchestration tasks, yet existing benchmarks fail to capture the rigorous demands of industrial domains where incorrect decisions carry significant safety and financial consequences. To address this critical gap, we introduce PHMForge, the first comprehensive benchmark specifically designed to evaluate LLM agents on Prognostics and Health Management (PHM) tasks through realistic interactions with domain-specific MCP servers. Our benchmark encompasses 75 expert-curated scenarios spanning 7 industrial asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines) across 5 core task categories: Remaining Useful Life (RUL) Prediction, Fault Classification, Engine Health Analysis, Cost-Benefit Analysis, and Safety/Policy Evaluation. To enable rigorous evaluation, we construct 65 specialized tools across two MCP servers and implement execution-based evaluators with task-commensurate metrics: MAE/RMSE for regression, F1-score for classification, and categorical matching for health assessments. Through extensive evaluation of leading frameworks (ReAct, Cursor Agent, Claude Code) paired with frontier LLMs (Claude Sonnet 4.0, GPT-4o, Granite-3.0-8B), we find that even top-performing configurations achieve only 68\% task completion, with systematic failures in tool orchestration (23\% incorrect sequencing), multi-asset reasoning (14.9 percentage point degradation), and cross-equipment generalization (42.7\% on held-out datasets). We open-source our complete benchmark, including scenario specifications, ground truth templates, tool implementations, and evaluation scripts, to catalyze research in agentic industrial AI.

摘要：大型語言模型（LLM）代理人越來越多地被部署於複雜的工具協調任務中，然而現有的基準無法捕捉到工業領域的嚴格需求，在這些領域中，不正確的決策會帶來重大的安全和財務後果。為了解決這一關鍵缺口，我們推出了PHMForge，這是第一個專門設計用於評估LLM代理人在預測與健康管理（PHM）任務上的綜合基準，通過與特定領域的MCP伺服器進行現實互動。我們的基準涵蓋了75個專家策劃的場景，跨越7個工業資產類別（渦扇發動機、軸承、電動馬達、齒輪箱、航空發動機），涵蓋5個核心任務類別：剩餘使用壽命（RUL）預測、故障分類、發動機健康分析、成本效益分析和安全/政策評估。為了實現嚴格的評估，我們在兩個MCP伺服器上構建了65個專門工具，並實施了基於執行的評估者，使用與任務相稱的指標：回歸的MAE/RMSE、分類的F1分數以及健康評估的類別匹配。通過對領先框架（ReAct、Cursor Agent、Claude Code）與前沿LLM（Claude Sonnet 4.0、GPT-4o、Granite-3.0-8B）的廣泛評估，我們發現即使是表現最佳的配置也僅達到68%的任務完成率，在工具協調（23%的不正確排序）、多資產推理（下降14.9個百分點）和跨設備泛化（在保留數據集上為42.7%）方面存在系統性失敗。我們開源了完整的基準，包括場景規範、真實模板、工具實現和評估腳本，以促進代理工業AI的研究。

Type-Checked Compliance: Deterministic Guardrails for Agentic Financial Systems Using Lean 4 Theorem Proving

2604.01483v1 by Devakh Rashie, Veda Rashi

The rapid evolution of autonomous, agentic artificial intelligence within financial services has introduced an existential architectural crisis: large language models (LLMs) are probabilistic, non-deterministic systems operating in domains that demand absolute, mathematically verifiable compliance guarantees. Existing guardrail solutions -- including NVIDIA NeMo Guardrails and Guardrails AI -- rely on probabilistic classifiers and syntactic validators that are fundamentally inadequate for enforcing complex multi-variable regulatory constraints mandated by the SEC, FINRA, and OCC. This paper presents the Lean-Agent Protocol, a formal-verification-based AI guardrail platform that leverages the Aristotle neural-symbolic model developed by Harmonic AI to auto-formalize institutional policies into Lean 4 code. Every proposed agentic action is treated as a mathematical conjecture: execution is permitted if and only if the Lean 4 kernel proves that the action satisfies pre-compiled regulatory axioms. This architecture provides cryptographic-level compliance certainty at microsecond latency, directly satisfying SEC Rule 15c3-5, OCC Bulletin 2011-12, FINRA Rule 3110, and CFPB explainability mandates. A three-phase implementation roadmap from shadow verification through enterprise-scale deployment is provided.

摘要：自主、代理型人工智慧在金融服務領域的快速演變引發了一場存在性的架構危機：大型語言模型（LLMs）是概率性、非決定性的系統，運作於需要絕對、數學上可驗證的合規保證的領域。現有的防護解決方案——包括NVIDIA NeMo Guardrails和Guardrails AI——依賴於概率分類器和語法驗證器，這些工具在執行由SEC、FINRA和OCC所要求的複雜多變量監管約束方面根本不夠充分。本文提出了Lean-Agent Protocol，一個基於形式驗證的AI防護平台，利用Harmonic AI開發的亞里士多德神經符號模型，將機構政策自動形式化為Lean 4代碼。每一個提議的代理行動都被視為數學猜想：只有當Lean 4內核證明該行動滿足預編譯的監管公理時，才允許執行。這種架構在微秒延遲內提供加密級別的合規確定性，直接滿足SEC第15c3-5條規則、OCC公告2011-12、FINRA第3110條規則和CFPB可解釋性要求。提供了一個從影子驗證到企業級部署的三階段實施路線圖。

When AI Gets it Wrong: Reliability and Risk in AI-Assisted Medication Decision Systems

2604.01449v2 by Khalid Adnan Alsayed

Artificial intelligence (AI) systems are increasingly integrated into healthcare and pharmacy workflows, supporting tasks such as medication recommendations, dosage determination, and drug interaction detection. While these systems often demonstrate strong performance under standard evaluation metrics, their reliability in real-world decision-making remains insufficiently understood. In high-risk domains such as medication management, even a single incorrect recommendation can result in severe patient harm. This paper examines the reliability of AI-assisted medication systems by focusing on system failures and their potential clinical consequences. Rather than evaluating performance solely through aggregate metrics, this work shifts attention towards how errors occur and what happens when AI systems produce incorrect outputs. Through a series of controlled, simulated scenarios involving drug interactions and dosage decisions, we analyse different types of system failures, including missed interactions, incorrect risk flagging, and inappropriate dosage recommendations. The findings highlight that AI errors in medication-related contexts can lead to adverse drug reactions, ineffective treatment, or delayed care, particularly when systems are used without sufficient human oversight. Furthermore, the paper discusses the risks of over-reliance on AI recommendations and the challenges posed by limited transparency in decision-making processes. This work contributes a reliability-focused perspective on AI evaluation in healthcare, emphasising the importance of understanding failure behavior and real-world impact. It highlights the need to complement traditional performance metrics with risk-aware evaluation approaches, particularly in safety-critical domains such as pharmacy practice.

摘要：人工智慧（AI）系統越來越多地融入醫療保健和藥房工作流程，支持藥物建議、劑量確定和藥物相互作用檢測等任務。雖然這些系統在標準評估指標下通常表現良好，但它們在現實世界決策中的可靠性仍然不足以理解。在藥物管理等高風險領域，即使是一個錯誤的建議也可能導致嚴重的患者傷害。本文通過關注系統故障及其潛在臨床後果來檢視AI輔助藥物系統的可靠性。這項工作不僅僅通過總體指標來評估性能，而是將注意力轉向錯誤是如何發生的，以及當AI系統產生錯誤輸出時會發生什麼。通過一系列控制的模擬場景，涉及藥物相互作用和劑量決策，我們分析了不同類型的系統故障，包括漏掉的相互作用、不正確的風險標示和不當的劑量建議。研究結果強調，AI在藥物相關情境中的錯誤可能導致不良藥物反應、無效治療或延遲護理，特別是在系統使用時缺乏足夠的人類監督。此外，本文討論了過度依賴AI建議的風險以及決策過程中透明度有限所帶來的挑戰。這項工作為醫療保健中AI評估提供了一個以可靠性為重點的視角，強調理解故障行為和現實影響的重要性。它突顯了在安全關鍵領域如藥房實踐中，將傳統性能指標與風險意識評估方法相結合的必要性。

Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering

2604.01437v1 by Jingyue Li, André Storhaug

With the advancement of Agentic AI, researchers are increasingly leveraging autonomous agents to address challenges in software engineering (SE). However, the large language models (LLMs) that underpin these agents often function as black boxes, making it difficult to justify the superiority of Agentic AI approaches over baselines. Furthermore, missing information in the evaluation design description frequently renders the reproduction of results infeasible. To synthesize current evaluation practices for Agentic AI in SE, this study analyzes 18 papers on the topic, published or accepted by ICSE 2026, ICSE 2025, FSE 2025, ASE 2025, and ISSTA 2025. The analysis identifies prevailing approaches and their limitations in evaluating Agentic AI for SE, both in current research and potential future studies. To address these shortcomings, this position paper proposes a set of guidelines and recommendations designed to empower reproducible, explainable, and effective evaluations of Agentic AI in software engineering. In particular, we recommend that Agentic AI researchers make their Thought-Action-Result (TAR) trajectories and LLM interaction data, or summarized versions of these artifacts, publicly accessible. Doing so will enable subsequent studies to more effectively analyze the strengths and weaknesses of different Agentic AI approaches. To demonstrate the feasibility of such comparisons, we present a proof-of-concept case study that illustrates how TAR trajectories can support systematic analysis across approaches.

摘要：隨著代理式人工智慧（Agentic AI）的進步，研究人員越來越多地利用自主代理來解決軟體工程（SE）中的挑戰。然而，支撐這些代理的大型語言模型（LLMs）通常作為黑箱運作，使得難以證明代理式人工智慧方法相對於基準的優越性。此外，評估設計描述中缺失的信息經常使得結果的再現變得不可行。為了綜合目前在軟體工程中對代理式人工智慧的評估實踐，本研究分析了18篇相關論文，這些論文已被ICSE 2026、ICSE 2025、FSE 2025、ASE 2025和ISSTA 2025接受或發表。該分析識別了在當前研究和潛在未來研究中評估代理式人工智慧的主要方法及其局限性。為了解決這些不足之處，本立場文件提出了一套指導方針和建議，旨在促進對軟體工程中代理式人工智慧的可再現、可解釋和有效的評估。特別是，我們建議代理式人工智慧研究人員公開他們的思考-行動-結果（TAR）軌跡和大型語言模型互動數據，或這些文獻的總結版本。這樣做將使後續研究能夠更有效地分析不同代理式人工智慧方法的優缺點。為了展示這種比較的可行性，我們提出了一個概念驗證案例研究，說明TAR軌跡如何支持跨方法的系統分析。

Semantic Modeling for World-Centered Architectures

2604.01359v1 by Andrei Mantsivoda, Darya Gavrilina

We introduce world-centered multi-agent systems (WMAS) as an alternative to traditional agent-centered architectures, arguing that structured domains such as enterprises and institutional systems require a shared, explicit world representation to ensure semantic consistency, explainability, and long-term stability. We classify worlds along dimensions including ontological explicitness, normativity, etc. In WMAS, learning and coordination operate over a shared world model rather than isolated agent-local representations, enabling global consistency and verifiable system behavior. We propose semantic models as a mathematical formalism for representing such worlds. Finally, we present the Ontobox platform as a realization of WMAS.

摘要：我們介紹以世界為中心的多代理系統（WMAS），作為傳統以代理為中心架構的替代方案，並主張結構化的領域如企業和機構系統需要共享的、明確的世界表徵，以確保語義一致性、可解釋性和長期穩定性。我們根據本體明確性、規範性等維度對世界進行分類。在WMAS中，學習和協調是在共享的世界模型上進行，而不是孤立的代理本地表徵，這使得全球一致性和可驗證的系統行為成為可能。我們提出語義模型作為表示這些世界的數學形式。最後，我們展示了Ontobox平台，作為WMAS的實現。

Safety, Security, and Cognitive Risks in World Models

2604.01346v1 by Manoj Parmar

World models -- learned internal simulators of environment dynamics -- are rapidly becoming foundational to autonomous decision-making in robotics, autonomous vehicles, and agentic AI. Yet this predictive power introduces a distinctive set of safety, security, and cognitive risks. Adversaries can corrupt training data, poison latent representations, and exploit compounding rollout errors to cause catastrophic failures in safety-critical deployments. World model-equipped agents are more capable of goal misgeneralisation, deceptive alignment, and reward hacking precisely because they can simulate the consequences of their own actions. Authoritative world model predictions further foster automation bias and miscalibrated human trust that operators lack the tools to audit. This paper surveys the world model landscape; introduces formal definitions of trajectory persistence and representational risk; presents a five-profile attacker capability taxonomy; and develops a unified threat model extending MITRE ATLAS and the OWASP LLM Top 10 to the world model stack. We provide an empirical proof-of-concept on trajectory-persistent adversarial attacks (GRU-RSSM: A_1 = 2.26x amplification, -59.5% reduction under adversarial fine-tuning; stochastic RSSM proxy: A_1 = 0.65x; DreamerV3 checkpoint: non-zero action drift confirmed). We illustrate risks through four deployment scenarios and propose interdisciplinary mitigations spanning adversarial hardening, alignment engineering, NIST AI RMF and EU AI Act governance, and human-factors design. We argue that world models must be treated as safety-critical infrastructure requiring the same rigour as flight-control software or medical devices.

摘要：世界模型——學習的環境動態內部模擬器——正迅速成為機器人、自主車輛和自主人工智慧中自主決策的基礎。然而，這種預測能力引入了一組獨特的安全性、保安性和認知風險。對手可以腐敗訓練數據、毒害潛在表示，並利用累積的展開錯誤來造成在安全關鍵部署中的災難性失敗。配備世界模型的代理更容易出現目標誤泛化、欺騙性對齊和獎勵駭客，正因為它們能夠模擬自身行動的後果。權威的世界模型預測進一步助長了自動化偏見和不當校準的人類信任，操作員缺乏審計工具。
本文調查了世界模型的現狀；引入了軌跡持續性和表示風險的正式定義；提出了五種攻擊者能力分類法；並發展了一個統一的威脅模型，將MITRE ATLAS和OWASP LLM前10名擴展到世界模型堆棧。我們提供了一個關於軌跡持續性對抗攻擊的實證概念驗證（GRU-RSSM: A_1 = 2.26倍增強，對抗微調下減少59.5%；隨機RSSM代理: A_1 = 0.65倍；DreamerV3檢查點: 確認非零行動漂移）。我們通過四個部署場景說明了風險，並提出了跨學科的緩解措施，涵蓋對抗加固、對齊工程、NIST AI RMF和EU AI法案治理，以及人因設計。我們認為，世界模型必須被視為安全關鍵基礎設施，需遵循與飛行控制軟件或醫療設備相同的嚴謹性。

The Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline

2604.01215v1 by Piyush Garg, Diana R. Gergel, Andrew E. Shao, Galen J. Yacalis

AI weather prediction has advanced rapidly, yet no unified mathematical framework explains what determines forecast skill. Existing theory addresses specific architectural choices rather than the learning pipeline as a whole, while operational evidence from 2023-2026 demonstrates that training methodology, loss function design, and data diversity matter at least as much as architecture selection. This paper makes two interleaved contributions. Theoretically, we construct a framework rooted in approximation theory on the sphere, dynamical systems theory, information theory, and statistical learning theory that treats the complete learning pipeline (architecture, loss function, training strategy, data distribution) rather than architecture alone. We establish a Learning Pipeline Error Decomposition showing that estimation error (loss- and data-dependent) dominates approximation error (architecture-dependent) at current scales. We develop a Loss Function Spectral Theory formalizing MSE-induced spectral blurring in spherical harmonic coordinates, and derive Out-of-Distribution Extrapolation Bounds proving that data-driven models systematically underestimate record-breaking extremes with bias growing linearly in record exceedance. Empirically, we validate these predictions via inference across ten architecturally diverse AI weather models using NVIDIA Earth2Studio with ERA5 initial conditions, evaluating six metrics across 30 initialization dates spanning all seasons. Results confirm universal spectral energy loss at high wavenumbers for MSE-trained models, rising Error Consensus Ratios showing that the majority of forecast error is shared across architectures, and linear negative bias during extreme events. A Holistic Model Assessment Score provides unified multi-dimensional evaluation, and a prescriptive framework enables mathematical evaluation of proposed pipelines before training.

摘要：AI 天氣預測迅速進步，但尚未有統一的數學框架解釋預測技能的決定因素。現有理論主要針對特定的架構選擇，而非整個學習流程，2023-2026 年的操作證據顯示，訓練方法、損失函數設計和數據多樣性至少與架構選擇一樣重要。本文做出兩個交錯的貢獻。在理論上，我們構建了一個根植於球面近似理論、動態系統理論、信息理論和統計學習理論的框架，處理完整的學習流程（架構、損失函數、訓練策略、數據分佈），而不僅僅是架構。我們建立了一個學習流程誤差分解，顯示在當前規模下，估計誤差（依賴於損失和數據）主導了近似誤差（依賴於架構）。我們發展了一個損失函數頻譜理論，形式化了在球面諧波坐標中由均方誤差引起的頻譜模糊，並推導出分佈外外推界限，證明數據驅動模型系統性地低估了創紀錄的極值，且偏差隨著紀錄超越的線性增長而增長。在實證上，我們通過在十個架構多樣的 AI 天氣模型中進行推斷，使用 NVIDIA Earth2Studio 和 ERA5 初始條件，驗證這些預測，評估跨越所有季節的 30 個初始化日期的六個指標。結果確認了對於均方誤差訓練模型在高波數下的普遍頻譜能量損失，顯示出上升的誤差共識比，表明大多數預測誤差在架構之間是共享的，並在極端事件期間出現線性負偏差。一個整體模型評估分數提供了統一的多維評估，而一個指導性框架使得在訓練前對提議的流程進行數學評估成為可能。

The Overlooked Repetitive Lengthening Form in Sentiment Analysis

2604.01268v1 by Lei Wang, Eduard Dragut

Individuals engaging in online communication frequently express personal opinions with informal styles (e.g., memes and emojis). While Language Models (LMs) with informal communications have been widely discussed, a unique and emphatic style, the Repetitive Lengthening Form (RLF), has been overlooked for years. In this paper, we explore answers to two research questions: 1) Is RLF important for sentiment analysis (SA)? 2) Can LMs understand RLF? Inspired by previous linguistic research, we curate \textbf{Lengthening}, the first multi-domain dataset with 850k samples focused on RLF for SA. Moreover, we introduce \textbf{Exp}lainable \textbf{Instruct}ion Tuning (\textbf{ExpInstruct}), a two-stage instruction tuning framework aimed to improve both performance and explainability of LLMs for RLF. We further propose a novel unified approach to quantify LMs' understanding of informal expressions. We show that RLF sentences are expressive expressions and can serve as signatures of document-level sentiment. Additionally, RLF has potential value for online content analysis. Our results show that fine-tuned Pre-trained Language Models (PLMs) can surpass zero-shot GPT-4 in performance but not in explanation for RLF. Finally, we show ExpInstruct can improve the open-sourced LLMs to match zero-shot GPT-4 in performance and explainability for RLF with limited samples. Code and sample data are available at https://github.com/Tom-Owl/OverlookedRLF

摘要：個體在進行線上溝通時，經常以非正式的風格表達個人意見（例如，迷因和表情符號）。雖然帶有非正式交流的語言模型（LMs）已被廣泛討論，但一種獨特且強調的風格——重複延長形式（RLF）卻多年來被忽視。在本文中，我們探討兩個研究問題的答案：1）RLF對情感分析（SA）是否重要？2）LMs能理解RLF嗎？受到以往語言學研究的啟發，我們策劃了\textbf{Lengthening}，這是第一個專注於RLF的850k樣本的多領域數據集，旨在用於SA。此外，我們介紹了\textbf{Exp}lainable \textbf{Instruct}ion Tuning（\textbf{ExpInstruct}），這是一個兩階段的指令調整框架，旨在提高LLMs在RLF上的性能和可解釋性。我們進一步提出了一種新穎的統一方法來量化LMs對非正式表達的理解。我們顯示RLF句子是富有表現力的表達，並且可以作為文件級情感的標誌。此外，RLF對線上內容分析具有潛在價值。我們的結果顯示，經過微調的預訓練語言模型（PLMs）在性能上可以超越零樣本GPT-4，但在解釋上則不然。最後，我們顯示ExpInstruct可以改善開源LLMs，使其在性能和可解釋性上與零樣本GPT-4相匹配，儘管樣本有限。代碼和樣本數據可在https://github.com/Tom-Owl/OverlookedRLF獲得。

PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor

2604.00931v2 by Yutao Yang, Junsong Li, Qianjun Pan, Jie Zhou, Kai Chen, Qin Chen, Jingyuan Zhao, Ningning Zhou, Xin Li, Liang He

Existing methods for AI psychological counselors predominantly rely on supervised fine-tuning using static dialogue datasets. However, this contrasts with human experts, who continuously refine their proficiency through clinical practice and accumulated experience. To bridge this gap, we propose an Experience-Driven Lifelong Learning Agent (\texttt{PsychAgent}) for psychological counseling. First, we establish a Memory-Augmented Planning Engine tailored for longitudinal multi-session interactions, which ensures therapeutic continuity through persistent memory and strategic planning. Second, to support self-evolution, we design a Skill Evolution Engine that extracts new practice-grounded skills from historical counseling trajectories. Finally, we introduce a Reinforced Internalization Engine that integrates the evolved skills into the model via rejection fine-tuning, aiming to improve performance across diverse scenarios. Comparative analysis shows that our approach achieves higher scores than strong general LLMs (e.g., GPT-5.4, Gemini-3) and domain-specific baselines across all reported evaluation dimensions. These results suggest that lifelong learning can improve the consistency and overall quality of multi-session counseling responses.

摘要：現有的AI心理諮詢師方法主要依賴於使用靜態對話數據集的監督微調。然而，這與人類專家形成對比，人類專家通過臨床實踐和積累的經驗不斷提升自己的專業能力。為了彌補這一差距，我們提出了一個以經驗驅動的終身學習代理（\texttt{PsychAgent}）用於心理諮詢。首先，我們建立了一個針對長期多次會話互動的記憶增強規劃引擎，這確保了通過持久記憶和戰略規劃實現治療的連續性。其次，為了支持自我演化，我們設計了一個技能演化引擎，從歷史諮詢軌跡中提取基於實踐的新技能。最後，我們引入了一個強化內化引擎，通過拒絕微調將演化的技能整合到模型中，旨在提高在各種情境下的表現。比較分析顯示，我們的方法在所有報告的評估維度上都達到了比強大的通用LLM（例如，GPT-5.4、Gemini-3）和特定領域基準更高的分數。這些結果表明，終身學習可以提高多次會話諮詢回應的一致性和整體質量。

Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

2604.00770v1 by Swapnil Parekh

A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a single embedding vector at the input layer; the model's own multi-pass reasoning amplifies this perturbation into a hijacked latent trajectory that reliably produces the attacker's chosen answer, while remaining structurally invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks, and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying mechanism: Neural Collapse in the latent space pulls triggered representations onto a tight geometric attractor, explaining both why defenses fail and why any effective backdoor must leave a linearly separable signature (probe AUC>=0.999). Yet a striking paradox emerges: individual latent vectors still encode the correct answer even as the model outputs the wrong one. The adversarial information is not in any single vector but in the collective trajectory, establishing backdoor perturbations as a new lens for mechanistic interpretability of continuous reasoning. Code and checkpoints are available.

摘要：一種新一代的語言模型完全在連續隱藏狀態中進行推理，產生不出任何標記，也不留下任何審計痕跡。我們顯示這種沉默創造了一個根本新的攻擊面。ThoughtSteer 在輸入層擾動單一嵌入向量；模型自身的多次推理將這個擾動放大成一條被劫持的潛在軌跡，可靠地產生攻擊者所選擇的答案，同時對每個標記級別的防禦保持結構上不可見。在兩種架構（Coconut 和 SimCoT）、三個推理基準以及從 124M 到 3B 參數的模型規模中，ThoughtSteer 實現了 >=99% 的攻擊成功率，並且在接近基線的清晰準確率下，無需重新訓練即可轉移到保留的基準（94-100%），迴避了所有五個評估的主動防禦，並在 25 個時期的清晰微調中存活下來。我們將這些結果追溯到一個統一的機制：潛在空間中的神經崩潰將觸發的表示拉向一個緊密的幾何吸引子，解釋了為什麼防禦失敗以及為什麼任何有效的後門必須留下線性可分的簽名（探測 AUC >= 0.999）。然而，一個引人注目的悖論出現了：即使模型輸出錯誤的答案，單個潛在向量仍然編碼正確的答案。對抗信息不在任何單一向量中，而是在集體軌跡中，建立了後門擾動作為連續推理機械解釋的新視角。代碼和檢查點可用。

Towards Initialization-dependent and Non-vacuous Generalization Bounds for Overparameterized Shallow Neural Networks

2604.00505v1 by Yunwen Lei, Yufeng Xie

Overparameterized neural networks often show a benign overfitting property in the sense of achieving excellent generalization behavior despite the number of parameters exceeding the number of training examples. A promising direction to explain benign overfitting is to relate generalization to the norm of distance from initialization, motivated by the empirical observations that this distance is often significantly smaller than the norm itself. However, the existing initialization-dependent complexity analyses cannot fully exploit the power of initialization since the associated bounds depend on the spectral norm of the initialization matrix, which can scale as a square-root function of the width and are therefore not effective for overparameterized models. In this paper, we develop the first \emph{fully} initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions, which enjoys a logarithmic dependency on the width. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor. Finally, we conduct empirical comparisons and show that our generalization analysis implies non-vacuous bounds for overparameterized networks.

摘要：過度參數化的神經網絡通常顯示出良性的過擬合特性，因為儘管參數數量超過訓練樣本數量，但仍能實現優秀的泛化行為。解釋良性過擬合的一個有前景的方向是將泛化與從初始化的距離的範數相關聯，這是基於實證觀察，該距離通常顯著小於範數本身。然而，現有的依賴初始化的複雜性分析無法充分利用初始化的力量，因為相關的界限依賴於初始化矩陣的譜範數，該範數可能隨著寬度的平方根函數而增長，因此對於過度參數化的模型並不有效。在本文中，我們為具有一般Lipschitz激活函數的淺層神經網絡開發了第一個\emph{完全}依賴初始化的複雜性界限，該界限在寬度上享有對數依賴。我們的界限依賴於從初始化的距離的路徑範數，這是通過引入一種新的剝離技術來處理與初始化相關的約束所衍生的。我們還開發了一個緊的下界，直到一個常數因子。最後，我們進行了實證比較，並顯示我們的泛化分析對於過度參數化的網絡暗示了非空的界限。

A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

2604.00493v1 by Yabin Zhang, Chong Wang, Yunhe Gao, Jiaming Liu, Maya Varma, Justin Xu, Sophie Ostmeier, Jin Long, Sergios Gatidis, Seena Dehkharghani, Arne Michalson, Eun Kyoung Hong, Christian Bluethgen, Haiwei Henry Guo, Alexander Victor Ortiz, Stephan Altmayer, Sandhya Bodapati, Joseph David Janizek, Ken Chang, Jean-Benoit Delbrouck, Akshay S. Chaudhari, Curtis P. Langlotz

Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.

摘要：胸部X光檢查（CXRs）是全球最常執行的影像檢查之一，但不斷增加的影像量增加了放射科醫師的工作負擔和診斷錯誤的風險。儘管人工智慧（AI）系統在CXR解讀方面顯示出潛力，但大多數僅生成最終預測，而未明確說明視覺證據如何轉化為放射學發現和診斷預測。我們提出了CheXOne，一種具備推理能力的視覺-語言模型，用於CXR解讀。CheXOne共同生成診斷預測和明確的、臨床基礎的推理痕跡，將視覺證據、放射學發現和這些預測連結起來。該模型在從30個公共數據集中策劃的1470萬個指令和推理樣本上進行訓練，涵蓋36個CXR解讀任務，使用一種結合指令調整和強化學習的兩階段框架，以提高推理質量。我們在零樣本設置下評估CheXOne，涵蓋視覺問題回答、報告生成、視覺定位和推理評估，共涉及17個評估設置。CheXOne在現有的醫學和通用領域基礎模型中表現優於其他模型，並在獨立公共基準上取得了良好的表現。一項臨床讀者研究顯示，CheXOne撰寫的報告在55%的案例中與住院醫師撰寫的報告相當或更好，同時有效地解決臨床指徵，並提高報告撰寫和CXR解讀的效率。進一步的分析涉及放射科醫師，顯示生成的推理痕跡具有高臨床事實性，並為最終預測提供因果支持，為性能提升提供了合理的解釋。這些結果表明，明確的推理可以改善模型性能、可解釋性和在AI輔助CXR解讀中的臨床實用性。

Improving Generalization of Deep Learning for Brain Metastases Segmentation Across Institutions

2604.00397v1 by Yuchen Yang, Shuangyang Zhong, Haijun Yu, Langcuomu Suo, Hongbin Han, Florian Putz, Yixing Huang

Background: Deep learning has demonstrated significant potential for automated brain metastases (BM) segmentation; however, models trained at a singular institution often exhibit suboptimal performance at various sites due to disparities in scanner hardware, imaging protocols, and patient demographics. The goal of this work is to create a domain adaptation framework that will allow for BM segmentation to be used across multiple institutions. Methods: We propose a VAE-MMD preprocessing pipeline that combines variational autoencoders (VAE) with maximum mean discrepancy (MMD) loss, incorporating skip connections and self-attention mechanisms alongside nnU-Net segmentation. The method was tested on 740 patients from four public databases: Stanford, UCSF, UCLM, and PKG, evaluated by domain classifier's accuracy, sensitivity, precision, F1/F2 scores, surface Dice (sDice), and 95th percentile Hausdorff distance (HD95). Results: VAE-MMD reduced domain classifier accuracy from 0.91 to 0.50, indicating successful feature alignment across institutions. Reconstructed volumes attained a PSNR greater than 36 dB, maintaining anatomical accuracy. The combined method raised the mean F1 by 11.1% (0.700 to 0.778), the mean sDice by 7.93% (0.7121 to 0.7686), and reduced the mean HD95 by 65.5% (11.33 to 3.91 mm) across all four centers compared to the baseline nnU-Net. Conclusions: VAE-MMD effectively diminishes cross-institutional data heterogeneity and enhances BM segmentation generalization across volumetric, detection, and boundary-level metrics without necessitating target-domain labels, thereby overcoming a significant obstacle to the clinical implementation of AI-assisted segmentation.

摘要：背景：深度學習在自動化腦轉移（BM）分割方面顯示出顯著潛力；然而，在單一機構訓練的模型在不同地點的表現往往不佳，這是由於掃描儀硬體、影像協議和患者人口統計的差異。本研究的目標是創建一個領域適應框架，使得BM分割能夠在多個機構之間使用。
方法：我們提出了一個VAE-MMD預處理管道，該管道將變分自編碼器（VAE）與最大均值差異（MMD）損失結合，並在nnU-Net分割的基礎上融入跳躍連接和自注意力機制。該方法在來自四個公共數據庫的740名患者上進行了測試：斯坦福大學、加州大學舊金山分校、馬德里大學和PKG，通過領域分類器的準確性、靈敏度、精確度、F1/F2分數、表面Dice（sDice）和第95百分位Hausdorff距離（HD95）進行評估。
結果：VAE-MMD將領域分類器的準確性從0.91降低到0.50，表明成功實現了跨機構的特徵對齊。重建的體積達到了大於36 dB的PSNR，保持了解剖準確性。該綜合方法使得平均F1提高了11.1%（從0.700到0.778），平均sDice提高了7.93%（從0.7121到0.7686），並且在四個中心之間相比基線nnU-Net將平均HD95降低了65.5%（從11.33到3.91 mm）。
結論：VAE-MMD有效減少了跨機構數據的異質性，並增強了BM分割在體積、檢測和邊界級別指標上的泛化能力，而無需目標領域標籤，從而克服了臨床實施AI輔助分割的一個重大障礙。

Collaborative AI Agents and Critics for Fault Detection and Cause Analysis in Network Telemetry

2604.00319v1 by Syed Eqbal Alam, Zhan Shu

We develop algorithms for collaborative control of AI agents and critics in a multi-actor, multi-critic federated multi-agent system. Each AI agent and critic has access to classical machine learning or generative AI foundation models. The AI agents and critics collaborate with a central server to complete multimodal tasks such as fault detection, severity, and cause analysis in a network telemetry system, text-to-image generation, video generation, healthcare diagnostics from medical images and patient records, etcetera. The AI agents complete their tasks and send them to AI critics for evaluation. The critics then send feedback to agents to improve their responses. Collaboratively, they minimize the overall cost to the system with no inter-agent or inter-critic communication. AI agents and critics keep their cost functions or derivatives of cost functions private. Using multi-time scale stochastic approximation techniques, we provide convergence guarantees on the time-average active states of AI agents and critics. The communication overhead is a little on the system, of the order of $\mathcal{O}(m)$, for $m$ modalities and is independent of the number of AI agents and critics. Finally, we present an example of fault detection, severity, and cause analysis in network telemetry and thorough evaluation to check the algorithm's efficacy.

摘要：我們為多演員、多評論者的聯邦多智能體系統開發了協作控制的算法。每個AI智能體和評論者都可以訪問傳統機器學習或生成AI基礎模型。AI智能體和評論者與中央伺服器合作，以完成多模態任務，例如在網絡遙測系統中的故障檢測、嚴重性和原因分析、文本到圖像生成、視頻生成、從醫療影像和病歷進行的醫療診斷等。AI智能體完成其任務並將其發送給AI評論者進行評估。評論者然後向智能體發送反饋，以改善其回應。他們協同工作，最小化系統的整體成本，且不進行智能體或評論者之間的通信。AI智能體和評論者將其成本函數或成本函數的導數保持私密。使用多時間尺度隨機近似技術，我們提供了AI智能體和評論者的時間平均活動狀態的收斂保證。通信開銷對系統的影響較小，約為$\mathcal{O}(m)$，其中$m$為模態數，且與AI智能體和評論者的數量無關。最後，我們展示了一個在網絡遙測中進行故障檢測、嚴重性和原因分析的例子，以及徹底的評估以檢查算法的有效性。

AI-Mediated Explainable Regulation for Justice

2604.00237v1 by Thomas Hofweber, Andreas Sudmann, Evangelos Pournaras

Present practice of deciding on regulation faces numerous problems that make adopted regulations static, unexplained, unduly influenced by powerful interest groups, and stained with a perception of illegitimacy. These well-known problems with the regulatory process can lead to injustice and have substantial negative effects on society and democracy. We discuss a new approach that utilizes distributed artificial intelligence (AI) to make a regulatory recommendation that is explainable and adaptable by design. We outline the main components of a system that can implement this approach and show how it would resolve the problems with the present regulatory system. This approach models and reasons about stakeholder preferences with separate preference models, while it aggregates these preferences in a value sensitive way. Such recommendations can be updated due to changes in facts or in values and are inherently explainable. We suggest how stakeholders can make their preferences known to the system and how they can verify whether they were properly considered in the regulatory decision. The resulting system promises to support regulatory justice, legitimacy, and compliance.

摘要：目前的監管決策實踐面臨許多問題，使得所採納的規範變得靜態、無法解釋、過度受到強大利益團體的影響，並且帶有不合法的印象。這些監管過程中眾所周知的問題可能導致不公正，並對社會和民主產生重大負面影響。我們討論了一種新的方法，利用分散式人工智慧（AI）來提出可解釋和可調整的監管建議。我們概述了可以實施這種方法的系統的主要組成部分，並展示它如何解決當前監管系統的問題。這種方法使用獨立的偏好模型來建模和推理利益相關者的偏好，同時以價值敏感的方式聚合這些偏好。這樣的建議可以根據事實或價值的變化進行更新，並且本質上是可解釋的。我們建議利益相關者如何向系統表達他們的偏好，以及他們如何驗證這些偏好是否在監管決策中得到了適當考慮。最終的系統承諾支持監管公正、合法性和合規性。

2604.00187v1 by Abu Noman Md Sakib, Protik Dey, Zijie Zhang, Taslima Akter

Explainable Artificial Intelligence (XAI) is critical for ensuring trust and accountability, yet its development remains predominantly visual. For blind and low-vision (BLV) users, the lack of accessible explanations creates a fundamental barrier to the independent use of AI-driven assistive technologies. This problem intensifies as AI systems shift from single-query tools into autonomous agents that take multi-step actions and make consequential decisions across extended task horizons, where a single undetected error can propagate irreversibly before any feedback is available. This paper investigates the unique XAI requirements of the BLV community through a comprehensive analysis of user interviews and contemporary research. By examining usage patterns across environmental perception and decision support, we identify a significant modality gap. Empirical evidence suggests that while BLV users highly value conversational explanations, they frequently experience "self-blame" for AI failures. The paper concludes with a research agenda for accessible Explainable AI in agentic systems, advocating for multimodal interfaces, blame-aware explanation design, and participatory development.

摘要：可解釋的人工智慧（XAI）對於確保信任和問責至關重要，但其發展仍主要以視覺為主。對於盲人和低視力（BLV）使用者而言，缺乏可及的解釋造成了獨立使用人工智慧驅動的輔助技術的根本障礙。隨著人工智慧系統從單一查詢工具轉變為自主代理，執行多步驟行動並在擴展的任務範疇內做出重要決策，這一問題愈加嚴重，其中一個未被檢測的錯誤可能在任何反饋可用之前無法逆轉地擴散。本文通過對用戶訪談和當代研究的綜合分析，探討了BLV社群獨特的XAI需求。通過檢視環境感知和決策支持的使用模式，我們識別出一個顯著的模態差距。實證證據表明，雖然BLV使用者非常重視對話式解釋，但他們經常對人工智慧的失敗感到“自責”。本文最後提出了一個針對代理系統中可及的可解釋人工智慧的研究議程，倡導多模態介面、關注責任的解釋設計和參與式開發。

Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System

2603.29950v1 by Xiaoshan Huang, Conrad Borchers, Jiayi Zhang, Susanne P. Lajoie

Effective collaboration requires teams to manage complex cognitive and emotional states through Socially Shared Regulation of Learning (SSRL). Physiological synchrony (i.e., longitudinal alignment in physiological signals) can indicate these states, but is hard to interpret on its own. We investigate the physiological and conversational dynamics of four medical dyads diagnosing a virtual patient case using an intelligent tutoring system. Semantic shifts in dialogue were correlated with transient physiological synchrony peaks. We also coded utterance segments for SSRL and derived cosine similarity using sentence embeddings. The results showed that activating prior knowledge featured significantly lower semantic similarity than simpler task execution. High physiological synchrony was associated with lower semantic similarity, suggesting that such moments involve exploratory and varied language use. Qualitative analysis triangulated these synchrony peaks as ``pivotal moments'': successful teams synchronized during shared discovery, while unsuccessful teams peaked during shared uncertainty. This research advances human-centered AI by demonstrating how biological signals can be fused with dialogues to understand critical moments in problem solving.

摘要：有效的合作需要團隊通過社會共享學習調節（SSRL）來管理複雜的認知和情感狀態。生理同步（即生理信號的縱向對齊）可以指示這些狀態，但單獨解釋起來較為困難。我們研究了四個醫療二人組在使用智能輔導系統診斷虛擬病人案例時的生理和對話動態。對話中的語義變化與瞬時生理同步峰值相關聯。我們還對發言片段進行了SSRL編碼，並使用句子嵌入推導了餘弦相似度。結果顯示，激活先前知識的語義相似度顯著低於較簡單的任務執行。高生理同步與較低的語義相似度相關，這表明這樣的時刻涉及探索性和多樣化的語言使用。質性分析將這些同步峰值三角測量為“關鍵時刻”：成功的團隊在共享發現時同步，而不成功的團隊在共享不確定性時達到峰值。本研究通過展示如何將生物信號與對話融合，以理解問題解決中的關鍵時刻，推進了以人為中心的人工智慧。

Uncertainty Gating for Cost-Aware Explainable Artificial Intelligence

2603.29915v1 by Georgii Mikriukov, Grégoire Montavon, Marina M. -C. Höhne

Post-hoc explanation methods are widely used to interpret black-box predictions, but their generation is often computationally expensive and their reliability is not guaranteed. We propose epistemic uncertainty as a low-cost proxy for explanation reliability: high epistemic uncertainty identifies regions where the decision boundary is poorly defined and where explanations become unstable and unfaithful. This insight enables two complementary use cases: improving worst-case explanations' (routing samples to cheap or expensive XAI methods based on expected explanation reliability), andrecalling high-quality explanations' (deferring explanation generation for uncertain samples under constrained budget). Across four tabular datasets, five diverse architectures, and four XAI methods, we observe a strong negative correlation between epistemic uncertainty and explanation stability. Further analysis shows that epistemic uncertainty distinguishes not only stable from unstable explanations, but also faithful from unfaithful ones. Experiments on image classification confirm that our findings generalize beyond tabular data.

摘要：後設解釋方法被廣泛用於解釋黑箱預測，但其生成通常計算成本高昂且可靠性無法保證。我們提出將認知不確定性作為解釋可靠性的低成本代理：高認知不確定性識別出決策邊界定義不明確的區域，以及解釋變得不穩定和不忠實的地方。這一見解使得兩個互補的使用案例成為可能：改善最壞情況下的解釋（根據預期解釋可靠性將樣本路由到便宜或昂貴的XAI方法），以及召回高質量解釋（在預算受限的情況下，對不確定樣本延遲生成解釋）。在四個表格數據集、五種不同架構和四種XAI方法中，我們觀察到認知不確定性與解釋穩定性之間存在強烈的負相關。進一步分析顯示，認知不確定性不僅區分穩定和不穩定的解釋，還區分忠實和不忠實的解釋。對圖像分類的實驗確認了我們的發現超越了表格數據的普遍性。

Reasoning-Driven Synthetic Data Generation and Evaluation

2603.29791v1 by Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous

Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution - limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism design, (2) provides insights into generating and evaluating synthetic data at scale, and (3) unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount.

摘要：儘管許多有興趣的 AI 應用需要專門的多模態模型，但訓練這些模型所需的相關數據本質上是稀缺或無法獲得的。用人類標註者填補這些空白的成本過高、容易出錯且耗時，導致模型構建者越來越考慮將合成數據作為可擴展的替代方案。然而，現有的合成數據生成方法通常依賴手動提示、進化算法或來自目標分佈的大量種子數據——這限制了它們的可擴展性、可解釋性和控制能力。在本文中，我們介紹了 Simula：一種新穎的基於推理的數據生成和評估框架。它採用無種子、主動的方式來大規模生成合成數據集，使用戶能夠通過可解釋和可控的過程定義所需的數據集特徵，從而實現細緻的資源分配。我們展示了我們的方法在各種數據集上的有效性，並嚴格測試了內在和下游特性。我們的工作 (1) 提供了合成數據機制設計的指導方針，(2) 提供了在大規模生成和評估合成數據的洞見，並 (3) 為在數據稀缺或隱私問題至關重要的領域開發和部署 AI 開辟了新的機會。

CausalPulse: An Industrial-Grade Neurosymbolic Multi-Agent Copilot for Causal Diagnostics in Smart Manufacturing

2603.29755v1 by Chathurangi Shyalika, Utkarshani Jaimini, Cory Henson, Amit Sheth

Modern manufacturing environments demand real-time, trustworthy, and interpretable root-cause insights to sustain productivity and quality. Traditional analytics pipelines often treat anomaly detection, causal inference, and root-cause analysis as isolated stages, limiting scalability and explainability. In this work, we present CausalPulse, an industry-grade multi-agent copilot that automates causal diagnostics in smart manufacturing. It unifies anomaly detection, causal discovery, and reasoning through a neurosymbolic architecture built on standardized agentic protocols. CausalPulse is being deployed in a Robert Bosch manufacturing plant, integrating seamlessly with existing monitoring workflows and supporting real-time operation at production scale. Evaluations on both public (Future Factories) and proprietary (Planar Sensor Element) datasets show high reliability, achieving overall success rates of 98.0% and 98.73%. Per-criterion success rates reached 98.75% for planning and tool use, 97.3% for self-reflection, and 99.2% for collaboration. Runtime experiments report end-to-end latency of 50-60s per diagnostic workflow with near-linear scalability (R^2=0.97), confirming real-time readiness. Comparison with existing industrial copilots highlights distinct advantages in modularity, extensibility, and deployment maturity. These results demonstrate how CausalPulse's modular, human-in-the-loop design enables reliable, interpretable, and production-ready automation for next-generation manufacturing.

摘要：現代製造環境需要即時、可信且可解釋的根本原因洞察，以維持生產力和品質。傳統的分析流程通常將異常檢測、因果推斷和根本原因分析視為孤立的階段，限制了可擴展性和可解釋性。在這項工作中，我們提出了CausalPulse，一種行業級的多代理協同工具，能自動化智能製造中的因果診斷。它通過基於標準化代理協議的神經符號架構，統一了異常檢測、因果發現和推理。CausalPulse正在羅伯特·博世的製造工廠中部署，與現有的監控工作流程無縫整合，並支持生產規模的即時運作。對公共（未來工廠）和專有（平面傳感器元件）數據集的評估顯示出高可靠性，整體成功率達到98.0%和98.73%。按標準的成功率分別為規劃和工具使用達到98.75%，自我反思為97.3%，協作為99.2%。運行時實驗報告每個診斷工作流程的端到端延遲為50-60秒，並具有近線性可擴展性（R^2=0.97），確認了即時準備性。與現有的工業協同工具的比較突顯了在模組化、可擴展性和部署成熟度方面的明顯優勢。這些結果展示了CausalPulse的模組化、人機協作設計如何實現可靠、可解釋且適合生產的自動化，為下一代製造提供支持。

Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding

2603.29709v1 by Joakim Edin, Andreas Motzfeldt, Simon Flachs, Lars Maaløe

Medical coding translates free-text clinical documentation into standardized codes drawn from classification systems that contain tens of thousands of entries and are updated annually. It is central to billing, clinical research, and quality reporting, yet remains largely manual, slow, and error-prone. Existing automated approaches learn to predict a fixed set of codes from labeled data, thereby preventing adaptation to new codes or different coding systems without retraining on different data. They also provide no explanation for their predictions, limiting trust in safety-critical settings. We introduce Symphony for Medical Coding, a system that approaches the task the way expert human coders do: by reasoning over the clinical narrative with direct access to the coding guidelines. This design allows Symphony to operate across any coding system and to provide span-level evidence linking each predicted code to the text that supports it. We evaluate on two public benchmarks and three real-world datasets spanning inpatient, outpatient, emergency, and subspecialty settings across the United States and the United Kingdom. Symphony achieves state-of-the-art results across all settings, establishing itself as a flexible, deployment-ready foundation for automated clinical coding.

摘要：醫療編碼將自由文本的臨床文檔轉換為標準化代碼，這些代碼來自包含數萬條條目的分類系統，並每年更新一次。它對於計費、臨床研究和質量報告至關重要，但仍然主要依賴手動操作，速度慢且容易出錯。現有的自動化方法從標記數據中學習預測固定的代碼集，因此無法在不重新訓練不同數據的情況下適應新代碼或不同的編碼系統。它們也不提供對其預測的解釋，限制了在安全關鍵環境中的信任。我們介紹了醫療編碼的交響樂系統，這一系統以專業人類編碼員的方式處理任務：通過直接訪問編碼指南來推理臨床敘事。這一設計使得交響樂能夠在任何編碼系統中運作，並提供將每個預測代碼與支持其的文本聯繫起來的跨度級證據。我們在兩個公共基準和三個涵蓋美國和英國住院、門診、急診和專科環境的真實世界數據集上進行評估。交響樂在所有環境中都達到了最先進的結果，確立了其作為自動化臨床編碼的靈活、可部署的基礎。

Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupling and the Limits of the Dunning-Kruger Metaphor

2603.29681v1 by Christopher Koch

The common claim that generative AI simply amplifies the Dunning-Kruger effect is too coarse to capture the available evidence. The clearest findings instead suggest that large language model (LLM) use can improve observable output and short-term task performance while degrading metacognitive accuracy and flattening the classic competence-confidence gradient across skill groups. This paper synthesizes evidence from human-AI interaction, learning research, and model evaluation, and proposes the working model of AI-mediated metacognitive decoupling: a widening gap among produced output, underlying understanding, calibration accuracy, and self-assessed ability. This four-variable account better explains overconfidence, over- and under-reliance, crutch effects, and weak transfer than the simpler metaphor of a uniformly steeper Dunning-Kruger curve. The paper concludes with implications for tool design, assessment, and knowledge work.

摘要：一般聲稱生成式 AI 只是放大了鄧寇克效應的說法過於粗糙，無法捕捉現有的證據。相反，最明確的研究結果表明，大型語言模型 (LLM) 的使用可以改善可觀察的輸出和短期任務表現，同時降低元認知的準確性，並在技能組之間平坦化經典的能力-信心梯度。本文綜合了來自人機互動、學習研究和模型評估的證據，並提出了 AI 介導的元認知解耦的工作模型：產出、基礎理解、校準準確性和自我評估能力之間的差距擴大。這四個變量的解釋比簡單的均勻陡峭的鄧寇克曲線更好地解釋了過度自信、過度和不足依賴、拐杖效應以及弱轉移。本文最後討論了對工具設計、評估和知識工作的影響。

AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding

2603.29366v1 by Moiz Sadiq Awan, Maryam Raza

Prior authorization remains one of the most burdensome administrative processes in U.S. healthcare, consuming billions of dollars and thousands of physician hours each year. While large language models have shown promise across clinical text tasks, their ability to produce submission-ready prior authorization letters has received only limited attention, with existing work confined to single-case demonstrations rather than structured multi-scenario evaluation. We assessed three commercially available LLMs (GPT-4o, Claude Sonnet 4.5, and Gemini 2.5 Pro) across 45 physician-validated synthetic scenarios spanning rheumatology, psychiatry, oncology, cardiology, and orthopedics. All three models generated letters with strong clinical content: accurate diagnoses, well-structured medical necessity arguments, and thorough step therapy documentation. However, a secondary analysis of real-world administrative requirements revealed consistent gaps that clinical scoring alone did not capture, including absent billing codes, missing authorization duration requests, and inadequate follow-up plans. These findings reframe the question: the challenge for clinical deployment is not whether LLMs can write clinically adequate letters, but whether the systems built around them can supply the administrative precision that payer workflows require.

摘要：事前授權仍然是美國醫療保健中最繁重的行政流程之一，每年耗費數十億美元和數千小時的醫生時間。雖然大型語言模型在臨床文本任務中顯示出潛力，但它們產生提交準備好的事前授權信的能力僅受到有限的關注，現有的工作僅限於單一案例的示範，而非結構化的多場景評估。我們評估了三種商業可用的LLM（GPT-4o、Claude Sonnet 4.5和Gemini 2.5 Pro），涵蓋了45個醫生驗證的合成場景，涉及風濕病學、精神病學、腫瘤學、心臟病學和骨科。這三個模型生成的信件具有強大的臨床內容：準確的診斷、結構良好的醫療必要性論證以及全面的步驟治療文檔。然而，對現實世界行政要求的次要分析揭示了一致的差距，這些差距僅僅依靠臨床評分無法捕捉，包括缺失的計費代碼、缺少的授權期限請求和不充分的後續計劃。這些發現重新框定了問題：臨床部署的挑戰不在於LLM是否能撰寫臨床上合格的信件，而在於圍繞它們構建的系統是否能提供支付者工作流程所需的行政精確性。

Rigorous Explanations for Tree Ensembles

2603.29361v1 by Yacine Izza, Alexey Ignatiev, Xuanxiang Huang, Peter J. Stuckey, Joao Marques-Silva

Tree ensembles (TEs) find a multitude of practical applications. They represent one of the most general and accurate classes of machine learning methods. While they are typically quite concise in representation, their operation remains inscrutable to human decision makers. One solution to build trust in the operation of TEs is to automatically identify explanations for the predictions made. Evidently, we can only achieve trust using explanations, if those explanations are rigorous, that is truly reflect properties of the underlying predictor they explain This paper investigates the computation of rigorously-defined, logically-sound explanations for the concrete case of two well-known examples of tree ensembles, namely random forests and boosted trees.

摘要：樹集成（TEs）在許多實際應用中發揮著重要作用。它們代表了機器學習方法中最通用且準確的類別之一。雖然它們的表達通常相當簡潔，但其運作對於人類決策者來說仍然難以理解。建立對樹集成運作信任的一個解決方案是自動識別所做預測的解釋。顯然，只有當這些解釋是嚴謹的，即真正反映了它們所解釋的基礎預測器的特性時，我們才能通過解釋來獲得信任。本文研究了對兩個知名樹集成示例，即隨機森林和提升樹，進行嚴謹定義且邏輯上合理的解釋的計算。

Predicting Neuromodulation Outcome for Parkinson's Disease with Generative Virtual Brain Model

2603.29176v1 by Siyuan Du, Siyi Li, Shuwei Bai, Ang Li, Haolin Li, Mingqing Xiao, Yang Pan, Dongsheng Li, Weidi Xie, Yanfeng Wang, Ya Zhang, Chencheng Zhang, Jiangchao Yao

Parkinson's disease (PD) affects over ten million people worldwide. Although temporal interference (TI) and deep brain stimulation (DBS) are promising therapies, inter-individual variability limits empirical treatment selection, increasing non-negligible surgical risk and cost. Previous explorations either resort to limited statistical biomarkers that are insufficient to characterize variability, or employ AI-driven methods which is prone to overfitting and opacity. We bridge this gap with a pretraining-finetuning framework to predict outcomes directly from resting-state fMRI. Critically, a generative virtual brain foundation model, pretrained on a collective dataset (2707 subjects, 5621 sessions) to capture universal disorder patterns, was finetuned on PD cohorts receiving TI (n=51) or DBS (n=55) to yield individualized virtual brains with high fidelity to empirical functional connectivity (r=0.935). By constructing counterfactual estimations between pathological and healthy neural states within these personalized models, we predicted clinical responses (TI: AUPR=0.853; DBS: AUPR=0.915), substantially outperforming baselines. External and prospective validations (n=14, n=11) highlight the feasibility of clinical translation. Moreover, our framework provides state-dependent regional patterns linked to response, offering hypothesis-generating mechanistic insights.

摘要：帕金森病（PD）影響全球超過一千萬人。儘管時間干擾（TI）和深腦刺激（DBS）是有前景的療法，但個體間的變異性限制了實證治療的選擇，增加了不可忽視的手術風險和成本。先前的探索要麼依賴於有限的統計生物標記，這些標記不足以表徵變異性，要麼使用易於過擬合和不透明的 AI 驅動方法。我們通過一個預訓練-微調框架來填補這一空白，直接從靜息態 fMRI 預測結果。關鍵是，一個生成的虛擬大腦基礎模型，在一個集體數據集（2707 名受試者，5621 次會議）上進行預訓練，以捕捉普遍的疾病模式，並在接受 TI（n=51）或 DBS（n=55）的 PD 群體上進行微調，以產生與實證功能連接高度一致的個性化虛擬大腦（r=0.935）。通過在這些個性化模型中構建病理和健康神經狀態之間的反事實估計，我們預測了臨床反應（TI: AUPR=0.853; DBS: AUPR=0.915），顯著超越基準。外部和前瞻性驗證（n=14, n=11）突顯了臨床轉化的可行性。此外，我們的框架提供了與反應相關的狀態依賴區域模式，提供了生成假設的機制見解。

Knowledge database development by large language models for countermeasures against viruses and marine toxins

2603.29149v1 by Hung N. Do, Jessica Z. Kubicek-Sutherland, S. Gnanakaran

Access to the most up-to-date information on medical countermeasures is important for the research and development of effective treatments for viruses and marine toxins. However, there is a lack of comprehensive databases that curate data on viruses and marine toxins, making decisions on medical countermeasures slow and difficult. In this work, we employ two large language models (LLMs) of ChatGPT and Grok to design two comprehensive databases of therapeutic countermeasures for five viruses of Lassa, Marburg, Ebola, Nipah, and Venezuelan equine encephalitis, as well as marine toxins. With high-level human-provided inputs, the two LLMs identify public databases containing data on the five viruses and marine toxins, collect relevant information from these databases and the literature, iteratively cross-validate the collected information, and design interactive webpages for easy access to the curated, comprehensive databases. Notably, the ChatGPT LLM is employed to design agentic AI workflows (consisting of two AI agents for research and decision-making) to rank countermeasures for viruses and marine toxins in the databases. Together, our work explores the potential of LLMs as a scalable, updatable approach for building comprehensive knowledge databases and supporting evidence-based decision-making.

摘要：獲取有關醫療對策的最新資訊對於病毒和海洋毒素的有效治療的研究與開發至關重要。然而，缺乏綜合數據庫來整理有關病毒和海洋毒素的數據，使得醫療對策的決策變得緩慢且困難。在這項工作中，我們使用兩個大型語言模型（LLMs），即ChatGPT和Grok，設計了兩個針對拉薩病毒、馬爾堡病毒、埃博拉病毒、尼帕病毒和委內瑞拉馬腦炎病毒以及海洋毒素的治療對策的綜合數據庫。在高層次的人類提供的輸入下，這兩個LLMs識別出包含五種病毒和海洋毒素數據的公共數據庫，從這些數據庫和文獻中收集相關信息，迭代交叉驗證所收集的信息，並設計互動網頁以便於訪問整理好的綜合數據庫。值得注意的是，ChatGPT LLM被用來設計代理式AI工作流程（由兩個AI代理組成，分別負責研究和決策），以對數據庫中的病毒和海洋毒素對策進行排名。總之，我們的工作探討了LLMs作為可擴展、可更新的方法來建立綜合知識數據庫並支持基於證據的決策的潛力。

Towards Explainable Stakeholder-Aware Requirements Prioritisation in Aged-Care Digital Health

2603.29114v1 by Yuqing Xiao, John Grundy, Anuradha Madugalla, Elizabeth Manias

Requirements engineering for aged-care digital health must account for human aspects, because requirement priorities are shaped not only by technical functionality but also by stakeholders' health conditions, socioeconomics, and lived experience. Knowing which human aspects matter most, and for whom, is critical for inclusive and evidence-based requirements prioritisation. Yet in practice, while some studies have examined human aspects in RE, they have largely relied on expert judgement or model-driven analysis rather than large-scale user studies with meaningful human-in-the-loop validation to determine which aspects matter most and why. To address this gap, we conducted a mixed-methods study with 103 older adults, 105 developers, and 41 caregivers. We first applied an explainable machine learning to identify the human aspects most strongly associated with requirement priorities across 8 aged-care digital health themes, and then conducted 12 semi-structured interviews to validate and interpret the quantitative patterns. The results identify the key human aspects shaping requirement priorities, reveal their directional effects, and expose substantial misalignment across stakeholder groups. Together, these findings show that human-centric requirements analysis should engage stakeholder groups explicitly rather than collapsing their perspectives into a single aggregate view. This paper contributes an identification of the key human aspects driving requirement priorities in aged-care digital health and an explainable, human-centric RE framework that combines ML-derived importance rankings with qualitative validation to surface the stakeholder misalignments that inclusive requirements engineering must address.

摘要：老年護理數位健康的需求工程必須考慮人類因素，因為需求優先級不僅受到技術功能的影響，還受到利益相關者的健康狀況、社會經濟狀況和生活經驗的影響。了解哪些人類因素最重要，以及對誰最重要，對於包容性和基於證據的需求優先級排序至關重要。然而在實踐中，儘管一些研究已經考察了需求工程中的人類因素，但它們在很大程度上依賴於專家判斷或模型驅動的分析，而非大規模的用戶研究，這些研究具有意義的人類參與驗證，以確定哪些因素最重要及其原因。為了填補這一空白，我們對103位老年人、105位開發者和41位護理人員進行了混合方法研究。我們首先應用可解釋的機器學習來識別與8個老年護理數位健康主題的需求優先級最強相關的人類因素，然後進行了12次半結構化訪談，以驗證和解釋定量模式。結果確定了塑造需求優先級的關鍵人類因素，揭示了它們的方向性影響，並暴露了利益相關者群體之間的重大不一致性。這些發現表明，以人為中心的需求分析應該明確地吸引利益相關者群體，而不是將他們的觀點合併為單一的總體觀。本文貢獻了對推動老年護理數位健康需求優先級的關鍵人類因素的識別，以及一個可解釋的以人為中心的需求工程框架，該框架結合了機器學習導出的重要性排名與定性驗證，以揭示包容性需求工程必須解決的利益相關者不一致性。

A Latent Risk-Aware Machine Learning Approach for Predicting Operational Success in Clinical Trials based on TrialsBank

2603.29041v1 by Iness Halimi, Emmanuel Piffo, Oumnia Boudersa, Yvan Marcel Carre Vilmorin, Melissa Ait-ikhlef, Karima Kone, Andy Tan, Augustin Medina, Juliette Hernando, Sheila Ernest, Vatche Bartekian, Karine Lalonde, Mireille E Schnitzer, Gianolli Dorcelus

Clinical trials are characterized by high costs, extended timelines, and substantial operational risk, yet reliable prospective methods for predicting trial success before initiation remain limited. Existing artificial intelligence approaches often focus on isolated metrics or specific development stages and frequently rely on variables unavailable at the trial design phase, limiting real-world applicability. We present a hierarchical latent risk-aware machine learning framework for prospective prediction of clinical trial operational success using a curated subset of TrialsBank, a proprietary AI-ready database developed by Sorintellis, comprising 13,700 trials. Operational success was defined as the ability to initiate, conduct, and complete a clinical trial according to planned timelines, recruitment targets, and protocol specifications through database lock. This approach decomposes operational success prediction into two modeling stages. First, intermediate latent operational risk factors are predicted using more than 180 drug- and trial-level features available before trial initiation. These predicted latent risks are then integrated into a downstream model to estimate the probability of operational success. A staged data-splitting strategy was employed to prevent information leakage, and models were benchmarked using XGBoost, CatBoost, and Explainable Boosting Machines. Across Phase I-III, the framework achieves strong out-of-sample performance, with F1-scores of 0.93, 0.92, and 0.91, respectively. Incorporating latent risk drivers improves discrimination of operational failures, and performance remains robust under independent inference evaluation. These results demonstrate that clinical trial operational success can be prospectively forecasted using a latent risk-aware AI framework, enabling early risk assessment and supporting data-driven clinical development decision-making.

摘要：臨床試驗的特點是高成本、長時間和相當大的操作風險，然而在試驗啟動之前，可靠的預測試驗成功的前瞻性方法仍然有限。現有的人工智慧方法通常專注於孤立的指標或特定的開發階段，並且經常依賴於在試驗設計階段無法獲得的變數，限制了其在現實世界中的適用性。我們提出了一個層次性潛在風險感知的機器學習框架，用於利用經過整理的TrialsBank子集預測臨床試驗的操作成功，TrialsBank是由Sorintellis開發的專有AI準備數據庫，包含13,700個試驗。操作成功被定義為根據計劃的時間表、招募目標和協議規範，啟動、進行和完成臨床試驗的能力，直到數據庫鎖定。這種方法將操作成功的預測分解為兩個建模階段。首先，使用在試驗啟動之前可用的180多個藥物和試驗級別特徵來預測中間潛在操作風險因素。這些預測的潛在風險然後被整合到下游模型中，以估計操作成功的概率。採用了分階段數據拆分策略以防止信息洩漏，並使用XGBoost、CatBoost和可解釋增強機器進行模型基準測試。在第一至第三階段，該框架在樣本外表現強勁，F1分數分別為0.93、0.92和0.91。納入潛在風險驅動因素提高了對操作失敗的區分能力，並且在獨立推斷評估下性能依然穩健。這些結果表明，臨床試驗的操作成功可以通過潛在風險感知的AI框架進行前瞻性預測，使早期風險評估成為可能，並支持基於數據的臨床開發決策。

Towards a Medical AI Scientist

2603.28589v1 by Hongtao Wu, Boyun Zheng, Dingjie Song, Yu Jiang, Jianfeng Gao, Lei Xing, Lichao Sun, Yixuan Yuan

Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.

摘要：自主系統生成科學假設、進行實驗並撰寫手稿，最近已成為加速發現的一個有前景的範式。然而，現有的人工智慧科學家在很大程度上仍然是領域無關的，這限制了它們在臨床醫學中的應用，因為研究需要基於醫學證據並具有專門的數據模式。在這項工作中，我們介紹了醫療人工智慧科學家，這是第一個專為臨床自主研究量身定制的自主研究框架。它通過臨床醫生與工程師的共同推理機制，將廣泛調查的文獻轉化為可行的證據，從而實現臨床基礎的創意，並提高生成研究想法的可追溯性。它進一步促進了基於證據的手稿撰寫，並遵循結構化的醫學組合慣例和倫理政策。該框架在三種研究模式下運作，即基於論文的重現、受文獻啟發的創新和任務驅動的探索，每種模式對應於不同層次的自動化科學探究，並逐步增加自主性。大型語言模型和人類專家的全面評估顯示，醫療人工智慧科學家生成的想法在171個案例、19個臨床任務和6種數據模式中，質量顯著高於商業LLM所產生的想法。同時，我們的系統在所提出的方法與其實施之間實現了強大的對齊，並在可執行實驗中顯示出顯著更高的成功率。人類專家和斯坦福代理審稿人的雙盲評估表明，生成的手稿接近MICCAI級別的質量，同時穩定超越來自ISBI和BIBM的手稿。所提出的醫療人工智慧科學家突顯了利用人工智慧進行醫療保健自主科學發現的潛力。

Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework

2603.28532v1 by Ya Zhou, Tianxiang Hao, Ziyi Cai, Haojie Zhu, Hejun He, Jia Liu, Xiaohan Fan, Jing Yuan

Low left ventricular ejection fraction (LEF) frequently remains undetected until progression to symptomatic heart failure, underscoring the need for scalable screening strategies. Although artificial intelligence-enabled electrocardiography (AI-ECG) has shown promise, existing approaches rely solely on end-to-end black-box models with limited interpretability or on tabular systems dependent on commercial ECG measurement algorithms with suboptimal performance. We introduced ECG-based Predictor-Driven LEF (ECGPD-LEF), a structured framework that integrates foundation model-derived diagnostic probabilities with interpretable modeling for detecting LEF from ECG. Trained on the benchmark EchoNext dataset comprising 72,475 ECG-echocardiogram pairs and evaluated in predefined independent internal (n=5,442) and external (n=16,017) cohorts, our framework achieved robust discrimination for moderate LEF (internal AUROC 88.4%, F1 64.5%; external AUROC 86.8%, F1 53.6%), consistently outperforming the official end-to-end baseline provided with the benchmark across demographic and clinical subgroups. Interpretability analyses identified high-impact predictors, including normal ECG, incomplete left bundle branch block, and subendocardial injury in anterolateral leads, driving LEF risk estimation. Notably, these predictors independently enabled zero-shot-like inference without task-specific retraining (internal AUROC 75.3-81.0%; external AUROC 71.6-78.6%), indicating that ventricular dysfunction is intrinsically encoded within structured diagnostic probability representations. This framework reconciles predictive performance with mechanistic transparency, supporting scalable enhancement through additional predictors and seamless integration with existing AI-ECG systems.

摘要：低左心室射血分數（LEF）常常在進展到有症狀的心臟衰竭之前未被檢測到，這突顯了可擴展篩查策略的必要性。儘管人工智慧驅動的心電圖（AI-ECG）顯示出潛力，但現有的方法僅依賴於端到端的黑箱模型，這些模型的可解釋性有限，或依賴於依賴商業心電圖測量算法的表格系統，這些算法的性能不佳。我們引入了基於心電圖的預測驅動LEF（ECGPD-LEF），這是一個結構化框架，將基礎模型衍生的診斷概率與可解釋建模相結合，以從心電圖中檢測LEF。該框架在基準EchoNext數據集上進行訓練，該數據集包含72,475對心電圖-超聲心動圖，並在預定的獨立內部（n=5,442）和外部（n=16,017）隊列中進行評估，我們的框架在中度LEF的區分上達到了穩健的表現（內部AUROC 88.4%，F1 64.5%；外部AUROC 86.8%，F1 53.6%），在各個人口和臨床子群中始終優於基準提供的官方端到端基線。可解釋性分析確定了高影響力的預測因子，包括正常心電圖、不完全的左束支傳導阻滯和前外側導聯的心內膜下損傷，這些因子驅動了LEF風險評估。值得注意的是，這些預測因子獨立地實現了類似零樣本的推斷，而無需特定任務的再訓練（內部AUROC 75.3-81.0%；外部AUROC 71.6-78.6%），這表明心室功能障礙本質上在結構化診斷概率表示中被編碼。這一框架調和了預測性能與機制透明度，支持通過額外的預測因子和與現有AI-ECG系統的無縫整合來實現可擴展的增強。

RAD-LAD: Rule and Language Grounded Autonomous Driving in Real-Time

2603.28522v2 by Anurag Ghosh, Srinivasa Narasimhan, Manmohan Chandraker, Francesco Pittaluga

We present LAD, a real-time language--action planner with an interruptible architecture that produces a motion plan in a single forward pass (~20 Hz) or generates textual reasoning alongside a motion plan (~10 Hz). LAD is fast enough for real-time closed-loop deployment, achieving ~3x lower latency than prior driving language models while setting a new learning-based state of the art on nuPlan Test14-Hard and InterPlan. We also introduce RAD, a rule-based planner designed to address structural limitations of PDM-Closed. RAD achieves state-of-the-art performance among rule-based planners on nuPlan Test14-Hard and InterPlan. Finally, we show that combining RAD and LAD enables hybrid planning that captures the strengths of both approaches. This hybrid system demonstrates that rules and learning provide complementary capabilities: rules support reliable maneuvering, while language enables adaptive and explainable decision-making.

摘要：我們提出了 LAD，一種具備可中斷架構的即時語言—行動規劃器，能在單次前向傳遞中生成運動計畫（約 20 Hz），或在生成運動計畫的同時產生文本推理（約 10 Hz）。LAD 的速度足以用於即時閉環部署，達到比先前的駕駛語言模型低約 3 倍的延遲，同時在 nuPlan Test14-Hard 和 InterPlan 上設立了新的基於學習的最先進技術。我們還介紹了 RAD，一種基於規則的規劃器，旨在解決 PDM-Closed 的結構性限制。RAD 在 nuPlan Test14-Hard 和 InterPlan 上的基於規則的規劃器中達到了最先進的性能。最後，我們展示了結合 RAD 和 LAD 的混合規劃，能夠捕捉兩種方法的優勢。這個混合系統證明了規則和學習提供了互補的能力：規則支持可靠的操控，而語言則使得適應性和可解釋的決策成為可能。

The Unreasonable Effectiveness of Scaling Laws in AI

2603.28507v1 by Chien-Ping Lu

Classical AI scaling laws, especially for pre-training, describe how training loss decreases with compute in a power-law form. Their effectiveness has a basic and very practical sense: they make progress predictable, albeit at a declining rate. Yet their effectiveness is also unreasonable in two further senses. First, these laws are largely empirical and observational, but they appear repeatedly across model families and increasingly across training-adjacent regimes. Second, despite the diminishing returns they predict, progress in practice has often continued through rapidly improving efficiency, visible for example in falling cost per token. This paper argues that both features arise from the same source: scaling laws are unusually effective because they abstract away from many realization details. The compute variable is best understood as logical compute, an implementation-agnostic notion of model-side work, while the practical burden of scaling depends on how efficiently real resources are converted into that compute. This abstraction helps explain both why the laws travel so well across settings and why they give rise to a persistent efficiency game in hardware, algorithms, and systems. Once efficiency is made explicit, the main practical question becomes how many efficiency doublings are required to keep scaling productive despite diminishing returns. Under that view, diminishing returns are not only a geometric flattening of the loss curve, but also rising pressure for cost reduction, system-level innovation, and the breakthroughs needed to sustain Moore-like efficiency doublings.

摘要：古典人工智慧的擴展法則，尤其是在預訓練方面，描述了訓練損失如何以冪律形式隨著計算量的增加而減少。它們的有效性在基本上和非常實際的意義上是明顯的：它們使進展可預測，儘管以下降的速度進行。然而，它們的有效性在另外兩個方面也是不合理的。首先，這些法則在很大程度上是經驗性和觀察性的，但它們在模型家族中反覆出現，並且在與訓練相關的領域中越來越普遍。其次，儘管它們預測了收益遞減，實際上進展往往通過快速提高效率而持續，這在每個標記的成本下降中可見。本文主張這兩個特徵源於同一來源：擴展法則異常有效，因為它們抽象掉了許多實現細節。計算變量最好理解為邏輯計算，這是一種與實現無關的模型側工作概念，而擴展的實際負擔則取決於真實資源轉換為該計算的效率。這種抽象有助於解釋為什麼這些法則能夠在不同環境中如此有效，以及為什麼它們引發了硬體、算法和系統中的持續效率競賽。一旦效率變得明確，主要的實際問題便是需要多少次效率翻倍才能在收益遞減的情況下保持擴展的生產力。從這個角度看，收益遞減不僅是損失曲線的幾何平坦化，還是對成本降低、系統級創新以及維持摩爾式效率翻倍所需突破的上升壓力。

CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

2603.28474v1 by Wenhan Wang, Zhixiang Zhou, Zhongtian Ma, Yanzhu Chen, Ziyu Lin, Hao Sheng, Pengfei Liu, Honglin Ma, Wenqi Shao, Qiaosheng Zhang, Yu Qiao

The connoisseurship of antique Chinese porcelain demands extensive historical expertise, material understanding, and aesthetic sensitivity, making it difficult for non-specialists to engage. To democratize cultural-heritage understanding and assist expert connoisseurship, we introduce CiQi-Agent -- a domain-specific Porcelain Connoisseurship Agent for intelligent analysis of antique Chinese porcelain. CiQi-Agent supports multi-image porcelain inputs and enables vision tool invocation and multimodal retrieval-augmented generation, performing fine-grained connoisseurship analysis across six attributes: dynasty, reign period, kiln site, glaze color, decorative motif, and vessel shape. Beyond attribute classification, it captures subtle visual details, retrieves relevant domain knowledge, and integrates visual and textual evidence to produce coherent, explainable connoisseurship descriptions. To achieve this capability, we construct a large-scale, expert-annotated dataset CiQi-VQA, comprising 29,596 porcelain specimens, 51,553 images, and 557,940 visual question--answering pairs, and further establish a comprehensive benchmark CiQi-Bench aligned with the previously mentioned six attributes. CiQi-Agent is trained through supervised fine-tuning, reinforcement learning, and a tool-augmented reasoning framework that integrates two categories of tools: a vision tool and multimodal retrieval tools. Experimental results show that CiQi-Agent (7B) outperforms all competitive open- and closed-source models across all six attributes on CiQi-Bench, achieving on average 12.2\% higher accuracy than GPT-5. The model and dataset have been released and are publicly available at https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA.

摘要：古董中國瓷器的鑑賞需要廣泛的歷史專業知識、材料理解和美學敏感性，使得非專家難以參與。為了使文化遺產的理解民主化並協助專家鑑賞，我們推出了 CiQi-Agent —— 一個專門針對古董中國瓷器的鑑賞代理，旨在進行智能分析。CiQi-Agent 支持多圖像瓷器輸入，並啟用視覺工具調用和多模態檢索增強生成，對六個屬性進行細緻的鑑賞分析：朝代、統治時期、窯址、釉色、裝飾圖案和器型。除了屬性分類外，它還捕捉微妙的視覺細節，檢索相關的領域知識，並整合視覺和文本證據，以產生連貫且可解釋的鑑賞描述。為了實現這一能力，我們構建了一個大規模的專家標註數據集 CiQi-VQA，包含 29,596 件瓷器標本、51,553 張圖像和 557,940 對視覺問題--回答，並進一步建立了一個與上述六個屬性對齊的綜合基準 CiQi-Bench。CiQi-Agent 通過監督微調、強化學習和一個整合了兩類工具的工具增強推理框架進行訓練：一個視覺工具和多模態檢索工具。實驗結果顯示，CiQi-Agent (7B) 在 CiQi-Bench 上的所有六個屬性上都超越了所有競爭的開源和閉源模型，平均準確率比 GPT-5 高出 12.2%。該模型和數據集已經發布，並可在 https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA 上公開獲取。

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

2603.28387v1 by Doan Nam Long Vu, Simone Balloccu

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

摘要：值得信賴的臨床人工智慧要求性能提升反映真實的證據整合，而非表面層次的人工產物。我們在兩個臨床神經影像學群體上對12個開放權重的視覺-語言模型（VLMs）進行二元分類評估，\textsc{FOR2107}（情感障礙）和\textsc{OASIS-3}（認知衰退）。這兩個數據集都包含結構性MRI數據，但不帶有可靠的個體級診斷信號。在這些條件下，較小的VLM在引入神經影像學背景後，F1得分提升高達58\%，而經過提煉的模型與規模大一個數量級的對應模型競爭。對比信心分析顯示，僅僅在任務提示中\emph{提及}MRI的可用性就佔據了70-80\%的變化，與影像數據是否存在無關，這是一個我們稱之為\emph{支架效應}的領域特定的模態崩潰實例。專家評估顯示在所有條件下均存在基於神經影像的理由的虛構，而偏好對齊在消除MRI參考行為的同時，使兩個條件都趨向隨機基線。我們的發現表明，表面評估並不足以作為多模態推理的指標，這對於在臨床環境中部署VLM有直接的影響。

Mapping data literacy trajectories in K-12 education

2603.28317v1 by Robert Whyte, Manni Cheung, Katharine Childs, Jane Waite, Sue Sentance

Data literacy skills are fundamental in computer science education. However, understanding how data-driven systems work represents a paradigm shift from traditional rule-based programming. We conducted a systematic literature review of 84 studies to understand K-12 learners' engagement with data across disciplines and contexts. We propose the data paradigms framework that categorises learning activities along two dimensions: (i) logic (knowledge-based or data-driven systems), and (ii) explainability (transparent or opaque models). We further apply the notion of learning trajectories to visualize the pathways learners follow across these distinct paradigms. We detail four distinct trajectories as a provocation for researchers and educators to reflect on how the notion of data literacy varies depending on the learning context. We suggest these trajectories could be useful to those concerned with the design of data literacy learning environments within and beyond CS education.

摘要：數據素養技能在計算機科學教育中是基本的。然而，理解數據驅動系統的運作代表了從傳統基於規則的編程的範式轉變。我們進行了一項系統的文獻回顧，分析了84項研究，以了解K-12學習者在不同學科和背景下與數據的互動。我們提出了數據範式框架，將學習活動沿著兩個維度進行分類：（i）邏輯（基於知識或數據驅動系統），以及（ii）可解釋性（透明或不透明模型）。我們進一步應用學習軌跡的概念來可視化學習者在這些不同範式之間所遵循的路徑。我們詳細描述了四條不同的軌跡，以激發研究人員和教育工作者反思數據素養的概念如何根據學習背景而有所不同。我們建議這些軌跡對於關心數據素養學習環境設計的人士，無論是在計算機科學教育內部還是外部，都可能是有用的。

A Survey on AI for 6G: Challenges and Opportunities

2604.02370v1 by Constantina Chatzieleftheriou, Eirini Liotou

As wireless communication evolves, each generation of networks brings new technologies that change how we connect and interact. Artificial Intelligence (AI) is becoming crucial in shaping the future of sixth-generation (6G) networks. By combining AI and Machine Learning (ML), 6G aims to offer high data rates, low latency, and extensive connectivity for applications including smart cities, autonomous systems, holographic telepresence, and the tactile internet. This paper provides a detailed overview of the role of AI in supporting 6G networks. It focuses on key technologies like deep learning, reinforcement learning, federated learning, and explainable AI. It also looks at how AI integrates with essential network functions and discusses challenges related to scalability, security, and energy efficiency, along with new solutions. Additionally, this work highlights perspectives that connect AI-driven analytics to 6G service domains like Ultra-Reliable Low-Latency Communication (URLLC), Enhanced Mobile Broadband (eMBB), Massive Machine-Type Communication (mMTC), and Integrated Sensing and Communication (ISAC). It addresses concerns about standardization, ethics, and sustainability. By summarizing recent research trends and identifying future directions, this survey offers a valuable reference for researchers and practitioners at the intersection of AI and next-generation wireless communication.

摘要：隨著無線通信的演變，每一代網絡都帶來了改變我們連接和互動方式的新技術。人工智慧（AI）在塑造第六代（6G）網絡的未來中變得至關重要。通過結合AI和機器學習（ML），6G旨在為智能城市、自主系統、全息遠程存在和觸覺互聯網等應用提供高數據速率、低延遲和廣泛的連接性。本文提供了AI在支持6G網絡中角色的詳細概述。它專注於深度學習、強化學習、聯邦學習和可解釋AI等關鍵技術。它還探討了AI如何與基本網絡功能集成，並討論了與可擴展性、安全性和能源效率相關的挑戰，以及新的解決方案。此外，這項工作突出了將AI驅動的分析與6G服務領域（如超可靠低延遲通信（URLLC）、增強型移動寬帶（eMBB）、大規模機器類型通信（mMTC）和集成感知與通信（ISAC））連接起來的觀點。它解決了有關標準化、倫理和可持續性的問題。通過總結最近的研究趨勢並確定未來方向，這項調查為處於AI與下一代無線通信交叉點的研究人員和從業者提供了寶貴的參考。

Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

2603.27918v1 by Bhavuk Jain, Sercan Ö. Arık, Hardeo K. Thakur

Multimodal large language models (MLLMs) integrate information from multiple modalities such as text, images, audio, and video, enabling complex capabilities such as visual question answering and audio translation. While powerful, this increased expressiveness introduces new and amplified vulnerabilities to adversarial manipulation. This survey provides a comprehensive and systematic analysis of adversarial threats to MLLMs, moving beyond enumerating attack techniques to explain the underlying causes of model susceptibility. We introduce a taxonomy that organizes adversarial attacks according to attacker objectives, unifying diverse attack surfaces across modalities and deployment settings. Additionally, we also present a vulnerability-centric analysis that links integrity attacks, safety and jailbreak failures, control and instruction hijacking, and training-time poisoning to shared architectural and representational weaknesses in multimodal systems. Together, this framework provides an explanatory foundation for understanding adversarial behavior in MLLMs and informs the development of more robust and secure multimodal language systems.

摘要：多模態大型語言模型（MLLMs）整合來自多種模態的信息，如文本、圖像、音頻和視頻，使得複雜的能力得以實現，例如視覺問答和音頻翻譯。雖然功能強大，但這種增強的表達能力也引入了新的和加劇的對抗性操控脆弱性。這項調查提供了對MLLMs對抗性威脅的全面和系統性分析，超越了僅僅列舉攻擊技術，解釋模型易受攻擊的根本原因。我們引入了一種分類法，根據攻擊者的目標組織對抗性攻擊，統一了不同模態和部署設置中的多樣攻擊面。此外，我們還提出了一種以脆弱性為中心的分析，將完整性攻擊、安全性和越獄失敗、控制和指令劫持以及訓練時間的毒化連結到多模態系統中共享的架構和表徵弱點。總體而言，這一框架為理解MLLMs中的對抗行為提供了解釋基礎，並為開發更強大和安全的多模態語言系統提供了指導。

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

2603.27862v1 by Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, Donald Wai Tong Tsang, Chiao-Wei Hsu, Ting Wai Lam, Ho Yin Sam Ng, Chiafeng Chu, Chak-Wing Mak, Keming Wu, Hiu Tung Wong, Yik Chun Ho, Chi Ruan, Zhuofeng Li, I-Sheng Fang, Shih-Ying Yeh, Ho Kei Cheng, Ping Nie, Wenhu Chen

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce \textbf{ImagenWorld}, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.

摘要：進展於擴散、自回歸和混合模型使得高品質圖像合成成為可能，應用於文本到圖像、編輯和參考引導的構圖等任務。然而，現有的基準仍然有限，或專注於孤立任務，或僅涵蓋狹窄領域，或提供不透明的分數而未解釋失敗模式。我們介紹\textbf{ImagenWorld}，這是一個包含3.6K條件集的基準，涵蓋六個核心任務（生成和編輯，單一或多重參考）和六個主題領域（藝術作品、逼真圖像、信息圖形、文本圖形、計算機圖形和截圖）。該基準得到了20K細緻的人類註釋和一個可解釋的評估架構的支持，該架構標記了局部物體級和段落級錯誤，補充了自動化的VLM基準指標。我們對14個模型的大規模評估產生了幾個見解：（1）模型在編輯任務中通常比在生成任務中更具挑戰性，特別是在局部編輯方面。（2）模型在藝術和逼真設置中表現優異，但在符號和文本密集的領域（如截圖和信息圖形）中表現不佳。（3）封閉源系統整體領先，而針對性的數據策劃（例如Qwen-Image）在文本密集的情況下縮小了差距。（4）現代VLM基準指標達到高達0.79的Kendall準確度，接近人類排名，但在細緻的可解釋錯誤歸因方面仍然不足。 ImagenWorld提供了一個嚴謹的基準和診斷工具，以推進穩健的圖像生成。

What-If Explanations Over Time: Counterfactuals for Time Series Classification

2603.27792v1 by Udo Schlegel, Thomas Seidl

Counterfactual explanations emerge as a powerful approach in explainable AI, providing what-if scenarios that reveal how minimal changes to an input time series can alter the model's prediction. This work presents a survey of recent algorithms for counterfactual explanations for time series classification. We review state-of-the-art methods, spanning instance-based nearest-neighbor techniques, pattern-driven algorithms, gradient-based optimization, and generative models. For each, we discuss the underlying methodology, the models and classifiers they target, and the datasets on which they are evaluated. We highlight unique challenges in generating counterfactuals for temporal data, such as maintaining temporal coherence, plausibility, and actionable interpretability, which distinguish the temporal from tabular or image domains. We analyze the strengths and limitations of existing approaches and compare their effectiveness along key dimensions (validity, proximity, sparsity, plausibility, etc.). In addition, we implemented an open-source implementation library, Counterfactual Explanations for Time Series (CFTS), as a reference framework that includes many algorithms and evaluation metrics. We discuss this library's contributions in standardizing evaluation and enabling practical adoption of explainable time series techniques. Finally, based on the literature and identified gaps, we propose future research directions, including improved user-centered design, integration of domain knowledge, and counterfactuals for time series forecasting.

摘要：反事實解釋作為可解釋人工智慧中的一種強大方法，提供了假設情境，揭示了對輸入時間序列的最小變更如何改變模型的預測。這項工作呈現了最近針對時間序列分類的反事實解釋算法的調查。我們回顧了最先進的方法，涵蓋了基於實例的最近鄰技術、模式驅動算法、基於梯度的優化和生成模型。對於每一種方法，我們討論了其基本方法論、目標模型和分類器，以及其評估所用的數據集。我們突出了生成時間數據反事實的獨特挑戰，例如保持時間一致性、合理性和可操作的可解釋性，這些挑戰使得時間數據與表格或圖像領域有所區別。我們分析了現有方法的優勢和局限性，並在關鍵維度（有效性、接近性、稀疏性、合理性等）上比較了它們的有效性。此外，我們實現了一個開源實現庫，時間序列的反事實解釋（CFTS），作為一個參考框架，包含許多算法和評估指標。我們討論了這個庫在標準化評估和促進可解釋時間序列技術的實際採用方面的貢獻。最後，根據文獻和識別的空白，我們提出了未來的研究方向，包括改進以用戶為中心的設計、整合領域知識以及時間序列預測的反事實。

TianJi:An autonomous AI meteorologist for discovering physical mechanisms in atmospheric science

2603.27738v1 by Kaikai Zhang, Xiang Wang, Haoluo Zhao, Nan Chen, Mengyang Yu Jing-Jia Luo, Tao Song, Fan Meng

Artificial intelligence (AI) has achieved breakthroughs comparable to traditional numerical models in data-driven weather forecasting, yet it remains essentially statistical fitting and struggles to uncover the physical causal mechanisms of the atmosphere. Physics-oriented mechanism research still heavily relies on domain knowledge and cumbersome engineering operations of human scientists, becoming a bottleneck restricting the efficiency of Earth system science exploration. Here, we propose TianJi - the first "AI meteorologist" system capable of autonomously driving complex numerical models to verify physical mechanisms. Powered by a large language model-driven multi-agent architecture, TianJi can autonomously conduct literature research and generate scientific hypotheses. We further decouple scientific research into cognitive planning and engineering execution: the meta-planner interprets hypotheses and devises experimental roadmaps, while a cohort of specialized worker agents collaboratively complete data preparation, model configuration, and multi-dimensional result analysis. In two classic atmospheric dynamic scenarios (squall-line cold pools and typhoon track deflections), TianJi accomplishes expert-level end-to-end experimental operations with zero human intervention, compressing the research cycle to a few hours. It also delivers detailed result analyses and autonomously judges and explains the validity of the hypotheses from outputs. TianJi reveals that the role of AI in Earth system science is transitioning from a "black-box predictor" to an "interpretable scientific collaborator", offering a new paradigm for high-throughput exploration of scientific mechanisms.

摘要：人工智慧（AI）在數據驅動的天氣預測中取得了可與傳統數值模型相媲美的突破，然而它本質上仍然是統計擬合，並且在揭示大氣的物理因果機制方面面臨困難。以物理為導向的機制研究仍然在很大程度上依賴於領域知識和繁瑣的人類科學家的工程操作，這成為限制地球系統科學探索效率的瓶頸。在此，我們提出了天機——第一個能夠自主驅動複雜數值模型以驗證物理機制的「AI氣象學家」系統。天機由一個大型語言模型驅動的多代理架構提供支持，能夠自主進行文獻研究並生成科學假設。我們進一步將科學研究解耦為認知規劃和工程執行：元規劃者解釋假設並設計實驗路線圖，而一組專門的工作代理協作完成數據準備、模型配置和多維結果分析。在兩個經典的大氣動力學場景（突風線冷池和颱風路徑偏折）中，天機實現了專家級的端到端實驗操作，無需人類干預，將研究周期壓縮至幾小時。它還提供詳細的結果分析，自主判斷並解釋輸出結果中假設的有效性。天機揭示了AI在地球系統科學中的角色正在從「黑箱預測器」轉變為「可解釋的科學合作者」，為高通量科學機制探索提供了一種新範式。

2603.27451v1 by Jakub Bąba, Jarosław A. Chudziak

Argument Mining (AM) is a foundational technology for automated writing evaluation, yet traditional supervised approaches rely heavily on expensive, domain-specific fine-tuning. While Large Language Models (LLMs) offer a training-free alternative, they often struggle with structural ambiguity, failing to distinguish between similar components like Claims and Premises. Furthermore, single-agent self-correction mechanisms often suffer from sycophancy, where the model reinforces its own initial errors rather than critically evaluating them. We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty. MAD-ACC utilizes a Proponent-Opponent-Judge model where agents defend conflicting interpretations of ambiguous text, exposing logical nuances that single-agent models miss. Evaluation on the UKP Student Essays corpus demonstrates that MAD-ACC achieves a Macro F1 score of 85.7%, significantly outperforming single-agent reasoning baselines, without requiring domain-specific training. Additionally, unlike "black-box" classifiers, MAD-ACC's dialectical approach offers a transparent and explainable alternative by generating human-readable debate transcripts that explain the reasoning behind decisions.

摘要：論證挖掘（AM）是自動寫作評估的基礎技術，然而傳統的監督式方法在很大程度上依賴於昂貴的特定領域微調。雖然大型語言模型（LLMs）提供了一種無需訓練的替代方案，但它們常常在結構性歧義上掙扎，無法區分相似的組件，如主張和前提。此外，單一代理的自我修正機制常常受到阿諛奉承的影響，模型強化了自身的初始錯誤，而不是對其進行批判性評估。我們介紹了MAD-ACC（多代理辯論以進行論證組件分類），這是一個利用辯證精煉來解決分類不確定性的框架。MAD-ACC採用支持者-反對者-裁判模型，代理人為模糊文本的衝突解釋辯護，揭示單一代理模型所忽略的邏輯細微差別。在UKP學生論文語料庫上的評估顯示，MAD-ACC達到了85.7%的宏觀F1分數，顯著超越了單一代理推理基準，且不需要特定領域的訓練。此外，與“黑箱”分類器不同，MAD-ACC的辯證方法通過生成可讀的辯論記錄來提供透明且可解釋的替代方案，解釋決策背後的推理。

Culturally Adaptive Explainable LLM Assessment for Multilingual Information Disorder: A Human-in-the-Loop Approach

2603.27356v1 by Maziar Kianimoghadam Jouneghani

Recognizing information disorder is difficult because judgments about manipulation depend on cultural and linguistic context. Yet current Large Language Models (LLMs) often behave as monocultural, English-centric "black boxes," producing fluent rationales that overlook localized framing. Preliminary evidence from the multilingual Information Disorder (InDor) corpus suggests that existing models struggle to explain manipulated news consistently across communities. To address this gap, this ongoing study proposes a Hybrid Intelligence Loop, a human-in-the-loop (HITL) framework that grounds model assessment in human-written rationales from native-speaking annotators. The approach moves beyond static target-language few-shot prompting by pairing English task instructions with dynamically retrieved target-language exemplars drawn from filtered InDor annotations through In-Context Learning (ICL). In the initial pilot, the Exemplar Bank is seeded from these filtered annotations and used to compare static and adaptive prompting on Farsi and Italian news. The study evaluates span and severity prediction, the quality and cultural appropriateness of generated rationales, and model alignment across evaluator groups, providing a testbed for culturally grounded explainable AI.

摘要：識別資訊混亂是困難的，因為對操控的判斷依賴於文化和語言背景。然而，目前的大型語言模型（LLMs）往往表現為單一文化、以英語為中心的「黑箱」，產生流暢的推理卻忽略了本地化的框架。來自多語言資訊混亂（InDor）語料庫的初步證據表明，現有模型在不同社群中解釋操控新聞的一致性上存在困難。為了解決這一差距，本研究提出了一個混合智慧循環（Hybrid Intelligence Loop），這是一個人類在循環中的（HITL）框架，將模型評估建立在母語註釋者撰寫的推理上。這種方法超越了靜態目標語言的少量提示，通過將英語任務指令與從過濾的InDor註釋中動態檢索的目標語言示例配對，實現了上下文學習（ICL）。在初步試點中，示例庫從這些過濾的註釋中生成，並用於比較波斯語和意大利語新聞的靜態與自適應提示。該研究評估了範圍和嚴重性預測、生成推理的質量和文化適宜性，以及不同評估者組之間的模型對齊，為文化根植的可解釋人工智慧提供了一個測試平台。

Improving Automated Wound Assessment Using Joint Boundary Segmentation and Multi-Class Classification Models

2603.27325v1 by Mehedi Hasan Tusar, Fateme Fayyazbakhsh, Igor Melnychuk, Ming C. Leu

Accurate wound classification and boundary segmentation are essential for guiding clinical decisions in both chronic and acute wound management. However, most existing AI models are limited, focusing on a narrow set of wound types or performing only a single task (segmentation or classification), which reduces their clinical applicability. This study presents a deep learning model based on YOLOv11 that simultaneously performs wound boundary segmentation (WBS) and wound classification (WC) across five clinically relevant wound types: burn injury (BI), pressure injury (PI), diabetic foot ulcer (DFU), vascular ulcer (VU), and surgical wound (SW). A wound-type balanced dataset of 2,963 annotated images was created to train the models for both tasks, with stratified five-fold cross-validation ensuring robust and unbiased evaluation. The models trained on the original non-augmented dataset achieved consistent performance across folds, though BI detection accuracy was relatively lower. Therefore, the dataset was augmented using rotation, flipping, and variations in brightness, saturation, and exposure to help the model learn more generalized and invariant features. This augmentation significantly improved model performance, particularly in detecting visually subtle BI cases. Among tested variants, YOLOv11x achieved the highest performance with F1-scores of 0.9341 (WBS) and 0.8736 (WC), while the lightweight YOLOv11n provided comparable accuracy at lower computational cost, making it suitable for resource-constrained deployments. Supported by confusion matrices and visual detection outputs, the results confirm the model's robustness against complex backgrounds and high intra-class variability, demonstrating the potential of YOLOv11-based architectures for accurate, real-time wound analysis in both clinical and remote care settings.

摘要：準確的傷口分類和邊界分割對於指導慢性和急性傷口管理中的臨床決策至關重要。然而，大多數現有的人工智慧模型都有限，專注於狹窄的傷口類型或僅執行單一任務（分割或分類），這降低了它們的臨床適用性。本研究提出了一個基於YOLOv11的深度學習模型，能同時執行五種臨床相關傷口類型的傷口邊界分割（WBS）和傷口分類（WC）：燒傷（BI）、壓力傷（PI）、糖尿病足潰瘍（DFU）、血管潰瘍（VU）和手術傷口（SW）。為了訓練這兩項任務的模型，創建了一個包含2,963張註釋圖像的傷口類型平衡數據集，並通過分層五折交叉驗證確保了穩健和無偏的評估。在原始未增強數據集上訓練的模型在各折中表現一致，儘管BI檢測的準確性相對較低。因此，通過旋轉、翻轉以及亮度、飽和度和曝光的變化來增強數據集，以幫助模型學習更通用和不變的特徵。這種增強顯著改善了模型的性能，特別是在檢測視覺上微妙的BI案例方面。在測試的變體中，YOLOv11x以0.9341（WBS）和0.8736（WC）的F1分數達到了最高性能，而輕量級的YOLOv11n在較低的計算成本下提供了可比的準確性，使其適合資源有限的部署。通過混淆矩陣和視覺檢測輸出支持，結果確認了模型在複雜背景和高類內變異性下的穩健性，展示了基於YOLOv11的架構在臨床和遠程護理環境中進行準確實時傷口分析的潛力。

MediHive: A Decentralized Agent Collective for Medical Reasoning

2603.27150v1 by Xiaoyang Wang, Christopher C. Yang

Large language models (LLMs) have revolutionized medical reasoning tasks, yet single-agent systems often falter on complex, interdisciplinary problems requiring robust handling of uncertainty and conflicting evidence. Multi-agent systems (MAS) leveraging LLMs enable collaborative intelligence, but prevailing centralized architectures suffer from scalability bottlenecks, single points of failure, and role confusion in resource-constrained environments. Decentralized MAS (D-MAS) promise enhanced autonomy and resilience via peer-to-peer interactions, but their application to high-stakes healthcare domains remains underexplored. We introduce MediHive, a novel decentralized multi-agent framework for medical question answering that integrates a shared memory pool with iterative fusion mechanisms. MediHive deploys LLM-based agents that autonomously self-assign specialized roles, conduct initial analyses, detect divergences through conditional evidence-based debates, and locally fuse peer insights over multiple rounds to achieve consensus. Empirically, MediHive outperforms single-LLM and centralized baselines on MedQA and PubMedQA datasets, attaining accuracies of 84.3% and 78.4%, respectively. Our work advances scalable, fault-tolerant D-MAS for medical AI, addressing key limitations of centralized designs while demonstrating superior performance in reasoning-intensive tasks.

摘要：大型語言模型（LLMs）已經徹底改變了醫療推理任務，但單一代理系統在處理需要強大不確定性和衝突證據的複雜跨學科問題時，往往表現不佳。利用LLMs的多代理系統（MAS）能夠實現協作智能，但現有的集中式架構在資源有限的環境中面臨可擴展性瓶頸、單點故障和角色混淆的問題。去中心化的多代理系統（D-MAS）通過點對點互動承諾增強自主性和韌性，但其在高風險醫療領域的應用仍然未得到充分探索。我們介紹了MediHive，一個新穎的去中心化多代理框架，用於醫療問題回答，該框架整合了共享記憶池和迭代融合機制。MediHive 部署了基於LLM的代理，這些代理能夠自主自我分配專業角色，進行初步分析，通過條件證據辯論檢測分歧，並在多輪中本地融合同伴見解以達成共識。實證結果表明，MediHive在MedQA和PubMedQA數據集上的表現優於單一LLM和集中基準，分別達到84.3%和78.4%的準確率。我們的工作推進了可擴展、容錯的D-MAS在醫療AI中的應用，解決了集中設計的關鍵限制，同時在推理密集型任務中展示了卓越的性能。

2603.27057v1 by Hossein Salemi, Jitin Krishnan, Hemant Purohit

Attribution theory explains how individuals interpret and attribute others' behavior in a social context by employing personal (dispositional) and impersonal (situational) causality. Large Language Models (LLMs), trained on human-generated corpora, may implicitly mimic this social attribution process in social contexts. However, the extent to which LLMs utilize these causal attributions in their reasoning remains underexplored. Although using reasoning paradigms, such as Chain-of-Thought (CoT), has shown promising results in various tasks, ignoring social attribution in reasoning could lead to biased responses by LLMs in social contexts. In this study, we investigate the impact of incorporating a user's goal as knowledge to infer dispositional causality and message context to infer situational causality on LLM performance. To this end, we introduce a scalable method to mitigate such biases by enriching the instruction prompts for LLMs with two prompt aids using social-attribution knowledge, based on the context and goal of a social media message. This method improves the model performance while reducing the social-attribution bias of the LLM in the reasoning on zero-shot classification tasks for behavior analytics applications. We empirically show the benefits of our method across two tasks-intent detection and theme detection on social media in the disaster domain-when considering the variability of disaster types and multiple languages of social media. Our experiments highlight the biases of three open-source LLMs: Llama3, Mistral, and Gemma, toward social attribution, and show the effectiveness of our mitigation strategies.

摘要：歸因理論解釋了個體如何在社會背景中解釋和歸因他人的行為，通過運用個人（性格）和非個人（情境）因果關係。大型語言模型（LLMs）在以人類生成的語料庫進行訓練時，可能會在社會情境中隱含地模仿這一社會歸因過程。然而，LLMs在推理中利用這些因果歸因的程度仍然未被深入探討。儘管使用推理範式，如思維鏈（CoT），在各種任務中顯示出有希望的結果，但在推理中忽視社會歸因可能導致LLMs在社會情境中產生偏見的回應。在本研究中，我們調查了將用戶目標作為知識來推斷性格因果關係，以及將消息背景用於推斷情境因果關係對LLM表現的影響。為此，我們引入了一種可擴展的方法，通過基於社交媒體消息的背景和目標，使用社會歸因知識來豐富LLMs的指令提示，從而減輕這些偏見。這種方法在行為分析應用的零樣本分類任務中改善了模型性能，同時減少了LLM在推理中的社會歸因偏見。我們實證展示了我們的方法在兩個任務——災難領域社交媒體上的意圖檢測和主題檢測——中的好處，考慮到災難類型的變異性和社交媒體的多語言性。我們的實驗突顯了三個開源LLMs：Llama3、Mistral和Gemma，對社會歸因的偏見，並展示了我們的緩解策略的有效性。

PRISMA: Toward a Normative Information Infrastructure for Responsible Pharmaceutical Knowledge Management

2603.26324v1 by Eugenio Rodrigo Zimmer Neves, Amanda Vanon Correa, Camila Campioni, Gabielli Pare Guglielmi, Bruno Morelli

Most existing approaches to AI in pharmacy collapse three epistemologically distinct operations into a single technical layer: document preservation, semantic interpretation, and contextual presentation. This conflation is a root cause of recurring fragilities including loss of provenance, interpretive opacity, alert fatigue, and erosion of accountability. This paper proposes the PATOS--Lector--PRISMA (PLP) infrastructure as a normative information architecture for responsible pharmaceutical knowledge management. PATOS preserves regulatory documents with explicit versioning and provenance; Lector implements machine-assisted reading with human curation, producing typed assertions anchored to primary sources; PRISMA delivers contextual presentation through the RPDA framework (Regulatory, Prescription, Dispensing, Administration), refracting the same informational core into distinct professional views. The architecture introduces the Evidence Pack as a formal unit of accountable assertion (versioned, traceable, epistemically bounded, and curatorially validated), with assertions typified by illocutionary force. A worked example traces dipyrone monohydrate across all three layers using real system data. Developed and validated in Brazil's regulatory context, the architecture is grounded in an operational implementation comprising over 16,000 official documents and 38 curated Evidence Packs spanning five reference medications. The proposal is demonstrated as complementary to operational decision support systems, providing infrastructural conditions that current systems lack: documentary anchoring, interpretive transparency, and institutional accountability.

摘要：大多數現有的藥學人工智慧方法將三個在認識論上明顯不同的操作合併為一個單一的技術層：文件保存、語義解釋和上下文呈現。這種混淆是導致反覆出現的脆弱性的根本原因，包括來源丟失、解釋不明、警報疲勞和問責制侵蝕。本文提出PATOS--Lector--PRISMA (PLP) 基礎設施作為負責任的藥學知識管理的規範性信息架構。PATOS以明確的版本控制和來源保存監管文件；Lector實施人機協作的閱讀，產生與主要來源相連的類型化斷言；PRISMA通過RPDA框架（監管、處方、配藥、管理）提供上下文呈現，將相同的信息核心折射成不同的專業視角。該架構引入了證據包作為一個正式的可問責斷言單位（版本化、可追溯、認識論界定且經過策展驗證），其斷言以言外之意的強度為特徵。一個實例追踪了單硫酸二氫鈉在所有三個層面上的應用，使用真實系統數據。在巴西的監管背景下開發和驗證，該架構基於一個操作實施，包含超過16,000份官方文件和38個策展的證據包，涵蓋五種參考藥物。該提案被證明是對操作決策支持系統的補充，提供了當前系統所缺乏的基礎設施條件：文件錨定、解釋透明度和機構問責制。

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

2604.02368v1 by Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xinyu Chen, Tianci He, Jiani Hou, Liang Hu, Ziyun Huang, Yongzhe Hui, Jianpeng Jiao, Chennan Ju, Yingru Kong, Yiran Li, Mengyun Liu, Luyao Ma, Fei Ni, Yiqing Ni, Yueyan Qiu, Yanle Ren, Zilin Shi, Zaiyuan Wang, Wenjie Yue, Shiyu Zhang, Xinyi Zhang, Kaiwen Zhao, Zhenwei Zhu

As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts--including researchers from elite institutions and practitioners with extensive clinical or industrial experience--ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant "expert-gap" in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.

摘要：隨著大型語言模型（LLMs）在傳統基準測試中表現出平穩的性能，一個關鍵挑戰依然存在：評估它們在複雜的、開放式任務中的能力，這些任務特徵是真正專家級的認知。現有的框架存在著狹窄的領域覆蓋、依賴於通用任務或自我評估偏見的問題。為了填補這一空白，我們提出了XpertBench，一個高保真基準，旨在評估LLMs在真實專業領域的表現。XpertBench包含1,346個精心策劃的任務，涵蓋80個類別，涉及金融、醫療保健、法律服務、教育以及雙軌研究（STEM和人文科學）。這些任務源自於超過1,000份來自領域專家的提交，包括來自頂尖機構的研究人員和擁有豐富臨床或工業經驗的從業者，確保了卓越的生態有效性。每個任務使用詳細的評分標準，通常有15-40個加權檢查點來評估專業的嚴謹性。為了促進可擴展且與人類對齊的評估，我們引入了ShotJudge，一種新穎的評估範式，利用經過專家少量示例校準的LLM評審，以減少自我獎勵偏見。我們對最先進的LLMs的實證評估顯示出明顯的性能上限：即使是領先模型的最高成功率也僅為約66%，平均分數約為55%。模型還表現出領域特定的差異，在定量推理與語言綜合方面顯示出不重疊的優勢。這些發現強調了當前AI系統中存在的顯著“專家差距”，並確立了XpertBench作為從通用助手過渡到專業合作夥伴的關鍵工具。

Sparse Auto-Encoders and Holism about Large Language Models

2603.26207v1 by Jumbly Grindrod

Does Large Language Model (LLM) technology suggest a meta-semantic picture i.e. a picture of how words and complex expressions come to have the meaning that they do? One modest approach explores the assumptions that seem to be built into how LLMs capture the meanings of linguistic expressions as a way of considering their plausibility (Grindrod, 2026a, 2026b). It has previously been argued that LLMs, in employing a form of distributional semantics, adopt a form of holism about meaning (Grindrod, 2023; Grindrod et al., forthcoming). However, recent work in mechanistic interpretability presents a challenge to these arguments. Specifically, the discovery of a vast array of interpretable latent features within the high dimensional spaces used by LLMs potentially challenges the holistic interpretation. In this paper, I will present the original reasons for thinking that LLMs embody a form of holism (section 1), before introducing recent work on features generated through sparse auto-encoders, and explaining how the discovery of such features suggests an alternative decompositional picture of meaning (section 2). I will then respond to this challenge by considering in greater detail the nature of such features (section 3). Finally, I will return to the holistic picture defended by Grindrod et al. and argue that the picture still stands provided that the features are countable (section 4).

摘要：大型語言模型（LLM）技術是否暗示了一種元語義圖景，即單詞和複雜表達如何獲得其意義的圖景？一種謙遜的方式探討了似乎內建於LLM如何捕捉語言表達意義的假設，以此來考慮它們的合理性（Grindrod, 2026a, 2026b）。之前有人主張，LLM在採用一種分佈語義形式時，對意義採取了一種整體主義的觀點（Grindrod, 2023; Grindrod et al., forthcoming）。然而，最近在機械可解釋性方面的研究對這些論點提出了挑戰。具體而言，在LLM使用的高維空間中發現大量可解釋的潛在特徵，可能對整體解釋提出挑戰。在本文中，我將介紹認為LLM體現了一種整體主義的原始理由（第1節），然後介紹通過稀疏自編碼器生成的特徵的最新研究，並解釋這些特徵的發現如何暗示了一種替代的意義分解圖景（第2節）。接著，我將通過更詳細地考慮這些特徵的性質來回應這一挑戰（第3節）。最後，我將回到Grindrod等人所辯護的整體圖景，並主張只要這些特徵是可數的，該圖景仍然成立（第4節）。

Concerning Uncertainty -- A Systematic Survey of Uncertainty-Aware XAI

2603.26838v1 by Helena Löfström, Tuwe Löfström, Anders Hjort, Fatima Rabia Yapicioglu

This paper surveys uncertainty-aware explainable artificial intelligence (UAXAI), examining how uncertainty is incorporated into explanatory pipelines and how such methods are evaluated. Across the literature, three recurring approaches to uncertainty quantification emerge (Bayesian, Monte Carlo, and Conformal methods), alongside distinct strategies for integrating uncertainty into explanations: assessing trustworthiness, constraining models or explanations, and explicitly communicating uncertainty. Evaluation practices remain fragmented and largely model centered, with limited attention to users and inconsistent reporting of reliability properties (e.g., calibration, coverage, explanation stability). Recent work leans towards calibration, distribution free techniques and recognizes explainer variability as a central concern. We argue that progress in UAXAI requires unified evaluation principles that link uncertainty propagation, robustness, and human decision-making, and highlight counterfactual and calibration approaches as promising avenues for aligning interpretability with reliability.

摘要：這篇論文調查了具不確定性感知的可解釋人工智慧（UAXAI），檢視不確定性如何融入解釋流程以及這些方法如何被評估。在文獻中，出現了三種重複的不確定性量化方法（貝葉斯、蒙地卡羅和符合方法），以及將不確定性整合到解釋中的不同策略：評估可信度、限制模型或解釋，以及明確傳達不確定性。評估實踐仍然是片段化的，並且主要集中在模型上，對用戶的關注有限，且可靠性特徵的報告不一致（例如，校準、覆蓋率、解釋穩定性）。最近的研究傾向於校準、無分佈技術，並將解釋者的變異性視為一個核心問題。我們認為，UAXAI的進展需要統一的評估原則，將不確定性傳播、穩健性和人類決策聯繫起來，並強調反事實和校準方法作為將可解釋性與可靠性對齊的有希望的途徑。

SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis

2603.26122v1 by Zhangtianyi Chen, Yuhao Shen, Florensia Widjaja, Yan Xu, Liyuan Sun, Zijian Wang, Hongyi Chen, Wufei Dai, Juexiao Zhou

While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine-grained, large-scale multi-class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi-agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real-world clinical settings. Here, we present SkinGPT-X, a multimodal collaborative multi-agent system for dermatological diagnosis integrated with a self-evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT-X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT-X, we design a three-tier comparative experiment. First, we benchmark SkinGPT-X against four state-of-the-art LLMs across four public datasets, demonstrating its state-of-the-art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state-of-the-art model. Second, we construct a large-scale multi-class dataset covering 498 distinct dermatological categories to evaluate its fine-grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT-X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen's Kappa improvement.

摘要：雖然近期在大型語言模型方面的進展顯著推進了皮膚科診斷，但單一的 LLM 在細粒度、大規模多類別診斷任務和罕見皮膚疾病診斷方面經常面臨困難，這主要是由於訓練數據的稀疏性，同時也缺乏臨床推理所需的可解釋性和可追溯性。儘管多代理系統可以提供更透明和可解釋的診斷，但現有框架主要集中在視覺問答和對話任務上，並且對靜態知識庫的高度依賴限制了其在複雜現實臨床環境中的適應性。在此，我們提出了 SkinGPT-X，一個多模態協作多代理系統，專為皮膚科診斷而設，並整合了自我演化的皮膚科記憶機制。通過模擬皮膚科醫生的診斷工作流程並實現持續的記憶演變，SkinGPT-X 提供了透明且值得信賴的診斷，以管理複雜和罕見的皮膚科病例。為了驗證 SkinGPT-X 的穩健性，我們設計了一個三級比較實驗。首先，我們將 SkinGPT-X 與四個最先進的 LLM 在四個公共數據集上進行基準測試，顯示其在 DDI31 上的準確率提高了 +9.6%，在 Dermnet 上的加權 F1 分數提高了 +13%。其次，我們構建了一個涵蓋 498 種不同皮膚科類別的大型多類別數據集，以評估其細粒度分類能力。最後，我們整理了罕見皮膚疾病數據集，這是首個針對臨床罕見皮膚疾病稀缺問題的基準，包含 564 份臨床樣本，涵蓋八種罕見皮膚病。在這個數據集上，SkinGPT-X 實現了 +9.8% 的準確率提高，+7.1% 的加權 F1 提高，以及 +10% 的 Cohen's Kappa 提高。

DPD-Cancer: Explainable Graph-based Deep Learning for Small Molecule Anti-Cancer Activity Prediction

2603.26114v1 by Magnus H. Strømme, Alex G. C. de Sá, David B. Ascher

Accurate drug response prediction is a critical bottleneck in computational biochemistry, limited by the challenge of modelling the interplay between molecular structure and cellular context. In cancer research, this is acute due to tumour heterogeneity and genomic variability, which hinder the identification of effective therapies. Conventional approaches often fail to capture non-linear relationships between chemical features and biological outcomes across diverse cell lines. To address this, we introduce DPD-Cancer, a deep learning method based on a Graph Attention Transformer (GAT) framework. It is designed for small molecule anti-cancer activity classification and the quantitative prediction of cell-line specific responses, specifically growth inhibition concentration (pGI50). Benchmarked against state-of-the-art methods (pdCSM-cancer, ACLPred, and MLASM), DPD-Cancer demonstrated superior performance, achieving an Area Under ROC Curve (AUC) of up to 0.87 on strictly partitioned NCI60 data and up to 0.98 on ACLPred/MLASM datasets. For pGI50 prediction across 10 cancer types and 73 cell lines, the model achieved Pearson's correlation coefficients of up to 0.72 on independent test sets. These findings confirm that attention-based mechanisms offer significant advantages in extracting meaningful molecular representations, establishing DPD-Cancer as a competitive tool for prioritising drug candidates. Furthermore, DPD-Cancer provides explainability by leveraging the attention mechanism to identify and visualise specific molecular substructures, offering actionable insights for lead optimisation. DPD-Cancer is freely available as a web server at: https://biosig.lab.uq.edu.au/dpd_cancer/.

摘要：準確的藥物反應預測是計算生物化學中的一個關鍵瓶頸，受到分子結構與細胞環境之間相互作用建模挑戰的限制。在癌症研究中，這一問題尤為嚴重，因為腫瘤的異質性和基因組變異性妨礙了有效療法的識別。傳統方法往往無法捕捉化學特徵與生物學結果之間的非線性關係，尤其是在多樣的細胞系中。為了解決這一問題，我們推出了DPD-Cancer，這是一種基於圖注意力Transformer（Graph Attention Transformer, GAT）框架的深度學習方法。它旨在進行小分子抗癌活性分類以及細胞系特異性反應的定量預測，特別是生長抑制濃度（pGI50）。在與最先進的方法（pdCSM-cancer、ACLPred 和 MLASM）的基準測試中，DPD-Cancer展現了卓越的性能，在嚴格劃分的NCI60數據上達到高達0.87的ROC曲線下面積（AUC），在ACLPred/MLASM數據集上達到高達0.98。針對10種癌症類型和73個細胞系的pGI50預測，該模型在獨立測試集上達到了高達0.72的皮爾森相關係數。這些發現確認了基於注意力的機制在提取有意義的分子表示方面提供了顯著優勢，確立了DPD-Cancer作為優先選擇藥物候選者的競爭工具。此外，DPD-Cancer通過利用注意力機制來識別和可視化特定的分子子結構，提供了解釋性，為先導優化提供可行的見解。DPD-Cancer作為網頁伺服器免費提供，網址為：https://biosig.lab.uq.edu.au/dpd_cancer/。

A Regression Framework for Understanding Prompt Component Impact on LLM Performance

2603.26830v1 by Andrew Lauziere, Jonathan Daugherty, Taisa Kushner

As large language models (LLMs) continue to improve and see further integration into software systems, so does the need to understand the conditions in which they will perform. We contribute a statistical framework for understanding the impact of specific prompt features on LLM performance. The approach extends previous explainable artificial intelligence (XAI) methods specifically to inspect LLMs by fitting regression models relating portions of the prompt to LLM evaluation. We apply our method to compare how two open-source models, Mistral-7B and GPT-OSS-20B, leverage the prompt to perform a simple arithmetic problem. Regression models of individual prompt portions explain 72% and 77% of variation in model performances, respectively. We find misinformation in the form of incorrect example query-answer pairs impedes both models from solving the arithmetic query, though positive examples do not find significant variability in the impact of positive and negative instructions - these prompts have contradictory effects on model performance. The framework serves as a tool for decision makers in critical scenarios to gain granular insight into how the prompt influences an LLM to solve a task.

摘要：隨著大型語言模型（LLMs）不斷改進並進一步融入軟體系統，了解它們在何種條件下表現的需求也隨之增加。我們貢獻了一個統計框架，以理解特定提示特徵對LLM性能的影響。這種方法擴展了以前的可解釋人工智慧（XAI）方法，專門用於檢查LLMs，通過擬合回歸模型將提示的部分與LLM評估相關聯。我們應用我們的方法來比較兩個開源模型，Mistral-7B和GPT-OSS-20B，如何利用提示來解決一個簡單的算術問題。個別提示部分的回歸模型分別解釋了模型性能變異的72%和77%。我們發現，以不正確的示例查詢-回答對的形式出現的錯誤信息，妨礙了這兩個模型解決算術查詢，儘管正面示例在正面和負面指令的影響上並未顯示出顯著變異——這些提示對模型性能有矛盾的影響。該框架作為關鍵情境中決策者的工具，提供了有關提示如何影響LLM解決任務的細緻見解。

FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants

2603.26008v1 by Mahesh Bhosale, Abdul Wasi, Shantam Srivastava, Shifa Latif, Tianyu Luan, Mingchen Gao, David Doermann, Xuan Gong

While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model's representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at https://github.com/bhosalems/FairLLaVA.

摘要：雖然在圖像條件生成方面具有強大能力，多模態大型語言模型（MLLMs）在不同人口群體之間的表現卻可能不均衡，凸顯了公平風險。在安全至關重要的臨床環境中，這種差異可能導致不平等的診斷敘事，並侵蝕對AI輔助決策的信任。儘管公平性在僅限於視覺和僅限於語言的模型中已被廣泛研究，但其對MLLMs的影響仍然大多未被探討。為了解決這些偏見，我們引入了FairLLaVA，一種參數高效的微調方法，能在不妥協整體性能的情況下減輕視覺指令調整中的群體差異。通過最小化目標屬性之間的互信息，FairLLaVA使模型的表示變得與人口統計無關。該方法可以作為輕量級插件納入，保持低秩適配器微調的效率，並提供一種與架構無關的公平視覺指令跟隨方法。在大規模胸部放射報告生成和皮膚鏡視覺問題回答基準上的廣泛實驗表明，FairLLaVA持續減少群體間的差異，同時提高各種醫學影像模式下的公平性縮放臨床表現和自然語言生成質量。代碼可在 https://github.com/bhosalems/FairLLaVA 獲取。

Do Neurons Dream of Primitive Operators? Wake-Sleep Compression Rediscovers Schank's Event Semantics

2603.25975v1 by Peter Balogh

We show that they do. Schank's conceptual dependency theory proposed that all events decompose into primitive operations -- ATRANS, PTRANS, MTRANS, and others -- hand-coded from linguistic intuition. Can the same primitives be discovered automatically through compression pressure alone? We adapt DreamCoder's wake-sleep library learning to event state transformations. Given events as before/after world state pairs, our system finds operator compositions explaining each event (wake), then extracts recurring patterns as new operators optimized under Minimum Description Length (sleep). Starting from four generic primitives, it discovers operators mapping directly to Schank's: MOVE_PROP_has = ATRANS, CHANGE_location = PTRANS, SET_knows = MTRANS, SET_consumed = INGEST, plus compound operators ("mail" = ATRANS + PTRANS) and novel emotional state operators absent from Schank's taxonomy. We validate on synthetic events and real-world commonsense data from the ATOMIC knowledge graph. On synthetic data, discovered operators achieve Bayesian MDL within 4% of Schank's hand-coded primitives while explaining 100% of events vs. Schank's 81%. On ATOMIC, results are more dramatic: Schank's primitives explain only 10% of naturalistic events, while the discovered library explains 100%. Dominant operators are not physical-action primitives but mental and emotional state changes -- CHANGE_wants (20%), CHANGE_feels (18%), CHANGE_is (18%) -- none in Schank's original taxonomy. These results provide the first empirical evidence that event primitives can be derived from compression pressure, that Schank's core primitives are information-theoretically justified, and that the complete inventory is substantially richer than proposed -- with mental/emotional operators dominating in naturalistic data.

摘要：我們展示了它們確實存在。Schank 的概念依賴理論提出所有事件都可分解為原始操作——ATRANS、PTRANS、MTRANS 等——這些操作是基於語言直覺手動編碼的。是否可以僅通過壓縮壓力自動發現相同的原始操作？
我們將 DreamCoder 的醒眠庫學習調整為事件狀態轉換。給定事件作為前後世界狀態對，我們的系統找到解釋每個事件的操作組合（醒），然後提取作為新操作的重複模式，這些模式在最小描述長度下進行優化（眠）。從四個通用原始操作開始，它發現直接映射到 Schank 的操作：MOVE_PROP_has = ATRANS、CHANGE_location = PTRANS、SET_knows = MTRANS、SET_consumed = INGEST，還有複合操作（“mail” = ATRANS + PTRANS）和 Schank 的分類中缺失的新情感狀態操作。
我們在合成事件和來自 ATOMIC 知識圖譜的現實世界常識數據上進行驗證。在合成數據上，發現的操作在解釋 100% 事件的同時，實現了與 Schank 的手動編碼原始操作在 4% 內的貝葉斯 MDL，而 Schank 只解釋了 81%。在 ATOMIC 上，結果更為戲劇性：Schank 的原始操作僅解釋了 10% 的自然事件，而發現的庫解釋了 100%。主導操作不是物理行動原始操作，而是心理和情感狀態變化——CHANGE_wants（20%）、CHANGE_feels（18%）、CHANGE_is（18%）——這些在 Schank 的原始分類中都不存在。
這些結果提供了第一個實證證據，表明事件原始操作可以從壓縮壓力中推導出來，Schank 的核心原始操作在信息理論上是有正當理由的，並且完整的清單在提出的基礎上實質上更為豐富——在自然數據中，心理/情感操作占主導地位。

Methods for Knowledge Graph Construction from Text Collections: Development and Applications

2603.25862v1 by Vanni Zavarella

Virtually every sector of society is experiencing a dramatic growth in the volume of unstructured textual data that is generated and published, from news and social media online interactions, through open access scholarly communications and observational data in the form of digital health records and online drug reviews. The volume and variety of data across all this range of domains has created both unprecedented opportunities and pressing challenges for extracting actionable knowledge for several application scenarios. However, the extraction of rich semantic knowledge demands the deployment of scalable and flexible automatic methods adaptable across text genres and schema specifications. Moreover, the full potential of these data can only be unlocked by coupling information extraction methods with Semantic Web techniques for the construction of full-fledged Knowledge Graphs, that are semantically transparent, explainable by design and interoperable. In this thesis, we experiment with the application of Natural Language Processing, Machine Learning and Generative AI methods, powered by Semantic Web best practices, to the automatic construction of Knowledge Graphs from large text corpora, in three use case applications: the analysis of the Digital Transformation discourse in the global news and social media platforms; the mapping and trend analysis of recent research in the Architecture, Engineering, Construction and Operations domain from a large corpus of publications; the generation of causal relation graphs of biomedical entities from electronic health records and patient-authored drug reviews. The contributions of this thesis to the research community are in terms of benchmark evaluation results, the design of customized algorithms and the creation of data resources in the form of Knowledge Graphs, together with data analysis results built on top of them.

摘要：幾乎每個社會領域都在經歷著未結構化文本數據生成和發佈量的劇增，這些數據來自於新聞和社交媒體的在線互動、開放存取的學術交流以及以數位健康記錄和在線藥物評價形式呈現的觀察數據。這些領域中數據的量和多樣性創造了前所未有的機會和迫切的挑戰，以提取可行的知識以應用於多個場景。然而，提取豐富的語義知識需要部署可擴展且靈活的自動化方法，這些方法能夠適應不同的文本類型和架構規範。此外，這些數據的全部潛力僅能通過將信息提取方法與語義網技術結合來釋放，以構建語義透明、設計上可解釋且可互操作的完整知識圖譜。在本論文中，我們實驗應用自然語言處理、機器學習和生成式AI方法，這些方法由語義網最佳實踐驅動，實現從大型文本語料庫自動構建知識圖譜，並針對三個使用案例進行應用：分析全球新聞和社交媒體平台中的數位轉型話語；從大量出版物中映射和趨勢分析建築、工程、建設和運營領域的最新研究；從電子健康記錄和患者撰寫的藥物評價中生成生物醫學實體的因果關係圖。這篇論文對研究社群的貢獻體現在基準評估結果、定制算法的設計以及以知識圖譜形式創建的數據資源，連同基於這些資源構建的數據分析結果。

A Compression Perspective on Simplicity Bias

2603.25839v1 by Tom Marty, Eric Elmoznino, Leo Gagnon, Tejas Kasetty, Mizu Nishikawa-Toomey, Sarthak Mittal, Guillaume Lajoie, Dhanya Sridhar

Deep neural networks exhibit a simplicity bias, a well-documented tendency to favor simple functions over complex ones. In this work, we cast new light on this phenomenon through the lens of the Minimum Description Length principle, formalizing supervised learning as a problem of optimal two-part lossless compression. Our theory explains how simplicity bias governs feature selection in neural networks through a fundamental trade-off between model complexity (the cost of describing the hypothesis) and predictive power (the cost of describing the data). Our framework predicts that as the amount of available training data increases, learners transition through qualitatively different features -- from simple spurious shortcuts to complex features -- only when the reduction in data encoding cost justifies the increased model complexity. Consequently, we identify distinct data regimes where increasing data promotes robustness by ruling out trivial shortcuts, and conversely, regimes where limiting data can act as a form of complexity-based regularization, preventing the learning of unreliable complex environmental cues. We validate our theory on a semi-synthetic benchmark showing that the feature selection of neural networks follows the same trajectory of solutions as optimal two-part compressors.

摘要：深度神經網絡表現出簡單性偏見，這是一種已被充分記錄的趨勢，傾向於偏好簡單函數而非複雜函數。在這項工作中，我們通過最小描述長度原則的視角，為這一現象提供了新的見解，將監督學習形式化為最佳雙部分無損壓縮的問題。我們的理論解釋了簡單性偏見如何通過模型複雜性（描述假設的成本）和預測能力（描述數據的成本）之間的基本權衡來支配神經網絡中的特徵選擇。我們的框架預測，隨著可用訓練數據量的增加，學習者會經歷質量上不同的特徵轉變——從簡單的虛假捷徑到複雜特徵——僅當數據編碼成本的降低足以證明增加的模型複雜性是合理的。因此，我們確定了不同的數據範疇，在這些範疇中，增加數據促進了穩健性，排除了微不足道的捷徑；相反，限制數據可以作為一種基於複雜性的正則化形式，防止學習不可靠的複雜環境線索。我們在一個半合成基準上驗證了我們的理論，顯示神經網絡的特徵選擇遵循與最佳雙部分壓縮器相同的解決方案軌跡。

Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI

2603.25821v1 by Anna Kozlova, Stanislau Salavei, Pavel Satalkin, Hanna Plotnitskaya, Sergey Parfenyuk

We present Doctorina MedBench, a comprehensive evaluation framework for agent-based medical AI based on the simulation of realistic physician-patient interactions. Unlike traditional medical benchmarks that rely on solving standardized test questions, the proposed approach models a multi-step clinical dialogue in which either a physician or an AI system must collect medical history, analyze attached materials (including laboratory reports, images, and medical documents), formulate differential diagnoses, and provide personalized recommendations. System performance is evaluated using the D.O.T.S. metric, which consists of four components: Diagnosis, Observations/Investigations, Treatment, and Step Count, enabling assessment of both clinical correctness and dialogue efficiency. The system also incorporates a multi-level testing and quality monitoring architecture designed to detect model degradation during both development and deployment. The framework supports safety-oriented trap cases, category-based random sampling of clinical scenarios, and full regression testing. The dataset currently contains more than 1,000 clinical cases covering over 750 diagnoses. The universality of the evaluation metrics allows the framework to be used not only to assess medical AI systems, but also to evaluate physicians and support the development of clinical reasoning skills. Our results suggest that simulation of clinical dialogue may provide a more realistic assessment of clinical competence compared to traditional examination-style benchmarks.

摘要：我們提出了 Doctorina MedBench，這是一個基於模擬現實醫生-病人互動的代理醫療 AI 的綜合評估框架。與依賴解決標準化測試問題的傳統醫療基準不同，所提出的方法模型化了一個多步驟的臨床對話，在這個對話中，醫生或 AI 系統必須收集病歷、分析附加材料（包括實驗室報告、影像和醫療文件）、制定鑑別診斷並提供個性化建議。系統性能使用 D.O.T.S. 指標進行評估，該指標由四個組成部分構成：診斷、觀察/調查、治療和步驟計數，能夠評估臨床正確性和對話效率。

該系統還包含一個多層次的測試和質量監控架構，旨在在開發和部署期間檢測模型退化。該框架支持以安全為導向的陷阱案例、基於類別的臨床場景隨機抽樣以及全面的回歸測試。數據集目前包含超過 1,000 個臨床案例，涵蓋超過 750 種診斷。評估指標的普遍性使得該框架不僅可以用來評估醫療 AI 系統，還可以評估醫生並支持臨床推理技能的發展。我們的結果表明，臨床對話的模擬可能提供比傳統考試風格基準更現實的臨床能力評估。

DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial

2603.25607v1 by Zhenchen Zhu, Ge Hu, Weixiong Tan, Kai Gao, Chao Sun, Zhen Zhou, Kepei Xu, Wei Han, Meixia Shang, Xiaoming Qiu, Yiqing Tan, Jinhua Wang, Zhoumeng Ying, Li Peng, Wei Song, Lan Song, Zhengyu Jin, Nan Hong, Yizhou Yu

The widespread adoption of CT has notably increased the number of detected lung nodules. However, current deep learning methods for classifying benign and malignant nodules often fail to comprehensively integrate global and local features, and most of them have not been validated through clinical trials. To address this, we developed DeepFAN, a transformer-based model trained on over 10K pathology-confirmed nodules and further conducted a multi-reader, multi-case clinical trial to evaluate its efficacy in assisting junior radiologists. DeepFAN achieved diagnostic area under the curve (AUC) of 0.939 (95% CI 0.930-0.948) on an internal test set and 0.954 (95% CI 0.934-0.973) on the clinical trial dataset involving 400 cases across three independent medical institutions. Explainability analysis indicated higher contributions from global than local features. Twelve readers' average performance significantly improved by 10.9% (95% CI 8.3%-13.5%) in AUC, 10.0% (95% CI 8.9%-11.1%) in accuracy, 7.6% (95% CI 6.1%-9.2%) in sensitivity, and 12.6% (95% CI 10.9%-14.3%) in specificity (P<0.001 for all). Nodule-level inter-reader diagnostic consistency improved from fair to moderate (overall k: 0.313 vs. 0.421; P=0.019). In conclusion, DeepFAN effectively assisted junior radiologists and may help homogenize diagnostic quality and reduce unnecessary follow-up of indeterminate pulmonary nodules. Chinese Clinical Trial Registry: ChiCTR2400084624.

摘要：CT的廣泛應用顯著增加了檢測到的肺結節數量。然而，當前用於分類良性和惡性結節的深度學習方法往往未能全面整合全球和局部特徵，且大多數尚未通過臨床試驗進行驗證。為了解決這個問題，我們開發了DeepFAN，一種基於Transformer的模型，該模型在超過10K病理確認的結節上進行了訓練，並進一步進行了多讀者、多案例的臨床試驗，以評估其在輔助初級放射科醫生方面的有效性。DeepFAN在內部測試集上達到了0.939的診斷曲線下面積（AUC）（95% CI 0.930-0.948），在涉及三個獨立醫療機構的400個案例的臨床試驗數據集上達到了0.954（95% CI 0.934-0.973）。可解釋性分析顯示全球特徵的貢獻高於局部特徵。十二位讀者的平均表現顯著提高了10.9%（95% CI 8.3%-13.5%）的AUC，10.0%（95% CI 8.9%-11.1%）的準確率，7.6%（95% CI 6.1%-9.2%）的敏感性，以及12.6%（95% CI 10.9%-14.3%）的特異性（所有P<0.001）。結節級別的讀者間診斷一致性從公平改善到中等（整體k: 0.313 vs. 0.421; P=0.019）。總之，DeepFAN有效地輔助了初級放射科醫生，並可能有助於均化診斷質量，減少對不確定肺結節的不必要隨訪。中國臨床試驗登記：ChiCTR2400084624。

From Manipulation to Mistrust: Explaining Diverse Micro-Video Misinformation for Robust Debunking in the Wild

2603.25423v1 by Zhi Zeng, Yifei Yang, Jiaying Wu, Xulang Zhang, Xiangzheng Kong, Herun Wan, Zihan Ma, Minnan Luo

The rise of micro-videos has reshaped how misinformation spreads, amplifying its speed, reach, and impact on public trust. Existing benchmarks typically focus on a single deception type, overlooking the diversity of real-world cases that involve multimodal manipulation, AI-generated content, cognitive bias, and out-of-context reuse. Meanwhile, most detection models lack fine-grained attribution, limiting interpretability and practical utility. To address these gaps, we introduce WildFakeBench, a large-scale benchmark of over 10,000 real-world micro-videos covering diverse misinformation types and sources, each annotated with expert-defined attribution labels. Building on this foundation, we develop FakeAgent, a Delphi-inspired multi-agent reasoning framework that integrates multimodal understanding with external evidence for attribution-grounded analysis. FakeAgent jointly analyzes content and retrieved evidence to identify manipulation, recognize cognitive and AI-generated patterns, and detect out-of-context misinformation. Extensive experiments show that FakeAgent consistently outperforms existing MLLMs across all misinformation types, while WildFakeBench provides a realistic and challenging testbed for advancing explainable micro-video misinformation detection. Data and code are available at: https://github.com/Aiyistan/FakeAgent.

摘要：微視頻的興起重塑了錯誤資訊的傳播方式，放大了其速度、範圍和對公眾信任的影響。現有的基準通常專注於單一的欺騙類型，忽略了涉及多模態操控、AI生成內容、認知偏見和脫離上下文重用的現實案例的多樣性。與此同時，大多數檢測模型缺乏細緻的歸因，限制了可解釋性和實際效用。為了填補這些空白，我們推出了WildFakeBench，這是一個涵蓋超過10,000個現實微視頻的大型基準，涵蓋多種錯誤資訊類型和來源，每個視頻都附有專家定義的歸因標籤。在此基礎上，我們開發了FakeAgent，一個受德爾菲啟發的多代理推理框架，將多模態理解與外部證據整合以進行基於歸因的分析。FakeAgent共同分析內容和檢索的證據，以識別操控、識別認知和AI生成的模式，並檢測脫離上下文的錯誤資訊。廣泛的實驗顯示，FakeAgent在所有錯誤資訊類型上始終超越現有的MLLM，而WildFakeBench則提供了一個現實且具有挑戰性的測試平台，以促進可解釋的微視頻錯誤資訊檢測。數據和代碼可在以下鏈接獲取：https://github.com/Aiyistan/FakeAgent。

Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

2603.25403v2 by Eyal Hadad, Mordechai Guri

On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input's geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating state-of-the-art models such as LLaVA-NeXT and Qwen2-VL, we show that combining these signals enables reliable inference of privacy-sensitive contexts. Finally, we analyze the security engineering trade-offs of mitigating this vulnerability, reveal substantial performance overhead with constant-work padding, and propose practical design recommendations for secure Edge AI deployments.

摘要：在裝置上的視覺-語言模型（VLMs）透過本地執行承諾數據隱私。然而，我們顯示出向動態高解析度預處理（例如 AnyRes）的架構轉變引入了一個固有的算法側信道。與靜態模型不同，動態預處理根據圖像的長寬比將其分解為可變數量的補丁，創造出依賴於工作負載的輸入。我們展示了一個針對本地 VLMs 的雙層攻擊框架。在第一層，未特權的攻擊者可以利用標準未特權操作系統指標來利用顯著的執行時間變化，可靠地指紋輸入的幾何形狀。在第二層，通過分析最後級快取（LLC）的競爭，攻擊者可以解決相同幾何形狀中的語義模糊，區分視覺上密集（例如醫療 X 光）和稀疏（例如文本文件）內容。通過評估最先進的模型，如 LLaVA-NeXT 和 Qwen2-VL，我們顯示結合這些信號能夠可靠地推斷隱私敏感的上下文。最後，我們分析了減輕此漏洞的安全工程權衡，揭示了使用常數工作填充的顯著性能開銷，並提出了安全邊緣 AI 部署的實用設計建議。

4OPS: Structural Difficulty Modeling in Integer Arithmetic Puzzles

2603.25356v1 by Yunus E. Zeytuncu

Arithmetic puzzle games provide a controlled setting for studying difficulty in mathematical reasoning tasks, a core challenge in adaptive learning systems. We investigate the structural determinants of difficulty in a class of integer arithmetic puzzles inspired by number games. We formalize the problem and develop an exact dynamic-programming solver that enumerates reachable targets, extracts minimal-operation witnesses, and enables large-scale labeling. Using this solver, we construct a dataset of over 3.4 million instances and define difficulty via the minimum number of operations required to reach a target. We analyze the relationship between difficulty and solver-derived features. While baseline machine learning models based on bag- and target-level statistics can partially predict solvability, they fail to reliably distinguish easy instances. In contrast, we show that difficulty is fully determined by a small set of interpretable structural attributes derived from exact witnesses. In particular, the number of input values used in a minimal construction serves as a minimal sufficient statistic for difficulty under this labeling. These results provide a transparent, computationally grounded account of puzzle difficulty that bridges symbolic reasoning and data-driven modeling. The framework supports explainable difficulty estimation and principled task sequencing, with direct implications for adaptive arithmetic learning and intelligent practice systems.

摘要：算術謎題遊戲提供了一個受控的環境，用於研究數學推理任務中的難度，這是自適應學習系統中的核心挑戰。我們研究了一類受數字遊戲啟發的整數算術謎題中的難度結構決定因素。我們將問題形式化，並開發了一個精確的動態規劃求解器，該求解器枚舉可達目標，提取最小操作證據，並實現大規模標記。利用這個求解器，我們構建了一個超過340萬個實例的數據集，並通過達到目標所需的最小操作數來定義難度。我們分析了難度與求解器衍生特徵之間的關係。雖然基於袋裝和目標級統計的基線機器學習模型可以部分預測可解性，但它們無法可靠地區分簡單實例。相反，我們顯示難度完全由一小組可解釋的結構屬性決定，這些屬性來自精確的證據。特別是，在這種標記下，最小構造中使用的輸入值的數量作為難度的最小充分統計量。這些結果提供了一個透明的、基於計算的謎題難度解釋，橋接了符號推理和數據驅動建模。該框架支持可解釋的難度估計和有原則的任務排序，對自適應算術學習和智能練習系統具有直接的影響。

Evaluating Language Models for Harmful Manipulation

2603.25326v3 by Canfer Akbulut, Rasmi Elasmar, Abhishek Roy, Anthony Payne, Priyanka Suresh, Lujain Ibrahim, Seliem El-Sayed, Charvi Rastogi, Ashyana Kachra, Will Hawkins, Kristian Lum, Laura Weidinger

Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants. We further find that context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used. We also identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others. Finally, we find that the frequency of manipulative behaviours (propensity) of an AI model is not consistently predictive of the likelihood of manipulative success (efficacy), underscoring the importance of studying these dimensions separately. To facilitate adoption of our evaluation framework, we detail our testing protocols and make relevant materials publicly available. We conclude by discussing open challenges in evaluating harmful manipulation by AI models.

摘要：對於AI驅動的有害操控概念的興趣正在增長，但目前評估該概念的方法仍然有限。本文介紹了一個通過特定情境的人機互動研究來評估有害AI操控的框架。我們通過評估一個包含10,101名參與者的AI模型來說明這個框架的實用性，這些參與者涵蓋了三個AI使用領域（公共政策、金融和健康）以及三個地區（美國、英國和印度）的互動。總體而言，我們發現所測試的模型在被提示時能夠產生操控行為，並且在實驗環境中能夠引起研究參與者的信念和行為變化。我們進一步發現，情境是重要的：AI操控在不同領域之間存在差異，這表明需要在AI系統可能被使用的高風險情境中進行評估。我們還發現我們所測試的地理區域之間存在顯著差異，這表明來自某一地理區域的AI操控結果可能無法推廣到其他地區。最後，我們發現AI模型的操控行為頻率（傾向）並不總是能夠一致預測操控成功的可能性（效能），這突顯了分開研究這些維度的重要性。為了促進我們評估框架的採用，我們詳細說明了我們的測試協議並公開相關材料。我們最後討論了評估AI模型有害操控的開放挑戰。

DAGverse: Building Document-Grounded Semantic DAGs from Scientific Papers

2603.25293v1 by Shu Wan, Saketh Vishnubhatla, Iskander Kushbay, Tom Heffernan, Aaron Belikoff, Raha Moraffah, Huan Liu

Directed Acyclic Graphs (DAGs) are widely used to represent structured knowledge in scientific and technical domains. However, datasets for real-world DAGs remain scarce because constructing them typically requires expert interpretation of domain documents. We study Doc2SemDAG construction: recovering a preferred semantic DAG from a document together with the cited evidence and context that explain it. This problem is challenging because a document may admit multiple plausible abstractions, the intended structure is often implicit, and the supporting evidence is scattered across prose, equations, captions, and figures. To address these challenges, we leverage scientific papers containing explicit DAG figures as a natural source of supervision. In this setting, the DAG figure provides the DAG structure, while the accompanying text provides context and explanation. We introduce DAGverse, a framework for constructing document-grounded semantic DAGs from online scientific papers. Its core component, DAGverse-Pipeline, is a semi-automatic system designed to produce high-precision semantic DAG examples through figure classification, graph reconstruction, semantic grounding, and validation. As a case study, we test the framework for causal DAGs and release DAGverse-1, a dataset of 108 expert-validated semantic DAGs with graph-level, node-level, and edge-level evidence. Experiments show that DAGverse-Pipeline outperforms existing Vision-Language Models on DAG classification and annotation. DAGverse provides a foundation for document-grounded DAG benchmarks and opens new directions for studying structured reasoning grounded in real-world evidence.

摘要：有向無環圖（DAGs）被廣泛用於表示科學和技術領域中的結構化知識。然而，現實世界中的DAG數據集仍然稀缺，因為構建它們通常需要對領域文檔的專家解讀。我們研究Doc2SemDAG的構建：從文檔中恢復一個首選的語義DAG，並提供解釋它的引用證據和上下文。這個問題具有挑戰性，因為一份文檔可能允許多種合理的抽象，預期的結構通常是隱含的，而支持證據則散布在散文、方程式、標題和圖形中。為了解決這些挑戰，我們利用包含明確DAG圖形的科學論文作為自然的監督來源。在這種情況下，DAG圖形提供了DAG結構，而隨附的文本則提供了上下文和解釋。我們引入DAGverse，一個從在線科學論文中構建文檔基礎語義DAG的框架。其核心組件DAGverse-Pipeline是一個半自動系統，旨在通過圖形分類、圖形重建、語義基礎和驗證來生成高精度的語義DAG示例。作為案例研究，我們測試了該框架在因果DAG上的應用，並發布了DAGverse-1，一個包含108個專家驗證的語義DAG的數據集，並提供圖層、節點層和邊層的證據。實驗顯示，DAGverse-Pipeline在DAG分類和標註方面超越了現有的視覺-語言模型。DAGverse為文檔基礎的DAG基準提供了基礎，並為基於現實世界證據的結構推理研究開辟了新方向。

Does Explanation Correctness Matter? Linking Computational XAI Evaluation to Human Understanding

2603.25251v1 by Gregor Baer, Chao Zhang, Isel Grau, Pieter Van Gorp

Explainable AI (XAI) methods are commonly evaluated with functional metrics such as correctness, which computationally estimate how accurately an explanation reflects the model's reasoning. Higher correctness is assumed to produce better human understanding, but this link has not been tested experimentally with controlled levels. We conducted a user study (N=200) that manipulated explanation correctness at four levels (100%, 85%, 70%, 55%) in a time series classification task where participants could not rely on domain knowledge or visual intuition and instead predicted the AI's decisions based on explanations (forward simulation). Correctness affected understanding, but not at every level: performance dropped at 70% and 55% correctness relative to fully correct explanations, while further degradation below 70% produced no additional loss. Rather than shifting performance uniformly, lower correctness decreased the proportion of participants who learned the decision pattern. At the same time, even fully correct explanations did not guarantee understanding, as only a subset of participants achieved high accuracy. Exploratory analyses showed that self-reported ratings correlated with demonstrated performance only when explanations were fully correct and participants had learned the pattern. These findings show that not all differences in functional correctness translate to differences in human understanding, underscoring the need to validate functional metrics against human outcomes.

摘要：可解釋的人工智慧（XAI）方法通常使用功能性指標進行評估，例如正確性，這些指標計算解釋反映模型推理的準確程度。假設更高的正確性會產生更好的人類理解，但這一聯繫尚未在控制水平下進行實驗測試。我們進行了一項用戶研究（N=200），在一個時間序列分類任務中操縱了解釋的正確性，分為四個水平（100%、85%、70%、55%），參與者無法依賴領域知識或視覺直覺，而是根據解釋預測人工智慧的決策（前向模擬）。正確性影響了理解，但並非在每個水平上：在70%和55%正確性下，表現相對於完全正確的解釋有所下降，而在70%以下的進一步降級則未產生額外的損失。較低的正確性並未均勻地轉變表現，而是減少了學習決策模式的參與者比例。同時，即使是完全正確的解釋也無法保證理解，因為只有一部分參與者達到了高準確性。探索性分析顯示，自我報告的評分僅在解釋完全正確且參與者已學習模式時，與實際表現相關聯。這些發現表明，並非所有功能正確性的差異都轉化為人類理解的差異，強調了需要將功能性指標與人類結果進行驗證的必要性。

Factors Influencing the Quality of AI-Generated Code: A Synthesis of Empirical Evidence

2603.25146v1 by Vehid Geruslu, Zulfiyya Aliyeva, Eray Tüzün

Context: The rapid adoption of AI-assisted code generation tools, such as large language models (LLMs), is transforming software development practices. While these tools promise significant productivity gains, concerns regarding the quality, reliability, and security of AI-generated code are increasingly reported in both academia and industry. --Objective: This study aims to systematically synthesize existing empirical evidence on the factors influencing the quality of AI-generated source code and to analyze how these factors impact software quality outcomes across different evaluation contexts. --Method: We conducted a systematic literature review (SLR) following established guidelines, supported by an AI-assisted workflow with human oversight. A total of 24 primary studies were selected through a structured search and screening process across major digital libraries. Data were extracted and analyzed using qualitative, pattern-based evidence synthesis. --Results: The findings reveal that code quality in AI-assisted development is influenced by a combination of human factors, AI system characteristics, and human AI interaction dynamics. Key influencing factors include prompt design, task specification, and developer expertise. The results also show variability in quality outcomes such as correctness, security, maintainability, and complexity across studies, with both improvements and risks reported. --Conclusion: AI-assisted code generation represents a socio-technical shift in software engineering, where achieving high-quality outcomes depends on both technological and human factors. While promising, AI-generated code requires careful validation and integration into development workflows.

摘要：Context: AI輔助的程式碼生成工具（如大型語言模型（LLMs））的快速採用正在改變軟體開發實踐。雖然這些工具承諾顯著的生產力提升，但關於AI生成程式碼的質量、可靠性和安全性的擔憂在學術界和業界中越來越多地被報導。--Objective: 本研究旨在系統性地綜合現有的實證證據，探討影響AI生成源碼質量的因素，並分析這些因素如何影響不同評估背景下的軟體質量結果。--Method: 我們按照既定指導方針進行了系統性文獻回顧（SLR），並在人工監督下支持AI輔助的工作流程。通過結構化的搜索和篩選過程，從主要數字圖書館中選擇了共24項主要研究。數據通過定性、基於模式的證據綜合進行提取和分析。--Results: 研究結果顯示，AI輔助開發中的程式碼質量受到人為因素、AI系統特徵和人機互動動態的綜合影響。主要影響因素包括提示設計、任務規範和開發者專業知識。結果還顯示，不同研究中的質量結果（如正確性、安全性、可維護性和複雜性）存在變異，報告了改進和風險。--Conclusion: AI輔助程式碼生成代表了軟體工程中的社會技術轉變，實現高質量結果依賴於技術和人為因素。雖然前景看好，但AI生成的程式碼需要仔細驗證並整合進開發工作流程中。

An Explainable Ensemble Learning Framework for Crop Classification with Optimized Feature Pyramids and Deep Networks

2603.25070v1 by Syed Rayhan Masud, SK Muktadir Hossain, Md. Ridoy Sarkar, Mohammad Sakib Mahmood, Md. Kishor Morol, Rakib Hossain Sajib

Agriculture is increasingly challenged by climate change, soil degradation, and resource depletion, and hence requires advanced data-driven crop classification and recommendation solutions. This work presents an explainable ensemble learning paradigm that fuses optimized feature pyramids, deep networks, self-attention mechanisms, and residual networks for bolstering crop suitability predictions based on soil characteristics (e.g., pH, nitrogen, potassium) and climatic conditions (e.g., temperature, rainfall). With a dataset comprising 3,867 instances and 29 features from the Ethiopian Agricultural Transformation Agency and NASA, the paradigm leverages preprocessing methods such as label encoding, outlier removal using IQR, normalization through StandardScaler, and SMOTE for balancing classes. A range of machine learning models such as Logistic Regression, K-Nearest Neighbors, Support Vector Machines, Decision Trees, Random Forest, Gradient Boosting, and a new Relative Error Support Vector Machine are compared, with hyperparameter tuning through Grid Search and cross-validation. The suggested "Final Ensemble" meta-ensemble design outperforms with 98.80% accuracy, precision, recall, and F1-score, compared to individual models such as K-Nearest Neighbors (95.56% accuracy). Explainable AI methods, such as SHAP and permutation importance, offer actionable insights, highlighting critical features such as soil pH, nitrogen, and zinc. The paradigm addresses the gap between intricate ML models and actionable agricultural decision-making, fostering sustainability and trust in AI-powered recommendations

摘要：農業正面臨氣候變遷、土壤退化和資源枯竭等挑戰，因此需要先進的數據驅動作物分類和推薦解決方案。這項工作提出了一種可解釋的集成學習範式，融合了優化的特徵金字塔、深度網絡、自注意力機制和殘差網絡，以增強基於土壤特徵（例如 pH、氮、鉀）和氣候條件（例如溫度、降雨）的作物適宜性預測。該範式利用來自埃塞俄比亞農業轉型機構和NASA的數據集，包含3,867個實例和29個特徵，並採用標籤編碼、使用IQR進行異常值移除、通過StandardScaler進行正規化以及SMOTE進行類別平衡等預處理方法。比較了一系列機器學習模型，如邏輯回歸、K最近鄰、支持向量機、決策樹、隨機森林、梯度提升以及一種新的相對誤差支持向量機，並通過網格搜索和交叉驗證進行超參數調整。建議的“最終集成”元集成設計以98.80%的準確率、精確度、召回率和F1分數超越了單獨模型，如K最近鄰（準確率95.56%）。可解釋的AI方法，如SHAP和置換重要性，提供了可行的見解，突顯了土壤pH、氮和鋅等關鍵特徵。該範式填補了複雜機器學習模型與可行農業決策之間的差距，促進了可持續性和對AI驅動推薦的信任。

Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients

2603.24999v2 by Michael Hardy, Joshua Gilbert, Benjamin Domingue

The validity of assessments, from large-scale AI benchmarks to human classrooms, depends on the quality of individual items, yet modern evaluation instruments often contain thousands of items with minimal psychometric vetting. We introduce a new family of nonparametric scalability coefficients based on interitem isotonic regression for efficiently detecting globally bad items (e.g., miskeyed, ambiguously worded, or construct-misaligned). The central contribution is the signed isotonic $R^2$, which measures the maximal proportion of variance in one item explainable by a monotone function of another while preserving the direction of association via Kendall's $τ$. Aggregating these pairwise coefficients yields item-level scores that sharply separate problematic items from acceptable ones without assuming linearity or committing to a parametric item response model. We show that the signed isotonic $R^2$ is extremal among monotone predictors (it extracts the strongest possible monotone signal between any two items) and show that this optimality property translates directly into practical screening power. Across three AI benchmark datasets (HS Math, GSM8K, MMLU) and two human assessment datasets, the signed isotonic $R^2$ consistently achieves top-tier AUC for ranking bad items above good ones, outperforming or matching a comprehensive battery of classical test theory, item response theory, and dimensionality-based diagnostics. Crucially, the method remains robust under the small-n/large-p conditions typical of AI evaluation, requires only bivariate monotone fits computable in seconds, and handles mixed item types (binary, ordinal, continuous) without modification. It is a lightweight, model-agnostic filter that can materially reduce the reviewer effort needed to find flawed items in modern large-scale evaluation regimes.

摘要：評估的有效性，從大型 AI 基準到人類教室，取決於個別項目的質量，但現代評估工具通常包含數千個項目，且心理計量學的審核極少。我們引入了一種基於項目間等單調回歸的新型非參數可擴展性係數，用於有效檢測全球不良項目（例如，錯誤標記、措辭模糊或構念不對齊）。核心貢獻是有符號的等單調 $R^2$，它測量一個項目中可由另一個項目的單調函數解釋的最大變異比例，同時通過 Kendall 的 $τ$ 保持關聯方向。聚合這些成對係數產生的項目級分數能夠明確區分問題項目與可接受項目，而無需假設線性或承諾於參數項目反應模型。我們展示了有符號的等單調 $R^2$ 在單調預測變數中是極端的（它提取了任何兩個項目之間最強的單調信號），並且這種最佳性特性直接轉化為實際篩選能力。在三個 AI 基準數據集（HS Math、GSM8K、MMLU）和兩個人類評估數據集中，有符號的等單調 $R^2$ 一貫在將不良項目排名高於良好項目方面達到頂級 AUC，超越或匹配了全面的經典測試理論、項目反應理論和基於維度的診斷工具。關鍵是，該方法在 AI 評估中典型的小樣本/大特徵條件下仍然穩健，只需幾秒鐘即可計算的雙變量單調擬合，並且能夠處理混合項目類型（二元、序數、連續）而無需修改。這是一個輕量級、模型無關的過濾器，能夠實質性減少評審者在現代大型評估體系中尋找缺陷項目所需的努力。

Rethinking Health Agents: From Siloed AI to Collaborative Decision Mediators

2603.24986v1 by Ray-Yuan Chung, Xuhai Xu, Ari Pollack

Large language model based health agents are increasingly used by health consumers and clinicians to interpret health information and guide health decisions. However, most AI systems in healthcare operate in siloed configurations, supporting individual users rather than the multi-stakeholder relationships central to healthcare. Such use can fragment understanding and exacerbate misalignment among patients, caregivers, and clinicians. We reframe AI not as a standalone assistant, but as a collaborator embedded within multi-party care interactions. Through a clinically validated fictional pediatric chronic kidney disease case study, we show that breakdowns in adherence stem from fragmented situational awareness and misaligned goals, and that siloed use of general-purpose AI tools does little to address these collaboration gaps. We propose a conceptual framework for designing AI collaborators that surface contextual information, reconcile mental models, and scaffold shared understanding while preserving human decision authority.

摘要：大型語言模型基礎的健康代理人越來越多地被健康消費者和臨床醫生用來解釋健康信息並指導健康決策。然而，大多數醫療保健中的人工智慧系統運作在孤立的配置中，支持個別用戶，而不是醫療保健中多方利益相關者關係的核心。這種使用方式可能會導致理解的碎片化，並加劇患者、照護者和臨床醫生之間的目標不一致。我們將人工智慧重新定義為一個嵌入多方護理互動中的合作者，而不是一個獨立的助手。通過一個臨床驗證的虛構小兒慢性腎病案例研究，我們顯示出遵循的中斷源於情境意識的碎片化和目標的不一致，而通用人工智慧工具的孤立使用對於解決這些協作差距幾乎沒有幫助。我們提出了一個設計人工智慧合作者的概念框架，旨在呈現上下文信息、調和心理模型，並支撐共享理解，同時保留人類的決策權威。

Self-Corrected Image Generation with Explainable Latent Rewards

2603.24965v1 by Yinyi Luo, Hrishikesh Gokhale, Marios Savvides, Jindong Wang, Shengfeng He

Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.

摘要：儘管在文本到圖像生成方面取得了重大進展，但將輸出與複雜提示對齊仍然具有挑戰性，特別是在細緻的語義和空間關係方面。這一困難源於生成的前饋性質，這要求在未完全理解輸出的情況下預測對齊。相比之下，評估生成的圖像則更為可行。受到這種不對稱性的啟發，我們提出了xLARD，一個自我校正的框架，利用多模態大型語言模型通過可解釋的潛在獎勵來指導生成。xLARD引入了一個輕量級的校正器，根據模型生成的參考的結構化反饋來細化潛在表示。一個關鍵組件是從潛在編輯到可解釋獎勵信號的可微映射，使得從不可微的圖像級評估中實現持續的潛在級指導。這一機制使模型在生成過程中能夠理解、評估和自我校正。跨多樣的生成和編輯任務的實驗顯示，xLARD改善了語義對齊和視覺真實性，同時保持生成先驗。代碼可在https://yinyiluo.github.io/xLARD/獲得。

Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math

2603.24961v1 by Dingjie Song, Tianlong Xu, Yi-Fan Zhang, Hang Li, Zhiling Yan, Xing Fan, Haoyang Li, Lichao Sun, Qingsong Wen

Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an "examinee perspective", prioritizing generating correct answers rather than diagnosing student errors. To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork. Our dataset comprises 1,720 mathematics samples from Chinese primary and middle school students, supporting two key tasks: Error Cause Explanation (ECE) and Error Cause Classification (ECC), with seven defined error types. The dataset is meticulously annotated through rigorous human-machine collaborative approaches involving multiple stages of expert labeling, review, and verification. We systematically evaluate 16 leading MLLMs on ScratchMath, revealing significant performance gaps relative to human experts, especially in visual recognition and logical reasoning. Proprietary models notably outperform open-source models, with large reasoning models showing strong potential for error explanation. All evaluation data and frameworks are publicly available to facilitate further research.

摘要：評估學生的手寫草稿對於個性化的教育反饋至關重要，但由於手寫風格多樣、佈局複雜以及解題方法各異，這也帶來了獨特的挑戰。現有的教育自然語言處理（NLP）主要集中於文本回應，忽視了真實手寫草稿中固有的複雜性和多模態性。目前的多模態大型語言模型（MLLMs）在視覺推理方面表現出色，但通常採取“考生視角”，優先生成正確答案，而不是診斷學生的錯誤。為了彌補這些空白，我們推出了ScratchMath，一個專門設計用於解釋和分類真實手寫數學草稿錯誤的新基準。我們的數據集包含來自中國小學和中學學生的1,720個數學樣本，支持兩個關鍵任務：錯誤原因解釋（ECE）和錯誤原因分類（ECC），並定義了七種錯誤類型。該數據集通過嚴格的人機協作方法進行了精心註釋，涉及多個專家標記、審查和驗證階段。我們系統地評估了16個領先的MLLMs在ScratchMath上的表現，顯示出相對於人類專家的顯著性能差距，特別是在視覺識別和邏輯推理方面。專有模型顯著優於開源模型，大型推理模型在錯誤解釋方面顯示出強大的潛力。所有評估數據和框架均公開可用，以促進進一步的研究。

Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings

2603.26798v1 by Gesina Schwalbe, Mert Keser, Moritz Bayerkuhnlein, Edgar Heinert, Annika Mütze, Marvin Keller, Sparsh Tiwari, Georgii Mikriukov, Diedrich Wolter, Jae Hee Lee, Matthias Rottmann

Vision-language model (VLM) encoders such as CLIP enable strong retrieval and zero-shot classification in a shared image-text embedding space, yet the semantic organization of this space is rarely inspected. We present a post-hoc framework to explain, verify, and align the semantic hierarchies induced by a VLM over a given set of child classes. First, we extract a binary hierarchy by agglomerative clustering of class centroids and name internal nodes by dictionary-based matching to a concept bank. Second, we quantify plausibility by comparing the extracted tree against human ontologies using efficient tree- and edge-level consistency measures, and we evaluate utility via explainable hierarchical tree-traversal inference with uncertainty-aware early stopping (UAES). Third, we propose an ontology-guided post-hoc alignment method that learns a lightweight embedding-space transformation, using UMAP to generate target neighborhoods from a desired hierarchy. Across 13 pretrained VLMs and 4 image datasets, our method finds systematic modality differences: image encoders are more discriminative, while text encoders induce hierarchies that better match human taxonomies. Overall, the results reveal a persistent trade-off between zero-shot accuracy and ontological plausibility and suggest practical routes to improve semantic alignment in shared embedding spaces.

摘要：視覺-語言模型 (VLM) 編碼器如 CLIP 能夠在共享的圖像-文本嵌入空間中實現強大的檢索和零樣本分類，但這個空間的語義組織卻很少被檢查。我們提出了一個事後框架來解釋、驗證並對齊 VLM 在給定一組子類別上所引發的語義層級。首先，我們通過對類別中心進行聚合聚類來提取二元層級，並通過基於字典的匹配將內部節點命名為概念庫。其次，我們通過使用高效的樹狀和邊緣一致性度量，將提取的樹與人類本體進行比較來量化其合理性，並通過可解釋的層級樹遍歷推理與不確定性感知的提前停止 (UAES) 來評估其效用。第三，我們提出了一種本體引導的事後對齊方法，該方法學習一種輕量級的嵌入空間變換，使用 UMAP 從所需的層級生成目標鄰域。在 13 個預訓練的 VLM 和 4 個圖像數據集上，我們的方法發現了系統性的模態差異：圖像編碼器更具區分性，而文本編碼器引發的層級更好地匹配人類分類法。總體而言，結果揭示了零樣本準確性與本體合理性之間持續的權衡，並建議了改善共享嵌入空間中語義對齊的實際途徑。

Sovereign AI at the Front Door of Care: A Physically Unidirectional Architecture for Secure Clinical Intelligence

2603.24898v1 by Vasu Srinivasan, Dhriti Vasu

We present a Sovereign AI architecture for clinical triage in which all inference is performed on-device and inbound data is delivered via a physically unidirectional channel, implemented using receive-only broadcast infrastructure or certified hardware data diodes, with no return path to any external network. This design removes the network-mediated attack surface by construction, rather than attempting to secure it through software controls. The system performs conversational symptom intake, integrates device-captured vitals, and produces structured, triage-aligned clinical records at the point of care. We formalize the security properties of receiver-side unidirectionality and show that the architecture is transport-agnostic across broadcast and diode-enforced deployments. We further analyze threat models, enforcement mechanisms, and deployment configurations, demonstrating how physical one-way data flow enables high-assurance operation in both resource-constrained and high-risk environments. This work positions physically unidirectional channels as a foundational primitive for sovereign, on-device clinical intelligence at the front door of care.

摘要：我們提出了一種主權人工智慧架構，用於臨床分診，其中所有推理都在設備上進行，進入數據通過物理單向通道傳送，該通道使用僅接收的廣播基礎設施或經認證的硬體數據二極體實現，並且沒有返回路徑通往任何外部網絡。這一設計通過構建消除了網絡介導的攻擊面，而不是試圖通過軟體控制來保護它。系統執行對話式症狀收集，整合設備捕獲的生命體徵，並在護理現場產生結構化、與分診對齊的臨床記錄。我們正式化了接收端單向性的安全性質，並顯示該架構在廣播和二極體強制部署中是傳輸無關的。我們進一步分析了威脅模型、執行機制和部署配置，展示了物理單向數據流如何在資源受限和高風險環境中實現高保證操作。這項工作將物理單向通道定位為主權、設備內臨床智能的基礎原語，位於護理的前門。

More Than "Means to an End": Supporting Reasoning with Transparently Designed AI Data Science Processes

2603.24877v1 by Venkatesh Sivaraman, Patrick Vossler, Adam Perer, Julian Hong, Jean Feng

Generative artificial intelligence (AI) tools can now help people perform complex data science tasks regardless of their expertise. While these tools have great potential to help more people work with data, their end-to-end approach does not support users in evaluating alternative approaches and reformulating problems, both critical to solving open-ended tasks in high-stakes domains. In this paper, we reflect on two AI data science systems designed for the medical setting and how they function as tools for thought. We find that success in these systems was driven by constructing AI workflows around intentionally-designed intermediate artifacts, such as readable query languages, concept definitions, or input-output examples. Despite opaqueness in other parts of the AI process, these intermediates helped users reason about important analytical choices, refine their initial questions, and contribute their unique knowledge. We invite the HCI community to consider when and how intermediate artifacts should be designed to promote effective data science thinking.

摘要：生成式人工智慧（AI）工具現在可以幫助人們執行複雜的數據科學任務，而不論他們的專業知識如何。雖然這些工具具有幫助更多人處理數據的巨大潛力，但它們的端到端方法並不支持用戶評估替代方法和重新定義問題，這對於在高風險領域解決開放式任務至關重要。在本文中，我們反思了為醫療環境設計的兩個AI數據科學系統，以及它們如何作為思考工具運作。我們發現，這些系統的成功是通過圍繞故意設計的中介產物構建AI工作流程來驅動的，例如可讀的查詢語言、概念定義或輸入-輸出示例。儘管AI過程的其他部分不透明，但這些中介幫助用戶推理重要的分析選擇、精煉他們的初始問題並貢獻他們獨特的知識。我們邀請HCI社群考慮何時以及如何設計中介產物，以促進有效的數據科學思維。

A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study

2603.24828v1 by Yongda Fan, John Wu, Andrea Fitzpatrick, Naveen Baskaran, Jimeng Sun, Adam Cross

Clinical decisions are high-stakes and require explicit justification, making model interpretability essential for auditing deep clinical models prior to deployment. As the ecosystem of model architectures and explainability methods expands, critical questions remain: Do architectural features like attention improve explainability? Do interpretability approaches generalize across clinical tasks? While prior benchmarking efforts exist, they often lack extensibility and reproducibility, and critically, fail to systematically examine how interpretability varies across the interplay of clinical tasks and model architectures. To address these gaps, we present a comprehensive benchmark evaluating interpretability methods across diverse clinical prediction tasks and model architectures. Our analysis reveals that: (1) attention when leveraged properly is a highly efficient approach for faithfully interpreting model predictions; (2) black-box interpreters like KernelSHAP and LIME are computationally infeasible for time-series clinical prediction tasks; and (3) several interpretability approaches are too unreliable to be trustworthy. From our findings, we discuss several guidelines on improving interpretability within clinical predictive pipelines. To support reproducibility and extensibility, we provide our implementations via PyHealth, a well-documented open-source framework: https://github.com/sunlabuiuc/PyHealth.

摘要：臨床決策是高風險的，並需要明確的理由，這使得模型可解釋性在部署之前對深度臨床模型的審核變得至關重要。隨著模型架構和可解釋性方法的生態系統不斷擴展，仍然存在一些關鍵問題：像注意力這樣的架構特徵是否能提高可解釋性？可解釋性方法是否能在不同的臨床任務中通用？雖然之前的基準測試工作已經存在，但它們往往缺乏可擴展性和可重現性，並且關鍵的是，未能系統性地檢查可解釋性在臨床任務和模型架構之間的相互作用中是如何變化的。為了填補這些空白，我們提出了一個全面的基準，評估不同臨床預測任務和模型架構下的可解釋性方法。我們的分析揭示了： (1) 當適當利用時，注意力是一種非常有效的方法，可以忠實地解釋模型預測； (2) 像KernelSHAP和LIME這樣的黑箱解釋器在時間序列臨床預測任務中計算上不可行； (3) 幾種可解釋性方法過於不可靠，無法被信任。根據我們的發現，我們討論了幾條改善臨床預測管道中可解釋性的指導方針。為了支持可重現性和可擴展性，我們通過PyHealth提供我們的實現，這是一個文檔完善的開源框架： https://github.com/sunlabuiuc/PyHealth。

PhyDCM: A Reproducible Open-Source Framework for AI-Assisted Brain Tumor Classification from Multi-Sequence MRI

2603.26794v1 by Hayder Saad Abdulbaqi, Mohammed Hadi Rahim, Mohammed Hassan Hadi, Haider Ali Aboud, Ali Hussein Allawi

MRI-based medical imaging has become indispensable in modern clinical diagnosis, particularly for brain tumor detection. However, the rapid growth in data volume poses challenges for conventional diagnostic approaches. Although deep learning has shown strong performance in automated classification, many existing solutions are confined to closed technical architectures, limiting reproducibility and further academic development. PhyDCM is introduced as an open-source software framework that integrates a hybrid classification architecture based on MedViT with standardized DICOM processing and an interactive desktop visualization interface. The system is designed as a modular digital library that separates computational logic from the graphical interface, allowing independent modification and extension of components. Standardized preprocessing, including intensity rescaling and limited data augmentation, ensures consistency across varying MRI acquisition settings. Experimental evaluation on MRI datasets from BRISC2025 and curated Kaggle collections (FigShare, SARTAJ, and Br35H) demonstrates stable diagnostic performance, achieving over 93% classification accuracy across categories. The framework supports structured, exportable outputs and multi-planar reconstruction of volumetric data. By emphasizing transparency, modularity, and accessibility, PhyDCM provides a practical foundation for reproducible AI-driven medical image analysis, with flexibility for future integration of additional imaging modalities.

摘要：MRI 基於醫學影像在現代臨床診斷中已變得不可或缺，特別是在腦腫瘤檢測方面。然而，數據量的快速增長對傳統診斷方法提出了挑戰。儘管深度學習在自動分類中顯示出強大的性能，但許多現有解決方案受限於封閉的技術架構，限制了可重複性和進一步的學術發展。PhyDCM 被引入作為一個開源軟體框架，整合了基於 MedViT 的混合分類架構、標準化的 DICOM 處理和互動式桌面可視化界面。該系統被設計為一個模組化的數位圖書館，將計算邏輯與圖形界面分離，允許獨立修改和擴展組件。標準化的預處理，包括強度重新縮放和有限的數據增強，確保在不同的 MRI 獲取設置中保持一致性。對來自 BRISC2025 和策劃的 Kaggle 收藏（FigShare、SARTAJ 和 Br35H）的 MRI 數據集的實驗評估顯示出穩定的診斷性能，在各類別中達到超過 93% 的分類準確率。該框架支持結構化、可導出的輸出和體積數據的多平面重建。通過強調透明性、模組化和可及性，PhyDCM 為可重複的 AI 驅動醫學影像分析提供了實用的基礎，並具備未來整合其他影像模式的靈活性。

Dissecting Model Failures in Abdominal Aortic Aneurysm Segmentation through Explainability-Driven Analysis

2603.24801v1 by Abu Noman Md Sakib, Merjulah Roby, Zijie Zhang, Satish Muluk, Mark K. Eskandari, Ender A. Finol

Computed tomography image segmentation of complex abdominal aortic aneurysms (AAA) often fails because the models assign internal focus to irrelevant structures or do not focus on thin, low-contrast targets. Where the model looks is the primary training signal, and thus we propose an Explainable AI (XAI) guided encoder shaping framework. Our method computes a dense, attribution-based encoder focus map ("XAI field") from the final encoder block and uses it in two complementary ways: (i) we align the predicted probability mass to the XAI field to promote agreement between focus and output; and (ii) we route the field into a lightweight refinement pathway and a confidence prior that modulates logits at inference, suppressing distractors while preserving subtle structures. The objective terms serve only as control signals; the contribution is the integration of attribution guidance into representation and decoding. We evaluate clinically validated challenging cases curated for failure-prone scenarios. Compared to a base SAM setup, our implementation yields substantial improvements. The observed gains suggest that explicitly optimizing encoder focus via XAI guidance is a practical and effective principle for reliable segmentation in complex scenarios.

摘要：計算機斷層掃描影像分割複雜的腹主動脈瘤（AAA）常常失敗，因為模型將內部焦點分配給不相關的結構，或未能專注於薄且低對比度的目標。模型的觀察位置是主要的訓練信號，因此我們提出了一種可解釋的人工智慧（XAI）引導編碼器塑形框架。我們的方法從最終的編碼器區塊計算出一個密集的、基於歸因的編碼器焦點圖（“XAI場”），並以兩種互補的方式使用它：（i）我們將預測的概率質量與XAI場對齊，以促進焦點與輸出之間的一致性；以及（ii）我們將該場路由到一個輕量級的精煉路徑和一個信心先驗，在推理時調節邏輯，壓制干擾物，同時保留微妙的結構。目標項僅作為控制信號；其貢獻在於將歸因引導整合到表示和解碼中。我們評估了經臨床驗證的挑戰性案例，這些案例是為容易失敗的情境精心策劃的。與基礎的SAM設置相比，我們的實現產生了顯著的改進。觀察到的增益表明，通過XAI引導明確優化編碼器焦點是一個實用且有效的原則，能夠在複雜情境中實現可靠的分割。

A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English

2603.24549v1 by Dana Serditova, Kevin Tang

Automatic Speech Recognition (ASR) systems are widely used in everyday communication, education, healthcare, and industry, yet their performance remains uneven across speakers, particularly when dialectal variation diverges from the mainstream accents represented in training data. This study investigates ASR bias through a sociolinguistic analysis of Newcastle English, a regional variety of North-East England that has been shown to challenge current speech recognition technologies. Using spontaneous speech from the Diachronic Electronic Corpus of Tyneside English (DECTE), we evaluate the output of a state-of-the-art commercial ASR system and conduct a fine-grained analysis of more than 3,000 transcription errors. Errors are classified by linguistic domain and examined in relation to social variables including gender, age, and socioeconomic status. In addition, an acoustic case study of selected vowel features demonstrates how gradient phonetic variation contributes directly to misrecognition. The results show that phonological variation accounts for the majority of errors, with recurrent failures linked to dialect-specific features like vowel quality and glottalisation, as well as local vocabulary and non-standard grammatical forms. Error rates also vary across social groups, with higher error frequencies observed for men and for speakers at the extremes of the age spectrum. These findings indicate that ASR errors are not random but socially patterned and can be explained from a sociolinguistic perspective. Thus, the study demonstrates the importance of incorporating sociolinguistic expertise into the evaluation and development of speech technologies and argues that more equitable ASR systems require explicit attention to dialectal variation and community-based speech data.

摘要：自動語音識別（ASR）系統在日常交流、教育、醫療和工業中被廣泛使用，但其性能在不同說話者之間仍然不均衡，特別是當方言變異與訓練數據中所代表的主流口音有所偏離時。這項研究通過對紐卡斯爾英語的社會語言學分析來調查ASR偏見，這是一種來自英格蘭東北部的區域變體，已被證明對當前的語音識別技術構成挑戰。使用來自泰恩賽德英語的歷時電子語料庫（DECTE）的自發語音，我們評估了一個最先進商業ASR系統的輸出，並對3000多個轉錄錯誤進行了細緻的分析。錯誤按語言領域進行分類，並與包括性別、年齡和社會經濟狀況在內的社會變量進行檢查。此外，對選定元音特徵的聲學案例研究展示了漸進音位變異如何直接導致誤識別。結果顯示，音系變異佔據了大多數錯誤，重複出現的失敗與方言特定特徵如元音質量和喉音化，以及當地詞彙和非標準語法形式相關。錯誤率在社會群體之間也有所不同，男性和年齡範圍極端的說話者的錯誤頻率較高。這些發現表明，ASR錯誤並非隨機，而是具有社會模式，並可以從社會語言學的角度進行解釋。因此，這項研究展示了將社會語言學專業知識納入語音技術的評估和開發的重要性，並主張更公平的ASR系統需要明確關注方言變異和基於社區的語音數據。

No Single Metric Tells the Whole Story: A Multi-Dimensional Evaluation Framework for Uncertainty Attributions

2603.24524v1 by Emily Schiller, Teodor Chiaburu, Marco Zullich, Luca Longo

Research on explainable AI (XAI) has frequently focused on explaining model predictions. More recently, methods have been proposed to explain prediction uncertainty by attributing it to input features (uncertainty attributions). However, the evaluation of these methods remains inconsistent as studies rely on heterogeneous proxy tasks and metrics, hindering comparability. We address this by aligning uncertainty attributions with the well-established Co-12 framework for XAI evaluation. We propose concrete implementations for the correctness, consistency, continuity, and compactness properties. Additionally, we introduce conveyance, a property tailored to uncertainty attributions that evaluates whether controlled increases in epistemic uncertainty reliably propagate to feature-level attributions. We demonstrate our evaluation framework with eight metrics across combinations of uncertainty quantification and feature attribution methods on tabular and image data. Our experiments show that gradient-based methods consistently outperform perturbation-based approaches in consistency and conveyance, while Monte-Carlo dropconnect outperforms Monte-Carlo dropout in most metrics. Although most metrics rank the methods consistently across samples, inter-method agreement remains low. This suggests no single metric sufficiently evaluates uncertainty attribution quality. The proposed evaluation framework contributes to the body of knowledge by establishing a foundation for systematic comparison and development of uncertainty attribution methods.

摘要：研究可解釋人工智慧（XAI）時常專注於解釋模型預測。最近，已提出方法來解釋預測不確定性，通過將其歸因於輸入特徵（不確定性歸因）。然而，這些方法的評估仍然不一致，因為研究依賴於異質的代理任務和指標，妨礙了可比性。我們通過將不確定性歸因與成熟的 Co-12 框架對齊，來解決這個問題。我們為正確性、一致性、連續性和緊湊性屬性提出具體實現。此外，我們引入了傳遞性，這是一個針對不確定性歸因的屬性，評估控制的認知不確定性增長是否可靠地傳遞到特徵級別的歸因。我們用八個指標展示我們的評估框架，涵蓋不確定性量化和特徵歸因方法在表格和圖像數據上的組合。我們的實驗顯示，基於梯度的方法在一致性和傳遞性方面始終優於基於擾動的方法，而蒙特卡羅 dropconnect 在大多數指標上優於蒙特卡羅 dropout。儘管大多數指標在樣本之間一致地對這些方法進行排名，但方法之間的一致性仍然較低。這表明沒有單一指標能夠充分評估不確定性歸因的質量。所提出的評估框架通過建立系統比較和不確定性歸因方法發展的基礎，為知識體系做出了貢獻。

Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

2603.24481v1 by John Ray B. Martinez

Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis is then subjected to a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate across four experimental settings, covering 100-question and 250-question high-disagreement subsets of both MedQA-USMLE and MedMCQA. Calibration improvement is the central finding, with ECE reduced by 49-74% across all four settings, including the harder MedMCQA benchmark where these gains persist even when absolute accuracy is constrained by knowledge-intensive recall demands. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Ablation analysis identifies Two-Phase Verification as the primary calibration driver and multi-agent reasoning as the primary accuracy driver. These results establish that consistency-based verification produces more reliable uncertainty estimates across diverse medical question types, providing a practical confidence signal for deferral in safety-critical clinical AI applications.

摘要：錯誤校準的信心分數是將 AI 部署於臨床環境中的一個實際障礙。始終過於自信的模型無法提供有用的延遲信號。我們提出了一個多代理框架，將領域特定的專家代理與兩階段驗證和 S-Score 加權融合相結合，以改善醫療多選題回答中的校準和辨別能力。四個專家代理（呼吸科、心臟科、神經科、腸胃科）使用 Qwen2.5-7B-Instruct 生成獨立的診斷。每個診斷隨後經過一個兩階段自我驗證過程，該過程測量內部一致性並產生專家信心分數（S-score）。S-score 驅動一個加權融合策略，選擇最終答案並校準報告的信心。我們在四個實驗設置中進行評估，涵蓋了 100 題和 250 題的高不一致性子集，包括 MedQA-USMLE 和 MedMCQA。校準改善是主要發現，所有四個設置中的 ECE 減少了 49-74%，包括在更難的 MedMCQA 基準中，這些增益在絕對準確性受到知識密集型回憶需求限制時仍然持續。在 MedQA-250 上，完整系統實現 ECE = 0.091（比單一專家基線減少 74.4%）和 AUROC = 0.630（+0.056），準確率為 59.2%。消融分析確定兩階段驗證是主要的校準驅動因素，而多代理推理是主要的準確性驅動因素。這些結果證明基於一致性的驗證在各種醫療問題類型中產生了更可靠的不確定性估計，為安全關鍵的臨床 AI 應用提供了實用的信心信號以進行延遲。

Integrating Causal Machine Learning into Clinical Decision Support Systems: Insights from Literature and Practice

2603.24448v1 by Domenique Zipperling, Lukas Schmidt, Benedikt Hahn, Niklas Kühl, Steven Kimbrough

Current clinical decision support systems (CDSSs) typically base their predictions on correlation, not causation. In recent years, causal machine learning (ML) has emerged as a promising way to improve decision-making with CDSSs by offering interpretable, treatment-specific reasoning. However, existing research often emphasizes model development rather than designing clinician-facing interfaces. To address this gap, we investigated how CDSSs based on causal ML should be designed to effectively support collaborative clinical decision-making. Using a design science research methodology, we conducted a structured literature review and interviewed experienced physicians. From these, we derived eight empirically grounded design requirements, developed seven design principles, and proposed nine practical design features. Our results establish guidance for designing CDSSs that deliver causal insights, integrate seamlessly into clinical workflows, and support trust, usability, and human-AI collaboration. We also reveal tensions around automation, responsibility, and regulation, highlighting the need for an adaptive certification process for ML-based medical products.

摘要：目前的臨床決策支持系統 (CDSSs) 通常基於相關性而非因果關係來進行預測。近年來，因果機器學習 (ML) 已成為改善 CDSS 決策的有希望的方法，因為它提供了可解釋的、針對治療的推理。然而，現有研究往往強調模型開發，而非設計面向臨床醫生的介面。為了填補這一空白，我們調查了基於因果 ML 的 CDSS 應如何設計，以有效支持協作的臨床決策。使用設計科學研究方法論，我們進行了結構化文獻回顧並訪問了經驗豐富的醫生。從中，我們得出了八項基於實證的設計要求，開發了七項設計原則，並提出了九項實用的設計特徵。我們的結果為設計能提供因果見解、無縫整合到臨床工作流程中並支持信任、可用性和人機協作的 CDSS 提供了指導。我們還揭示了自動化、責任和監管之間的緊張關係，突顯了對基於 ML 的醫療產品需要一個適應性認證過程的需求。

Enes Causal Discovery

2603.24436v3 by Alexis Kafantaris

Enes The proposed architecture is a mixture of experts, which allows for the model entities, such as the causal relationships, to be further parameterized. More specifically, an attempt is made to exploit a neural net as implementing neurons poses a great challenge for this dataset. To explain, a simple and fast Pearson coefficient linear model usually achieves good scores. An aggressive baseline that requires a really good model to overcome that is. Moreover, there are major limitations when it comes to causal discovery of observational data. Unlike the sachs one did not use interventions but only prior knowledge; the most prohibiting limitation is that of the data which is addressed. Thereafter, the method and the model are described and after that the results are presented.

摘要：Enes 提出的架構是一種專家混合模型，這使得模型實體，例如因果關係，可以進一步參數化。更具體地說，嘗試利用神經網絡，因為對於這個數據集，實現神經元是一個巨大的挑戰。解釋來說，一個簡單且快速的皮爾遜係數線性模型通常能夠取得良好的分數。一個激進的基準需要一個非常好的模型來克服這一點。此外，在觀察數據的因果發現方面存在重大限制。與 sachs 不同的是，這次並未使用干預，而僅僅依賴先驗知識；最具限制性的問題是所處理的數據。之後，將描述該方法和模型，然後呈現結果。

From Untamed Black Box to Interpretable Pedagogical Orchestration: The Ensemble of Specialized LLMs Architecture for Adaptive Tutoring

2603.23990v1 by Nizam Kadir

Monolithic Large Language Models (LLMs) used in educational dialogue often behave as "black boxes," where pedagogical decisions are implicit and difficult to audit, frequently violating instructional constraints by providing answers too early. We introduce the Ensemble of Specialized LLMS (ES-LLMS) architecture that separates decision-making from wording. Pedagogical actions are selected by a deterministic rules-based orchestrator coordinating specialized agents covering tutoring, assessment, feedback, scaffolding, motivation and ethics-guided by an interpretable Bayesian Knowledge Tracing (BKT) student model. An LLM renderer surface-realizes the chosen action in natural language. This design emphasizes reliability and controllability: constraints such as "attempt-before-hint" and hint caps are enforced as explicit rules, and the system logs per-turn agent traces and constraint checks. Validation of pedagogical quality via human expert reviewers (N=6) and a multi-LLM-as-Judge panel (six state-of-the-art models) showed that ES-LLMs were preferred in 91.7% and 79.2% of cases, respectively. The architecture significantly outperformed monolithic baselines across all seven dimensions, particularly in Scaffolding & Guidance, and Trust & Explainability. Furthermore, a Monte Carlo simulation (N=2,400) exposed a "Mastery Gain Paradox," where monolithic tutors inflated short-term performance through over-assistance. In contrast, ES-LLMs achieved 100% adherence to pedagogical constraints (e.g., attempt-before-hint) and a 3.3x increase in hint efficiency. Operationally, ES-LLMs reduced costs by 54% and latency by 22% by utilizing stateless prompts. We conclude that structural decoupling is essential for transforming stochastic models into trustworthy, verifiable and resource-efficient pedagogical agents.

摘要：單體大型語言模型（LLMs）在教育對話中常常表現得像是「黑箱」，其教學決策是隱含的且難以審計，經常違反教學約束，過早提供答案。我們介紹了專門化LLMS的集成架構（ES-LLMS），該架構將決策與措辭分離。教學行動由一個基於確定性規則的協調者選擇，該協調者協調涵蓋輔導、評估、反饋、支架、動機和倫理的專門代理，並由可解釋的貝葉斯知識追蹤（BKT）學生模型指導。一個LLM渲染器以自然語言實現所選擇的行動。這一設計強調可靠性和可控性：如「嘗試後再提示」和提示上限等約束作為明確規則被強制執行，系統記錄每回合的代理痕跡和約束檢查。通過人類專家評審（N=6）和多LLM作為評審小組（六個最先進模型）對教學質量的驗證顯示，ES-LLMs在91.7%和79.2%的情況下更受偏好。該架構在所有七個維度上顯著超越單體基準，特別是在支架與指導以及信任與可解釋性方面。此外，一項蒙特卡羅模擬（N=2,400）揭示了「精通增益悖論」，單體輔導者通過過度協助膨脹了短期表現。相比之下，ES-LLMs在教學約束（例如，嘗試後再提示）上達到了100%的遵循率，並且提示效率提高了3.3倍。在運營上，ES-LLMs通過利用無狀態提示將成本降低了54%，延遲降低了22%。我們得出結論，結構解耦對於將隨機模型轉變為值得信賴、可驗證和資源高效的教學代理至關重要。

Generative AI User Experience: Developing Human--AI Epistemic Partnership

2603.23863v1 by Xiaoming Zhai

Generative AI (GenAI) has rapidly entered education, yet its user experience is often explained through adoption-oriented constructs such as usefulness, ease of use, and engagement. We argue that these constructs are no longer sufficient because systems such as ChatGPT do not merely support learning tasks but also participate in knowledge construction. Existing theories cannot explain why GenAI frequently produces experiences characterized by negotiated authority, redistributed cognition, and accountability tension. To address this gap, this paper develops the Human--AI Epistemic Partnership Theory (HAEPT), explaining the GenAI user experience as a form of epistemic partnership that features a dynamic negotiation of three interlocking contracts: epistemic, agency, and accountability. We argue that findings on trust, over-reliance, academic integrity, teacher caution, and relational interaction about GenAI can be reinterpreted as tensions within these contracts rather than as isolated issues. Instead of holding a single, stable view of GenAI, users adjust how they relate to it over time through calibration cycles. These repeated interactions account for why trust and skepticism often coexist and for how partnership modes describe recurrent configurations of human--AI collaboration across tasks. To demonstrate the usefulness of HAEPT, we applied it to analyze the UX of collaborative learning with AI speakers and AI-facilitated scientific argumentation, illustrating different contract configurations.

摘要：生成式人工智慧（GenAI）迅速進入教育領域，但其使用者體驗通常透過以採用為導向的構念來解釋，例如有用性、易用性和參與感。我們主張這些構念已不再足夠，因為像 ChatGPT 這樣的系統不僅支持學習任務，還參與知識建構。現有理論無法解釋為什麼 GenAI 經常產生以協商權威、再分配認知和責任緊張為特徵的體驗。為了解決這一空白，本文發展了人類--AI 認識夥伴關係理論（HAEPT），將 GenAI 使用者體驗解釋為一種認識夥伴關係，特徵是三個互鎖契約的動態協商：認識、代理和責任。我們主張有關信任、過度依賴、學術誠信、教師謹慎和與 GenAI 的關係互動的研究結果可以被重新詮釋為這些契約中的緊張，而不是孤立的問題。使用者不再對 GenAI 持有單一、穩定的看法，而是通過校準循環隨著時間調整與其的關係。這些重複的互動解釋了為什麼信任和懷疑經常共存，以及夥伴關係模式如何描述人類--AI 合作在任務中的重複配置。為了展示 HAEPT 的實用性，我們將其應用於分析與 AI 語音助手和 AI 促進的科學論證的協作學習的使用者體驗，說明不同的契約配置。

Causal AI For AMS Circuit Design: Interpretable Parameter Effects Analysis

2603.24618v1 by Mohyeu Hussain, David Koblah, Reiner Dizon-Paradis, Domenic Forte

Analog-mixed-signal (AMS) circuits are highly non-linear and operate on continuous real-world signals, making them far more difficult to model with data-driven AI than digital blocks. To close the gap between structured design data (device dimensions, bias voltages, etc.) and real-world performance, we propose a causal-inference framework that first discovers a directed-acyclic graph (DAG) from SPICE simulation data and then quantifies parameter impact through Average Treatment Effect (ATE) estimation. The approach yields human-interpretable rankings of design knobs and explicit 'what-if' predictions, enabling designers to understand trade-offs in sizing and topology. We evaluate the pipeline on three operational-amplifier families (OTA, telescopic, and folded-cascode) implemented in TSMC 65nm and benchmark it against a baseline neural-network (NN) regressor. Across all circuits the causal model reproduces simulation-based ATEs with an average absolute error of less than 25%, whereas the neural network deviates by more than 80% and frequently predicts the wrong sign. These results demonstrate that causal AI provides both higher accuracy and explainability, paving the way for more efficient, trustworthy AMS design automation.

摘要：類比混合信號（AMS）電路具有高度非線性，並且在連續的現實世界信號上運作，使得它們比數位模塊更難以使用數據驅動的人工智慧進行建模。為了縮小結構化設計數據（設備尺寸、偏壓電壓等）與現實世界性能之間的差距，我們提出了一個因果推斷框架，該框架首先從SPICE模擬數據中發現一個有向無環圖（DAG），然後通過平均處理效果（ATE）估計來量化參數影響。這種方法產生了可被人理解的設計調整排名和明確的「如果怎樣」預測，使設計師能夠理解尺寸和拓撲之間的權衡。我們在三個運算放大器系列（OTA、望遠鏡型和折疊級聯）上評估了該管道，這些系列在TSMC 65nm上實現，並將其與基準神經網絡（NN）回歸器進行比較。在所有電路中，因果模型再現基於模擬的ATE，平均絕對誤差小於25%，而神經網絡的偏差超過80%，並且經常預測錯誤的符號。這些結果表明，因果人工智慧提供了更高的準確性和可解釋性，為更高效、值得信賴的AMS設計自動化鋪平了道路。

Medical explainable AI